PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2025-10-15 19:54:52 +08:00

Author	SHA1	Message	Date
BobLd	097692f1cb	Move ArtifactType inside PdfArtifactMarkedContent	2020-01-09 11:24:32 +00:00
Eliot Jones	4976fa1027	handle incorrect end image detected since an inline image's data stream may contain the characters 'ei' as a result of compression it's possible to read an end image operator mid-data, this results in the next operator also being end image and the content stream being in an invalid state. to recover from this when we detect this situation we remove the previous operator, read to the current operator and replace the operator and data bytes in the list of operations.	2020-01-08 12:17:30 +00:00
BobLd	7be36fdc58	Update PublicApiScannerTests 2	2020-01-08 11:07:27 +00:00
BobLd	4b929482cc	Update PublicApiScannerTests	2020-01-08 10:46:49 +00:00
BobLd	84bab1b627	Add basic marked content extraction capabilities	2020-01-08 10:34:01 +00:00
Eliot Jones	10dc5a8eed	don't cache invalid offsets unless brute forced don't cache objects parsed if their offset doesn't match the cross-reference offset, unless the object was parsed by a brute-force search operation. this is because 1 object may lie in 2 streams, 1 valid and 1 invalid. If the invalid stream is parsed first for another object then the valid stream will never be read.	2020-01-07 14:54:12 +00:00
Eliot Jones	0b048fde57	handle eof further back in file an %%eof for a pdf file may appear further back than the last 1024 bytes. this change doubles the search range. it also handles an empty differences array being defined for a font encoding. we also remove the old approach to dependency injection from the code since we are now favouring static classes where possible.	2020-01-07 11:48:09 +00:00
Eliot Jones	00bd285262	add support for quadpoints to annotations highlight, link, strikeout, squiggly and underline annotation types may define a set of quadrilaterals using the quadpoints entry. this defines the regions to show/activate the annotation. the order of points in the quadpoints array does not match the specification so we provide a convenience class to access the point data rather than interpreting it as a rectangle: https://stackoverflow.com/questions/9855814/pdf-spec-vs-acrobat-creation-quadpoints.	2020-01-05 16:23:07 +00:00
Eliot Jones	b29354e3e6	move compact font format fonts to fonts project	2020-01-05 12:08:01 +00:00
Eliot Jones	bbde38f656	move tokenizers to their own project since both pdfs and Adobe Type1 fonts use postscript type objects, tokenization is needed by the main project and the fonts project	2020-01-05 10:40:44 +00:00
Eliot Jones	d09b33af4d	move tokens to new project	2020-01-05 10:07:01 +00:00
Eliot Jones	1c38a2ae8a	move pdfline to the core project	2020-01-05 09:33:59 +00:00
Eliot Jones	15525acbaa	move document layout analysis and export to new project	2020-01-05 09:19:58 +00:00
Eliot Jones	a6541f1cfc	fix test references update references for unit tests to reference new core and fonts projects. all tests except the public api scanner tests now run successfully.	2020-01-04 22:56:41 +00:00
Eliot Jones	74774995d6	complete move of truetype, afm and standard14 fonts the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.	2020-01-04 22:39:13 +00:00
Eliot Jones	cf1b8651d6	make adler32checksum public there's no reason to keep adler32checksum internal so it is made public in case people find it useful.	2020-01-04 10:27:07 +00:00
Eliot Jones	b15a3a9b57	tidy up truetype tables * improves the naming of truetype related classes. * uses correct numeric type for the loca table. * makes a few related classes public.	2020-01-04 10:27:07 +00:00
Eliot Jones	90f8f97bfd	add simple test case for subsetting issue #98 adds a single test which proves that the invalid truetype subsetting with roboto is related to our font subsetting code, since we can subset the same text correctly with windows calibri we must be reading roboto incorrectly.	2020-01-04 10:27:07 +00:00
Eliot Jones	fe315be2ef	fix truetype subsetting for composite glyphs #98 each glyph included in the subset must count towards the number of glyphs, the horizontal metrics and the maximum profile table for the output truetype font. each glyph must also lie on a 4 byte boundary in the output file. the output file is valid for the windows system font calibri containing accented characters but the roboto subset files are still invalid. moves all subsetting related classes into their own namespace which will be made public.	2020-01-04 10:27:07 +00:00
Eliot Jones	336947db73	add writing methods to truetype tables #98 since we have verified the problem with the characters not appearing in acrobat reader isn't the checksum (other files also have invalid checksums but work) it seems likely the issue is with the os/2 table. this change moves the logic for writing out the cmap table, the format 6 cmap sub-table, truetype table headers and the os/2 table into the classes themselves. now we can write an os/2 table and we've tested that the output matches the input, we can overwrite the os/2 table in order to work out which of the os/2 errors is causing our font to be invalid. the writeable interface should be added to more and more parts of the codebase so that writing, editing and document creation become first class citizens rather than hardcoded additions. this change also adds the macroman (1,0) cmap subtable to edited fonts so that it is present for consumers which expect it.	2020-01-04 10:27:07 +00:00
Eliot Jones	9fff879bd4	fix tests by using custom equality comparers since we now round glyph widths for truetype fonts in the widths array of the pdf some values are out by a very small amount from the expected value. since we don't care about such fractional inaccuracy we use a custom comparer for these tests.	2020-01-04 10:27:07 +00:00
Eliot Jones	59c43cc526	truetype encoding replacer and checksum calculator #98 we need to provide a custom cmap for our overridden fonts when creating a document using truetype fonts. in order to do this without writing a complete subsetter (yet) we simply rearrange the font by moving the cmap table to the end of the font. in order to keep a valid font we need to recalculate the offsets and checksums for all table headers. this adds a calculator which can calculate per-table checksums as well as the whole-font checksum used to calculate the checksum adjustment recorded in the head table. now that the cmap table has been moved to the end of the font file we can overwrite it with a different-length custom cmap table without further invasive changes to the rest of the truetype file. this isn't implemented yet in this commit but will be the next thing to implement. in truetype writing font we've temporarily reverted the change which maps characters to bytes until the custom cmap is written so we can ensure for this change the output font file is still valid and can be interpreted by pdf consumers. once the custom cmap is written we can uncomment the mapping logic and it should all just work.	2020-01-04 10:27:07 +00:00
Eliot Jones	f319e7f4b5	adds per character byte mapping to truetype #98 this starts to add logic for per-character mapping of unicode characters to byte values for truetype fonts in the pdf document builder. in order to support unicode characters outside the 0-255 range when creating new pdf documents without using composite fonts, we need to map values outside these range into this range. to do this we start at 1 and map each character we encounter to the next code, up to a maximum of 255. we provide a custom tounicode cmap in the font dictionary which maps these byte values, 0-255, back to unicode code points (short). we also provide a custom firstchar, lastchar and widths array for the font mapping just the values we use. since fonts no longer contain just the latin character set the font descriptor enum is set to have the symbolic flag set. this means values will be looked up in either the mac-roman (1, 0) or windows-symbol (3, 0) cmap tables (these cmap tables are distinct from cmap tables in the pdf file) inside the actual truetype font bytes. this means the currently generated font file is invalid, because while the widths array and tounicode cmap return the correct values the actual font itself returns whatever values where in those positions before the remapping occurred. in order to fix this we will need to override the windows-symbol cmap contained in the underlying truetype font to match our mapping. this will be a lot of work and involve significant rewriting of the font file itself, in order to preserve checksum integrity.	2020-01-04 10:27:07 +00:00
BobLd	07f51712c6	Update PublicApiScannerTests	2020-01-03 12:31:23 +00:00
BobLd	3a060d9769	Update PublicApiScannerTests	2019-12-28 14:43:09 +00:00
Eliot Jones	87528199c6	use byte values when showing text for document builder #98 when writing text content the current show text operator was just writing the unicode string value and hoping it produced the correct value in the resulting document despite the values being consumed in a different encoding. this change adds a method to retrieve the corresponding byte value for a unicode character and uses that to write a hex show text operator to the page content. this is only implemented for standard14 fonts in this change. for standard14 fonts we look up the corresponding name for the unicode value from the adobe glyph list. once we find the corresponding glyph name we look up the code value in the encoding we have chosen when writing standard14 fonts (macromanencoding). this value is then the byte value written to the show text operator. if the value does not appear in any of the lookups we throw a not support exception. this also adds a test case which will still fail for czech characters in a truetype font, the issue reported in #98.	2019-12-28 14:42:27 +00:00
Eliot Jones	1e29c298cf	use correct numeric types when parsing truetype fonts	2019-12-24 12:22:17 +00:00
Eliot Jones	935d182888	use doubles where calculations are being run	2019-12-24 12:22:17 +00:00
Eliot Jones	e984180b3d	add method to retrieve any embedded files	2019-12-21 16:16:36 +00:00
Eliot Jones	4d697e3669	allow the user to supply multiple passwords for decryption previously the only way to test if a password was correct was to supply a single password and throw if the value was incorrect. this was slow. now parsing options supports a list of passwords as well as a single password option (which is equivalent to a list with a single item). these passwords are all tested at the same time and an exception is only thrown once all passwords are tested.	2019-12-20 15:11:05 +00:00
Eliot Jones	7296c3c125	merge pull request #105 from BobLd/master whitespace covering algorithm and #104	2019-12-20 11:57:31 +00:00
Eliot Jones	c30cd1b96d	use cid font subroutines where applicable. add ucs 2 cmap support for type 1 fonts * cid cff fonts have multiple sub-fonts and multiple private dictionaries, in addition to a top level font and private dictionary. this fix uses the specific sub-dictionary when getting local subroutines on a per-glyph basis. * chinese, japanese or korean fonts can use a ucs-2 encoding cmap for retrieving unicode values. * add support for the additional glyph list for unicode values in true type fonts. adds nonmarkingreturn mapping to carriage return. * makes font parsing classes static where there's no reason for them to be per-instance.	2019-12-19 13:33:44 +00:00
BobLd	6dba5bb2b4	update PublicApiScannerTests	2019-12-18 11:43:39 +00:00
Eliot Jones	1fb416eee3	add convenience method to retrieve all hyperlinks and their text from annotations on a page	2019-12-18 11:41:02 +00:00
BobLd	5cf1f6c58c	Modifications and adding som tests	2019-12-16 14:36:52 +00:00
BobLd	1656411fcb	Improving Geometry classes with Tests	2019-12-14 11:41:11 +00:00
Eliot Jones	75a6260501	make cropbox public	2019-12-06 17:34:51 +00:00
Eliot Jones	e38da0a403	add support for alternative colorspace in separation colorspaces #89	2019-12-06 17:23:15 +00:00
Eliot Jones	e01d77b93a	add negative test case and make tests non-lenient	2019-12-05 13:56:12 +00:00
Eliot Jones	2e5c995322	make external nodes different to document nodes and finish reimplementation	2019-12-05 13:21:19 +00:00
Eliot Jones	ecf0b8743b	make bookmarknode immutable and use scanner when retrieving bookmarks	2019-12-05 12:03:30 +00:00
Eliot Jones	928347bcce	merge pull request #84 from BobLd/master add basic bookmarks extraction capabilities.	2019-12-04 14:24:10 +00:00
Eliot Jones	a967e0898a	handle missing width and height correctly for compact font format fonts #75	2019-12-04 14:19:06 +00:00
Eliot Jones	80f024dbed	make form access public	2019-11-27 16:36:25 +00:00
Eliot Jones	df3cb43cfc	update coverage libraries	2019-11-27 16:16:11 +00:00
Eliot Jones	ed53773c7b	handle checked state of radio buttons and checkboxes	2019-11-27 15:34:28 +00:00
Eliot Jones	910e22a4e9	wrap checkboxes and radiobuttons in their own form field types with access to the child collections	2019-11-26 16:33:24 +00:00
BobLd	89daa2818e	update PublicApiScannerTests	2019-11-04 15:17:25 +00:00
BobLd	99f260befb	Enhancing NearestNeighbourWordExtractor - Making the code easier to read - Using 20% of Width instead of 60% - Making DefaultWordExtractor public	2019-10-21 20:51:27 +01:00
Eliot Jones	efe7896824	#75 support vertical writing mode fonts	2019-10-17 15:57:04 +01:00

1 2 3 4 5 ...

333 Commits