PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2025-09-21 04:17:57 +08:00

Author	SHA1	Message	Date
Eliot Jones	4150881be9	recover from invalid acro-form references we add a try/catch to the direct object finder's tryget method so it returns false rather than throwing. if we have an acro-form reference in the catalog but no corresponding object in the document we instead scan all objects in the document to find form fields and reconstruct the acro-form dictionary.	2020-02-27 12:08:40 +00:00
Eliot Jones	f415c3116e	cross reference offset is in the xref table we ignore the error previously we checked the offset was not inside the table (correct thing to check), however this is only a special case of the more general issue (cross reference offsets are wrong). we move handling for this into the pdf token scanner. if we attempt to read an object at an offset and it fails we brute force the entire file to find correct offsets. we also needed to add handling to make sure we don't attempt to use stream length tokens if we're brute-forcing since we can't look up indirect references for length.	2020-02-26 14:03:46 +00:00
Eliot Jones	7d0d5806a9	fix reverse xref location search when brute force searching for the start of the cross-reference table (xref) we read 5 byte buffers, previously if the 'x' of 'xref' was the first character of the buffer we skipped it. this checks when 'x' is the first character of the buffer.	2020-02-26 12:55:11 +00:00
Eliot Jones	f07e2dfb84	more tolerant handling of endimage recovery fixes the recorded offset when an endimage is recovered from the first time. it was off by one so if the subsequent end image was also the wrong tag then the second attempt at recovery failed. also allows recovery when other tags appear after an endimage as long as they're not block ending operations (end text, perhaps pop/push in future).	2020-02-26 12:41:39 +00:00
Eliot Jones	d6d3869fe2	fix brute force searcher offsets the brute force searcher offsets were off by one. this change means the offset returned is now aligned with the object number in the object number/generation/operator triple.	2020-02-24 12:24:18 +00:00
Eliot Jones	8ab2838063	recover from invalid cross reference position if we are reading a cross reference offset which contains a number we assumed it was a stream object. if it's not we now brute-force the entire file looking for an 'xref' token. this should be combined with a search for cross-reference streams and should run when we read neither the numeric token or an 'xref' token but for now this fixes the observed issue. also adds number of images to the page api to prevent consumers needing to enumerate.	2020-01-28 18:07:05 +00:00
Eliot Jones	693a3d5958	use offset to file header to correct cross references if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.	2020-01-26 15:30:20 +00:00
Eliot Jones	a561c8954e	handle the format header being preceded by nonsense some files seem to have the format header preceded by large amounts of junk but this appears to be valid for chrome and acrobat reader. this change ups the amount of nonsense to be read prior to the version header. also makes parsing of the version header culture invariant which may be related to #85.	2020-01-25 16:53:41 +00:00
Eliot Jones	ba09a13d08	more end image recovery logic since inline image data may contain the end image "ei" token inside the data stream there's no reliable way to actually determine if we've read all the data. for this reason if we end up with an invalid state parsing operations after we've read the end image token we try to recover by reading from the previous token to the next end image token if any. we supply log information to let the consumer know this is what we're doing. it's still not bullet-proof but it should be good enough. also support negative page rotation values by adding them to a 360 degree rotation so -90 degrees clockwise is 270 degrees clockwise.	2020-01-25 15:53:08 +00:00
Eliot Jones	efc258b0f0	use tokenscanner when converting array to rectangle an arrray of 4 items representing a rectangle may define its values as indirect references. when converting to a rectangle we pass a pdf token scanner to resolve any indirect references.	2020-01-13 10:20:08 +00:00
Eliot Jones	4976fa1027	handle incorrect end image detected since an inline image's data stream may contain the characters 'ei' as a result of compression it's possible to read an end image operator mid-data, this results in the next operator also being end image and the content stream being in an invalid state. to recover from this when we detect this situation we remove the previous operator, read to the current operator and replace the operator and data bytes in the list of operations.	2020-01-08 12:17:30 +00:00
Eliot Jones	a083214da2	handle missing mediabox irrespective of parsing type since pdfbox defaults to us letter if the mediabox is missing rather than throwing we remove the behaviour where uselenientparsing is false which used to throw, now we log an error. throwing didn't provide any benefit to consumers.	2020-01-08 11:34:35 +00:00
Eliot Jones	63b118b141	handle type1 fonts disguised as truetype if the font descriptor uses the fromsubtype flag the actual type of the font can differ from that specified in the font dictionary. in this case a truetype font actually contains a type1c, compact font format, font. in this case we fall back to using the type1 parser. also handles a closesubpath command appearing without any path construction operators.	2020-01-07 16:49:21 +00:00
Eliot Jones	903d71a93d	skip cross references outside file if the previous cross-reference location points to an offset outside the file size we skip it. also makes cid font factory more resilient by skipping missing descriptors.	2020-01-07 12:37:41 +00:00
Eliot Jones	0b048fde57	handle eof further back in file an %%eof for a pdf file may appear further back than the last 1024 bytes. this change doubles the search range. it also handles an empty differences array being defined for a font encoding. we also remove the old approach to dependency injection from the code since we are now favouring static classes where possible.	2020-01-07 11:48:09 +00:00
Eliot Jones	b29354e3e6	move compact font format fonts to fonts project	2020-01-05 12:08:01 +00:00
Eliot Jones	74774995d6	complete move of truetype, afm and standard14 fonts the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.	2020-01-04 22:39:13 +00:00
Eliot Jones	7c0ef111ea	move classes to new projects to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.	2020-01-04 16:38:18 +00:00
Eliot Jones	e048bb8c2c	performance tuning for numeric tokens and parsing	2019-12-24 12:22:17 +00:00
Eliot Jones	e984180b3d	add method to retrieve any embedded files	2019-12-21 16:16:36 +00:00
Eliot Jones	4d697e3669	allow the user to supply multiple passwords for decryption previously the only way to test if a password was correct was to supply a single password and throw if the value was incorrect. this was slow. now parsing options supports a list of passwords as well as a single password option (which is equivalent to a list with a single item). these passwords are all tested at the same time and an exception is only thrown once all passwords are tested.	2019-12-20 15:11:05 +00:00
Eliot Jones	c30cd1b96d	use cid font subroutines where applicable. add ucs 2 cmap support for type 1 fonts * cid cff fonts have multiple sub-fonts and multiple private dictionaries, in addition to a top level font and private dictionary. this fix uses the specific sub-dictionary when getting local subroutines on a per-glyph basis. * chinese, japanese or korean fonts can use a ucs-2 encoding cmap for retrieving unicode values. * add support for the additional glyph list for unicode values in true type fonts. adds nonmarkingreturn mapping to carriage return. * makes font parsing classes static where there's no reason for them to be per-instance.	2019-12-19 13:33:44 +00:00
Eliot Jones	dab64ec406	handle newlines before inline images and support larger data streams in brute force search	2019-12-18 12:02:07 +00:00
Eliot Jones	1fb416eee3	add convenience method to retrieve all hyperlinks and their text from annotations on a page	2019-12-18 11:41:02 +00:00
Eliot Jones	f2ead37134	handle missing whitespaces before the start of the object #88	2019-12-09 12:24:20 +00:00
Eliot Jones	75a6260501	make cropbox public	2019-12-06 17:34:51 +00:00
Eliot Jones	e38da0a403	add support for alternative colorspace in separation colorspaces #89	2019-12-06 17:23:15 +00:00
Eliot Jones	ecf0b8743b	make bookmarknode immutable and use scanner when retrieving bookmarks	2019-12-05 12:03:30 +00:00
Eliot Jones	677d2b5e8f	#82 make resource store state local to the page and operation being processed resources such as fonts are linked to page content operations using name labels, e.g. "/F1", these resource labels can be reassigned on different pages or inside form xobjects. we now clear the entire resource state for each page which is parsed and after form xobject operations which use resource dictionaries.	2019-11-25 14:34:02 +00:00
Eliot Jones	84990722ca	#76 add infinite loop protection for brute force search also treats 'm' or 'j' in endstream/endobj as a valid object number start character	2019-10-17 16:50:01 +01:00
Eliot Jones	3f1321141a	#73 process xobject form content when extracting text and images	2019-10-16 14:59:16 +01:00
Eliot Jones	dec4c31a33	fix bug where cross reference stream subsections were skipped a single cross-reference stream may contain multiple disjoint runs of object numbers, previously we only took the first now we load all objects. adds indexer to array token for ease-of-use. adds page number and bounds information to all form fields.	2019-10-10 16:05:21 +01:00
Eliot Jones	2ef45f71d5	make missing acroform types public and start improving data also changes pages to use a proper tree structure since this will be required for resource inheritance and for acroform widget dictionaries.	2019-10-09 14:28:37 +01:00
Eliot Jones	68bcaf3901	#55 move support for images to page and add inline images support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page. treat all warnings as errors.	2019-10-08 14:04:36 +01:00
Eliot Jones	e02e130947	#57 add creation and modified date to document information this enables users to check if xmp metadata is outdated	2019-10-03 12:56:48 +01:00
Eliot Jones	d98b8b43c1	small performance tweaks and remove package license expression package license url is deprecated in favour of package license expression but nuget doesn't seem to support expressions properly for published packages yet so we'll keep the deprecated url for the time being. having both url and expression causes the build to fail. small obvious performance improvements for file header passing and getting the encoding information using the existing reverse name to code map.	2019-08-18 13:47:01 +01:00
Eliot Jones	0349bedd3e	#57 add access to document metadata and expose wrapper type	2019-08-11 12:42:30 +01:00
Eliot Jones	364bd25fa8	#48 add handling of inline image data to pdf content parsing an inline image in a pdf content stream starts with the bi tag, then id declares the start of image data and ei the end. attempting to parse the bytes after the id tag as usual resulted in errors. this change adds special case handling for inline images.	2019-08-03 15:42:19 +01:00
Eliot Jones	0dfe742770	continue searching for xref tokens even if an %%eof is encountered #38	2019-07-06 14:26:38 +01:00
Eliot Jones	c495065178	support gs operator, fix systemfonts, apply rotation to glyphs - begin adding support for extended graphics state (the 'gs' operator) including setting the font #39. - apply page level rotation to the glyph bounding box and width to get correct glyph sizes #41. - wrap page rotation in a value type to ensure the value is restricted to right angle rotations and provide convenience members #42. - fix bug where system font finder never worked for truetype fonts because it began reading the file from the wrong offset.	2019-07-06 14:03:23 +01:00
Eliot Jones	88e02cabab	include rotation in page object #42 we need to apply rotation to the crop and media box and therefore find the correct width and height. but for now correctly deriving the rotation from the page tree should help consumers.	2019-07-05 19:18:14 +01:00
Eliot Jones	41eddca0bf	handle incorrect xref offsets #34 previously if the cross reference did not exist at exactly the provided offset we'd immediately throw, now we assume we can read a few more tokens to find the xref table or stream start. this won't work in the case where the provided offset is past the start of the table or nowhere near the table but in those cases there's not much we can do. there's some more work to do to provide a fallback xref parser which finds the xref tables and streams using a brute-force scan of the whole document.	2019-06-23 12:05:21 +01:00
Eliot Jones	7b96483664	include raw dictionary token in the document information class #38	2019-06-19 21:23:06 +01:00
Eliot Jones	caf1a0c233	use invariant culture for parsing all numbers #37	2019-06-18 19:12:51 +01:00
Eliot Jones	ffa7b3bcc7	generate synthetic encoding where not present and use direct object finder to lookup cropbox and mediabox	2019-05-18 15:20:07 +01:00
Eliot Jones	9afceed1c5	correctly delimit content streams when concatenating arrays	2019-05-11 10:49:04 +01:00
Eliot Jones	23c033c788	implement validation of owner password and throw more descriptive exception for encrypted documents	2019-05-09 19:02:39 +01:00
Eliot Jones	bad57763a1	finish initial support for rc4 encryption with blank user password	2019-05-06 15:41:29 +01:00
Eliot Jones	be394f5bba	start adding support for reading encrypted documents	2019-05-04 15:36:13 +01:00
Eliot Jones	5c091aeba7	#29 skip mediaboxes or cropboxes with the wrong dimensions and log an error.	2019-01-21 18:34:24 +00:00

1 2

81 Commits