Commit Graph

33 Commits

Author SHA1 Message Date
Eliot Jones
693a3d5958 use offset to file header to correct cross references
if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.
2020-01-26 15:30:20 +00:00
Eliot Jones
e588b2bc50 support documents without endobj for stream
some documents declare stream objects without an endobj marker at the end of the stream. if a new obj token is encountered after reading a stream we reset the scanner to the object number token and return the stream.
2020-01-07 15:27:01 +00:00
Eliot Jones
10dc5a8eed don't cache invalid offsets unless brute forced
don't cache objects parsed if their offset doesn't match the cross-reference offset, unless the object was parsed by a brute-force search operation. this is because 1 object may lie in 2 streams, 1 valid and 1 invalid. If the invalid stream is parsed first for another object then the valid stream will never be read.
2020-01-07 14:54:12 +00:00
Eliot Jones
5114b2da2c avoid overwriting cache for valid objects
some objects may be defined in more than one stream. parsing both streams would overwrite the object in the cache. to prevent this we avoid overwriting the existing object in the cache if it has the expected offset from the cross reference table.
2020-01-07 11:48:09 +00:00
Eliot Jones
bbde38f656 move tokenizers to their own project
since both pdfs and Adobe Type1 fonts use postscript type objects, tokenization is needed by the main project and the fonts project
2020-01-05 10:40:44 +00:00
Eliot Jones
d09b33af4d move tokens to new project 2020-01-05 10:07:01 +00:00
Eliot Jones
7c0ef111ea move classes to new projects
to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.
2020-01-04 16:38:18 +00:00
Eliot Jones
23c7e44fc8 handle stream length being an object stream value 2019-12-24 15:22:47 +00:00
Eliot Jones
9c9a08c6a7 make numeric tokenizer threadsafe by removing cache 2019-12-24 12:24:40 +00:00
Eliot Jones
ba9fe40bc1 cache some more common values and improve performance of tokenizers 2019-12-24 12:22:17 +00:00
Eliot Jones
3084a9aab6 support streams containing only carriage returns. handle comments in arrays and dictionaries
* while the pdf specification says stream data should follow a newline following a stream operator some files have only a carriage return following the stream operator.
* since comment tokens may appear inside an array or dictionary we ignore them if they occur here since they will break interpretation of the dictionary or array contents.
2019-12-20 14:04:58 +00:00
Eliot Jones
e37e4c37b3 require end image token to be followed by at least 1 whitespace 2019-12-19 17:34:40 +00:00
Eliot Jones
82c2ee7026 handle ei end image token appearing in inline image data 2019-12-19 16:29:44 +00:00
Eliot Jones
dab64ec406 handle newlines before inline images and support larger data streams in brute force search 2019-12-18 12:02:07 +00:00
Eliot Jones
68bcaf3901 #55 move support for images to page and add inline images
support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page.

treat all warnings as errors.
2019-10-08 14:04:36 +01:00
Eliot Jones
bbe5409f94 #62 use length value of stream directly to read the full stream once 2019-08-20 21:08:06 +01:00
Eliot Jones
364bd25fa8 #48 add handling of inline image data to pdf content parsing
an inline image in a pdf content stream starts with the bi tag, then id declares the start of image data and ei the end. attempting to parse the bytes after the id tag as usual resulted in errors. this change adds special case handling for inline images.
2019-08-03 15:42:19 +01:00
Eliot Jones
caf1a0c233 use invariant culture for parsing all numbers #37 2019-06-18 19:12:51 +01:00
Eliot Jones
98424b32aa special case handling for faulty offsets in xref with missing whitespace between eof and object number 2019-06-14 20:40:24 +01:00
Eliot Jones
2b486dccab prevent infinite loops where a stream token's length entry references itself. perform brute force scans in case of a faulty xref table #33 2019-06-08 16:45:02 +01:00
Eliot Jones
03af28ed6d fix bug with compact font format font matrix reading and where endstream token is missed if immediately following 'e' 2019-05-10 20:02:29 +01:00
Eliot Jones
bad57763a1 finish initial support for rc4 encryption with blank user password 2019-05-06 15:41:29 +01:00
Eliot Jones
be394f5bba start adding support for reading encrypted documents 2019-05-04 15:36:13 +01:00
Eliot Jones
2fa781b8e9 #10 make all token classes public and expose via a public structure member on pdf document 2018-11-24 19:02:06 +00:00
Eliot Jones
3172596b7c remove all old cos objects 2018-01-21 14:56:50 +00:00
Eliot Jones
7d90f4858a continue migrating code to tokenizer 2018-01-20 18:42:29 +00:00
Eliot Jones
3d2a66cbf9 fix bug with endstream appearing without line break 2018-01-20 11:53:24 +00:00
Eliot Jones
c5e3ce7ec7 finish moving all parsing to token scanner 2018-01-20 00:49:53 +00:00
Eliot Jones
615ee88a46 start passing the pdf scanner in to read the type 1 files 2018-01-14 15:33:22 +00:00
Eliot Jones
36c0eedd7c move the usages of cos object key to indirect reference 2018-01-14 14:48:54 +00:00
Eliot Jones
b19b96604d make the pdf object scanner work with streams 2018-01-14 10:53:01 +00:00
Eliot Jones
8dcea9b37f create a pdf object scanner which sits on top of the core token scanner to provide complete object parsing 2018-01-13 22:30:15 +00:00
Eliot Jones
ec62542b64 change the project name to something silly 2018-01-10 19:49:32 +00:00