Commit Graph

80 Commits

Author SHA1 Message Date
Eliot Jones
19462d79f0 add support for jpeg images in pdf document builder
since jpegs can be trivially embedded in pdf documents without changes to the data stream this is the first image format we will support. currently this is a naive approach which doesn't share an image resources between pages. ideally we will either de-duplicated images when added, return a re-usable key once an image is added, or both.
2020-03-16 19:32:57 +00:00
Eliot Jones
48d166276d remove islenientparsing from contentstreamprocessor 2020-02-28 11:44:13 +00:00
Eliot Jones
7b09999a3f remove islenientparsing from the font handlers
we're removing islenientparsing to make the code simpler to maintain and use as well as more resilient.
2020-02-28 11:37:18 +00:00
BobLd
0afaa19d15 Handle null CurrentPath 2020-02-24 11:20:56 +00:00
BobLd
1d095af974 Implement Modify Clipping operations 2020-02-24 11:20:56 +00:00
BobLd
ac1e2c49ba Fix bounding box for artifact
Add tests
2020-02-10 11:23:19 +00:00
BobLd
588648d30b Fix #133 Marked content extraction issue 2020-02-10 11:23:19 +00:00
Eliot Jones
29061b1fd2 handle unexpected adobe type 1 format
an encoding array in an adobe type 1 font may be missing its declaration ending in 'for', if we encounter 'dup' while looking for the 'for' token we have a special case to go straight into reading the encoding.

also handles a case where the page content stream contains a path-closing operator without any path being active.
2020-01-28 16:05:53 +00:00
Eliot Jones
ba09a13d08 more end image recovery logic
since inline image data may contain the end image "ei" token inside the data stream there's no reliable way to actually determine if we've read all the data. for this reason if we end up with an invalid state parsing operations after we've read the end image token we try to recover by reading from the previous token to the next end image token if any. we supply log information to let the consumer know this is what we're doing. it's still not bullet-proof but it should be good enough.

also support negative page rotation values by adding them to a 360 degree rotation so -90 degrees clockwise is 270 degrees clockwise.
2020-01-25 15:53:08 +00:00
Eliot Jones
b4d917dcdc merge pull request #122 from uglytoad/marked-content
marked content
2020-01-10 17:07:21 +00:00
Eliot Jones
41cc7abd1b prevent negative point size for fonts 2020-01-10 14:40:28 +00:00
Eliot Jones
17b7cf2f61 load images eagerly for marked content
when a marked content region contains an image we load it eagerly since we won't have access to the necessary classes at evaluation time. we also default image colorspace to the active graphics state colorspace if the dictionary doesn't contain a valid entry.
2020-01-10 13:52:21 +00:00
Eliot Jones
2a579afd4d add missing doc comments for operation context marked content 2020-01-09 15:35:55 +00:00
Eliot Jones
d011f37316 merge master 2020-01-09 15:32:10 +00:00
Eliot Jones
43574097f1 rename marked content elements and use factory
since the properties in marked content may be indirect references or belong to the page resources array, the value should be calculated during content processing. this change tidies up the marked content classes so they do not expose mutable data and uses the pdf token scanner overloads to load dictionary data.
2020-01-09 15:30:16 +00:00
Eliot Jones
4976fa1027 handle incorrect end image detected
since an inline image's data stream may contain the characters 'ei' as a result of compression it's possible to read an end image operator mid-data, this results in the next operator also being end image and the content stream being in an invalid state. to recover from this when we detect this situation we remove the previous operator, read to the current operator and replace the operator and data bytes in the list of operations.
2020-01-08 12:17:30 +00:00
BobLd
84bab1b627 Add basic marked content extraction capabilities 2020-01-08 10:34:01 +00:00
Eliot Jones
63b118b141 handle type1 fonts disguised as truetype
if the font descriptor uses the fromsubtype flag the actual type of the font can differ from that specified in the font dictionary. in this case a truetype font actually contains a type1c, compact font format, font. in this case we fall back to using the type1 parser.

also handles a closesubpath command appearing without any path construction operators.
2020-01-07 16:49:21 +00:00
vadik299
f00eb5efa2 Update AppendRectangle.cs
(fix) Rectangle width and height should be also transformed by CurrentTransformationMatrix
2020-01-07 00:23:10 -05:00
Eliot Jones
74774995d6 complete move of truetype, afm and standard14 fonts
the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.
2020-01-04 22:39:13 +00:00
Eliot Jones
7c0ef111ea move classes to new projects
to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.
2020-01-04 16:38:18 +00:00
Eliot Jones
f319e7f4b5 adds per character byte mapping to truetype #98
this starts to add logic for per-character mapping of unicode characters to byte values for truetype fonts in the pdf document builder. in order to support unicode characters outside the 0-255 range when creating new pdf documents without using composite fonts, we need to map values outside these range into this range. to do this we start at 1 and map each character we encounter to the next code, up to a maximum of 255. we provide a custom tounicode cmap in the font dictionary which maps these byte values, 0-255, back to unicode code points (short).

we also provide a custom firstchar, lastchar and widths array for the font mapping just the values we use.

since fonts no longer contain just the latin character set the font descriptor enum is set to have the symbolic flag set. this means values will be looked up in either the mac-roman (1, 0) or windows-symbol (3, 0) cmap tables (these cmap tables are distinct from cmap tables in the pdf file) inside the actual truetype font bytes. this means the currently generated font file is invalid, because while the widths array and tounicode cmap return the correct values the actual font itself returns whatever values where in those positions before the remapping occurred.

in order to fix this we will need to override the windows-symbol cmap contained in the underlying truetype font to match our mapping. this will be a lot of work and involve significant rewriting of the font file itself, in order to preserve checksum integrity.
2020-01-04 10:27:07 +00:00
Eliot Jones
ec060ae81b add hardcoded switch branches for more content operations
also adds a gitignore entry for the 'benchmark' subfolder in tools where custom benchmarking applications can be built and run without being added to source control.
2019-12-24 23:12:04 +00:00
Eliot Jones
935d182888 use doubles where calculations are being run 2019-12-24 12:22:17 +00:00
Eliot Jones
3c0cd17a8b use correct defaults for separation colorspace #89 2019-12-10 14:10:50 +00:00
Eliot Jones
c89928d976 remove inefficient approach to checking if content stream path has been added #99 2019-12-10 13:20:57 +00:00
Eliot Jones
e38da0a403 add support for alternative colorspace in separation colorspaces #89 2019-12-06 17:23:15 +00:00
Eliot Jones
677d2b5e8f #82 make resource store state local to the page and operation being processed
resources such as fonts are linked to page content operations using name labels, e.g. "/F1", these resource labels can be reassigned on different pages or inside form xobjects. we now clear the entire resource state for each page which is parsed and after form xobject operations which use resource dictionaries.
2019-11-25 14:34:02 +00:00
Eliot Jones
80fc404b10 #47 improve performance by caching truetype bounding boxes
also uses less reflection when parsing the page content stream
2019-10-18 15:56:28 +01:00
Eliot Jones
efe7896824 #75 support vertical writing mode fonts 2019-10-17 15:57:04 +01:00
Eliot Jones
3f1321141a #73 process xobject form content when extracting text and images 2019-10-16 14:59:16 +01:00
Eliot Jones
68bcaf3901 #55 move support for images to page and add inline images
support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page.

treat all warnings as errors.
2019-10-08 14:04:36 +01:00
Eliot Jones
38b6f8e812 add current geometry path to page content when it is not explicitly closed #66 2019-09-11 15:38:57 +01:00
Eliot Jones
6878d9a82d #64 use decimal values directly rather than from array for transformation matrix 2019-08-20 22:51:00 +01:00
vadik299
cc767b8cd6 Merge branch 'master' into master 2019-08-16 18:34:57 -04:00
Eliot Jones
f55091f3d2 make color types public and add stream based tests to prevent future breaking as observed in #52 2019-08-13 20:48:22 +01:00
Vasya
22278f64c4 Added TextSequence 2019-08-11 14:55:59 -04:00
Eliot Jones
fc2d532b82 use single instances of black and white for rgb/gray colors 2019-08-10 14:58:02 +01:00
Eliot Jones
c5d03bca97 move application of transformation matrix outside path 2019-08-08 21:19:18 +01:00
Eliot Jones
4dde4ca0c1 add colors to letters based on current font and graphics state 2019-08-05 19:26:10 +01:00
Eliot Jones
0df35b8488 fix naming of color space to be 2 words 2019-08-05 18:32:44 +01:00
Eliot Jones
0b9ae1db13 add color information to the operation context. create color classes for letters and paths to use 2019-08-04 16:47:47 +01:00
Eliot Jones
1d551d6de3 add and document core classes for colorspace information 2019-08-04 12:57:06 +01:00
Eliot Jones
364bd25fa8 #48 add handling of inline image data to pdf content parsing
an inline image in a pdf content stream starts with the bi tag, then id declares the start of image data and ei the end. attempting to parse the bytes after the id tag as usual resulted in errors. this change adds special case handling for inline images.
2019-08-03 15:42:19 +01:00
vadimy
7d3a0929b6 Refactoring and fixing according to Eliot's comments 2019-07-24 00:00:00 -04:00
vadimy
b9d0cca2a6 Added "Paths" collection to Page object.
Added matrix transformation to path operators.
2019-07-16 00:35:29 -04:00
Eliot Jones
453faf50af start adding colorspace path operations to the operation context 2019-07-10 21:31:23 +01:00
Eliot Jones
557d8bc948 map missing character codes directly #44
previously if no matching unicode was found for a character code we would return a null letter. instead we now map from the character code directly to a character. this seems to work for most documents, except where there are ligatures, e.g. fi or ff, but is still better than not returning anything.
2019-07-07 13:53:25 +01:00
Eliot Jones
198cca1336 change point size calculation to use rotation #41
point size was previously only calculated based on the transformation matrix but now uses the transformation matrix, the rotation matrix and the font matrix values. the calculated value still seems unlikely to be correct so it is exposed using the page's experimental access for now, rather than as a public getter.
2019-07-07 12:12:09 +01:00
Eliot Jones
c495065178 support gs operator, fix systemfonts, apply rotation to glyphs
- begin adding support for extended graphics state (the 'gs' operator) including setting the font #39.
- apply page level rotation to the glyph bounding box and width to get correct glyph sizes #41.
- wrap page rotation in a value type to ensure the value is restricted to right angle rotations and provide convenience members #42.
- fix bug where system font finder never worked for truetype fonts because it began reading the file from the wrong offset.
2019-07-06 14:03:23 +01:00