PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2025-10-14 19:05:01 +08:00

Author	SHA1	Message	Date
Eliot Jones	19462d79f0	add support for jpeg images in pdf document builder since jpegs can be trivially embedded in pdf documents without changes to the data stream this is the first image format we will support. currently this is a naive approach which doesn't share an image resources between pages. ideally we will either de-duplicated images when added, return a re-usable key once an image is added, or both.	2020-03-16 19:32:57 +00:00
Eliot Jones	48d166276d	remove islenientparsing from contentstreamprocessor	2020-02-28 11:44:13 +00:00
Eliot Jones	7b09999a3f	remove islenientparsing from the font handlers we're removing islenientparsing to make the code simpler to maintain and use as well as more resilient.	2020-02-28 11:37:18 +00:00
BobLd	0afaa19d15	Handle null CurrentPath	2020-02-24 11:20:56 +00:00
BobLd	1d095af974	Implement Modify Clipping operations	2020-02-24 11:20:56 +00:00
BobLd	ac1e2c49ba	Fix bounding box for artifact Add tests	2020-02-10 11:23:19 +00:00
BobLd	588648d30b	Fix #133 Marked content extraction issue	2020-02-10 11:23:19 +00:00
Eliot Jones	29061b1fd2	handle unexpected adobe type 1 format an encoding array in an adobe type 1 font may be missing its declaration ending in 'for', if we encounter 'dup' while looking for the 'for' token we have a special case to go straight into reading the encoding. also handles a case where the page content stream contains a path-closing operator without any path being active.	2020-01-28 16:05:53 +00:00
Eliot Jones	ba09a13d08	more end image recovery logic since inline image data may contain the end image "ei" token inside the data stream there's no reliable way to actually determine if we've read all the data. for this reason if we end up with an invalid state parsing operations after we've read the end image token we try to recover by reading from the previous token to the next end image token if any. we supply log information to let the consumer know this is what we're doing. it's still not bullet-proof but it should be good enough. also support negative page rotation values by adding them to a 360 degree rotation so -90 degrees clockwise is 270 degrees clockwise.	2020-01-25 15:53:08 +00:00
Eliot Jones	b4d917dcdc	merge pull request #122 from uglytoad/marked-content marked content	2020-01-10 17:07:21 +00:00
Eliot Jones	41cc7abd1b	prevent negative point size for fonts	2020-01-10 14:40:28 +00:00
Eliot Jones	17b7cf2f61	load images eagerly for marked content when a marked content region contains an image we load it eagerly since we won't have access to the necessary classes at evaluation time. we also default image colorspace to the active graphics state colorspace if the dictionary doesn't contain a valid entry.	2020-01-10 13:52:21 +00:00
Eliot Jones	2a579afd4d	add missing doc comments for operation context marked content	2020-01-09 15:35:55 +00:00
Eliot Jones	d011f37316	merge master	2020-01-09 15:32:10 +00:00
Eliot Jones	43574097f1	rename marked content elements and use factory since the properties in marked content may be indirect references or belong to the page resources array, the value should be calculated during content processing. this change tidies up the marked content classes so they do not expose mutable data and uses the pdf token scanner overloads to load dictionary data.	2020-01-09 15:30:16 +00:00
Eliot Jones	4976fa1027	handle incorrect end image detected since an inline image's data stream may contain the characters 'ei' as a result of compression it's possible to read an end image operator mid-data, this results in the next operator also being end image and the content stream being in an invalid state. to recover from this when we detect this situation we remove the previous operator, read to the current operator and replace the operator and data bytes in the list of operations.	2020-01-08 12:17:30 +00:00
BobLd	84bab1b627	Add basic marked content extraction capabilities	2020-01-08 10:34:01 +00:00
Eliot Jones	63b118b141	handle type1 fonts disguised as truetype if the font descriptor uses the fromsubtype flag the actual type of the font can differ from that specified in the font dictionary. in this case a truetype font actually contains a type1c, compact font format, font. in this case we fall back to using the type1 parser. also handles a closesubpath command appearing without any path construction operators.	2020-01-07 16:49:21 +00:00
vadik299	f00eb5efa2	Update AppendRectangle.cs (fix) Rectangle width and height should be also transformed by CurrentTransformationMatrix	2020-01-07 00:23:10 -05:00
Eliot Jones	74774995d6	complete move of truetype, afm and standard14 fonts the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.	2020-01-04 22:39:13 +00:00
Eliot Jones	7c0ef111ea	move classes to new projects to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.	2020-01-04 16:38:18 +00:00
Eliot Jones	f319e7f4b5	adds per character byte mapping to truetype #98 this starts to add logic for per-character mapping of unicode characters to byte values for truetype fonts in the pdf document builder. in order to support unicode characters outside the 0-255 range when creating new pdf documents without using composite fonts, we need to map values outside these range into this range. to do this we start at 1 and map each character we encounter to the next code, up to a maximum of 255. we provide a custom tounicode cmap in the font dictionary which maps these byte values, 0-255, back to unicode code points (short). we also provide a custom firstchar, lastchar and widths array for the font mapping just the values we use. since fonts no longer contain just the latin character set the font descriptor enum is set to have the symbolic flag set. this means values will be looked up in either the mac-roman (1, 0) or windows-symbol (3, 0) cmap tables (these cmap tables are distinct from cmap tables in the pdf file) inside the actual truetype font bytes. this means the currently generated font file is invalid, because while the widths array and tounicode cmap return the correct values the actual font itself returns whatever values where in those positions before the remapping occurred. in order to fix this we will need to override the windows-symbol cmap contained in the underlying truetype font to match our mapping. this will be a lot of work and involve significant rewriting of the font file itself, in order to preserve checksum integrity.	2020-01-04 10:27:07 +00:00
Eliot Jones	ec060ae81b	add hardcoded switch branches for more content operations also adds a gitignore entry for the 'benchmark' subfolder in tools where custom benchmarking applications can be built and run without being added to source control.	2019-12-24 23:12:04 +00:00
Eliot Jones	935d182888	use doubles where calculations are being run	2019-12-24 12:22:17 +00:00
Eliot Jones	3c0cd17a8b	use correct defaults for separation colorspace #89	2019-12-10 14:10:50 +00:00
Eliot Jones	c89928d976	remove inefficient approach to checking if content stream path has been added #99	2019-12-10 13:20:57 +00:00
Eliot Jones	e38da0a403	add support for alternative colorspace in separation colorspaces #89	2019-12-06 17:23:15 +00:00
Eliot Jones	677d2b5e8f	#82 make resource store state local to the page and operation being processed resources such as fonts are linked to page content operations using name labels, e.g. "/F1", these resource labels can be reassigned on different pages or inside form xobjects. we now clear the entire resource state for each page which is parsed and after form xobject operations which use resource dictionaries.	2019-11-25 14:34:02 +00:00
Eliot Jones	80fc404b10	#47 improve performance by caching truetype bounding boxes also uses less reflection when parsing the page content stream	2019-10-18 15:56:28 +01:00
Eliot Jones	efe7896824	#75 support vertical writing mode fonts	2019-10-17 15:57:04 +01:00
Eliot Jones	3f1321141a	#73 process xobject form content when extracting text and images	2019-10-16 14:59:16 +01:00
Eliot Jones	68bcaf3901	#55 move support for images to page and add inline images support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page. treat all warnings as errors.	2019-10-08 14:04:36 +01:00
Eliot Jones	38b6f8e812	add current geometry path to page content when it is not explicitly closed #66	2019-09-11 15:38:57 +01:00
Eliot Jones	6878d9a82d	#64 use decimal values directly rather than from array for transformation matrix	2019-08-20 22:51:00 +01:00
vadik299	cc767b8cd6	Merge branch 'master' into master	2019-08-16 18:34:57 -04:00
Eliot Jones	f55091f3d2	make color types public and add stream based tests to prevent future breaking as observed in #52	2019-08-13 20:48:22 +01:00
Vasya	22278f64c4	Added TextSequence	2019-08-11 14:55:59 -04:00
Eliot Jones	fc2d532b82	use single instances of black and white for rgb/gray colors	2019-08-10 14:58:02 +01:00
Eliot Jones	c5d03bca97	move application of transformation matrix outside path	2019-08-08 21:19:18 +01:00
Eliot Jones	4dde4ca0c1	add colors to letters based on current font and graphics state	2019-08-05 19:26:10 +01:00
Eliot Jones	0df35b8488	fix naming of color space to be 2 words	2019-08-05 18:32:44 +01:00
Eliot Jones	0b9ae1db13	add color information to the operation context. create color classes for letters and paths to use	2019-08-04 16:47:47 +01:00
Eliot Jones	1d551d6de3	add and document core classes for colorspace information	2019-08-04 12:57:06 +01:00
Eliot Jones	364bd25fa8	#48 add handling of inline image data to pdf content parsing an inline image in a pdf content stream starts with the bi tag, then id declares the start of image data and ei the end. attempting to parse the bytes after the id tag as usual resulted in errors. this change adds special case handling for inline images.	2019-08-03 15:42:19 +01:00
vadimy	7d3a0929b6	Refactoring and fixing according to Eliot's comments	2019-07-24 00:00:00 -04:00
vadimy	b9d0cca2a6	Added "Paths" collection to Page object. Added matrix transformation to path operators.	2019-07-16 00:35:29 -04:00
Eliot Jones	453faf50af	start adding colorspace path operations to the operation context	2019-07-10 21:31:23 +01:00
Eliot Jones	557d8bc948	map missing character codes directly #44 previously if no matching unicode was found for a character code we would return a null letter. instead we now map from the character code directly to a character. this seems to work for most documents, except where there are ligatures, e.g. fi or ff, but is still better than not returning anything.	2019-07-07 13:53:25 +01:00
Eliot Jones	198cca1336	change point size calculation to use rotation #41 point size was previously only calculated based on the transformation matrix but now uses the transformation matrix, the rotation matrix and the font matrix values. the calculated value still seems unlikely to be correct so it is exposed using the page's experimental access for now, rather than as a public getter.	2019-07-07 12:12:09 +01:00
Eliot Jones	c495065178	support gs operator, fix systemfonts, apply rotation to glyphs - begin adding support for extended graphics state (the 'gs' operator) including setting the font #39. - apply page level rotation to the glyph bounding box and width to get correct glyph sizes #41. - wrap page rotation in a value type to ensure the value is restricted to right angle rotations and provide convenience members #42. - fix bug where system font finder never worked for truetype fonts because it began reading the file from the wrong offset.	2019-07-06 14:03:23 +01:00

1 2

80 Commits