PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2025-08-20 06:38:07 +08:00

Author	SHA1	Message	Date
EliotJones	efb8c2a803	i merged a pr which broke the build, this updates the build to work move all arguments to add page to a setting object so it can be extended in future in a non-breaking api change	2025-07-20 17:36:19 +01:00
jan-sutter	e636212ec8	check for cycles during indirect reference resolution (#1097 ) Co-authored-by: Jan Sutter <jan@suttermail.de>	2025-07-20 11:12:55 -05:00
EnraH	3b318e1944	add option to strip annotation (#492 ) * add option to strip annotation * fix implementation and tests --------- Co-authored-by: arne.hansen <arne.hansen@digitecgalaxus.ch> Co-authored-by: Eliot Jones <elioty@hotmail.co.uk>	2025-07-20 11:10:15 -05:00
EliotJones	377eb507e8	when writing content to an existing page inverse any global transform #614 when adding a page to a builder from an existing document using either addpage or copyfrom methods the added page's content stream can contain a global transform matrix change that will subsequently change all the locations of any modifications made by the user. here whenever using an existing stream we apply the inverse of any active transformation matrix there could be a bug here where if you use 'copy from' with a global transform active, we then apply the inverse, and you use 'copy from' again to the same destination page our inverse transform is now active and could potentially affect the second stream, but I don't think it will	2025-07-20 00:53:03 +01:00
BobLd	ff4e763192	Update hack for 1bpc + DeviceGray	2025-07-19 21:45:41 +01:00
BobLd	6a06452103	Remove decode parameter application from Stencil color space for consistency	2025-07-19 13:45:06 +01:00
BobLd	a5e92cd11c	Update run_common_crawl_tests.yml	2025-07-19 12:21:10 +01:00
EliotJones	4bf746c747	add new action to run integration against common crawl corpus	2025-07-19 11:49:34 +01:00
EliotJones	bffd51425d	support bfrange having incorrect length in a cmap the corpus file 0001413.pdf has an off-by-one error in its count for cmap bfranges. here we exit early if an unexpected endbfrange operator is encountered early. this matches the pdfbox behavior: `067d56e4db/fontbox/src/main/java/org/apache/fontbox/cmap/CMapParser.java (L373)`	2025-07-19 11:48:04 +01:00
Eliot Jones	e3388ec6b6	fix colorspace error when form xobject contains a transparency group (#1088 ) * fix colorspace error when form xobject contains a transparency group when a form xobject contains a reference to a group xobject this can only be used to change attributes of the transparency imaging model. the old code was setting the main colorspaces incorrectly causing errors when the transparency component had a different number of channels. this was causing #1071 in addition to the failure in file 0000355.pdf of the test corpus * add master integration tests for corpus group 0000 * tidy up actions * remove invalid reference in echo * move new action to different branch	2025-07-19 11:46:56 +01:00
EliotJones	31658ca020	allow reading to continue if encountering an invalid surrogate pair investigating the corpus at https://digitalcorpora.s3.amazonaws.com/s3_browser.html#corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/0000-0999/ the input file 0000000.pdf contained a utf-16 surrogate pair in an input defined as usc2. the approach of various parsers varies here, adobe acrobat seems to hard crash, pdf js returns the same text we now parse, chrome parses the intended text (2 invalid characters and "ib exam"). we don't care too much about matching chrome exactly so doing the same as firefox is fine here	2025-07-16 07:45:40 +01:00
EliotJones	1021729727	fall back to times-roman as standard 14 font when lenient if parsing in lenient mode and encountering a malformed base name (in this case 'helveticai') we fallback to times-roman as the adobe font metrics file for a standard 14 font. this aligns with the behavior of pdfbox. we also log a more informative error in non-lenient modes this fixes document 0000086.pdf from the corpus	2025-07-16 07:43:49 +01:00
Eliot Jones	9503f9c137	fix off-by-one and optimize brute force xref search #1078 (#1079 ) * fix off-by-one and optimize brute force xref search #1078 when performing a brute force xref search we were ending up off-by-one, update the search to use a ring buffer to reduce seeking and fix xref detection * make method testable and add test coverage * normalize test input on other platforms * seal circular buffer class	2025-07-16 07:35:24 +01:00
Eliot Jones	016b754c5b	back-calculate first char if last char and widths present (#1081 ) * back-calculate first char if last char and widths present when a truetype font has a last char and widths array in its font dictionary the first char can be calculated #644 * fix off by 1 in last char calculation	2025-07-14 21:57:01 +01:00
Eliot Jones	de3b6ac6f4	use correct bounding boxes for standard 14 glyphs #850 (#1080 ) * use correct bounding boxes for standard 14 glyphs #850 previously every bounding box for type 1 standard 14 fonts was assumed to start at 0,0 and ignored the bounding box data in the font metrics file. now we correctly read the glyph bounding box while preserving the existing advance width values for advancing the renderer position * update test case for new logic	2025-07-14 21:54:42 +01:00
EliotJones	b11f936f22	fix copying of sub-dictionary when keys collide when copying from a ancestor node of a page's resource dictionary we were incorrectly writing nested nodes of e.g. /fonts to the root of the target dictionary, here we write to the intended target node correctly	2025-07-10 18:32:20 +01:00
EliotJones	7fe60ff8c3	skip single letter final blocks align with the behavior of pdfbox and c implementations where single character final blocks are ignored rather than being written. also makes the error more informative in case it is ever encountered again. add more test cases. it is possible this is hiding the problem and will move the error elsewhere but this matches the implementation behavior of the 2 reference implementations. one other potential source for the error is if pdf supports '<~' as a start of data marker which i can't find in the spec but wikipedia says might be possible? without documents to trigger the error i think this is the best fix for now	2025-07-09 07:33:12 +01:00
EliotJones	781991b6bf	fix #670 by ignoring duplicate endstream definitions when parsing a stream object with multiple endstream tokens the last parsed token was selected instead of the actual stream token so instead we just skip all following tokens if the first is a stream and the following tokens are `endstream` operators only	2025-07-07 20:34:26 +01:00
EliotJones	daaac9350d	writer util did not follow reference links #1032 when copying various dictionaries from a source document to the builder any indirect references in the source document would throw because the code expected the dictionary token directly. now we follow the list of indirect references until we find a non-indirect leaf token. also changes the exception type.	2025-07-06 07:15:35 +01:00
EliotJones	f099dd5827	add test coverage to stream scanning	2025-07-06 07:13:55 +01:00
EliotJones	0586713da3	skip comments in pdf objects streams #926 the file provided in issue #926 contains the following syntax in pdf object streams: ``` % 750 0 obj << >> ``` currently we read the comment token and skip the rest however this producer is writing nonsense to the stream. comment tokens are only valid outside streams in pdf files so we align to the behavior of pdfbox here by skipping the entire line containing a comment inside a stream which fixes parsing this file.	2025-07-06 07:13:55 +01:00
xufeng	62612588c8	Fix bug in PngFromPdfImageFactory where softmask is wrongly referenced.	2025-07-06 07:10:11 +01:00
BobLd	bf664c3f0b	Use ReadOnlyMemory<byte> in ShowText operators and implement MoveToNextLineShowTextWithSpacing parsing	2025-06-29 14:27:14 +01:00
BobLd	6a50160e65	Prevent RunLengthFilter malicious OOM	2025-06-29 13:57:01 +01:00
BobLd	73ce5bbb73	Make classes related to page content parsing public	2025-06-28 13:17:40 +01:00
BobLd	d1d79b0b4c	Check ColorSpace token as dictionary and fix issue #1061	2025-06-25 19:20:02 +01:00
BobLd	89abf6de54	Skip creating IndirectReference in CrossReferenceTablePartBuilder when generationNumber is more than 65,535 Some checks failed Build and test / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / Check latest commit (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-06-01 14:16:22 +01:00
BobLd	24431b1f9f	Optimize internal representation of IndirectReference	2025-06-01 12:02:29 +01:00
BobLd	8f9194c9a4	Miscellaneous minor changes	2025-05-31 23:02:46 +01:00
BobLd	fe3d15d5db	Add extension method to get Memory<byte> from MemoryStream, attempting to do it without allocation and update CMapParser	2025-05-30 13:02:55 +01:00
BobLd	b5b58434e9	Make the Diacritics class public for use in external StreamProcessors	2025-05-30 09:25:03 +01:00
BobLd	d9b3891eb3	Do not throw if the Mask dictionary contains a ColorSpace key	2025-05-30 07:53:25 +01:00
BobLd	4bdb85d1ff	Modernise PngPredictor and refactor LzwFilter and FlateFilter to reduce memory allocation	2025-05-29 22:43:46 +01:00
BobLd	f84f2aceec	Improve memory allocation by changing IFilter.Decode() signature to use Memory<byte> instead of ReadOnlyMemory/ReadOnlySpan	2025-05-29 12:41:50 +01:00
BobLd	2b54a546d3	Check for infinite recursion in ObjectLocationProvider.TryGetOffset() and fix #1050	2025-05-28 20:24:31 +01:00
BobLd	5b566b53da	Only reset missed attempts count if table is found in CrossReferenceParser.Parse() and fix #1047	2025-05-27 20:57:38 +02:00
BobLd	ca9f70ffb0	Skip control chars in CoreTokenScanner.MoveNext() and fix #1048	2025-05-27 20:57:38 +02:00
BobLd	67d3dde04a	Handle TrueType case in CidFontFactory where the font is CFF, implement missing members in PdfCidCompactFontFormatFont and fix #554 Some checks failed Build and test / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / Check latest commit (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-05-19 00:27:51 +01:00
BobLd	e4d7805a1f	Add test to ensure #822 is fixed	2025-05-18 22:32:07 +01:00
BobLd	6911f31b49	Try to repair xref offset by looking for all startxref and fix #1040 Some checks are pending Build and test / build (push) Waiting to run Details Build and test [MacOS] / build (push) Waiting to run Details Run Integration Tests / build (push) Waiting to run Details	2025-05-18 17:32:27 +01:00
BobLd	bf7c3c01d0	Fix bug introduced in #1039 Some checks failed Build and test / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details	2025-05-13 18:44:31 +01:00
ricflams	c3c477a2ba	Bugfix and optimize GetStartXrefPosition The bugfix was the important part but the optimization is pretty nice too. - Bugfix: If startxref was found so far back (eg in the very beginning which can be the case for Linearized PDFs) that we ended up setting actualStartOffset to 0 then the loop would exit immediately without actually searching that part. - Optimization: GetStartXrefPosition would search for startxref in the last 2048 bytes and then double that search-range (looking back 4096, 8192, etc bytes) to look for startxref until the entire file was searched. This was rather inefficient since each step would search the same parts over and over again. This has been changed to properly search (still increasingly larger) chunks that doesn't overlap. On a test of 5000 PDFs that reduced their load-time by 10%. - Change: No need for the exception to say that startxref couldn't be found "in the last 2048 characters" since the entire file was searched anyway.	2025-05-13 18:21:31 +01:00
BobLd	4dab2ef239	Add early support for Stencil masking, rename SoftMaskImage property into MaskImage and make sure IsInlineImage is true for InlineImage Some checks failed Build and test / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details	2025-05-11 16:05:55 +01:00
BobLd	0bed135bad	Make sure the value of the ImageMask / Im token is check in ColorSpaceDetailsParser	2025-05-11 14:34:40 +01:00
BobLd	47584716ec	Add support for MacCatalyst in SystemFontFinder Some checks failed Build and test / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / Check latest commit (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-04-24 19:17:48 +01:00
BobLd	afdd1f8924	Fix issue #1013	2025-04-20 18:03:04 +01:00
BobLd	580858348b	Seal PdfSubpath class and IPathCommand implementations, fix Close.GetHashCode() and fix #1027	2025-04-12 16:39:48 +01:00
BobLd	24902f1839	Update README.md	2025-04-06 12:06:36 +01:00
BobLd	87f5735b26	Refactor AesEncryptionHelper and check for sting length when using < net8	2025-04-06 12:04:24 +01:00
BobLd	eeac910e44	Fix CanFilterClippedLetters() failing on MacOS because font is not available	2025-04-06 12:04:24 +01:00

1 2 3 4 5 ...

1691 Commits