PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2025-12-27 15:15:47 +08:00

Author	SHA1	Message	Date
BobLd	ee0cb1dc4a	Use file header offset when doing brute force find and fix #1223 Some checks failed Build, test and publish draft / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled Details Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled Details Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled Details Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / Check if this commit has already been published (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-12-07 13:43:22 +00:00
BobLd	2a6ee918b7	Revert "Avoid a lot of seeks by making most tokenizers no longer read to far by using seek." This reverts commit `e11dc6bf40`.	2025-11-07 22:31:43 +00:00
Bert Huijben	e11dc6bf40	Avoid a lot of seeks by making most tokenizers no longer read to far by using seek. Optimize the FirstPassParser to just fetch a final chunk before doing things char-by-char backwards.	2025-10-28 06:48:41 +00:00
Bert Huijben	6fba565d66	Avoid doing a true file seek for simple peeking the next char in the token parser Some checks failed Build, test and publish draft / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled Details Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled Details Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled Details Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / Check if this commit has already been published (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-10-20 06:33:34 +01:00
Eliot Jones	07df6fd740	read last line of ignore file (#1155 ) * read last line of ignore file - do not cancel other matrix jobs if one test fails - read all lines of the ignore list even if it doesn't end with a newline - add ignore list for 0008 and 0009 * support missing object numbers when brute-forcing the file 10404 (ironically) contains not found references with number 43 0 for its info dictionary. changes brute-force code so that objects can be entirely missing * fix test since document is now opened successfully but mediabox is broken	2025-09-13 16:57:35 +02:00
BobLd	a43b968ea9	Lower max search depth in preventing StackOverflow in ParseTrailer	2025-08-10 10:06:23 +01:00
BobLd	1031dcc221	Prevent StackOverflow in ParseTrailer and fix #1122	2025-08-09 08:46:04 +01:00
EliotJones	2b11961c8c	remove debug asserts causing test failures we encountered a fence constructed in the middle of a field for an unknown reason so we demolished it. i think this was intended to catch flaws in the parser logic but the reality is in a pdf anything can happen so we no longer want to catch these issues and this restores a green test run in debug mode. fix for #915	2025-07-20 17:42:34 +01:00
EliotJones	781991b6bf	fix #670 by ignoring duplicate endstream definitions when parsing a stream object with multiple endstream tokens the last parsed token was selected instead of the actual stream token so instead we just skip all following tokens if the first is a stream and the following tokens are `endstream` operators only	2025-07-07 20:34:26 +01:00
EliotJones	0586713da3	skip comments in pdf objects streams #926 the file provided in issue #926 contains the following syntax in pdf object streams: ``` % 750 0 obj << >> ``` currently we read the comment token and skip the rest however this producer is writing nonsense to the stream. comment tokens are only valid outside streams in pdf files so we align to the behavior of pdfbox here by skipping the entire line containing a comment inside a stream which fixes parsing this file.	2025-07-06 07:13:55 +01:00
BobLd	89abf6de54	Skip creating IndirectReference in CrossReferenceTablePartBuilder when generationNumber is more than 65,535 Some checks failed Build and test / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / Check latest commit (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-06-01 14:16:22 +01:00
BobLd	2b54a546d3	Check for infinite recursion in ObjectLocationProvider.TryGetOffset() and fix #1050	2025-05-28 20:24:31 +01:00
BobLd	afdd1f8924	Fix issue #1013	2025-04-20 18:03:04 +01:00
BobLd	f26e7d90a3	Pass IFilterProvider to IFilter.Decode() and handle null in PdfExtensions.Resolve() Some checks failed Build and test / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details	2025-02-23 09:37:25 +00:00
Arnaud TAMAILLON	abce843923	Fix message	2024-09-03 05:09:03 +01:00
Arnaud TAMAILLON	fc3cd81c96	Support relaxed parsing of missing or garbage-prepended endobj/endtream tokens	2024-09-03 05:09:03 +01:00
Sylvain Bruyere	65a18b200f	Improve TryReadStream with simplification & fix of Stream Invalid Length cutting off Streams (#838 ) * Improve TryReadStream with simplification & fix of Stream Invalid Length cutting off Streams - Fix of Stream invalid Length issue causing stream data being cut off: fix https://github.com/UglyToad/PdfPig/issues/809 - Improve Stream Token read performance by: - simplifying TryReadStream(), avoiding use of MemoryStream, with benefice of already existing Memory Span of "inputBytes" - removing the unecessary List<> * Add Stream with Invalid Length unit test * Use of Memory<> instead of direct Span to avoid byte array allocation .ToArray. Suggestion from (`4153e4a1b4 (r1619509165)`)	2024-05-31 07:16:56 +01:00
Jason Nelson	6d54355754	Spanify filters	2024-04-12 07:42:19 +01:00
Jason Nelson	f62929eb7c	Spanify work 1 (#812 ) * Add GetString(ReadOnlySpan<byte>) polyfill * Add ArrayPoolBufferWriter * Use Utf8.IsValid & char.IsAsciiHexDigit on NET8.0+ * Optimize HexTokenizer * Eliminate various Tuple allocations * Eliminate List allocation in CrossReferenceTable * Eliminate various allocations in Ascii85Filter * Spanify HexToken * Spanify Palette * Spanify various Cmap & font methods * Spanify Type1Charstring classes * Spanify PdfDocEncoding.TryConvertBytesToString * Spanify OctalHelpers.FromOctalDigits * Add missing braces * React to HexToken.Byte type changes * Cleanup * [Tests] React to span changes * Add ArgumentNullException check back to Type1CharstringDecryptedBytes * Remove unsafe code * Seal HexToken * Avoid allocation when passing an empty span	2024-04-01 09:18:01 +01:00
Jason Nelson	a412a239be	Enable nullable annotations (#803 ) * Enable nullable annotations * Remove unused Jetbrain annotations * Ensure system using statements are first * Improve nullability annotations * Annotate encryptionDictionary is non-null when IsEncrypted is true * Disable nullable for PdfTokenScanner.Get * Improve nullability annotations for ObjectLocationProvider.TryGetCached * Revert changes to RGBWorkingSpace * Update UglyToad.PdfPig.Package with new framework targets (fixes nightly builds)	2024-03-17 18:51:40 +00:00
Jason Nelson	834fb350a3	Use Array.Empty	2024-03-15 13:10:25 +00:00
BobLd	acfe8b5fdd	Allow lenient parsing in DictionaryTokenizer and fix #791	2024-03-11 20:01:07 +00:00
BobLd	096ebdbf70	Replace Trace by Debug	2024-01-16 19:24:59 +00:00
Eliot Jones	d7a34c69ce	handle duplicated invalid closing array/dict tokens in objects #6	2024-01-11 16:00:46 +00:00
Eliot Jones	6f59bed9a2	use pdfdocencoding when parsing strings	2023-06-04 16:40:43 +01:00
romain v	47a0a62eee	\r only in token scanner An edge case was lost with this commit `31ca3640d2` when scanner is only followed by \r (without \n)	2021-08-17 16:14:59 +02:00
InusualZ	31ca3640d2	Tolerate any white-space bytes after the `stream` operator.	2021-06-27 15:02:20 -04:00
Eliot Jones	9ae0a5ec15	allow stream filters to contain indirect references to name tokens	2021-04-25 16:22:22 -04:00
Eliot Jones	25cc3c8634	Merge pull request #279 from plaisted/edit-docs-v2 Add pages from PdfDocument to PdfDocumentBuilder	2021-02-14 09:58:08 -04:00
Eliot Jones	6f49b2e29e	Fix buggy font (#283 ) * Adds checksum on font file reading * fix name table parsing on broken table * only warn if checksum invalid, avoid exception with bounds check #258 also returns a null object when the object generation number exceeds ushort.maxvalue since this is the maximum allowed value and this broke tests attempting to parse all objects in the file from #258 * remove potentially problematic document it might be sensitive data * use ttf from file to test without including full file Co-authored-by: romain v <rvergnory@lucca.fr>	2021-02-07 12:23:11 -04:00
Plaisted	7f42ad0af9	refactored previous work to fit pr #250	2021-02-06 12:24:53 -06:00
Eliot Jones	fa5e37dc8c	handle presence of endobj markers in object stream #235	2020-11-22 12:51:38 -04:00
Eliot Jones	6359ba5df1	handle objects without endobj markers #198	2020-08-21 18:15:30 +01:00
romain v	5a82c36631	FIX : undefined references is a valid use case. I tried to mitigate the breaking change by keep on throwing in most uses of the change method.	2020-08-17 11:10:44 +02:00
Eliot Jones	ec9e425712	use length from stream dictionary if directly available when brute forcing we use the length available in the stream's dictionary token if it is a direct number rather than an indirect reference.	2020-02-27 17:17:49 +00:00
Eliot Jones	f415c3116e	cross reference offset is in the xref table we ignore the error previously we checked the offset was not inside the table (correct thing to check), however this is only a special case of the more general issue (cross reference offsets are wrong). we move handling for this into the pdf token scanner. if we attempt to read an object at an offset and it fails we brute force the entire file to find correct offsets. we also needed to add handling to make sure we don't attempt to use stream length tokens if we're brute-forcing since we can't look up indirect references for length.	2020-02-26 14:03:46 +00:00
Eliot Jones	693a3d5958	use offset to file header to correct cross references if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.	2020-01-26 15:30:20 +00:00
Eliot Jones	e588b2bc50	support documents without endobj for stream some documents declare stream objects without an endobj marker at the end of the stream. if a new obj token is encountered after reading a stream we reset the scanner to the object number token and return the stream.	2020-01-07 15:27:01 +00:00
Eliot Jones	10dc5a8eed	don't cache invalid offsets unless brute forced don't cache objects parsed if their offset doesn't match the cross-reference offset, unless the object was parsed by a brute-force search operation. this is because 1 object may lie in 2 streams, 1 valid and 1 invalid. If the invalid stream is parsed first for another object then the valid stream will never be read.	2020-01-07 14:54:12 +00:00
Eliot Jones	7c0ef111ea	move classes to new projects to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.	2020-01-04 16:38:18 +00:00
Eliot Jones	23c7e44fc8	handle stream length being an object stream value	2019-12-24 15:22:47 +00:00
Eliot Jones	3084a9aab6	support streams containing only carriage returns. handle comments in arrays and dictionaries * while the pdf specification says stream data should follow a newline following a stream operator some files have only a carriage return following the stream operator. * since comment tokens may appear inside an array or dictionary we ignore them if they occur here since they will break interpretation of the dictionary or array contents.	2019-12-20 14:04:58 +00:00
Eliot Jones	68bcaf3901	#55 move support for images to page and add inline images support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page. treat all warnings as errors.	2019-10-08 14:04:36 +01:00
Eliot Jones	bbe5409f94	#62 use length value of stream directly to read the full stream once	2019-08-20 21:08:06 +01:00
Eliot Jones	caf1a0c233	use invariant culture for parsing all numbers #37	2019-06-18 19:12:51 +01:00
Eliot Jones	98424b32aa	special case handling for faulty offsets in xref with missing whitespace between eof and object number	2019-06-14 20:40:24 +01:00
Eliot Jones	2b486dccab	prevent infinite loops where a stream token's length entry references itself. perform brute force scans in case of a faulty xref table #33	2019-06-08 16:45:02 +01:00
Eliot Jones	03af28ed6d	fix bug with compact font format font matrix reading and where endstream token is missed if immediately following 'e'	2019-05-10 20:02:29 +01:00
Eliot Jones	bad57763a1	finish initial support for rc4 encryption with blank user password	2019-05-06 15:41:29 +01:00
Eliot Jones	be394f5bba	start adding support for reading encrypted documents	2019-05-04 15:36:13 +01:00

1 2

56 Commits