Commit Graph

56 Commits

Author SHA1 Message Date
BobLd
ee0cb1dc4a Use file header offset when doing brute force find and fix #1223
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-12-07 13:43:22 +00:00
BobLd
2a6ee918b7 Revert "Avoid a lot of seeks by making most tokenizers no longer read to far by using seek."
This reverts commit e11dc6bf40.
2025-11-07 22:31:43 +00:00
Bert Huijben
e11dc6bf40 Avoid a lot of seeks by making most tokenizers no longer read to far by using seek.
Optimize the FirstPassParser to just fetch a final chunk before doing things char-by-char backwards.
2025-10-28 06:48:41 +00:00
Bert Huijben
6fba565d66 Avoid doing a true file seek for simple peeking the next char in the token parser
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-10-20 06:33:34 +01:00
Eliot Jones
07df6fd740 read last line of ignore file (#1155)
* read last line of ignore file

- do not cancel other matrix jobs if one test fails
- read all lines of the ignore list even if it doesn't end with a newline
- add ignore list for 0008 and 0009

* support missing object numbers when brute-forcing

the file 10404 (ironically) contains not found references with number 43 0
for its info dictionary. changes brute-force code so that objects can be
entirely missing

* fix test since document is now opened successfully but mediabox is broken
2025-09-13 16:57:35 +02:00
BobLd
a43b968ea9 Lower max search depth in preventing StackOverflow in ParseTrailer 2025-08-10 10:06:23 +01:00
BobLd
1031dcc221 Prevent StackOverflow in ParseTrailer and fix #1122 2025-08-09 08:46:04 +01:00
EliotJones
2b11961c8c remove debug asserts causing test failures
we encountered a fence constructed in the middle of a field for an unknown
reason so we demolished it. i think this was intended to catch flaws in the
parser logic but the reality is in a pdf anything can happen so we no longer
want to catch these issues and this restores a green test run in debug mode.

fix for #915
2025-07-20 17:42:34 +01:00
EliotJones
781991b6bf fix #670 by ignoring duplicate endstream definitions
when parsing a stream object with multiple endstream tokens
the last parsed token was selected instead of the actual stream
token so instead we just skip all following tokens if the first
is a stream and the following tokens are `endstream` operators
only
2025-07-07 20:34:26 +01:00
EliotJones
0586713da3 skip comments in pdf objects streams #926
the file provided in issue #926 contains the following syntax
in pdf object streams:

```
% 750 0 obj
<< >>
```

currently we read the comment token and skip the rest
however this producer is writing nonsense to the stream.
comment tokens are only valid outside streams in pdf files
so we align to the behavior of pdfbox here by skipping the
entire line containing a comment inside a stream which fixes
parsing this file.
2025-07-06 07:13:55 +01:00
BobLd
89abf6de54 Skip creating IndirectReference in CrossReferenceTablePartBuilder when generationNumber is more than 65,535
Some checks failed
Build and test / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / Check latest commit (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-06-01 14:16:22 +01:00
BobLd
2b54a546d3 Check for infinite recursion in ObjectLocationProvider.TryGetOffset() and fix #1050 2025-05-28 20:24:31 +01:00
BobLd
afdd1f8924 Fix issue #1013 2025-04-20 18:03:04 +01:00
BobLd
f26e7d90a3 Pass IFilterProvider to IFilter.Decode() and handle null in PdfExtensions.Resolve()
Some checks failed
Build and test / build (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
2025-02-23 09:37:25 +00:00
Arnaud TAMAILLON
abce843923 Fix message 2024-09-03 05:09:03 +01:00
Arnaud TAMAILLON
fc3cd81c96 Support relaxed parsing of missing or garbage-prepended endobj/endtream tokens 2024-09-03 05:09:03 +01:00
Sylvain Bruyere
65a18b200f Improve TryReadStream with simplification & fix of Stream Invalid Length cutting off Streams (#838)
* Improve TryReadStream with simplification & fix of Stream Invalid Length cutting off Streams

- Fix of Stream invalid Length issue causing stream data being cut off: fix https://github.com/UglyToad/PdfPig/issues/809

- Improve Stream Token read performance by:
  -  simplifying TryReadStream(), avoiding use of MemoryStream, with benefice of already existing Memory Span of "inputBytes"
  - removing the unecessary List<>

* Add Stream with Invalid Length unit test

* Use of Memory<> instead of direct Span to avoid byte array allocation .ToArray.
Suggestion from (4153e4a1b4 (r1619509165))
2024-05-31 07:16:56 +01:00
Jason Nelson
6d54355754 Spanify filters 2024-04-12 07:42:19 +01:00
Jason Nelson
f62929eb7c Spanify work 1 (#812)
* Add GetString(ReadOnlySpan<byte>) polyfill

* Add ArrayPoolBufferWriter

* Use Utf8.IsValid & char.IsAsciiHexDigit on NET8.0+

* Optimize HexTokenizer

* Eliminate various Tuple allocations

* Eliminate List allocation in CrossReferenceTable

* Eliminate various allocations in Ascii85Filter

* Spanify HexToken

* Spanify Palette

* Spanify various Cmap & font methods

* Spanify Type1Charstring classes

* Spanify PdfDocEncoding.TryConvertBytesToString

* Spanify OctalHelpers.FromOctalDigits

* Add missing braces

* React to HexToken.Byte type changes

* Cleanup

* [Tests] React to span changes

* Add ArgumentNullException check back to Type1CharstringDecryptedBytes

* Remove unsafe code

* Seal HexToken

* Avoid allocation when passing an empty span
2024-04-01 09:18:01 +01:00
Jason Nelson
a412a239be Enable nullable annotations (#803)
* Enable nullable annotations

* Remove unused Jetbrain annotations

* Ensure system using statements are first

* Improve nullability annotations

* Annotate encryptionDictionary is non-null when IsEncrypted is true

* Disable nullable for PdfTokenScanner.Get

* Improve nullability annotations for ObjectLocationProvider.TryGetCached

* Revert changes to RGBWorkingSpace

* Update UglyToad.PdfPig.Package with new framework targets (fixes nightly builds)
2024-03-17 18:51:40 +00:00
Jason Nelson
834fb350a3 Use Array.Empty 2024-03-15 13:10:25 +00:00
BobLd
acfe8b5fdd Allow lenient parsing in DictionaryTokenizer and fix #791 2024-03-11 20:01:07 +00:00
BobLd
096ebdbf70 Replace Trace by Debug 2024-01-16 19:24:59 +00:00
Eliot Jones
d7a34c69ce handle duplicated invalid closing array/dict tokens in objects #6 2024-01-11 16:00:46 +00:00
Eliot Jones
6f59bed9a2 use pdfdocencoding when parsing strings 2023-06-04 16:40:43 +01:00
romain v
47a0a62eee \r only in token scanner
An edge case was lost with this commit 31ca3640d2

when scanner is only followed by \r (without \n)
2021-08-17 16:14:59 +02:00
InusualZ
31ca3640d2 Tolerate any white-space bytes after the stream operator. 2021-06-27 15:02:20 -04:00
Eliot Jones
9ae0a5ec15 allow stream filters to contain indirect references to name tokens 2021-04-25 16:22:22 -04:00
Eliot Jones
25cc3c8634 Merge pull request #279 from plaisted/edit-docs-v2
Add pages from PdfDocument to PdfDocumentBuilder
2021-02-14 09:58:08 -04:00
Eliot Jones
6f49b2e29e Fix buggy font (#283)
* Adds checksum on font file reading

* fix name table parsing on broken table

* only warn if checksum invalid, avoid exception with bounds check #258

also returns a null object when the object generation number exceeds
ushort.maxvalue since this is the maximum allowed value and this
broke tests attempting to parse all objects in the file from #258

* remove potentially problematic document

it might be sensitive data

* use ttf from file to test without including full file

Co-authored-by: romain v <rvergnory@lucca.fr>
2021-02-07 12:23:11 -04:00
Plaisted
7f42ad0af9 refactored previous work to fit pr #250 2021-02-06 12:24:53 -06:00
Eliot Jones
fa5e37dc8c handle presence of endobj markers in object stream #235 2020-11-22 12:51:38 -04:00
Eliot Jones
6359ba5df1 handle objects without endobj markers #198 2020-08-21 18:15:30 +01:00
romain v
5a82c36631 FIX : undefined references is a valid use case.
I tried to mitigate the breaking change by keep on throwing in most uses of the change method.
2020-08-17 11:10:44 +02:00
Eliot Jones
ec9e425712 use length from stream dictionary if directly available
when brute forcing we use the length available in the stream's dictionary token if it is a direct number rather than an indirect reference.
2020-02-27 17:17:49 +00:00
Eliot Jones
f415c3116e cross reference offset is in the xref table we ignore the error
previously we checked the offset was not inside the table (correct thing to check), however this is only a special case of the more general issue (cross reference offsets are wrong). we move handling for this into the pdf token scanner. if we attempt to read an object at an offset and it fails we brute force the entire file to find correct offsets. we also needed to add handling to make sure we don't attempt to use stream length tokens if we're brute-forcing since we can't look up indirect references for length.
2020-02-26 14:03:46 +00:00
Eliot Jones
693a3d5958 use offset to file header to correct cross references
if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.
2020-01-26 15:30:20 +00:00
Eliot Jones
e588b2bc50 support documents without endobj for stream
some documents declare stream objects without an endobj marker at the end of the stream. if a new obj token is encountered after reading a stream we reset the scanner to the object number token and return the stream.
2020-01-07 15:27:01 +00:00
Eliot Jones
10dc5a8eed don't cache invalid offsets unless brute forced
don't cache objects parsed if their offset doesn't match the cross-reference offset, unless the object was parsed by a brute-force search operation. this is because 1 object may lie in 2 streams, 1 valid and 1 invalid. If the invalid stream is parsed first for another object then the valid stream will never be read.
2020-01-07 14:54:12 +00:00
Eliot Jones
7c0ef111ea move classes to new projects
to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.
2020-01-04 16:38:18 +00:00
Eliot Jones
23c7e44fc8 handle stream length being an object stream value 2019-12-24 15:22:47 +00:00
Eliot Jones
3084a9aab6 support streams containing only carriage returns. handle comments in arrays and dictionaries
* while the pdf specification says stream data should follow a newline following a stream operator some files have only a carriage return following the stream operator.
* since comment tokens may appear inside an array or dictionary we ignore them if they occur here since they will break interpretation of the dictionary or array contents.
2019-12-20 14:04:58 +00:00
Eliot Jones
68bcaf3901 #55 move support for images to page and add inline images
support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page.

treat all warnings as errors.
2019-10-08 14:04:36 +01:00
Eliot Jones
bbe5409f94 #62 use length value of stream directly to read the full stream once 2019-08-20 21:08:06 +01:00
Eliot Jones
caf1a0c233 use invariant culture for parsing all numbers #37 2019-06-18 19:12:51 +01:00
Eliot Jones
98424b32aa special case handling for faulty offsets in xref with missing whitespace between eof and object number 2019-06-14 20:40:24 +01:00
Eliot Jones
2b486dccab prevent infinite loops where a stream token's length entry references itself. perform brute force scans in case of a faulty xref table #33 2019-06-08 16:45:02 +01:00
Eliot Jones
03af28ed6d fix bug with compact font format font matrix reading and where endstream token is missed if immediately following 'e' 2019-05-10 20:02:29 +01:00
Eliot Jones
bad57763a1 finish initial support for rc4 encryption with blank user password 2019-05-06 15:41:29 +01:00
Eliot Jones
be394f5bba start adding support for reading encrypted documents 2019-05-04 15:36:13 +01:00