Commit Graph

79 Commits

Author SHA1 Message Date
BobLd
7c4f5e2424 Introduce StackDepthGuard class to check for stack depth in CoreTokenScanner and fix #1217
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-12-23 16:24:04 +01:00
BobLd
ee0cb1dc4a Use file header offset when doing brute force find and fix #1223
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-12-07 13:43:22 +00:00
BobLd
40bcc22ea1 Add CMap caching at document level and add MurmurHash3 hashing function
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-10-26 16:20:27 +00:00
Bert Huijben
6fba565d66 Avoid doing a true file seek for simple peeking the next char in the token parser
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-10-20 06:33:34 +01:00
Bert Huijben
3592fc8438 Use zlib information to verify compressed content before using it
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
2025-10-15 18:46:36 +01:00
BobLd
eb906a776d Handle non seekable stream by copying it into a memory stream and fix #1146 2025-09-14 14:42:59 +01:00
BobLd
22eab422a3 First create the StreamInputBytes in PdfDocument.Open() to check the stream CanRead and CanSeek
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-09-09 19:12:58 +01:00
Eliot Jones
0afe021ad3 move file parsing to single-pass static methods (#1102)
* move file parsing to single-pass static methods

for the file 0002973.pdf in the test corpus we need to completely overhaul
how initial xref parsing is done since we need to locate the xref stream by
brute-force and this is currently broken. i wanted to take this opportunity to
change the logic to be more imperative and less like the pdfbox methods with
instance data and classes.

currently the logic is split between the xref offset validator and parser methods
and we call the validator logic twice, followed by brute-force searching again
in the actual parser. we're going to move to a single method that performs
the following steps:

1. find the first (from the end) occurrence of "startxref" and pull out the location
in bytes. this will also support "startref" since some files in the wild have that
2. go to that offset if found and parse the chain of tables or streams by /prev
reference
3. if any element in step 2 fails then we perform a single brute-force over the
entire file and like pdfbox treat later in file-length xrefs as the ultimate arbiter
of the object positions. while we do this we potentially can capture the actual
object offsets since the xref positions are probably incorrect too.

the aim with this is to avoid as much seeking and re-reading of bytes as
possible. while this won't technically be single-pass it gets us much closer. it
also removes the more strict logic requiring a "startxref" token to exist and be
valid, since we can repair this by brute-force anyway.

we will surface as much information as possible from the static method so that
we could in future support an object explorer ui for pdfs.

this will also be more resilient to invalid xref formats with e.g. comment tokens
or missing newlines.

* move more parsing to the static classes

* plumb through the new parsing results

* plug in new parser and remove old classes, port tests to new classes

* update tests to reflect logic changes

* apply correction when file header has offset

* ignore console runner launch settings

* skip offsets outside of file bounds

* fix parsing tables missing a line break

* use brute forced locations if they're already present

* only treat line breaks and spaces as whitespace for stream content

* address review comments

---------

Co-authored-by: BobLd <38405645+BobLd@users.noreply.github.com>
2025-09-02 19:41:00 +01:00
EliotJones
1021729727 fall back to times-roman as standard 14 font when lenient
if parsing in lenient mode and encountering a malformed base name
(in this case 'helveticai') we fallback to times-roman as the adobe font
metrics file for a standard 14 font. this aligns with the behavior of pdfbox.
we also log a more informative error in non-lenient modes

this fixes document 0000086.pdf from the corpus
2025-07-16 07:43:49 +01:00
BobLd
73ce5bbb73 Make classes related to page content parsing public 2025-06-28 13:17:40 +01:00
BobLd
e10609e4e1 Use pdfScanner in ReadVerticalDisplacements and fix #693 and return 0 in CMap on exception in ReadByte() if useLenientParsing is true and fix #692
Some checks failed
Build and test / build (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
2024-10-19 00:29:42 +01:00
BobLd
8cee4f480f Introduce ParsingOptions.FilterProvider and BaseFilterProvider and make CcittFaxCompressionType a byte 2024-10-17 20:27:24 +01:00
Jason Nelson
6d54355754 Spanify filters 2024-04-12 07:42:19 +01:00
Jason Nelson
a412a239be Enable nullable annotations (#803)
* Enable nullable annotations

* Remove unused Jetbrain annotations

* Ensure system using statements are first

* Improve nullability annotations

* Annotate encryptionDictionary is non-null when IsEncrypted is true

* Disable nullable for PdfTokenScanner.Get

* Improve nullability annotations for ObjectLocationProvider.TryGetCached

* Revert changes to RGBWorkingSpace

* Update UglyToad.PdfPig.Package with new framework targets (fixes nightly builds)
2024-03-17 18:51:40 +00:00
Jason Nelson
95f0459900 Prefer is null to == null
ensures that an equals overload isn't use, and we don't compare structs
2024-03-16 12:37:51 +00:00
BobLd
acfe8b5fdd Allow lenient parsing in DictionaryTokenizer and fix #791 2024-03-11 20:01:07 +00:00
BobLd
63096de210 Add IPageFactory to the public API, remove InternalParsingOptions 2023-10-25 20:03:02 +01:00
BobLd
ba865b340e Make IResourceStore part of the public API and pass InternalParsingOptions to the ResourceStore constructor 2023-10-22 19:16:41 +01:00
Mark van 't Zet
e3f281435a Fix for #662: when encountering invalid content, try to continue parsing
if option "useLenientParsing" is in effect.
2023-09-29 10:44:54 +01:00
BobLd
9aaf20ceb4 Address #672 to ignore errors while reading the descriptor file in CidFontFactory 2023-08-05 16:21:01 +01:00
Eliot Jones
6f59bed9a2 use pdfdocencoding when parsing strings 2023-06-04 16:40:43 +01:00
Eliot Jones
fba1cbc13c skip missing objects if skip fonts is true #298
if skip missing fonts is set we want to read the file
as much as possible so we will also skip any missing
xobjects like images, forms or postscript code
2023-05-27 10:46:29 +01:00
Eliot Jones
0b8252e930 do not require tounicode to be valid even if present #354 #619
these issues reported that parsing was failing due to a missing
token being reference in the tounicode entry. since neither
issue included a sample file it's impossible to determine the
right fix accurately, however since the tounicode entry is
optional in the spec we can try being more lenient here, this
might just result in more errors once we try to use the font
but the logger will at least prevent parsing the entire document
failing
2023-05-21 12:34:27 +01:00
Eliot Jones
6b9c3be9f8 tidy up some small formatting issues 2023-05-21 12:20:57 +01:00
BobLD
b8a98fbed2 Properly implement color spaces 2023-04-12 07:25:09 +01:00
mvantzet
0e39bc0b76 Annotations named destinations (#579)
* Add Named Destinations to Catalog so that bookmarks and links can access
them.

The named destinations require access to page nodes, so created Pages object
that is made using PagesFactory (which contains the page-related code from
Catalog).

* Further implementation of destinations:
- Implement NamedDestinations in AnnotationProvider, so that we can look
  up named destinations for annotations and turn them into explicit destinations.
  Reused existing code inside BookmarksProvider to get destinations/actions.
- Added GoToE action
- According to the PDF reference, destinations are also required for
  external destinations and hence for ExternalBookmarkNode. This allows us
  to push up DocumentBookmarkNode.Destination to BookmarkNode.

* Implemented stateful appearance streams and integration test

* Added AppearanceStream to public API because it is used in the (public)
Annotation constructor

* After #552, must push down ExplicitDestination do DocumentBookmarkNode since it
does not apply to UriBookmarkNode.

* Added actions, which fits the PDF model better and works well with the
new bookmarks code (after PR #552)

* Rename Action to PdfAction + removed unused using in ActionProvider.cs

---------

Co-authored-by: mvantzet <mark@radialsg.com>
2023-04-10 17:14:14 +01:00
Eliot Jones
e2246a88bb #482 add skip missing fonts option and pass parsing options to content stream processor
this doesn't fix the reported issue since the pdf itself is corrupted on page 8 however it will
allow recovery in some scenarios where text content isn't important.

also adds more informative error when stream unintentionally passed with non zero offset
2022-10-09 13:44:05 -04:00
Eliot Jones
2a68670896 #443 handle case where file version comment token included in string by tokenization
instead just brute force the raw content
2022-04-24 12:37:26 -04:00
Eliot Jones
cbd02a270f don't throw if no information dictionary if lenient parsing 2022-04-14 20:46:36 -04:00
Eliot Jones
83948f42d7 #405 check encryption token value for null 2022-01-11 16:13:52 +00:00
Eliot Jones
9ae0a5ec15 allow stream filters to contain indirect references to name tokens 2021-04-25 16:22:22 -04:00
BobLd
f91acefcfa Set ClipPaths to false if no ParsingOptions given (consistent behaviour) 2020-04-27 17:21:52 +01:00
Eliot Jones
27e251f921 make filter provider and filter public and use tryget for image bytes 2020-04-25 09:42:24 +01:00
BobLd
a759a99389 Move ClipPaths option from GetPage() to ParsingOptions 2020-04-05 17:58:57 +01:00
Eliot Jones
f1be6634a7 add a bunch more performance improvements
filter provider becomes single instance and no longer has constructor parameters.

tokenizers use list and stringbuilder pools to reduce allocations.

system font finder becomes static to preserve file cache across all documents.
2020-04-05 15:34:47 +01:00
Eliot Jones
58972de7cb begin to rework cross-reference parsing
most of the cross-reference code is the earliest code in the project and hasn't been revisited since then. the issue #88 has been reopened due to a bug with brute-force searching so this tidies up the code in this area ahead of trying to fix the bug.
2020-03-03 15:21:11 +00:00
Inusual
013cbd14e0 Make CrossReferenceTableParser a static class 2020-03-02 17:00:16 +00:00
Eliot Jones
c864fa512c remove islenientparsing from page classes 2020-02-28 11:50:18 +00:00
Eliot Jones
746cbfa30c remove lenient parsing from font related classes
lenient parsing gives us more code to maintain for no real benefit, parsing should always be as lenient as possible. remove the flag from some of the font code.
2020-02-27 18:10:02 +00:00
Eliot Jones
4150881be9 recover from invalid acro-form references
we add a try/catch to the direct object finder's tryget method so it returns false rather than throwing.

if we have an acro-form reference in the catalog but no corresponding object in the document we instead scan all objects in the document to find form fields and reconstruct the acro-form dictionary.
2020-02-27 12:08:40 +00:00
Eliot Jones
693a3d5958 use offset to file header to correct cross references
if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.
2020-01-26 15:30:20 +00:00
Eliot Jones
63b118b141 handle type1 fonts disguised as truetype
if the font descriptor uses the fromsubtype flag the actual type of the font can differ from that specified in the font dictionary. in this case a truetype font actually contains a type1c, compact font format, font. in this case we fall back to using the type1 parser.

also handles a closesubpath command appearing without any path construction operators.
2020-01-07 16:49:21 +00:00
Eliot Jones
0b048fde57 handle eof further back in file
an %%eof for a pdf file may appear further back than the last 1024 bytes. this change doubles the search range. it also handles an empty differences array being defined for a font encoding.

we also remove the old approach to dependency injection from the code since we are now favouring static classes where possible.
2020-01-07 11:48:09 +00:00
Eliot Jones
b29354e3e6 move compact font format fonts to fonts project 2020-01-05 12:08:01 +00:00
Eliot Jones
74774995d6 complete move of truetype, afm and standard14 fonts
the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.
2020-01-04 22:39:13 +00:00
Eliot Jones
7c0ef111ea move classes to new projects
to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.
2020-01-04 16:38:18 +00:00
Eliot Jones
4d697e3669 allow the user to supply multiple passwords for decryption
previously the only way to test if a password was correct was to supply a single password and throw if the value was incorrect. this was slow. now parsing options supports a list of passwords as well as a single password option (which is equivalent to a list with a single item). these passwords are all tested at the same time and an exception is only thrown once all passwords are tested.
2019-12-20 15:11:05 +00:00
Eliot Jones
c30cd1b96d use cid font subroutines where applicable. add ucs 2 cmap support for type 1 fonts
* cid cff fonts have multiple sub-fonts and multiple private dictionaries, in addition to a top level font and private dictionary. this fix uses the specific sub-dictionary when getting local subroutines on a per-glyph basis.
* chinese, japanese or korean fonts can use a ucs-2 encoding cmap for retrieving unicode values.
* add support for the additional glyph list for unicode values in true type fonts. adds nonmarkingreturn mapping to carriage return.
* makes font parsing classes static where there's no reason for them to be per-instance.
2019-12-19 13:33:44 +00:00
Eliot Jones
ecf0b8743b make bookmarknode immutable and use scanner when retrieving bookmarks 2019-12-05 12:03:30 +00:00
Eliot Jones
2ef45f71d5 make missing acroform types public and start improving data
also changes pages to use a proper tree structure since this will be required for resource inheritance and for acroform widget dictionaries.
2019-10-09 14:28:37 +01:00