* move file parsing to single-pass static methods
for the file 0002973.pdf in the test corpus we need to completely overhaul
how initial xref parsing is done since we need to locate the xref stream by
brute-force and this is currently broken. i wanted to take this opportunity to
change the logic to be more imperative and less like the pdfbox methods with
instance data and classes.
currently the logic is split between the xref offset validator and parser methods
and we call the validator logic twice, followed by brute-force searching again
in the actual parser. we're going to move to a single method that performs
the following steps:
1. find the first (from the end) occurrence of "startxref" and pull out the location
in bytes. this will also support "startref" since some files in the wild have that
2. go to that offset if found and parse the chain of tables or streams by /prev
reference
3. if any element in step 2 fails then we perform a single brute-force over the
entire file and like pdfbox treat later in file-length xrefs as the ultimate arbiter
of the object positions. while we do this we potentially can capture the actual
object offsets since the xref positions are probably incorrect too.
the aim with this is to avoid as much seeking and re-reading of bytes as
possible. while this won't technically be single-pass it gets us much closer. it
also removes the more strict logic requiring a "startxref" token to exist and be
valid, since we can repair this by brute-force anyway.
we will surface as much information as possible from the static method so that
we could in future support an object explorer ui for pdfs.
this will also be more resilient to invalid xref formats with e.g. comment tokens
or missing newlines.
* move more parsing to the static classes
* plumb through the new parsing results
* plug in new parser and remove old classes, port tests to new classes
* update tests to reflect logic changes
* apply correction when file header has offset
* ignore console runner launch settings
* skip offsets outside of file bounds
* fix parsing tables missing a line break
* use brute forced locations if they're already present
* only treat line breaks and spaces as whitespace for stream content
* address review comments
---------
Co-authored-by: BobLd <38405645+BobLd@users.noreply.github.com>
if parsing in lenient mode and encountering a malformed base name
(in this case 'helveticai') we fallback to times-roman as the adobe font
metrics file for a standard 14 font. this aligns with the behavior of pdfbox.
we also log a more informative error in non-lenient modes
this fixes document 0000086.pdf from the corpus
* Enable nullable annotations
* Remove unused Jetbrain annotations
* Ensure system using statements are first
* Improve nullability annotations
* Annotate encryptionDictionary is non-null when IsEncrypted is true
* Disable nullable for PdfTokenScanner.Get
* Improve nullability annotations for ObjectLocationProvider.TryGetCached
* Revert changes to RGBWorkingSpace
* Update UglyToad.PdfPig.Package with new framework targets (fixes nightly builds)
if skip missing fonts is set we want to read the file
as much as possible so we will also skip any missing
xobjects like images, forms or postscript code
these issues reported that parsing was failing due to a missing
token being reference in the tounicode entry. since neither
issue included a sample file it's impossible to determine the
right fix accurately, however since the tounicode entry is
optional in the spec we can try being more lenient here, this
might just result in more errors once we try to use the font
but the logger will at least prevent parsing the entire document
failing
* Add Named Destinations to Catalog so that bookmarks and links can access
them.
The named destinations require access to page nodes, so created Pages object
that is made using PagesFactory (which contains the page-related code from
Catalog).
* Further implementation of destinations:
- Implement NamedDestinations in AnnotationProvider, so that we can look
up named destinations for annotations and turn them into explicit destinations.
Reused existing code inside BookmarksProvider to get destinations/actions.
- Added GoToE action
- According to the PDF reference, destinations are also required for
external destinations and hence for ExternalBookmarkNode. This allows us
to push up DocumentBookmarkNode.Destination to BookmarkNode.
* Implemented stateful appearance streams and integration test
* Added AppearanceStream to public API because it is used in the (public)
Annotation constructor
* After #552, must push down ExplicitDestination do DocumentBookmarkNode since it
does not apply to UriBookmarkNode.
* Added actions, which fits the PDF model better and works well with the
new bookmarks code (after PR #552)
* Rename Action to PdfAction + removed unused using in ActionProvider.cs
---------
Co-authored-by: mvantzet <mark@radialsg.com>
this doesn't fix the reported issue since the pdf itself is corrupted on page 8 however it will
allow recovery in some scenarios where text content isn't important.
also adds more informative error when stream unintentionally passed with non zero offset
filter provider becomes single instance and no longer has constructor parameters.
tokenizers use list and stringbuilder pools to reduce allocations.
system font finder becomes static to preserve file cache across all documents.
most of the cross-reference code is the earliest code in the project and hasn't been revisited since then. the issue #88 has been reopened due to a bug with brute-force searching so this tidies up the code in this area ahead of trying to fix the bug.
lenient parsing gives us more code to maintain for no real benefit, parsing should always be as lenient as possible. remove the flag from some of the font code.
we add a try/catch to the direct object finder's tryget method so it returns false rather than throwing.
if we have an acro-form reference in the catalog but no corresponding object in the document we instead scan all objects in the document to find form fields and reconstruct the acro-form dictionary.
if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.
if the font descriptor uses the fromsubtype flag the actual type of the font can differ from that specified in the font dictionary. in this case a truetype font actually contains a type1c, compact font format, font. in this case we fall back to using the type1 parser.
also handles a closesubpath command appearing without any path construction operators.
an %%eof for a pdf file may appear further back than the last 1024 bytes. this change doubles the search range. it also handles an empty differences array being defined for a font encoding.
we also remove the old approach to dependency injection from the code since we are now favouring static classes where possible.
the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.
to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.
previously the only way to test if a password was correct was to supply a single password and throw if the value was incorrect. this was slow. now parsing options supports a list of passwords as well as a single password option (which is equivalent to a list with a single item). these passwords are all tested at the same time and an exception is only thrown once all passwords are tested.
* cid cff fonts have multiple sub-fonts and multiple private dictionaries, in addition to a top level font and private dictionary. this fix uses the specific sub-dictionary when getting local subroutines on a per-glyph basis.
* chinese, japanese or korean fonts can use a ucs-2 encoding cmap for retrieving unicode values.
* add support for the additional glyph list for unicode values in true type fonts. adds nonmarkingreturn mapping to carriage return.
* makes font parsing classes static where there's no reason for them to be per-instance.