in document 10122 the font and xobject names are the same so the
xobject overwrote references to the font for the page content, separate
the dictionaries
this file cotains corrupt content following an inline image but other parsers
just treat this content as part of the image and parse the rest of the file
successfully
* remove alpha postfix, releases will increment version
* update the master build job to draft a release
* add publish action to publish full release
* enable setting assembly and file version
* bump assembly and file version for package project
---------
Co-authored-by: BobLd <38405645+BobLd@users.noreply.github.com>
* add test for filebufferingreadstream
* #1124 do not trust reported stream length if bytes can be read at end
the filebufferingreadstream input stream does not report more than the read
length. the change to seek the xref in a sliding window from the end broke
with the assumption that the reported length was correct. here we switch to
reading the window or continue reading if we can read beyond the stream's
initially reported length while seeking the startxref marker
* remove rogue newlines
* move file parsing to single-pass static methods
for the file 0002973.pdf in the test corpus we need to completely overhaul
how initial xref parsing is done since we need to locate the xref stream by
brute-force and this is currently broken. i wanted to take this opportunity to
change the logic to be more imperative and less like the pdfbox methods with
instance data and classes.
currently the logic is split between the xref offset validator and parser methods
and we call the validator logic twice, followed by brute-force searching again
in the actual parser. we're going to move to a single method that performs
the following steps:
1. find the first (from the end) occurrence of "startxref" and pull out the location
in bytes. this will also support "startref" since some files in the wild have that
2. go to that offset if found and parse the chain of tables or streams by /prev
reference
3. if any element in step 2 fails then we perform a single brute-force over the
entire file and like pdfbox treat later in file-length xrefs as the ultimate arbiter
of the object positions. while we do this we potentially can capture the actual
object offsets since the xref positions are probably incorrect too.
the aim with this is to avoid as much seeking and re-reading of bytes as
possible. while this won't technically be single-pass it gets us much closer. it
also removes the more strict logic requiring a "startxref" token to exist and be
valid, since we can repair this by brute-force anyway.
we will surface as much information as possible from the static method so that
we could in future support an object explorer ui for pdfs.
this will also be more resilient to invalid xref formats with e.g. comment tokens
or missing newlines.
* move more parsing to the static classes
* plumb through the new parsing results
* plug in new parser and remove old classes, port tests to new classes
* update tests to reflect logic changes
* apply correction when file header has offset
* ignore console runner launch settings
* skip offsets outside of file bounds
* fix parsing tables missing a line break
* use brute forced locations if they're already present
* only treat line breaks and spaces as whitespace for stream content
* address review comments
---------
Co-authored-by: BobLd <38405645+BobLd@users.noreply.github.com>
* Refactor letter handling by orientation for efficiency
Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to `GetWords` and minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over the `letters` collection.
* Update target frameworks to include net9.0
Expanded compatibility in `UglyToad.PdfPig.csproj` by adding
`net9.0` to the list of target frameworks, alongside existing
versions.
* Add .NET 9.0 support and refactor key components
Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features.
Refactored `GetBlocks` in `DocstrumBoundingBoxes.cs` for improved input handling and performance.
Significantly optimized `NearestNeighbourWordExtractor.cs` by replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency.
Consistent updates across `Fonts`, `Tests`, `Tokenization`, and `Tokens` project files to include .NET 9.0 support.
* Improve null checks and optimize list handling
- Updated null check for `words` in `DocstrumBoundingBoxes.cs` for better readability and performance.
- Changed from `ToList()` to `ToArray()` to avoid unnecessary enumeration.
- Added `results.TrimExcess()` in `NearestNeighbourWordExtractor.cs` to optimize memory usage.
---------
Co-authored-by: Chuck Beasley <CBeasley@kilpatricktownsend.com>
as long as there is a pages entry we accept this in lenient parsing mode. this
is to fix document 006705.pdf in the corpus that had '/calalog' as the dictionary
entry.
also adds a test for some weird content stream content in 0006324.pdf where
numbers seem to get split in the content stream on a decimal place. this is
just to check that our parser doesn't hard crash
* update readme to avoid people using `page.Text` or asking about editing docs
we need to be more clear because beloved chat gpt falls into the trap of
recommending `page.Text` when asked about the library even though this
text is usually the wrong field to use
* tabs to spaces
* rogue tab
- a file contained 2 indices pointing to '.notdef' for the character name so
we just take the first rather than requiring a single
- a file contained '/' (empty name) as the subtype declaration, so we fall back
to trying type 1 and truetype parsing in this situation
in #1082 and other issues relating to annotations we're running into
constraints of the current model of building a pdf document. currently
we skip all link type annotations, i think we can support copying of links
where the link destination is outside the current document. however the
more i look at this code the more i think we need a radical redesign of
how document building is done because it has been pushed far beyond
its current capabilities, i'll detail my thinking in the related pr in more
detail
the existing numeric tokenizer involved allocations and string parsing. since
the number formats in pdf files are fairly predictable we can improve this
substantially
in order to remove reflection from the core content stream operators
construction we ensure all types covered by the operations dictionary
have corresponding switch statement support. this moves the remaining
2 operators to the switch statement. fix for #1062
we encountered a fence constructed in the middle of a field for an unknown
reason so we demolished it. i think this was intended to catch flaws in the
parser logic but the reality is in a pdf anything can happen so we no longer
want to catch these issues and this restores a green test run in debug mode.
fix for #915
when adding a page to a builder from an existing document using either
addpage or copyfrom methods the added page's content stream can contain
a global transform matrix change that will subsequently change all the locations
of any modifications made by the user. here whenever using an existing stream
we apply the inverse of any active transformation matrix
there could be a bug here where if you use 'copy from' with a global transform
active, we then apply the inverse, and you use 'copy from' again to the same
destination page our inverse transform is now active and could potentially
affect the second stream, but I don't think it will