1749 Commits

Author SHA1 Message Date
ricflams
c28d114b79 Guard against circular references in XRef tables/streams
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
- Detect and prevent an xref table/stream at a certain offset from being read twice; malformed xref tables with circular references could otherwise cause the table-reading to loop forever.
- Another approach could be to prevent TryReadTableAtOffset from changing the bytes' CurrentOffset to the lastObjPosition in its attempt to read a table (eg restore CurrentOffset after the attempt to read a Table) so the outer bytes-loop could continue its search through the entire bytes unaffected.
nightly-latest
2025-10-01 06:32:38 +01:00
Richard Flamsholt
d7d01f842e Update test Issue874: No longer missing a font
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
Including the stream-xref means that the formerly missing font is no longer missing, so simply run the two test-cases under the (stricter) assumption of SkipMissingFonts=false.
2025-09-30 18:35:45 +01:00
Richard Flamsholt
33a8d829ee Update test Issue874: Also more text on page 2
Page two has had four more characters added, which is now delected by this xref-stream fix
2025-09-30 18:35:45 +01:00
Richard Flamsholt
57921c7e9b Update test Issue874: Now finds more text on page 1
With the fix for including associated streams, this test now finds more text on the first page. I've verified using Aspose.PDF and by viewing the ErcotFacts.pdf file being tested that yes, it was indeed missing part of the text before.
2025-09-30 18:35:45 +01:00
ricflams
5a6b3970f0 Add table-xref's associated stream-xrefs
- If an XrefTable has an associated stream, as indicated via the XrefStm-property, then read and add that XrefStream
- Any table can have 0 or 1 such associated streams
- A caveat: such an associated stream might also theoretically be part of the Parts-sequence in which case it would be encountered both by looping through all those parts along with all the regular tables and now also by association to any of those tables. It doesn't seem harmful since the offsets are flattened eventually anyway and stored by their offset-key into a mapping-table.
2025-09-30 18:35:45 +01:00
ricflams
397ccb15d6 Add xref-streams tied to any parts, not just the first
On a large sample of pdf-files PdfPig failed to read the correct StructTree-object for about 1% of them. The StructTree object was simply missing in the CrossReferenceTable.CrossReferenceTable.
It turned out that the constructed CrossReferenceTable could miss Stream-parts if there were multiple Table-parts because a stream will only be added if it's associated with the very first Table-part. The remedy would seem to be to check for and add streams that are associated with any of the Table-parts, not just the first one.
On a sample of 72 files where this failed, this changed fixed the StructTree for all of them.
2025-09-30 18:35:45 +01:00
BobLd
ca284e0cb9 Use pageFactoryCache.Clear() in Pages dispose and fix #1170 2025-09-28 17:18:00 +01:00
BobLd
b2f4ca8839 Add GetDescent() and GetAscent() methods to IFont, improve font matrix for TrueTypeSimpleFont and TrueTypeStandard14FallbackSimpleFont and add loose bounding box to Letter
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-09-21 15:07:52 +01:00
BobLd
008959457a Expose letter's font via GetFont(), make Font property as obsolete and use FontDetails instead
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-09-20 17:11:38 +01:00
BobLd
a53d96cb73 Use record struct in FileHeaderOffset 2025-09-20 13:45:50 +01:00
EliotJones
efdedb9495 handle case where offsets are out of range
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled
Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled
Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled
Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
default to returning empty glyph where the offset is out of the
file length range, this fixes file 12623 where the truetype file
is completely broken
2025-09-14 15:26:12 +01:00
BobLd
eb906a776d Handle non seekable stream by copying it into a memory stream and fix #1146 2025-09-14 14:42:59 +01:00
BobLd
44e638ee4d Add initial support to process CFF fonts contained inside a TrueType font 2025-09-14 11:32:32 +01:00
BobLd
304d7dde5a Use correct font matrix when transforming the width in Type 0 font and fix #1156 2025-09-14 08:22:58 +01:00
Eliot Jones
07df6fd740 read last line of ignore file (#1155)
* read last line of ignore file

- do not cancel other matrix jobs if one test fails
- read all lines of the ignore list even if it doesn't end with a newline
- add ignore list for 0008 and 0009

* support missing object numbers when brute-forcing

the file 10404 (ironically) contains not found references with number 43 0
for its info dictionary. changes brute-force code so that objects can be
entirely missing

* fix test since document is now opened successfully but mediabox is broken
2025-09-13 16:57:35 +02:00
Eliot Jones
c96880ac61 handle case where xobjects use same key as fonts (#1154)
in document 10122 the font and xobject names are the same so the
xobject overwrote references to the font for the page content, separate
the dictionaries
2025-09-13 16:49:24 +02:00
EliotJones
77db6c6b54 add test jobs for common crawl 0000 to 0007 2025-09-13 14:52:04 +01:00
EliotJones
e886ae648f copy other parser behavior by treating end of stream as valid end inline image
this file cotains corrupt content following an inline image but other parsers
just treat this content as part of the image and parse the rest of the file
successfully
2025-09-13 14:36:14 +01:00
BobLd
c4f442c0cd Properly fix #1148 by always parsing optional tables in TrueTypeFontParser and remove Type 0 font hack 2025-09-13 12:48:20 +01:00
BobLd
0ef120dc5c Properly handle CompactFontFormatCidFont font matrix and fix #1149 2025-09-13 10:38:35 +01:00
BobLd
d5b97065bd Fix #1148 2025-09-13 10:38:35 +01:00
BobLd
22eab422a3 First create the StreamInputBytes in PdfDocument.Open() to check the stream CanRead and CanSeek
Some checks failed
Build, test and publish draft / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / Check if this commit has already been published (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
2025-09-09 19:12:58 +01:00
Eliot Jones
8408c98aec Draft release on master build (#1145)
* remove alpha postfix, releases will increment version

* update the master build job to draft a release

* add publish action to publish full release

* enable setting assembly and file version

* bump assembly and file version for package project

---------

Co-authored-by: BobLd <38405645+BobLd@users.noreply.github.com>
2025-09-08 20:07:36 +01:00
Eliot Jones
dd5aa46c75 File buffering read stream investigation (#1140)
Some checks failed
Build and test / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
* add test for filebufferingreadstream

* #1124 do not trust reported stream length if bytes can be read at end

the filebufferingreadstream input stream does not report more than the read
length. the change to seek the xref  in a sliding window from the end broke
with the assumption that the reported length was correct. here we switch to
reading the window or continue reading if we can read beyond the stream's
initially reported length while seeking the startxref marker

* remove rogue newlines
2025-09-07 14:39:46 +01:00
BobLd
e4ed4d1b39 Add early version of IOSSystemFontLister
Some checks failed
Build and test / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
2025-09-02 19:53:12 +01:00
Eliot Jones
0afe021ad3 move file parsing to single-pass static methods (#1102)
* move file parsing to single-pass static methods

for the file 0002973.pdf in the test corpus we need to completely overhaul
how initial xref parsing is done since we need to locate the xref stream by
brute-force and this is currently broken. i wanted to take this opportunity to
change the logic to be more imperative and less like the pdfbox methods with
instance data and classes.

currently the logic is split between the xref offset validator and parser methods
and we call the validator logic twice, followed by brute-force searching again
in the actual parser. we're going to move to a single method that performs
the following steps:

1. find the first (from the end) occurrence of "startxref" and pull out the location
in bytes. this will also support "startref" since some files in the wild have that
2. go to that offset if found and parse the chain of tables or streams by /prev
reference
3. if any element in step 2 fails then we perform a single brute-force over the
entire file and like pdfbox treat later in file-length xrefs as the ultimate arbiter
of the object positions. while we do this we potentially can capture the actual
object offsets since the xref positions are probably incorrect too.

the aim with this is to avoid as much seeking and re-reading of bytes as
possible. while this won't technically be single-pass it gets us much closer. it
also removes the more strict logic requiring a "startxref" token to exist and be
valid, since we can repair this by brute-force anyway.

we will surface as much information as possible from the static method so that
we could in future support an object explorer ui for pdfs.

this will also be more resilient to invalid xref formats with e.g. comment tokens
or missing newlines.

* move more parsing to the static classes

* plumb through the new parsing results

* plug in new parser and remove old classes, port tests to new classes

* update tests to reflect logic changes

* apply correction when file header has offset

* ignore console runner launch settings

* skip offsets outside of file bounds

* fix parsing tables missing a line break

* use brute forced locations if they're already present

* only treat line breaks and spaces as whitespace for stream content

* address review comments

---------

Co-authored-by: BobLd <38405645+BobLd@users.noreply.github.com>
2025-09-02 19:41:00 +01:00
Karl
3650e27432 add container node support for BookmarksProvider.cs (#1133)
* add container node support for BookmarksProvider.cs

* move position

* fixed unittest error

* revert package name

* remove duplicated package info.
2025-08-14 21:17:58 +01:00
BobLd
a43b968ea9 Lower max search depth in preventing StackOverflow in ParseTrailer 2025-08-10 10:06:23 +01:00
BobLd
1031dcc221 Prevent StackOverflow in ParseTrailer and fix #1122 2025-08-09 08:46:04 +01:00
BobLd
0f641774e6 Update build_and_test_macos.yml 2025-08-09 08:33:34 +01:00
BobLd
a3edc926c8 Update build_and_test_macos.yml 2025-08-09 08:21:21 +01:00
BobLd
f1923fcbcd Increase FlateFilter multiplier when preventing malicious OOM and fix #1125 2025-08-08 19:04:31 +01:00
EliotJones
7ff58893af only run tests if nightly publish needed 2025-08-04 21:46:13 -05:00
EliotJones
bee6f13888 fix tag fetching and parse behavior 2025-08-04 21:40:28 -05:00
EliotJones
e6dd2d15c2 use gemini to mark ched gpt's work and improve the action 2025-08-04 21:00:12 -05:00
EliotJones
7dd5d68be3 prevent duplicate package publish on manual run, attempt 1 2025-08-04 20:49:18 -05:00
BobLd
bdf3b8e2b4 Update nightly_release.yml 2025-08-03 20:03:13 +01:00
BobLd
c8dff885bd Update run_common_crawl_tests.yml 2025-08-03 08:56:17 +01:00
BobLd
0b228c57b7 Update run_integration_tests.yml 2025-08-03 08:52:27 +01:00
BobLd
ef21227b3c Update run_integration_tests.yml 2025-08-03 08:46:40 +01:00
BobLd
b9f2230a0a Add global.json in tools 2025-08-03 08:43:58 +01:00
BobLd
b6950a5fb0 Update run_integration_tests.yml (#1117) 2025-08-03 08:34:50 +01:00
Chuck B.
1ed9e017f4 Performance improvements and .Net 9 support (#1116)
* Refactor letter handling by orientation for efficiency

Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to `GetWords` and minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over the `letters` collection.

* Update target frameworks to include net9.0

Expanded compatibility in `UglyToad.PdfPig.csproj` by adding
`net9.0` to the list of target frameworks, alongside existing
versions.

* Add .NET 9.0 support and refactor key components

Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features.

Refactored `GetBlocks` in `DocstrumBoundingBoxes.cs` for improved input handling and performance.

Significantly optimized `NearestNeighbourWordExtractor.cs` by replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency.

Consistent updates across `Fonts`, `Tests`, `Tokenization`, and `Tokens` project files to include .NET 9.0 support.

* Improve null checks and optimize list handling

- Updated null check for `words` in `DocstrumBoundingBoxes.cs` for better readability and performance.
- Changed from `ToList()` to `ToArray()` to avoid unnecessary enumeration.
- Added `results.TrimExcess()` in `NearestNeighbourWordExtractor.cs` to optimize memory usage.

---------

Co-authored-by: Chuck Beasley <CBeasley@kilpatricktownsend.com>
2025-08-01 22:24:16 +01:00
EliotJones
83d6fc6cc2 allow missing catalog type definition for catalog dictionary
Some checks failed
Build and test / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / Check latest commit (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
as long as there is a pages entry we accept this in lenient parsing mode. this
is to fix document 006705.pdf in the corpus that had '/calalog' as the dictionary
entry.

also adds a test for some weird content stream content in 0006324.pdf where
numbers seem to get split in the content stream on a decimal place. this is
just to check that our parser doesn't hard crash
2025-07-27 02:55:29 +01:00
theolivenbaum
febfa4d4b3 Fix usage of List.Contains 2025-07-27 02:52:56 +01:00
Eliot Jones
0ebbe0540d add nullability to core projec (#1111) 2025-07-27 02:48:58 +01:00
EliotJones
52c0635273 support performance profiling information in console runner 2025-07-26 15:04:03 -05:00
EliotJones
b6bd0a3169 bump version to 0.1.12-alpha001 2025-07-26 13:43:28 -05:00
EliotJones
3d2e12cb16 version 0.1.11 v0.1.11 2025-07-26 13:16:01 -05:00
Eliot Jones
9cb3b71e62 update readme to avoid people using page.Text or asking about editing docs (#1109)
* update readme to avoid people using `page.Text` or asking about editing docs

we need to be more clear because beloved chat gpt falls into the trap of
recommending `page.Text` when asked about the library even though this
text is usually the wrong field to use

* tabs to spaces

* rogue tab
2025-07-26 18:58:35 +01:00