Commit Graph

1723 Commits

Author SHA1 Message Date
Karl
3650e27432
add container node support for BookmarksProvider.cs (#1133)
* add container node support for BookmarksProvider.cs

* move position

* fixed unittest error

* revert package name

* remove duplicated package info.
2025-08-14 21:17:58 +01:00
BobLd
a43b968ea9 Lower max search depth in preventing StackOverflow in ParseTrailer 2025-08-10 10:06:23 +01:00
BobLd
1031dcc221 Prevent StackOverflow in ParseTrailer and fix #1122 2025-08-09 08:46:04 +01:00
BobLd
0f641774e6 Update build_and_test_macos.yml 2025-08-09 08:33:34 +01:00
BobLd
a3edc926c8 Update build_and_test_macos.yml 2025-08-09 08:21:21 +01:00
BobLd
f1923fcbcd Increase FlateFilter multiplier when preventing malicious OOM and fix #1125 2025-08-08 19:04:31 +01:00
EliotJones
7ff58893af only run tests if nightly publish needed 2025-08-04 21:46:13 -05:00
EliotJones
bee6f13888 fix tag fetching and parse behavior 2025-08-04 21:40:28 -05:00
EliotJones
e6dd2d15c2 use gemini to mark ched gpt's work and improve the action 2025-08-04 21:00:12 -05:00
EliotJones
7dd5d68be3 prevent duplicate package publish on manual run, attempt 1 2025-08-04 20:49:18 -05:00
BobLd
bdf3b8e2b4 Update nightly_release.yml 2025-08-03 20:03:13 +01:00
BobLd
c8dff885bd Update run_common_crawl_tests.yml 2025-08-03 08:56:17 +01:00
BobLd
0b228c57b7 Update run_integration_tests.yml 2025-08-03 08:52:27 +01:00
BobLd
ef21227b3c Update run_integration_tests.yml 2025-08-03 08:46:40 +01:00
BobLd
b9f2230a0a Add global.json in tools 2025-08-03 08:43:58 +01:00
BobLd
b6950a5fb0
Update run_integration_tests.yml (#1117) 2025-08-03 08:34:50 +01:00
Chuck B.
1ed9e017f4
Performance improvements and .Net 9 support (#1116)
* Refactor letter handling by orientation for efficiency

Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to `GetWords` and minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over the `letters` collection.

* Update target frameworks to include net9.0

Expanded compatibility in `UglyToad.PdfPig.csproj` by adding
`net9.0` to the list of target frameworks, alongside existing
versions.

* Add .NET 9.0 support and refactor key components

Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features.

Refactored `GetBlocks` in `DocstrumBoundingBoxes.cs` for improved input handling and performance.

Significantly optimized `NearestNeighbourWordExtractor.cs` by replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency.

Consistent updates across `Fonts`, `Tests`, `Tokenization`, and `Tokens` project files to include .NET 9.0 support.

* Improve null checks and optimize list handling

- Updated null check for `words` in `DocstrumBoundingBoxes.cs` for better readability and performance.
- Changed from `ToList()` to `ToArray()` to avoid unnecessary enumeration.
- Added `results.TrimExcess()` in `NearestNeighbourWordExtractor.cs` to optimize memory usage.

---------

Co-authored-by: Chuck Beasley <CBeasley@kilpatricktownsend.com>
2025-08-01 22:24:16 +01:00
EliotJones
83d6fc6cc2 allow missing catalog type definition for catalog dictionary
Some checks failed
Build and test / build (push) Has been cancelled
Build and test [MacOS] / build (push) Has been cancelled
Run Common Crawl Tests / build (push) Has been cancelled
Run Integration Tests / build (push) Has been cancelled
Nightly Release / tests (push) Has been cancelled
Nightly Release / Check latest commit (push) Has been cancelled
Nightly Release / build_and_publish_nightly (push) Has been cancelled
as long as there is a pages entry we accept this in lenient parsing mode. this
is to fix document 006705.pdf in the corpus that had '/calalog' as the dictionary
entry.

also adds a test for some weird content stream content in 0006324.pdf where
numbers seem to get split in the content stream on a decimal place. this is
just to check that our parser doesn't hard crash
2025-07-27 02:55:29 +01:00
theolivenbaum
febfa4d4b3 Fix usage of List.Contains 2025-07-27 02:52:56 +01:00
Eliot Jones
0ebbe0540d
add nullability to core projec (#1111) 2025-07-27 02:48:58 +01:00
EliotJones
52c0635273 support performance profiling information in console runner 2025-07-26 15:04:03 -05:00
EliotJones
b6bd0a3169 bump version to 0.1.12-alpha001 2025-07-26 13:43:28 -05:00
EliotJones
3d2e12cb16 version 0.1.11 2025-07-26 13:16:01 -05:00
Eliot Jones
9cb3b71e62
update readme to avoid people using page.Text or asking about editing docs (#1109)
* update readme to avoid people using `page.Text` or asking about editing docs

we need to be more clear because beloved chat gpt falls into the trap of
recommending `page.Text` when asked about the library even though this
text is usually the wrong field to use

* tabs to spaces

* rogue tab
2025-07-26 18:58:35 +01:00
EliotJones
27df4af5f9 handle additional broken pdf files in the common crawl set
- a file contained 2 indices pointing to '.notdef' for the character name so
we just take the first rather than requiring a single
- a file contained '/' (empty name) as the subtype declaration, so we fall back
to trying type 1 and truetype parsing in this situation
2025-07-26 18:55:29 +01:00
EliotJones
50f878b2ba restore copy link func logic 2025-07-25 18:18:22 +01:00
EliotJones
2a10b6c285 make link copying more tolerant when adding page
in #1082 and other issues relating to annotations we're running into
constraints of the current model of building a pdf document. currently
we skip all link type annotations, i think we can support copying of links
where the link destination is outside the current document. however the
more i look at this code the more i think we need a radical redesign of
how document building is done because it has been pushed far beyond
its current capabilities, i'll detail my thinking in the related pr in more
detail
2025-07-25 18:18:22 +01:00
EliotJones
85fc63d585 rework numeric tokenizer hot path
the existing numeric tokenizer involved allocations and string parsing. since
the number formats in pdf files are fairly predictable we can improve this
substantially
2025-07-25 18:12:43 +01:00
EliotJones
5abdfcb96c fix test case due to field renaming 2025-07-20 20:33:46 +01:00
EliotJones
00ca268092 move last uncovered operators to switch statement
in order to remove reflection from the core content stream operators
construction we ensure all types covered by the operations dictionary
have corresponding switch statement support. this moves the remaining
2 operators to the switch statement. fix for #1062
2025-07-20 20:33:46 +01:00
BobLd
813d3baa18 Track IndirectReference instead of only ObjectNumber when checking for cycles during indirect reference resolutionv and add test 2025-07-20 19:24:31 +01:00
EliotJones
2b11961c8c remove debug asserts causing test failures
we encountered a fence constructed in the middle of a field for an unknown
reason so we demolished it. i think this was intended to catch flaws in the
parser logic but the reality is in a pdf anything can happen so we no longer
want to catch these issues and this restores a green test run in debug mode.

fix for #915
2025-07-20 17:42:34 +01:00
EliotJones
efb8c2a803 i merged a pr which broke the build, this updates the build to work
move all arguments to add page to a setting object so it can be extended
in future in a non-breaking api change
2025-07-20 17:36:19 +01:00
jan-sutter
e636212ec8
check for cycles during indirect reference resolution (#1097)
Co-authored-by: Jan Sutter <jan@suttermail.de>
2025-07-20 11:12:55 -05:00
EnraH
3b318e1944
add option to strip annotation (#492)
* add option to strip annotation

* fix implementation and tests

---------

Co-authored-by: arne.hansen <arne.hansen@digitecgalaxus.ch>
Co-authored-by: Eliot Jones <elioty@hotmail.co.uk>
2025-07-20 11:10:15 -05:00
EliotJones
377eb507e8 when writing content to an existing page inverse any global transform #614
when adding a page to a builder from an existing document using either
addpage or copyfrom methods the added page's content stream can contain
a global transform matrix change that will subsequently change all the locations
of any modifications made by the user. here whenever using an existing stream
we apply the inverse of any active transformation matrix

there could be a bug here where if you use 'copy from' with a global transform
active, we then apply the inverse, and you use 'copy from' again to the same
destination page our inverse transform is now active and could potentially
affect the second stream, but I don't think it will
2025-07-20 00:53:03 +01:00
BobLd
ff4e763192 Update hack for 1bpc + DeviceGray 2025-07-19 21:45:41 +01:00
BobLd
6a06452103 Remove decode parameter application from Stencil color space for consistency 2025-07-19 13:45:06 +01:00
BobLd
a5e92cd11c Update run_common_crawl_tests.yml 2025-07-19 12:21:10 +01:00
EliotJones
4bf746c747 add new action to run integration against common crawl corpus 2025-07-19 11:49:34 +01:00
EliotJones
bffd51425d support bfrange having incorrect length in a cmap
the corpus file 0001413.pdf has an off-by-one error in its count
for cmap bfranges. here we exit early if an unexpected
endbfrange operator is encountered early. this matches the pdfbox
behavior:

067d56e4db/fontbox/src/main/java/org/apache/fontbox/cmap/CMapParser.java (L373)
2025-07-19 11:48:04 +01:00
Eliot Jones
e3388ec6b6
fix colorspace error when form xobject contains a transparency group (#1088)
* fix colorspace error when form xobject contains a transparency group

when a form xobject contains a reference to a group xobject this can only
be used to change attributes of the transparency imaging model. the old
code was setting the main colorspaces incorrectly causing errors when the
transparency component had a different number of channels. this was
causing #1071 in addition to the failure in file 0000355.pdf of the test corpus

* add master integration tests for corpus group 0000

* tidy up actions

* remove invalid reference in echo

* move new action to different branch
2025-07-19 11:46:56 +01:00
EliotJones
31658ca020 allow reading to continue if encountering an invalid surrogate pair
investigating the corpus at
https://digitalcorpora.s3.amazonaws.com/s3_browser.html#corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/0000-0999/

the input file 0000000.pdf contained a utf-16 surrogate pair in an input
defined as usc2. the approach of various parsers varies here, adobe
acrobat seems to hard crash, pdf js returns the same text we now
parse, chrome parses the intended text (2 invalid characters and
"ib exam"). we don't care too much about matching chrome exactly
so doing the same as firefox is fine here
2025-07-16 07:45:40 +01:00
EliotJones
1021729727 fall back to times-roman as standard 14 font when lenient
if parsing in lenient mode and encountering a malformed base name
(in this case 'helveticai') we fallback to times-roman as the adobe font
metrics file for a standard 14 font. this aligns with the behavior of pdfbox.
we also log a more informative error in non-lenient modes

this fixes document 0000086.pdf from the corpus
2025-07-16 07:43:49 +01:00
Eliot Jones
9503f9c137
fix off-by-one and optimize brute force xref search #1078 (#1079)
* fix off-by-one and optimize brute force xref search #1078

when performing a brute force xref search we were ending up
off-by-one, update the search to use a ring buffer to reduce
seeking and fix xref detection

* make method testable and add test coverage

* normalize test input on other platforms

* seal circular buffer class
2025-07-16 07:35:24 +01:00
Eliot Jones
016b754c5b
back-calculate first char if last char and widths present (#1081)
* back-calculate first char if last char and widths present

when a truetype font has a last char and widths array in its font
dictionary the first char can be calculated #644

* fix off by 1 in last char calculation
2025-07-14 21:57:01 +01:00
Eliot Jones
de3b6ac6f4
use correct bounding boxes for standard 14 glyphs #850 (#1080)
* use correct bounding boxes for standard 14 glyphs #850

previously every bounding box for type 1 standard 14 fonts was assumed
to start at 0,0 and ignored the bounding box data in the font metrics file.
now we correctly read the glyph bounding box while preserving the
existing advance width values for advancing the renderer position

* update test case for new logic
2025-07-14 21:54:42 +01:00
EliotJones
b11f936f22 fix copying of sub-dictionary when keys collide
when copying from a ancestor node of a page's resource dictionary
we were incorrectly writing nested nodes of e.g. /fonts to the root
of the target dictionary, here we write to the intended target node
correctly
2025-07-10 18:32:20 +01:00
EliotJones
7fe60ff8c3 skip single letter final blocks
align with the behavior of pdfbox and c implementations where
single character final blocks are ignored rather than being written.
also makes the error more informative in case it is ever encountered
again.

add more test cases.

it is possible this is hiding the problem and will move the error elsewhere
but this matches the implementation behavior of the 2 reference
implementations. one other potential source for the error is if pdf supports
'<~' as a start of data marker which i can't find in the spec but wikipedia
says might be possible? without documents to trigger the error i think
this is the best fix for now
2025-07-09 07:33:12 +01:00
EliotJones
781991b6bf fix #670 by ignoring duplicate endstream definitions
when parsing a stream object with multiple endstream tokens
the last parsed token was selected instead of the actual stream
token so instead we just skip all following tokens if the first
is a stream and the following tokens are `endstream` operators
only
2025-07-07 20:34:26 +01:00