PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2025-10-08 00:14:35 +08:00

Author	SHA1	Message	Date
EliotJones	14af2a3858	fix test since document is now opened successfully but mediabox is broken	2025-09-13 16:48:39 +02:00
EliotJones	853ce8b93e	support missing object numbers when brute-forcing the file 10404 (ironically) contains not found references with number 43 0 for its info dictionary. changes brute-force code so that objects can be entirely missing	2025-09-13 16:27:09 +02:00
EliotJones	c57cd5008b	read last line of ignore file - do not cancel other matrix jobs if one test fails - read all lines of the ignore list even if it doesn't end with a newline - add ignore list for 0008 and 0009	2025-09-13 16:20:01 +02:00
EliotJones	77db6c6b54	add test jobs for common crawl 0000 to 0007	2025-09-13 14:52:04 +01:00
EliotJones	e886ae648f	copy other parser behavior by treating end of stream as valid end inline image this file cotains corrupt content following an inline image but other parsers just treat this content as part of the image and parse the rest of the file successfully	2025-09-13 14:36:14 +01:00
BobLd	c4f442c0cd	Properly fix #1148 by always parsing optional tables in TrueTypeFontParser and remove Type 0 font hack	2025-09-13 12:48:20 +01:00
BobLd	0ef120dc5c	Properly handle CompactFontFormatCidFont font matrix and fix #1149	2025-09-13 10:38:35 +01:00
BobLd	d5b97065bd	Fix #1148	2025-09-13 10:38:35 +01:00
BobLd	22eab422a3	First create the StreamInputBytes in PdfDocument.Open() to check the stream CanRead and CanSeek Some checks failed Build, test and publish draft / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / Check if this commit has already been published (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-09-09 19:12:58 +01:00
Eliot Jones	8408c98aec	Draft release on master build (#1145 ) * remove alpha postfix, releases will increment version * update the master build job to draft a release * add publish action to publish full release * enable setting assembly and file version * bump assembly and file version for package project --------- Co-authored-by: BobLd <38405645+BobLd@users.noreply.github.com>	2025-09-08 20:07:36 +01:00
Eliot Jones	dd5aa46c75	File buffering read stream investigation (#1140 ) Some checks failed Build and test / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details * add test for filebufferingreadstream * #1124 do not trust reported stream length if bytes can be read at end the filebufferingreadstream input stream does not report more than the read length. the change to seek the xref in a sliding window from the end broke with the assumption that the reported length was correct. here we switch to reading the window or continue reading if we can read beyond the stream's initially reported length while seeking the startxref marker * remove rogue newlines	2025-09-07 14:39:46 +01:00
BobLd	e4ed4d1b39	Add early version of IOSSystemFontLister Some checks failed Build and test / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details	2025-09-02 19:53:12 +01:00
Eliot Jones	0afe021ad3	move file parsing to single-pass static methods (#1102 ) * move file parsing to single-pass static methods for the file 0002973.pdf in the test corpus we need to completely overhaul how initial xref parsing is done since we need to locate the xref stream by brute-force and this is currently broken. i wanted to take this opportunity to change the logic to be more imperative and less like the pdfbox methods with instance data and classes. currently the logic is split between the xref offset validator and parser methods and we call the validator logic twice, followed by brute-force searching again in the actual parser. we're going to move to a single method that performs the following steps: 1. find the first (from the end) occurrence of "startxref" and pull out the location in bytes. this will also support "startref" since some files in the wild have that 2. go to that offset if found and parse the chain of tables or streams by /prev reference 3. if any element in step 2 fails then we perform a single brute-force over the entire file and like pdfbox treat later in file-length xrefs as the ultimate arbiter of the object positions. while we do this we potentially can capture the actual object offsets since the xref positions are probably incorrect too. the aim with this is to avoid as much seeking and re-reading of bytes as possible. while this won't technically be single-pass it gets us much closer. it also removes the more strict logic requiring a "startxref" token to exist and be valid, since we can repair this by brute-force anyway. we will surface as much information as possible from the static method so that we could in future support an object explorer ui for pdfs. this will also be more resilient to invalid xref formats with e.g. comment tokens or missing newlines. * move more parsing to the static classes * plumb through the new parsing results * plug in new parser and remove old classes, port tests to new classes * update tests to reflect logic changes * apply correction when file header has offset * ignore console runner launch settings * skip offsets outside of file bounds * fix parsing tables missing a line break * use brute forced locations if they're already present * only treat line breaks and spaces as whitespace for stream content * address review comments --------- Co-authored-by: BobLd <38405645+BobLd@users.noreply.github.com>	2025-09-02 19:41:00 +01:00
Karl	3650e27432	add container node support for BookmarksProvider.cs (#1133 ) * add container node support for BookmarksProvider.cs * move position * fixed unittest error * revert package name * remove duplicated package info.	2025-08-14 21:17:58 +01:00
BobLd	a43b968ea9	Lower max search depth in preventing StackOverflow in ParseTrailer	2025-08-10 10:06:23 +01:00
BobLd	1031dcc221	Prevent StackOverflow in ParseTrailer and fix #1122	2025-08-09 08:46:04 +01:00
BobLd	0f641774e6	Update build_and_test_macos.yml	2025-08-09 08:33:34 +01:00
BobLd	a3edc926c8	Update build_and_test_macos.yml	2025-08-09 08:21:21 +01:00
BobLd	f1923fcbcd	Increase FlateFilter multiplier when preventing malicious OOM and fix #1125	2025-08-08 19:04:31 +01:00
EliotJones	7ff58893af	only run tests if nightly publish needed	2025-08-04 21:46:13 -05:00
EliotJones	bee6f13888	fix tag fetching and parse behavior	2025-08-04 21:40:28 -05:00
EliotJones	e6dd2d15c2	use gemini to mark ched gpt's work and improve the action	2025-08-04 21:00:12 -05:00
EliotJones	7dd5d68be3	prevent duplicate package publish on manual run, attempt 1	2025-08-04 20:49:18 -05:00
BobLd	bdf3b8e2b4	Update nightly_release.yml	2025-08-03 20:03:13 +01:00
BobLd	c8dff885bd	Update run_common_crawl_tests.yml	2025-08-03 08:56:17 +01:00
BobLd	0b228c57b7	Update run_integration_tests.yml	2025-08-03 08:52:27 +01:00
BobLd	ef21227b3c	Update run_integration_tests.yml	2025-08-03 08:46:40 +01:00
BobLd	b9f2230a0a	Add global.json in tools	2025-08-03 08:43:58 +01:00
BobLd	b6950a5fb0	Update run_integration_tests.yml (#1117 )	2025-08-03 08:34:50 +01:00
Chuck B.	1ed9e017f4	Performance improvements and .Net 9 support (#1116 ) * Refactor letter handling by orientation for efficiency Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to `GetWords` and minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over the `letters` collection. * Update target frameworks to include net9.0 Expanded compatibility in `UglyToad.PdfPig.csproj` by adding `net9.0` to the list of target frameworks, alongside existing versions. * Add .NET 9.0 support and refactor key components Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features. Refactored `GetBlocks` in `DocstrumBoundingBoxes.cs` for improved input handling and performance. Significantly optimized `NearestNeighbourWordExtractor.cs` by replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency. Consistent updates across `Fonts`, `Tests`, `Tokenization`, and `Tokens` project files to include .NET 9.0 support. * Improve null checks and optimize list handling - Updated null check for `words` in `DocstrumBoundingBoxes.cs` for better readability and performance. - Changed from `ToList()` to `ToArray()` to avoid unnecessary enumeration. - Added `results.TrimExcess()` in `NearestNeighbourWordExtractor.cs` to optimize memory usage. --------- Co-authored-by: Chuck Beasley <CBeasley@kilpatricktownsend.com>	2025-08-01 22:24:16 +01:00
EliotJones	83d6fc6cc2	allow missing catalog type definition for catalog dictionary Some checks failed Build and test / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / Check latest commit (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details as long as there is a pages entry we accept this in lenient parsing mode. this is to fix document 006705.pdf in the corpus that had '/calalog' as the dictionary entry. also adds a test for some weird content stream content in 0006324.pdf where numbers seem to get split in the content stream on a decimal place. this is just to check that our parser doesn't hard crash	2025-07-27 02:55:29 +01:00
theolivenbaum	febfa4d4b3	Fix usage of List.Contains	2025-07-27 02:52:56 +01:00
Eliot Jones	0ebbe0540d	add nullability to core projec (#1111 )	2025-07-27 02:48:58 +01:00
EliotJones	52c0635273	support performance profiling information in console runner	2025-07-26 15:04:03 -05:00
EliotJones	b6bd0a3169	bump version to 0.1.12-alpha001	2025-07-26 13:43:28 -05:00
EliotJones	3d2e12cb16	version 0.1.11 v0.1.11	2025-07-26 13:16:01 -05:00
Eliot Jones	9cb3b71e62	update readme to avoid people using `page.Text` or asking about editing docs (#1109 ) * update readme to avoid people using `page.Text` or asking about editing docs we need to be more clear because beloved chat gpt falls into the trap of recommending `page.Text` when asked about the library even though this text is usually the wrong field to use * tabs to spaces * rogue tab	2025-07-26 18:58:35 +01:00
EliotJones	27df4af5f9	handle additional broken pdf files in the common crawl set - a file contained 2 indices pointing to '.notdef' for the character name so we just take the first rather than requiring a single - a file contained '/' (empty name) as the subtype declaration, so we fall back to trying type 1 and truetype parsing in this situation	2025-07-26 18:55:29 +01:00
EliotJones	50f878b2ba	restore copy link func logic	2025-07-25 18:18:22 +01:00
EliotJones	2a10b6c285	make link copying more tolerant when adding page in #1082 and other issues relating to annotations we're running into constraints of the current model of building a pdf document. currently we skip all link type annotations, i think we can support copying of links where the link destination is outside the current document. however the more i look at this code the more i think we need a radical redesign of how document building is done because it has been pushed far beyond its current capabilities, i'll detail my thinking in the related pr in more detail	2025-07-25 18:18:22 +01:00
EliotJones	85fc63d585	rework numeric tokenizer hot path the existing numeric tokenizer involved allocations and string parsing. since the number formats in pdf files are fairly predictable we can improve this substantially	2025-07-25 18:12:43 +01:00
EliotJones	5abdfcb96c	fix test case due to field renaming	2025-07-20 20:33:46 +01:00
EliotJones	00ca268092	move last uncovered operators to switch statement in order to remove reflection from the core content stream operators construction we ensure all types covered by the operations dictionary have corresponding switch statement support. this moves the remaining 2 operators to the switch statement. fix for #1062	2025-07-20 20:33:46 +01:00
BobLd	813d3baa18	Track IndirectReference instead of only ObjectNumber when checking for cycles during indirect reference resolutionv and add test	2025-07-20 19:24:31 +01:00
EliotJones	2b11961c8c	remove debug asserts causing test failures we encountered a fence constructed in the middle of a field for an unknown reason so we demolished it. i think this was intended to catch flaws in the parser logic but the reality is in a pdf anything can happen so we no longer want to catch these issues and this restores a green test run in debug mode. fix for #915	2025-07-20 17:42:34 +01:00
EliotJones	efb8c2a803	i merged a pr which broke the build, this updates the build to work move all arguments to add page to a setting object so it can be extended in future in a non-breaking api change	2025-07-20 17:36:19 +01:00
jan-sutter	e636212ec8	check for cycles during indirect reference resolution (#1097 ) Co-authored-by: Jan Sutter <jan@suttermail.de>	2025-07-20 11:12:55 -05:00
EnraH	3b318e1944	add option to strip annotation (#492 ) * add option to strip annotation * fix implementation and tests --------- Co-authored-by: arne.hansen <arne.hansen@digitecgalaxus.ch> Co-authored-by: Eliot Jones <elioty@hotmail.co.uk>	2025-07-20 11:10:15 -05:00
EliotJones	377eb507e8	when writing content to an existing page inverse any global transform #614 when adding a page to a builder from an existing document using either addpage or copyfrom methods the added page's content stream can contain a global transform matrix change that will subsequently change all the locations of any modifications made by the user. here whenever using an existing stream we apply the inverse of any active transformation matrix there could be a bug here where if you use 'copy from' with a global transform active, we then apply the inverse, and you use 'copy from' again to the same destination page our inverse transform is now active and could potentially affect the second stream, but I don't think it will	2025-07-20 00:53:03 +01:00
BobLd	ff4e763192	Update hack for 1bpc + DeviceGray	2025-07-19 21:45:41 +01:00

1 2 3 4 5 ...

1736 Commits