PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2026-03-10 00:23:29 +08:00

Author	SHA1	Message	Date
BobLd	7c4f5e2424	Introduce StackDepthGuard class to check for stack depth in CoreTokenScanner and fix #1217 Some checks failed Build, test and publish draft / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled Details Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled Details Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled Details Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / Check if this commit has already been published (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-12-23 16:24:04 +01:00
BobLd	ee0cb1dc4a	Use file header offset when doing brute force find and fix #1223 Some checks failed Build, test and publish draft / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled Details Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled Details Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled Details Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / Check if this commit has already been published (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-12-07 13:43:22 +00:00
BobLd	40bcc22ea1	Add CMap caching at document level and add MurmurHash3 hashing function Some checks failed Build, test and publish draft / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled Details Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled Details Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled Details Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / Check if this commit has already been published (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-10-26 16:20:27 +00:00
Bert Huijben	6fba565d66	Avoid doing a true file seek for simple peeking the next char in the token parser Some checks failed Build, test and publish draft / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled Details Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled Details Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled Details Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / Check if this commit has already been published (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-10-20 06:33:34 +01:00
Bert Huijben	3592fc8438	Use zlib information to verify compressed content before using it Some checks failed Build, test and publish draft / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (0000-0001) (push) Has been cancelled Details Run Common Crawl Tests / build (0002-0003) (push) Has been cancelled Details Run Common Crawl Tests / build (0004-0005) (push) Has been cancelled Details Run Common Crawl Tests / build (0006-0007) (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details	2025-10-15 18:46:36 +01:00
BobLd	eb906a776d	Handle non seekable stream by copying it into a memory stream and fix #1146	2025-09-14 14:42:59 +01:00
BobLd	22eab422a3	First create the StreamInputBytes in PdfDocument.Open() to check the stream CanRead and CanSeek Some checks failed Build, test and publish draft / build (push) Has been cancelled Details Build and test [MacOS] / build (push) Has been cancelled Details Run Common Crawl Tests / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details Nightly Release / Check if this commit has already been published (push) Has been cancelled Details Nightly Release / tests (push) Has been cancelled Details Nightly Release / build_and_publish_nightly (push) Has been cancelled Details	2025-09-09 19:12:58 +01:00
Eliot Jones	0afe021ad3	move file parsing to single-pass static methods (#1102 ) * move file parsing to single-pass static methods for the file 0002973.pdf in the test corpus we need to completely overhaul how initial xref parsing is done since we need to locate the xref stream by brute-force and this is currently broken. i wanted to take this opportunity to change the logic to be more imperative and less like the pdfbox methods with instance data and classes. currently the logic is split between the xref offset validator and parser methods and we call the validator logic twice, followed by brute-force searching again in the actual parser. we're going to move to a single method that performs the following steps: 1. find the first (from the end) occurrence of "startxref" and pull out the location in bytes. this will also support "startref" since some files in the wild have that 2. go to that offset if found and parse the chain of tables or streams by /prev reference 3. if any element in step 2 fails then we perform a single brute-force over the entire file and like pdfbox treat later in file-length xrefs as the ultimate arbiter of the object positions. while we do this we potentially can capture the actual object offsets since the xref positions are probably incorrect too. the aim with this is to avoid as much seeking and re-reading of bytes as possible. while this won't technically be single-pass it gets us much closer. it also removes the more strict logic requiring a "startxref" token to exist and be valid, since we can repair this by brute-force anyway. we will surface as much information as possible from the static method so that we could in future support an object explorer ui for pdfs. this will also be more resilient to invalid xref formats with e.g. comment tokens or missing newlines. * move more parsing to the static classes * plumb through the new parsing results * plug in new parser and remove old classes, port tests to new classes * update tests to reflect logic changes * apply correction when file header has offset * ignore console runner launch settings * skip offsets outside of file bounds * fix parsing tables missing a line break * use brute forced locations if they're already present * only treat line breaks and spaces as whitespace for stream content * address review comments --------- Co-authored-by: BobLd <38405645+BobLd@users.noreply.github.com>	2025-09-02 19:41:00 +01:00
EliotJones	1021729727	fall back to times-roman as standard 14 font when lenient if parsing in lenient mode and encountering a malformed base name (in this case 'helveticai') we fallback to times-roman as the adobe font metrics file for a standard 14 font. this aligns with the behavior of pdfbox. we also log a more informative error in non-lenient modes this fixes document 0000086.pdf from the corpus	2025-07-16 07:43:49 +01:00
BobLd	73ce5bbb73	Make classes related to page content parsing public	2025-06-28 13:17:40 +01:00
BobLd	e10609e4e1	Use pdfScanner in ReadVerticalDisplacements and fix #693 and return 0 in CMap on exception in ReadByte() if useLenientParsing is true and fix #692 Some checks failed Build and test / build (push) Has been cancelled Details Run Integration Tests / build (push) Has been cancelled Details	2024-10-19 00:29:42 +01:00
BobLd	8cee4f480f	Introduce ParsingOptions.FilterProvider and BaseFilterProvider and make CcittFaxCompressionType a byte	2024-10-17 20:27:24 +01:00
Jason Nelson	6d54355754	Spanify filters	2024-04-12 07:42:19 +01:00
Jason Nelson	a412a239be	Enable nullable annotations (#803 ) * Enable nullable annotations * Remove unused Jetbrain annotations * Ensure system using statements are first * Improve nullability annotations * Annotate encryptionDictionary is non-null when IsEncrypted is true * Disable nullable for PdfTokenScanner.Get * Improve nullability annotations for ObjectLocationProvider.TryGetCached * Revert changes to RGBWorkingSpace * Update UglyToad.PdfPig.Package with new framework targets (fixes nightly builds)	2024-03-17 18:51:40 +00:00
Jason Nelson	95f0459900	Prefer is null to == null ensures that an equals overload isn't use, and we don't compare structs	2024-03-16 12:37:51 +00:00
BobLd	acfe8b5fdd	Allow lenient parsing in DictionaryTokenizer and fix #791	2024-03-11 20:01:07 +00:00
BobLd	63096de210	Add IPageFactory to the public API, remove InternalParsingOptions	2023-10-25 20:03:02 +01:00
BobLd	ba865b340e	Make IResourceStore part of the public API and pass InternalParsingOptions to the ResourceStore constructor	2023-10-22 19:16:41 +01:00
Mark van 't Zet	e3f281435a	Fix for #662 : when encountering invalid content, try to continue parsing if option "useLenientParsing" is in effect.	2023-09-29 10:44:54 +01:00
BobLd	9aaf20ceb4	Address #672 to ignore errors while reading the descriptor file in CidFontFactory	2023-08-05 16:21:01 +01:00
Eliot Jones	6f59bed9a2	use pdfdocencoding when parsing strings	2023-06-04 16:40:43 +01:00
Eliot Jones	fba1cbc13c	skip missing objects if skip fonts is true #298 if skip missing fonts is set we want to read the file as much as possible so we will also skip any missing xobjects like images, forms or postscript code	2023-05-27 10:46:29 +01:00
Eliot Jones	0b8252e930	do not require tounicode to be valid even if present #354 #619 these issues reported that parsing was failing due to a missing token being reference in the tounicode entry. since neither issue included a sample file it's impossible to determine the right fix accurately, however since the tounicode entry is optional in the spec we can try being more lenient here, this might just result in more errors once we try to use the font but the logger will at least prevent parsing the entire document failing	2023-05-21 12:34:27 +01:00
Eliot Jones	6b9c3be9f8	tidy up some small formatting issues	2023-05-21 12:20:57 +01:00
BobLD	b8a98fbed2	Properly implement color spaces	2023-04-12 07:25:09 +01:00
mvantzet	0e39bc0b76	Annotations named destinations (#579 ) * Add Named Destinations to Catalog so that bookmarks and links can access them. The named destinations require access to page nodes, so created Pages object that is made using PagesFactory (which contains the page-related code from Catalog). * Further implementation of destinations: - Implement NamedDestinations in AnnotationProvider, so that we can look up named destinations for annotations and turn them into explicit destinations. Reused existing code inside BookmarksProvider to get destinations/actions. - Added GoToE action - According to the PDF reference, destinations are also required for external destinations and hence for ExternalBookmarkNode. This allows us to push up DocumentBookmarkNode.Destination to BookmarkNode. * Implemented stateful appearance streams and integration test * Added AppearanceStream to public API because it is used in the (public) Annotation constructor * After #552, must push down ExplicitDestination do DocumentBookmarkNode since it does not apply to UriBookmarkNode. * Added actions, which fits the PDF model better and works well with the new bookmarks code (after PR #552) * Rename Action to PdfAction + removed unused using in ActionProvider.cs --------- Co-authored-by: mvantzet <mark@radialsg.com>	2023-04-10 17:14:14 +01:00
Eliot Jones	e2246a88bb	#482 add skip missing fonts option and pass parsing options to content stream processor this doesn't fix the reported issue since the pdf itself is corrupted on page 8 however it will allow recovery in some scenarios where text content isn't important. also adds more informative error when stream unintentionally passed with non zero offset	2022-10-09 13:44:05 -04:00
Eliot Jones	2a68670896	#443 handle case where file version comment token included in string by tokenization instead just brute force the raw content	2022-04-24 12:37:26 -04:00
Eliot Jones	cbd02a270f	don't throw if no information dictionary if lenient parsing	2022-04-14 20:46:36 -04:00
Eliot Jones	83948f42d7	#405 check encryption token value for null	2022-01-11 16:13:52 +00:00
Eliot Jones	9ae0a5ec15	allow stream filters to contain indirect references to name tokens	2021-04-25 16:22:22 -04:00
BobLd	f91acefcfa	Set ClipPaths to false if no ParsingOptions given (consistent behaviour)	2020-04-27 17:21:52 +01:00
Eliot Jones	27e251f921	make filter provider and filter public and use tryget for image bytes	2020-04-25 09:42:24 +01:00
BobLd	a759a99389	Move ClipPaths option from GetPage() to ParsingOptions	2020-04-05 17:58:57 +01:00
Eliot Jones	f1be6634a7	add a bunch more performance improvements filter provider becomes single instance and no longer has constructor parameters. tokenizers use list and stringbuilder pools to reduce allocations. system font finder becomes static to preserve file cache across all documents.	2020-04-05 15:34:47 +01:00
Eliot Jones	58972de7cb	begin to rework cross-reference parsing most of the cross-reference code is the earliest code in the project and hasn't been revisited since then. the issue #88 has been reopened due to a bug with brute-force searching so this tidies up the code in this area ahead of trying to fix the bug.	2020-03-03 15:21:11 +00:00
Inusual	013cbd14e0	Make CrossReferenceTableParser a static class	2020-03-02 17:00:16 +00:00
Eliot Jones	c864fa512c	remove islenientparsing from page classes	2020-02-28 11:50:18 +00:00
Eliot Jones	746cbfa30c	remove lenient parsing from font related classes lenient parsing gives us more code to maintain for no real benefit, parsing should always be as lenient as possible. remove the flag from some of the font code.	2020-02-27 18:10:02 +00:00
Eliot Jones	4150881be9	recover from invalid acro-form references we add a try/catch to the direct object finder's tryget method so it returns false rather than throwing. if we have an acro-form reference in the catalog but no corresponding object in the document we instead scan all objects in the document to find form fields and reconstruct the acro-form dictionary.	2020-02-27 12:08:40 +00:00
Eliot Jones	693a3d5958	use offset to file header to correct cross references if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.	2020-01-26 15:30:20 +00:00
Eliot Jones	63b118b141	handle type1 fonts disguised as truetype if the font descriptor uses the fromsubtype flag the actual type of the font can differ from that specified in the font dictionary. in this case a truetype font actually contains a type1c, compact font format, font. in this case we fall back to using the type1 parser. also handles a closesubpath command appearing without any path construction operators.	2020-01-07 16:49:21 +00:00
Eliot Jones	0b048fde57	handle eof further back in file an %%eof for a pdf file may appear further back than the last 1024 bytes. this change doubles the search range. it also handles an empty differences array being defined for a font encoding. we also remove the old approach to dependency injection from the code since we are now favouring static classes where possible.	2020-01-07 11:48:09 +00:00
Eliot Jones	b29354e3e6	move compact font format fonts to fonts project	2020-01-05 12:08:01 +00:00
Eliot Jones	74774995d6	complete move of truetype, afm and standard14 fonts the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.	2020-01-04 22:39:13 +00:00
Eliot Jones	7c0ef111ea	move classes to new projects to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.	2020-01-04 16:38:18 +00:00
Eliot Jones	4d697e3669	allow the user to supply multiple passwords for decryption previously the only way to test if a password was correct was to supply a single password and throw if the value was incorrect. this was slow. now parsing options supports a list of passwords as well as a single password option (which is equivalent to a list with a single item). these passwords are all tested at the same time and an exception is only thrown once all passwords are tested.	2019-12-20 15:11:05 +00:00
Eliot Jones	c30cd1b96d	use cid font subroutines where applicable. add ucs 2 cmap support for type 1 fonts * cid cff fonts have multiple sub-fonts and multiple private dictionaries, in addition to a top level font and private dictionary. this fix uses the specific sub-dictionary when getting local subroutines on a per-glyph basis. * chinese, japanese or korean fonts can use a ucs-2 encoding cmap for retrieving unicode values. * add support for the additional glyph list for unicode values in true type fonts. adds nonmarkingreturn mapping to carriage return. * makes font parsing classes static where there's no reason for them to be per-instance.	2019-12-19 13:33:44 +00:00
Eliot Jones	ecf0b8743b	make bookmarknode immutable and use scanner when retrieving bookmarks	2019-12-05 12:03:30 +00:00
Eliot Jones	2ef45f71d5	make missing acroform types public and start improving data also changes pages to use a proper tree structure since this will be required for resource inheritance and for acroform widget dictionaries.	2019-10-09 14:28:37 +01:00

1 2

79 Commits