* read last line of ignore file
- do not cancel other matrix jobs if one test fails
- read all lines of the ignore list even if it doesn't end with a newline
- add ignore list for 0008 and 0009
* support missing object numbers when brute-forcing
the file 10404 (ironically) contains not found references with number 43 0
for its info dictionary. changes brute-force code so that objects can be
entirely missing
* fix test since document is now opened successfully but mediabox is broken
we encountered a fence constructed in the middle of a field for an unknown
reason so we demolished it. i think this was intended to catch flaws in the
parser logic but the reality is in a pdf anything can happen so we no longer
want to catch these issues and this restores a green test run in debug mode.
fix for #915
when parsing a stream object with multiple endstream tokens
the last parsed token was selected instead of the actual stream
token so instead we just skip all following tokens if the first
is a stream and the following tokens are `endstream` operators
only
the file provided in issue #926 contains the following syntax
in pdf object streams:
```
% 750 0 obj
<< >>
```
currently we read the comment token and skip the rest
however this producer is writing nonsense to the stream.
comment tokens are only valid outside streams in pdf files
so we align to the behavior of pdfbox here by skipping the
entire line containing a comment inside a stream which fixes
parsing this file.
* Improve TryReadStream with simplification & fix of Stream Invalid Length cutting off Streams
- Fix of Stream invalid Length issue causing stream data being cut off: fix https://github.com/UglyToad/PdfPig/issues/809
- Improve Stream Token read performance by:
- simplifying TryReadStream(), avoiding use of MemoryStream, with benefice of already existing Memory Span of "inputBytes"
- removing the unecessary List<>
* Add Stream with Invalid Length unit test
* Use of Memory<> instead of direct Span to avoid byte array allocation .ToArray.
Suggestion from (4153e4a1b4 (r1619509165))
* Enable nullable annotations
* Remove unused Jetbrain annotations
* Ensure system using statements are first
* Improve nullability annotations
* Annotate encryptionDictionary is non-null when IsEncrypted is true
* Disable nullable for PdfTokenScanner.Get
* Improve nullability annotations for ObjectLocationProvider.TryGetCached
* Revert changes to RGBWorkingSpace
* Update UglyToad.PdfPig.Package with new framework targets (fixes nightly builds)
* Adds checksum on font file reading
* fix name table parsing on broken table
* only warn if checksum invalid, avoid exception with bounds check #258
also returns a null object when the object generation number exceeds
ushort.maxvalue since this is the maximum allowed value and this
broke tests attempting to parse all objects in the file from #258
* remove potentially problematic document
it might be sensitive data
* use ttf from file to test without including full file
Co-authored-by: romain v <rvergnory@lucca.fr>
previously we checked the offset was not inside the table (correct thing to check), however this is only a special case of the more general issue (cross reference offsets are wrong). we move handling for this into the pdf token scanner. if we attempt to read an object at an offset and it fails we brute force the entire file to find correct offsets. we also needed to add handling to make sure we don't attempt to use stream length tokens if we're brute-forcing since we can't look up indirect references for length.
if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.
some documents declare stream objects without an endobj marker at the end of the stream. if a new obj token is encountered after reading a stream we reset the scanner to the object number token and return the stream.
don't cache objects parsed if their offset doesn't match the cross-reference offset, unless the object was parsed by a brute-force search operation. this is because 1 object may lie in 2 streams, 1 valid and 1 invalid. If the invalid stream is parsed first for another object then the valid stream will never be read.
to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.
* while the pdf specification says stream data should follow a newline following a stream operator some files have only a carriage return following the stream operator.
* since comment tokens may appear inside an array or dictionary we ignore them if they occur here since they will break interpretation of the dictionary or array contents.
support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page.
treat all warnings as errors.