if parsing in lenient mode and encountering a malformed base name
(in this case 'helveticai') we fallback to times-roman as the adobe font
metrics file for a standard 14 font. this aligns with the behavior of pdfbox.
we also log a more informative error in non-lenient modes
this fixes document 0000086.pdf from the corpus
* fix off-by-one and optimize brute force xref search #1078
when performing a brute force xref search we were ending up
off-by-one, update the search to use a ring buffer to reduce
seeking and fix xref detection
* make method testable and add test coverage
* normalize test input on other platforms
* seal circular buffer class
* back-calculate first char if last char and widths present
when a truetype font has a last char and widths array in its font
dictionary the first char can be calculated #644
* fix off by 1 in last char calculation
* use correct bounding boxes for standard 14 glyphs #850
previously every bounding box for type 1 standard 14 fonts was assumed
to start at 0,0 and ignored the bounding box data in the font metrics file.
now we correctly read the glyph bounding box while preserving the
existing advance width values for advancing the renderer position
* update test case for new logic
when copying from a ancestor node of a page's resource dictionary
we were incorrectly writing nested nodes of e.g. /fonts to the root
of the target dictionary, here we write to the intended target node
correctly
align with the behavior of pdfbox and c implementations where
single character final blocks are ignored rather than being written.
also makes the error more informative in case it is ever encountered
again.
add more test cases.
it is possible this is hiding the problem and will move the error elsewhere
but this matches the implementation behavior of the 2 reference
implementations. one other potential source for the error is if pdf supports
'<~' as a start of data marker which i can't find in the spec but wikipedia
says might be possible? without documents to trigger the error i think
this is the best fix for now
when parsing a stream object with multiple endstream tokens
the last parsed token was selected instead of the actual stream
token so instead we just skip all following tokens if the first
is a stream and the following tokens are `endstream` operators
only
when copying various dictionaries from a source document
to the builder any indirect references in the source document
would throw because the code expected the dictionary token
directly. now we follow the list of indirect references until we
find a non-indirect leaf token. also changes the exception type.
the file provided in issue #926 contains the following syntax
in pdf object streams:
```
% 750 0 obj
<< >>
```
currently we read the comment token and skip the rest
however this producer is writing nonsense to the stream.
comment tokens are only valid outside streams in pdf files
so we align to the behavior of pdfbox here by skipping the
entire line containing a comment inside a stream which fixes
parsing this file.
The bugfix was the important part but the optimization is pretty nice too.
- Bugfix: If startxref was found so far back (eg in the very beginning which can be the case for Linearized PDFs) that we ended up setting actualStartOffset to 0 then the loop would exit immediately without actually searching that part.
- Optimization: GetStartXrefPosition would search for startxref in the last 2048 bytes and then double that search-range (looking back 4096, 8192, etc bytes) to look for startxref until the entire file was searched. This was rather inefficient since each step would search the same parts over and over again. This has been changed to properly search (still increasingly larger) chunks that doesn't overlap. On a test of 5000 PDFs that reduced their load-time by 10%.
- Change: No need for the exception to say that startxref couldn't be found "in the last 2048 characters" since the entire file was searched anyway.