since jpegs can be trivially embedded in pdf documents without changes to the data stream this is the first image format we will support. currently this is a naive approach which doesn't share an image resources between pages. ideally we will either de-duplicated images when added, return a re-usable key once an image is added, or both.
fixes the case where the brute-force searcher becomes stuck in an infinite loop, it may be the case that the problem pdf from #88 has a newline or some other whitespace between its object and generation number so this may cause a failure elsewhere.
most of the cross-reference code is the earliest code in the project and hasn't been revisited since then. the issue #88 has been reopened due to a bug with brute-force searching so this tidies up the code in this area ahead of trying to fix the bug.
check that we correctly handle the case where a comment appears inside a dictionary, this was handled by commit 3084a9. use list internally to dictionary tokenizer to avoid interface performance penalties.
renames fields to match the expected conventions for resharper. removes fully qualified names for using statements since resharper marks these as not-required.
adds a check to the pdf merger test to make sure the second page has the expected content. this is not currently valid since we lose the resources node on the pages tree.
the cross reference parser tests behaviour had change, this fixes a compilation error from merge conflicts. also updates the merger tests to account for new version behaviour and checks the output document text. adds pdfmerger to the public api in the tests.
I decided to move to his own method, the part that create the page node. This allowed me to visualize better, from where I was suppose to get the correct reference.
lenient parsing gives us more code to maintain for no real benefit, parsing should always be as lenient as possible. remove the flag from some of the font code.
when the close parenthesis is unbalanced and precedes a line break followed by '/' or '>' we assume the bracket to be unbalanced and finish reading the string.
previously we checked the offset was not inside the table (correct thing to check), however this is only a special case of the more general issue (cross reference offsets are wrong). we move handling for this into the pdf token scanner. if we attempt to read an object at an offset and it fails we brute force the entire file to find correct offsets. we also needed to add handling to make sure we don't attempt to use stream length tokens if we're brute-forcing since we can't look up indirect references for length.
the brute force searcher offsets were off by one. this change means the offset returned is now aligned with the object number in the object number/generation/operator triple.
now that rectangle constructor uses the order [ llx, lly, urx, ury ] and does not apply correction for points constructor parameters must be passed in the correct order. this change fixes the hyperlink factory which was passing them in the wrong order.
in addition the pdfpath bounding box was using left, right, top and bottom to calculate the minimum bounding box. this produced incorrect values now individual path operator bounding boxes are rotated, since for a rotated rectangle top may be less than bottom.
the performance seems to have taken a hit due to these changes however.
if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.
some files seem to have the format header preceded by large amounts of junk but this appears to be valid for chrome and acrobat reader. this change ups the amount of nonsense to be read prior to the version header.
also makes parsing of the version header culture invariant which may be related to #85.
since inline image data may contain the end image "ei" token inside the data stream there's no reliable way to actually determine if we've read all the data. for this reason if we end up with an invalid state parsing operations after we've read the end image token we try to recover by reading from the previous token to the next end image token if any. we supply log information to let the consumer know this is what we're doing. it's still not bullet-proof but it should be good enough.
also support negative page rotation values by adding them to a 360 degree rotation so -90 degrees clockwise is 270 degrees clockwise.
our bounding rectangle values still seem to be wrong for rotated letters. this change adds some test cases for common transformation matrix operations on a rectangle, scale, translate and rotate.
since the properties in marked content may be indirect references or belong to the page resources array, the value should be calculated during content processing. this change tidies up the marked content classes so they do not expose mutable data and uses the pdf token scanner overloads to load dictionary data.