Commit Graph

697 Commits

Author SHA1 Message Date
BobLd
bff18d81ca Improve minimum bounding box orientation 2020-01-31 16:24:59 +00:00
BobLd
483b30f44d Remove rounding 2020-01-31 16:24:59 +00:00
BobLd
253ae32193 Remove ordering from minimal bounding rectangle 2020-01-31 16:24:59 +00:00
BobLd
0dad611cb1 Implement minimum bounding box algorithm 2020-01-31 16:24:59 +00:00
BobLd
36c03459a7 first and last letter 2020-01-31 16:24:59 +00:00
BobLd
f221b58936 Remove useless code 2020-01-31 16:24:59 +00:00
BobLd
ea27820ca4 Improve Word bounding box TextDirection.Other case 2020-01-31 16:24:59 +00:00
BobLd
2e5fdb5867 Fix PdfRectangle's Centroid and Translate() 2020-01-31 16:24:59 +00:00
BobLd
adaccf97b3 Add files via upload 2020-01-31 16:24:59 +00:00
BobLd
380c36918b Remove unnecessary code 2020-01-31 16:24:59 +00:00
BobLd
0cbf3434bc Remove 'orderFunc' from 'NearestNeighbourWordExtractor' to use the order found by clustering algo 2020-01-31 16:24:59 +00:00
BobLd
3b90370f28 Using Math.Min(letter.Width, letter.GlyphRectangle.Width) for rotated 180 word bounding box 2020-01-31 16:24:59 +00:00
BobLd
c4b6bbc8e5 Using Math.Max(letter.Width, letter.GlyphRectangle.Width) for word bounding box 2020-01-31 16:24:59 +00:00
BobLd
6d8744e722 More decimals to Width and Height
+ handle the case where both bottom points are identical
2020-01-31 16:24:59 +00:00
BobLd
bc69376743 Increase max distance for TextDirection.Other in NearestNeighbourWordExtractor 2020-01-31 16:24:59 +00:00
BobLd
a326d7e9d9 TextDirection.Unknown -> TextDirection.Other
Imporve NearestNeighbourWordExtractor for TextDirection.Other
2020-01-31 16:24:59 +00:00
BobLd
9bcafdaa98 Update word bounding box computation 2020-01-31 16:24:59 +00:00
BobLd
27edf6cf77 Handle Width and Height for rotated rectangles 2020-01-31 16:24:59 +00:00
BobLd
75bd94e538 Better handling of TextDirection.Unknown word bounding box 2020-01-31 16:24:59 +00:00
BobLd
75821919a7 Fix NearestNeighbourWordExtractor for rotated text 2020-01-31 16:24:59 +00:00
Eliot Jones
8ab2838063 recover from invalid cross reference position
if we are reading a cross reference offset which contains a number we assumed it was a stream object. if it's not we now brute-force the entire file looking for an 'xref' token. this should be combined with a search for cross-reference streams and should run when we read neither the numeric token or an 'xref' token but for now this fixes the observed issue.

also adds number of images to the page api to prevent consumers needing to enumerate.
2020-01-28 18:07:05 +00:00
Eliot Jones
29061b1fd2 handle unexpected adobe type 1 format
an encoding array in an adobe type 1 font may be missing its declaration ending in 'for', if we encounter 'dup' while looking for the 'for' token we have a special case to go straight into reading the encoding.

also handles a case where the page content stream contains a path-closing operator without any path being active.
2020-01-28 16:05:53 +00:00
Eliot Jones
6292fc256d handle direct font objects in the resource dictionary
fonts can appear as dicitionary objects rather than indirect references in the resource dictionary for a page. if we encounter this we parse and store the font by name for retrieval during content parsing.
2020-01-27 18:07:51 +00:00
Eliot Jones
6cf257a331 strings record encoding used to create them.
in order to recreate the valid bytes for use in decryption it is necessary to know which encoding was used to read a string token. this is because utf16-be encoding has a byte-order marker which should be included in the resulting bytes.
2020-01-26 17:07:58 +00:00
Eliot Jones
693a3d5958 use offset to file header to correct cross references
if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.
2020-01-26 15:30:20 +00:00
Eliot Jones
a561c8954e handle the format header being preceded by nonsense
some files seem to have the format header preceded by large amounts of junk but this appears to be valid for chrome and acrobat reader. this change ups the amount of nonsense to be read prior to the version header.

also makes parsing of the version header culture invariant which may be related to #85.
2020-01-25 16:53:41 +00:00
Eliot Jones
d9492ab2f8 handle empty encrypted portion in adobe type 1 font
the encrypted portion of an adobe type 1 font can be empty in which case we default to a blank private dictionary and charstrings set.
2020-01-25 16:41:54 +00:00
Eliot Jones
736f83e227 handle null charstring names
it appears charstring definitions in adobe type 1 fonts can omit the charstring name. in this case we set the name to the string value of the charstring index.
2020-01-25 16:35:08 +00:00
Eliot Jones
ba09a13d08 more end image recovery logic
since inline image data may contain the end image "ei" token inside the data stream there's no reliable way to actually determine if we've read all the data. for this reason if we end up with an invalid state parsing operations after we've read the end image token we try to recover by reading from the previous token to the next end image token if any. we supply log information to let the consumer know this is what we're doing. it's still not bullet-proof but it should be good enough.

also support negative page rotation values by adding them to a 360 degree rotation so -90 degrees clockwise is 270 degrees clockwise.
2020-01-25 15:53:08 +00:00
Eliot Jones
3ac8d7ed91 update the github pages site
updates the information on the github pages site for the new api changes. includes some more seo friendly terms to improve discoverability, more engaging images as well as comprehensive code examples to improve onboarding.
2020-01-25 14:36:07 +00:00
Eliot Jones
3243be3ec5 change rectangle drawing logic for tests
support rotated output rectangles in the visual verification tests.
2020-01-22 13:45:52 +00:00
Eliot Jones
0ed4e58556 add test cases for rectangle transforms
our bounding rectangle values still seem to be wrong for rotated letters. this change adds some test cases for common transformation matrix operations on a rectangle, scale, translate and rotate.
2020-01-22 13:28:47 +00:00
Eliot Jones
f29170fef8 use default width if present
if no widths array entry exists for the character and no font program is present for a true type simple font then use the 0 index glyph width if present in the widths array.
2020-01-14 15:18:07 +00:00
Eliot Jones
b50f476c31 update local tests
we set the file type filter to only pick up pdfs.
2020-01-14 14:59:14 +00:00
Eliot Jones
f6e12f40d8 support named tounicode cmaps rather than streams type 0 fonts
tounicode cmap may refer to a known cmap name rather than an embedded cmap stream.
2020-01-14 14:58:20 +00:00
Eliot Jones
a36f5a3af3 handle missing embedded cid font for type 0 fonts
all font file entries in the font descriptor for type 0 fonts are optional. if the font is missing we default to returning the default bounding box.
2020-01-14 14:52:51 +00:00
Eliot Jones
e8401b87cf version 0.1.0 0.1.0 2020-01-13 10:46:47 +00:00
Eliot Jones
efc258b0f0 use tokenscanner when converting array to rectangle
an arrray of 4 items representing a rectangle may define its values as indirect references. when converting to a rectangle we pass a pdf token scanner to resolve any indirect references.
2020-01-13 10:20:08 +00:00
BobLd
47672d3f90 Make TextBlock.SetReadingOrder(int) public 2020-01-13 09:25:57 +00:00
BobLd
fd014cfaa7 Add files via upload 2020-01-12 11:15:58 +00:00
BobLd
e8216b29c5 Add reading order in PageXml export 2020-01-12 11:15:58 +00:00
BobLd
e7417be75a ReadingOrderDetector and tidying DLA project 2020-01-11 11:18:11 +00:00
Eliot Jones
b4d917dcdc merge pull request #122 from uglytoad/marked-content
marked content
2020-01-10 17:07:21 +00:00
Eliot Jones
41cc7abd1b prevent negative point size for fonts 2020-01-10 14:40:28 +00:00
Eliot Jones
17b7cf2f61 load images eagerly for marked content
when a marked content region contains an image we load it eagerly since we won't have access to the necessary classes at evaluation time. we also default image colorspace to the active graphics state colorspace if the dictionary doesn't contain a valid entry.
2020-01-10 13:52:21 +00:00
Eliot Jones
2a579afd4d add missing doc comments for operation context marked content 2020-01-09 15:35:55 +00:00
Eliot Jones
d011f37316 merge master 2020-01-09 15:32:10 +00:00
Eliot Jones
43574097f1 rename marked content elements and use factory
since the properties in marked content may be indirect references or belong to the page resources array, the value should be calculated during content processing. this change tidies up the marked content classes so they do not expose mutable data and uses the pdf token scanner overloads to load dictionary data.
2020-01-09 15:30:16 +00:00
BobLd
097692f1cb Move ArtifactType inside PdfArtifactMarkedContent 2020-01-09 11:24:32 +00:00
Eliot Jones
6c1e3c76a8 version 0.1.0-beta002 0.1.0-beta002 2020-01-08 14:26:45 +00:00