Commit Graph

38 Commits

Author SHA1 Message Date
Eliot Jones
5fb04582a7 0.1.2-alpha003 2020-06-20 12:54:31 +01:00
Eliot Jones
256c2833ab 0.1.2-alpha002 2020-05-10 16:36:14 +01:00
Eliot Jones
09b951f667 expose font details on individual letters
also fixes a regression for image extraction
2020-04-25 17:15:26 +01:00
Eliot Jones
98dd736f94 0.1.2-alpha001 2020-04-25 15:20:07 +01:00
Adam Busbin
00b9d416df added check for bad fonts see 61ceca8376/fontbox/src/main/java/org/apache/fontbox/ttf/HorizontalMetricsTable.java line 67 for matching code. 2020-04-25 08:40:12 +01:00
Eliot Jones
407ee5ca51 add content order text extractor and example of use 2020-04-19 17:06:34 +01:00
Eliot Jones
b122bf0ca6 inline transformation code and cache afm strings 2020-04-18 13:56:39 +01:00
BobLd
20c4b9594b Rename PdfSubpath.ClosePath() to PdfSubpath.CloseSubpath() to avoid confusion 2020-04-05 17:58:57 +01:00
BobLd
ab6a0f11fc Change name from PdfPath to PdfSubpath 2020-04-05 17:58:57 +01:00
Eliot Jones
f1be6634a7 add a bunch more performance improvements
filter provider becomes single instance and no longer has constructor parameters.

tokenizers use list and stringbuilder pools to reduce allocations.

system font finder becomes static to preserve file cache across all documents.
2020-04-05 15:34:47 +01:00
Eliot Jones
9abe9f4b2f #158 add strong naming to the solution 2020-04-04 16:59:51 +01:00
Eliot Jones
4ed1600cab version 0.1.1 2020-03-18 20:10:51 +00:00
Eliot Jones
8ac4195b83 0.1.1-alpha001 2020-03-15 16:52:28 +00:00
Eliot Jones
aa9df30722 handle invalid charstring sequences
it is possible for a file with an adobe type 1 font to contain an invalid charstring sequence, if this happens we handle it and return false from trygenerate.
2020-03-08 14:33:26 +00:00
Eliot Jones
8df2f9cf6b generate all xml docs and pack them #148
after we split the solution into multiple projects the xml doc comments were no longer packed in the generated nuget package. in addition they were only generated for the net standard 2.0 target framework.

this change generates comments for all target frameworks and makes sure they're included in the generated package. it also adds missing doc comments where they weren't included on the public api and clears up a couple of minor formatting issues in the affected files.
2020-03-08 13:44:09 +00:00
Eliot Jones
4442a69a97 use tryget rather than lambdas for union type
avoid the allocations caused by lambda expressions for performance reasons.
2020-02-28 16:02:20 +00:00
Eliot Jones
f7cabe5d12 ignore invalid postscript format type truetype
when reading the format type of a postscript table in a truetype font ignore an invalid format value.
2020-02-27 16:10:19 +00:00
Eliot Jones
0fcc4e54c8 add istestproject setting to all projects
indicates which projects are test projects to the test runner.
2020-02-27 12:35:40 +00:00
Eliot Jones
50c17f7951 make compact font format parser thread safe
the individual cff parser uses a cff dictionary reader inside it which has a per-instance operands list, for this reason it is not thread-safe and cannot be shared. this change creates a new individual font parser for each call to the top-level cff parser.
2020-02-25 14:24:29 +00:00
Eliot Jones
9f488809ac #141 cast adobe type 2 char string value to short
where the value is 28 the next two bytes indicate a short, not a 16 bit two's complement number, apparently, or i've misunderstood what the two's complement bit is about...
2020-02-25 13:56:26 +00:00
Eliot Jones
f7c6de4118 #141 fix two's complement in adobe type 2 charstring
the byte value of 28 indicates the next 2 bytes are a 16 bit two's complement number rather than just a short. this changes the calculation to generate the two's complement value correctly.
2020-02-25 13:19:47 +00:00
Eliot Jones
28faf1c22c default to .notdef for type 2 charstrings
if the glyph with a specific name isn't found in the set of type 2 charstrings we default to using the .notdef glyph if present.
2020-02-21 10:37:58 +00:00
Eliot Jones
29061b1fd2 handle unexpected adobe type 1 format
an encoding array in an adobe type 1 font may be missing its declaration ending in 'for', if we encounter 'dup' while looking for the 'for' token we have a special case to go straight into reading the encoding.

also handles a case where the page content stream contains a path-closing operator without any path being active.
2020-01-28 16:05:53 +00:00
Eliot Jones
d9492ab2f8 handle empty encrypted portion in adobe type 1 font
the encrypted portion of an adobe type 1 font can be empty in which case we default to a blank private dictionary and charstrings set.
2020-01-25 16:41:54 +00:00
Eliot Jones
736f83e227 handle null charstring names
it appears charstring definitions in adobe type 1 fonts can omit the charstring name. in this case we set the name to the string value of the charstring index.
2020-01-25 16:35:08 +00:00
Eliot Jones
a496daf0ce ignore hflex when calculating hint bytes
hflex and hflex1 should not count towards the hint byte count for a hintmask operator in type 2 charstrings.
2020-01-08 13:27:33 +00:00
Eliot Jones
d267d7501a use encoding specified in base font if present
if the font uses a named encoding which is not recognised, use the corresponding encoding based on the base font name, or fall back to windows ansi encoding.
2020-01-07 16:01:45 +00:00
Eliot Jones
fc9c1b6ff5 add method to retrieve single glyph bounds from truetype
this improves performance since we only need to load a single rectangle rather than the entire glyphs array including all points.
2020-01-06 14:43:51 +00:00
Eliot Jones
09c72a2fb2 handle 0 length gylph in true type font 2020-01-06 14:12:46 +00:00
Eliot Jones
02f9166c00 use lazy loading for glyph data
glyph data in TrueType fonts can be very large and slow to parse. to avoid this we store the raw table data at parsing time and enable lazy loading of glyph descriptions.
2020-01-05 15:42:23 +00:00
Eliot Jones
e0a45e3774 include dependencies as dlls in the published nuget
by default nuget pack does not include project dependencies. this is suboptimal since it would require managing at least 5 nuget packages. this uses a workaround detailed here https://github.com/nuget/home/issues/3891 to copy the dependent dlls to the generated nuget package. this doesn't resolve the issue of how we publish the documentlayoutanalysis project, since it is the top of the dependency tree and we publish its parent, rather than it.
2020-01-05 13:56:14 +00:00
Eliot Jones
e1b39983d0 handle missing encodings in cff fonts 2020-01-05 13:16:31 +00:00
Eliot Jones
b29354e3e6 move compact font format fonts to fonts project 2020-01-05 12:08:01 +00:00
Eliot Jones
bbde38f656 move tokenizers to their own project
since both pdfs and Adobe Type1 fonts use postscript type objects, tokenization is needed by the main project and the fonts project
2020-01-05 10:40:44 +00:00
Eliot Jones
d09b33af4d move tokens to new project 2020-01-05 10:07:01 +00:00
Eliot Jones
a6541f1cfc fix test references
update references for unit tests to reference new core and fonts projects. all tests except the public api scanner tests now run successfully.
2020-01-04 22:56:41 +00:00
Eliot Jones
74774995d6 complete move of truetype, afm and standard14 fonts
the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.
2020-01-04 22:39:13 +00:00
Eliot Jones
7c0ef111ea move classes to new projects
to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.
2020-01-04 16:38:18 +00:00