Commit Graph

693 Commits

Author SHA1 Message Date
Eliot Jones
336947db73 add writing methods to truetype tables #98
since we have verified the problem with the characters not appearing in acrobat reader isn't the checksum (other files also have invalid checksums but work) it seems likely the issue is with the os/2 table.

this change moves the logic for writing out the cmap table, the format 6 cmap sub-table, truetype table headers and the os/2 table into the classes themselves. now we can write an os/2 table and we've tested that the output matches the input, we can overwrite the os/2 table in order to work out which of the os/2 errors is causing our font to be invalid.

the writeable interface should be added to more and more parts of the codebase so that writing, editing and document creation become first class citizens rather than hardcoded additions.

this change also adds the macroman (1,0) cmap subtable to edited fonts so that it is present for consumers which expect it.
2020-01-04 10:27:07 +00:00
Eliot Jones
9fff879bd4 fix tests by using custom equality comparers
since we now round glyph widths for truetype fonts in the widths array of the pdf some values are out by a very small amount from the expected value. since we don't care about such fractional inaccuracy we use a custom comparer for these tests.
2020-01-04 10:27:07 +00:00
Eliot Jones
3ad03ff3ee finish implementation of truetype cmap replacer #98
this overwrites the cmap table which is moved to the end of the truetype font file. the new table contains a single windows symbol subtable (3,0) of the format 6 type which maps the character codes in the single range 0 -> glyphcount to the corresponding glyph indices in the font. the new cmap table is then written to the new font file and the header value for length is updated.

this also changes many truetype classes to use the corresponding ushort datatype rather than ints, to save space.

the generated file is displaying correctly in most pdf viewers and passes all tests but in adobe acrobat reader the text is present but invisible. this was not a problem with the previous approach to file generation. there is no log information as to why this might be the case but it seems like the answer must be related to the validity of the overwritten truetype file. we might need to provide an additional macroman cmap subtable in case this is required by acrobat reader.

running the produced font for andada-regular through fontvalidator https://github.com/HinTak/Font-Validator/releases indicates a number of issues that may cause the file to be an invalid font (cannot open with microsoft font viewer). the next step is to compare the errors present in the unmodified andada regular file with the errors in our version. the most likely candidates seem to be:
* os/2: font is a symbol font but panose byte 1, familytype, is not set to latin symbol.
* os/2: a unicode range was indicated in ulunicoderange but the font has no characters in that range.
* os/2: the usfirstcharindex/uslastcharindex is not valid.
* os/2: the font contains a 3,0 cmap but the codepagerange bit 31 is clear.
* os/2: the usbreakchar is not mapped to a glyf.
* head: font checksum is incorrect (this can also be the case for working fonts so seems unlikely to be the cause).
2020-01-04 10:27:07 +00:00
Eliot Jones
59c43cc526 truetype encoding replacer and checksum calculator #98
we need to provide a custom cmap for our overridden fonts when creating a document using truetype fonts. in order to do this without writing a complete subsetter (yet) we simply rearrange the font by moving the cmap table to the end of the font.

in order to keep a valid font we need to recalculate the offsets and checksums for all table headers. this adds a calculator which can calculate per-table checksums as well as the whole-font checksum used to calculate the checksum adjustment recorded in the head table.

now that the cmap table has been moved to the end of the font file we can overwrite it with a different-length custom cmap table without further invasive changes to the rest of the truetype file. this isn't implemented yet in this commit but will be the next thing to implement.

in truetype writing font we've temporarily reverted the change which maps characters to bytes until the custom cmap is written so we can ensure for this change the output font file is still valid and can be interpreted by pdf consumers. once the custom cmap is written we can uncomment the mapping logic and it should all just work.
2020-01-04 10:27:07 +00:00
Eliot Jones
f319e7f4b5 adds per character byte mapping to truetype #98
this starts to add logic for per-character mapping of unicode characters to byte values for truetype fonts in the pdf document builder. in order to support unicode characters outside the 0-255 range when creating new pdf documents without using composite fonts, we need to map values outside these range into this range. to do this we start at 1 and map each character we encounter to the next code, up to a maximum of 255. we provide a custom tounicode cmap in the font dictionary which maps these byte values, 0-255, back to unicode code points (short).

we also provide a custom firstchar, lastchar and widths array for the font mapping just the values we use.

since fonts no longer contain just the latin character set the font descriptor enum is set to have the symbolic flag set. this means values will be looked up in either the mac-roman (1, 0) or windows-symbol (3, 0) cmap tables (these cmap tables are distinct from cmap tables in the pdf file) inside the actual truetype font bytes. this means the currently generated font file is invalid, because while the widths array and tounicode cmap return the correct values the actual font itself returns whatever values where in those positions before the remapping occurred.

in order to fix this we will need to override the windows-symbol cmap contained in the underlying truetype font to match our mapping. this will be a lot of work and involve significant rewriting of the font file itself, in order to preserve checksum integrity.
2020-01-04 10:27:07 +00:00
BobLd
f67cce31b5 Adding a 'minimumEditDistanceNormalised' parameter to allow for other edit distance implementations. 2020-01-03 12:31:23 +00:00
BobLd
e46df38f4d Make numbersPattern private 2020-01-03 12:31:23 +00:00
BobLd
39f275aaeb Improve numbers pattern matching and include roman numerals 2020-01-03 12:31:23 +00:00
BobLd
07f51712c6 Update PublicApiScannerTests 2020-01-03 12:31:23 +00:00
BobLd
a233cc627c Add headers/footers (decoration) classifier for TextBlock 2020-01-03 12:31:23 +00:00
BobLd
d246bf5c74 - remove unnecessary casts
- make PageXmlTextExporter.Deserialize() public
2019-12-31 10:43:07 +00:00
BobLd
c421f3f85e Correct NearestNeighbourWordExtractor file name (remove whitespace) 2019-12-29 15:25:19 +00:00
BobLd
3a060d9769 Update PublicApiScannerTests 2019-12-28 14:43:09 +00:00
BobLd
c810ca74a6 Format and tidy up PAGE xml export autogenerated code. 2019-12-28 14:43:09 +00:00
Eliot Jones
87528199c6 use byte values when showing text for document builder #98
when writing text content the current show text operator was just writing the unicode string value and hoping it produced the correct value in the resulting document despite the values being consumed in a different encoding. this change adds a method to retrieve the corresponding byte value for a unicode character and uses that to write a hex show text operator to the page content. this is only implemented for standard14 fonts in this change.

for standard14 fonts we look up the corresponding name for the unicode value from the adobe glyph list. once we find the corresponding glyph name we look up the code value in the encoding we have chosen when writing standard14 fonts (macromanencoding). this value is then the byte value written to the show text operator. if the value does not appear in any of the lookups we throw a not support exception.

this also adds a test case which will still fail for czech characters in a truetype font, the issue reported in #98.
2019-12-28 14:42:27 +00:00
BobLd
5e3f5651b8 Update NearestNeighbourWordExtractor .cs
Removing the font name check (`string.Equals(l1.FontName, l2.FontName, StringComparison.OrdinalIgnoreCase)`) because some special characters or ligature may belong to different subsets.
2019-12-27 13:08:44 +00:00
BobLd
3b79ebc5d5 Update PageXmlTextExporter.cs
Set the `PageXmlTextRegion` type to default `PageXmlTextSimpleType.Paragraph` to avoid a crash in LayoutEvalGUI 1.9
2019-12-27 13:08:04 +00:00
BobLd
b5bab67889 Update MathExtensions.cs
Handling null or empty double array.
2019-12-27 10:51:32 +00:00
Eliot Jones
a4805ce97d improve performance of the truetype name table parsing 2019-12-25 10:52:00 +00:00
Eliot Jones
815705494a cache last loaded font from the resource store during content parsing 2019-12-24 23:24:52 +00:00
Eliot Jones
ec060ae81b add hardcoded switch branches for more content operations
also adds a gitignore entry for the 'benchmark' subfolder in tools where custom benchmarking applications can be built and run without being added to source control.
2019-12-24 23:12:04 +00:00
Eliot Jones
ce38238f2c ignore invalid minfeature values for type 1 fonts 2019-12-24 16:56:46 +00:00
Eliot Jones
23c7e44fc8 handle stream length being an object stream value 2019-12-24 15:22:47 +00:00
Eliot Jones
9c9a08c6a7 make numeric tokenizer threadsafe by removing cache 2019-12-24 12:24:40 +00:00
Eliot Jones
3bef786d5c use performant hasflag method for truetype simple glyphs 2019-12-24 12:22:17 +00:00
Eliot Jones
649abdf966 use named constants for relevant type2 charstring command bytes 2019-12-24 12:22:17 +00:00
Eliot Jones
526af82e1a fix naming and tostring for type2 charstring sequence 2019-12-24 12:22:17 +00:00
Eliot Jones
be00c3b1b7 remove union types from charstring parser to prevent allocations 2019-12-24 12:22:17 +00:00
Eliot Jones
4f9eb1a25a use short to save space when storing the set of glyph points 2019-12-24 12:22:17 +00:00
Eliot Jones
ba9fe40bc1 cache some more common values and improve performance of tokenizers 2019-12-24 12:22:17 +00:00
Eliot Jones
e048bb8c2c performance tuning for numeric tokens and parsing 2019-12-24 12:22:17 +00:00
Eliot Jones
1e29c298cf use correct numeric types when parsing truetype fonts 2019-12-24 12:22:17 +00:00
Eliot Jones
935d182888 use doubles where calculations are being run 2019-12-24 12:22:17 +00:00
Eliot Jones
e984180b3d add method to retrieve any embedded files 2019-12-21 16:16:36 +00:00
Eliot Jones
4d697e3669 allow the user to supply multiple passwords for decryption
previously the only way to test if a password was correct was to supply a single password and throw if the value was incorrect. this was slow. now parsing options supports a list of passwords as well as a single password option (which is equivalent to a list with a single item). these passwords are all tested at the same time and an exception is only thrown once all passwords are tested.
2019-12-20 15:11:05 +00:00
Eliot Jones
5e68720495 add support for type1c cid fonts 2019-12-20 14:46:25 +00:00
Eliot Jones
f401ab3ba0 handle case insensitive truetype table tags and missing tables for postscript fonts 2019-12-20 14:40:25 +00:00
Eliot Jones
3084a9aab6 support streams containing only carriage returns. handle comments in arrays and dictionaries
* while the pdf specification says stream data should follow a newline following a stream operator some files have only a carriage return following the stream operator.
* since comment tokens may appear inside an array or dictionary we ignore them if they occur here since they will break interpretation of the dictionary or array contents.
2019-12-20 14:04:58 +00:00
Eliot Jones
3e6fa4b694 correctly map character code to glyph id when retrieving bounding boxes for truetype fonts
previously we just treated character codes as glyph ids when getting the bounding box from the truetype font program itself. this change uses the code for character code to glyph id mapping from pdfbox, with some changes, to retrieve the correct bounding box where possible. since this relies in some places on using the unicode value or name, rather than character code, we add a cache to the individual truetype fonts to store the character code to unicode mapping which should have the benefit of improving performance.
2019-12-20 12:48:00 +00:00
Eliot Jones
7296c3c125 merge pull request #105 from BobLd/master
whitespace covering algorithm and #104
2019-12-20 11:57:31 +00:00
Eliot Jones
e37e4c37b3 require end image token to be followed by at least 1 whitespace 2019-12-19 17:34:40 +00:00
Eliot Jones
03a28287e9 handle missing widths in cid fonts correctly 2019-12-19 16:59:17 +00:00
Eliot Jones
82c2ee7026 handle ei end image token appearing in inline image data 2019-12-19 16:29:44 +00:00
Eliot Jones
528df5c396 handle malformed cmap base character listings 2019-12-19 15:27:12 +00:00
Eliot Jones
c30cd1b96d use cid font subroutines where applicable. add ucs 2 cmap support for type 1 fonts
* cid cff fonts have multiple sub-fonts and multiple private dictionaries, in addition to a top level font and private dictionary. this fix uses the specific sub-dictionary when getting local subroutines on a per-glyph basis.
* chinese, japanese or korean fonts can use a ucs-2 encoding cmap for retrieving unicode values.
* add support for the additional glyph list for unicode values in true type fonts. adds nonmarkingreturn mapping to carriage return.
* makes font parsing classes static where there's no reason for them to be per-instance.
2019-12-19 13:33:44 +00:00
Eliot Jones
a167d4c1dd fix bug where hex tokens for document identifier lost bytes due to encoding 2019-12-18 14:54:56 +00:00
Eliot Jones
dab64ec406 handle newlines before inline images and support larger data streams in brute force search 2019-12-18 12:02:07 +00:00
BobLd
6dba5bb2b4 update PublicApiScannerTests 2019-12-18 11:43:39 +00:00
BobLd
47b4428562 Adding Whitespace covering algorithm
Adding support for MaxDegreeOfParallelism in DocumentLayoutAnalysis
2019-12-18 11:41:39 +00:00
Eliot Jones
1fb416eee3 add convenience method to retrieve all hyperlinks and their text from annotations on a page 2019-12-18 11:41:02 +00:00