Commit Graph

738 Commits

Author SHA1 Message Date
BobLd
84bab1b627 Add basic marked content extraction capabilities 2020-01-08 10:34:01 +00:00
Eliot Jones
63b118b141 handle type1 fonts disguised as truetype
if the font descriptor uses the fromsubtype flag the actual type of the font can differ from that specified in the font dictionary. in this case a truetype font actually contains a type1c, compact font format, font. in this case we fall back to using the type1 parser.

also handles a closesubpath command appearing without any path construction operators.
2020-01-07 16:49:21 +00:00
Eliot Jones
d267d7501a use encoding specified in base font if present
if the font uses a named encoding which is not recognised, use the corresponding encoding based on the base font name, or fall back to windows ansi encoding.
2020-01-07 16:01:45 +00:00
Eliot Jones
e588b2bc50 support documents without endobj for stream
some documents declare stream objects without an endobj marker at the end of the stream. if a new obj token is encountered after reading a stream we reset the scanner to the object number token and return the stream.
2020-01-07 15:27:01 +00:00
Eliot Jones
10dc5a8eed don't cache invalid offsets unless brute forced
don't cache objects parsed if their offset doesn't match the cross-reference offset, unless the object was parsed by a brute-force search operation. this is because 1 object may lie in 2 streams, 1 valid and 1 invalid. If the invalid stream is parsed first for another object then the valid stream will never be read.
2020-01-07 14:54:12 +00:00
Eliot Jones
903d71a93d skip cross references outside file
if the previous cross-reference location points to an offset outside the file size we skip it.

also makes cid font factory more resilient by skipping missing descriptors.
2020-01-07 12:37:41 +00:00
Eliot Jones
5114b2da2c avoid overwriting cache for valid objects
some objects may be defined in more than one stream. parsing both streams would overwrite the object in the cache. to prevent this we avoid overwriting the existing object in the cache if it has the expected offset from the cross reference table.
2020-01-07 11:48:09 +00:00
Eliot Jones
0b048fde57 handle eof further back in file
an %%eof for a pdf file may appear further back than the last 1024 bytes. this change doubles the search range. it also handles an empty differences array being defined for a font encoding.

we also remove the old approach to dependency injection from the code since we are now favouring static classes where possible.
2020-01-07 11:48:09 +00:00
Eliot Jones
3c19b988e2 merge pull request #120 from vadik299/master
Fix for rectangle width/height incorrectly parsed
2020-01-07 08:44:47 +00:00
vadik299
f00eb5efa2 Update AppendRectangle.cs
(fix) Rectangle width and height should be also transformed by CurrentTransformationMatrix
2020-01-07 00:23:10 -05:00
vadik299
6ca2190f67 Merge pull request #2 from UglyToad/master
update
2020-01-07 00:20:12 -05:00
Eliot Jones
fc9c1b6ff5 add method to retrieve single glyph bounds from truetype
this improves performance since we only need to load a single rectangle rather than the entire glyphs array including all points.
2020-01-06 14:43:51 +00:00
Eliot Jones
09c72a2fb2 handle 0 length gylph in true type font 2020-01-06 14:12:46 +00:00
Eliot Jones
80845863a8 version 0.1.0-beta001 0.1.0-beta001 2020-01-06 12:31:18 +00:00
Eliot Jones
e2c3b6dc8b update package icon #96 and readme
updates nuget package definition to use new format of package icon as required by #96. add readme information for hyperlinks and truetype fonts #8.
2020-01-06 12:28:54 +00:00
Eliot Jones
0183c0af5f add project for nuget package #119
in order to include all projects from the solution we create a new solution with an entry-point assembly which references all projects. calling dotnet pack on this single project then packages all assemblies into the produced nuget package.

also remove old glyph list references from the main project since they have moved to the fonts project.
2020-01-06 11:31:41 +00:00
Eliot Jones
00bd285262 add support for quadpoints to annotations
highlight, link, strikeout, squiggly and underline annotation types may define a set of quadrilaterals using the quadpoints entry. this defines the regions to show/activate the annotation. the order of points in the quadpoints array does not match the specification so we provide a convenience class to access the point data rather than interpreting it as a rectangle: https://stackoverflow.com/questions/9855814/pdf-spec-vs-acrobat-creation-quadpoints.
2020-01-05 16:23:07 +00:00
Eliot Jones
e064d39671 remove unused project references from document layout analysis 2020-01-05 15:44:02 +00:00
Eliot Jones
02f9166c00 use lazy loading for glyph data
glyph data in TrueType fonts can be very large and slow to parse. to avoid this we store the raw table data at parsing time and enable lazy loading of glyph descriptions.
2020-01-05 15:42:23 +00:00
Eliot Jones
1948b4ad9f merge pull request #117 from uglytoad/refactor-font-project
refactor font project
2020-01-05 14:31:22 +00:00
Eliot Jones
e0a45e3774 include dependencies as dlls in the published nuget
by default nuget pack does not include project dependencies. this is suboptimal since it would require managing at least 5 nuget packages. this uses a workaround detailed here https://github.com/nuget/home/issues/3891 to copy the dependent dlls to the generated nuget package. this doesn't resolve the issue of how we publish the documentlayoutanalysis project, since it is the top of the dependency tree and we publish its parent, rather than it.
2020-01-05 13:56:14 +00:00
Eliot Jones
e1b39983d0 handle missing encodings in cff fonts 2020-01-05 13:16:31 +00:00
Eliot Jones
b29354e3e6 move compact font format fonts to fonts project 2020-01-05 12:08:01 +00:00
Eliot Jones
bbde38f656 move tokenizers to their own project
since both pdfs and Adobe Type1 fonts use postscript type objects, tokenization is needed by the main project and the fonts project
2020-01-05 10:40:44 +00:00
Eliot Jones
d09b33af4d move tokens to new project 2020-01-05 10:07:01 +00:00
Eliot Jones
1c38a2ae8a move pdfline to the core project 2020-01-05 09:33:59 +00:00
Eliot Jones
15525acbaa move document layout analysis and export to new project 2020-01-05 09:19:58 +00:00
Eliot Jones
a6541f1cfc fix test references
update references for unit tests to reference new core and fonts projects. all tests except the public api scanner tests now run successfully.
2020-01-04 22:56:41 +00:00
Eliot Jones
74774995d6 complete move of truetype, afm and standard14 fonts
the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.
2020-01-04 22:39:13 +00:00
Eliot Jones
7c0ef111ea move classes to new projects
to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.
2020-01-04 16:38:18 +00:00
Eliot Jones
cf1b8651d6 make adler32checksum public
there's no reason to keep adler32checksum internal so it is made public in case people find it useful.
2020-01-04 10:27:07 +00:00
Eliot Jones
b355a31ae8 write valid zlib stream for flate
since c# only produces a deflate stream when compressing it is necessary to provide the header and footer bytes to convert this to a valid zlib stream. this involves setting the correct 2 bytes for the header and appending a 4 byte adler checksum for the uncompressed data after the compressed data stream.
2020-01-04 10:27:07 +00:00
Eliot Jones
b7554e2838 compress page content stream
when writing a new pdf document we now use the flate filter to compress the page content stream. we also move letters in the same word into the same showtext operation.
2020-01-04 10:27:07 +00:00
Eliot Jones
694388f9b6 add more tables to subset
add additional optional and required (but really optional) tables to the truetype subset generated. adds cvt, fpgm and name tables to the output font file. also pads tables so they correctly appear on 4 byte boundaries.
2020-01-04 10:27:07 +00:00
Eliot Jones
b15a3a9b57 tidy up truetype tables
* improves the naming of truetype related classes.
* uses correct numeric type for the loca table.
* makes a few related classes public.
2020-01-04 10:27:07 +00:00
Eliot Jones
0b103ce8ad copy entire composite glyph to subset #98
our subsetted font was invalid because composite glyphs may include hinting instructions following the components. we use the existing glyph offsets to read the full length of the composite glyph data. the output files for roboto are now valid in all tested readers.

this completes the initial work required for truetype font subsetting and non-ascii support, some further work remains to tidy up the generated file and compress the page content stream.
2020-01-04 10:27:07 +00:00
Eliot Jones
90f8f97bfd add simple test case for subsetting issue #98
adds a single test which proves that the invalid truetype subsetting with roboto is related to our font subsetting code, since we can subset the same text correctly with windows calibri we must be reading roboto incorrectly.
2020-01-04 10:27:07 +00:00
Eliot Jones
fe315be2ef fix truetype subsetting for composite glyphs #98
each glyph included in the subset must count towards the number of glyphs, the horizontal metrics and the maximum profile table for the output truetype font. each glyph must also lie on a 4 byte boundary in the output file.

the output file is valid for the windows system font calibri containing accented characters but the roboto subset files are still invalid.

moves all subsetting related classes into their own namespace which will be made public.
2020-01-04 10:27:07 +00:00
Eliot Jones
bb5677c11e implement composite glyph support for subsetter #98
first pass at implementing composite glyph (glyphs formed by combining other simple glyphs) support for the subsetter. the produced file is valid as a pdf but does not display correctly for any composite glyphs. we need to check we're copying the full run of the composite glyph data as well as correctly setting any glyph indices, one idea is to try parsing the resulting font in pdfbox to see if fontbox can handle the subset we produce. next step is to add a test case with a single composite glyph and see what we're missing.

also remove the old cmap replacer code because it is obsoleted by the full subsetter.
2020-01-04 10:27:07 +00:00
Eliot Jones
f2f33dc5cf tidy up code in subsetting classes #98
removes some debugging code in the cmap replacer and moves the main glyph parsing logic in the glyph table subsetter to a method.

in the next commit we will delete the cmap replacer since it doesn't work and is no longer needed but we want a clean version of it in the commit history for reference.
2020-01-04 10:27:07 +00:00
Eliot Jones
1dd46bace2 generate full subset of truetype font when writing #98
* add writeable support for format 0 cmap subtable, index to location table and horizontal metrics table.
* add fix for writing cmap table offsets.
* add subsetter which copies on the glyphs required to the output font file. this ensures the output font uses valid indices, only includes required glyph data and is compliant with adobe acrobat reader.

a couple of tests are still failing because support for composite glyphs needs to be added to the subsetter.
2020-01-04 10:27:07 +00:00
Eliot Jones
336947db73 add writing methods to truetype tables #98
since we have verified the problem with the characters not appearing in acrobat reader isn't the checksum (other files also have invalid checksums but work) it seems likely the issue is with the os/2 table.

this change moves the logic for writing out the cmap table, the format 6 cmap sub-table, truetype table headers and the os/2 table into the classes themselves. now we can write an os/2 table and we've tested that the output matches the input, we can overwrite the os/2 table in order to work out which of the os/2 errors is causing our font to be invalid.

the writeable interface should be added to more and more parts of the codebase so that writing, editing and document creation become first class citizens rather than hardcoded additions.

this change also adds the macroman (1,0) cmap subtable to edited fonts so that it is present for consumers which expect it.
2020-01-04 10:27:07 +00:00
Eliot Jones
9fff879bd4 fix tests by using custom equality comparers
since we now round glyph widths for truetype fonts in the widths array of the pdf some values are out by a very small amount from the expected value. since we don't care about such fractional inaccuracy we use a custom comparer for these tests.
2020-01-04 10:27:07 +00:00
Eliot Jones
3ad03ff3ee finish implementation of truetype cmap replacer #98
this overwrites the cmap table which is moved to the end of the truetype font file. the new table contains a single windows symbol subtable (3,0) of the format 6 type which maps the character codes in the single range 0 -> glyphcount to the corresponding glyph indices in the font. the new cmap table is then written to the new font file and the header value for length is updated.

this also changes many truetype classes to use the corresponding ushort datatype rather than ints, to save space.

the generated file is displaying correctly in most pdf viewers and passes all tests but in adobe acrobat reader the text is present but invisible. this was not a problem with the previous approach to file generation. there is no log information as to why this might be the case but it seems like the answer must be related to the validity of the overwritten truetype file. we might need to provide an additional macroman cmap subtable in case this is required by acrobat reader.

running the produced font for andada-regular through fontvalidator https://github.com/HinTak/Font-Validator/releases indicates a number of issues that may cause the file to be an invalid font (cannot open with microsoft font viewer). the next step is to compare the errors present in the unmodified andada regular file with the errors in our version. the most likely candidates seem to be:
* os/2: font is a symbol font but panose byte 1, familytype, is not set to latin symbol.
* os/2: a unicode range was indicated in ulunicoderange but the font has no characters in that range.
* os/2: the usfirstcharindex/uslastcharindex is not valid.
* os/2: the font contains a 3,0 cmap but the codepagerange bit 31 is clear.
* os/2: the usbreakchar is not mapped to a glyf.
* head: font checksum is incorrect (this can also be the case for working fonts so seems unlikely to be the cause).
2020-01-04 10:27:07 +00:00
Eliot Jones
59c43cc526 truetype encoding replacer and checksum calculator #98
we need to provide a custom cmap for our overridden fonts when creating a document using truetype fonts. in order to do this without writing a complete subsetter (yet) we simply rearrange the font by moving the cmap table to the end of the font.

in order to keep a valid font we need to recalculate the offsets and checksums for all table headers. this adds a calculator which can calculate per-table checksums as well as the whole-font checksum used to calculate the checksum adjustment recorded in the head table.

now that the cmap table has been moved to the end of the font file we can overwrite it with a different-length custom cmap table without further invasive changes to the rest of the truetype file. this isn't implemented yet in this commit but will be the next thing to implement.

in truetype writing font we've temporarily reverted the change which maps characters to bytes until the custom cmap is written so we can ensure for this change the output font file is still valid and can be interpreted by pdf consumers. once the custom cmap is written we can uncomment the mapping logic and it should all just work.
2020-01-04 10:27:07 +00:00
Eliot Jones
f319e7f4b5 adds per character byte mapping to truetype #98
this starts to add logic for per-character mapping of unicode characters to byte values for truetype fonts in the pdf document builder. in order to support unicode characters outside the 0-255 range when creating new pdf documents without using composite fonts, we need to map values outside these range into this range. to do this we start at 1 and map each character we encounter to the next code, up to a maximum of 255. we provide a custom tounicode cmap in the font dictionary which maps these byte values, 0-255, back to unicode code points (short).

we also provide a custom firstchar, lastchar and widths array for the font mapping just the values we use.

since fonts no longer contain just the latin character set the font descriptor enum is set to have the symbolic flag set. this means values will be looked up in either the mac-roman (1, 0) or windows-symbol (3, 0) cmap tables (these cmap tables are distinct from cmap tables in the pdf file) inside the actual truetype font bytes. this means the currently generated font file is invalid, because while the widths array and tounicode cmap return the correct values the actual font itself returns whatever values where in those positions before the remapping occurred.

in order to fix this we will need to override the windows-symbol cmap contained in the underlying truetype font to match our mapping. this will be a lot of work and involve significant rewriting of the font file itself, in order to preserve checksum integrity.
2020-01-04 10:27:07 +00:00
BobLd
f67cce31b5 Adding a 'minimumEditDistanceNormalised' parameter to allow for other edit distance implementations. 2020-01-03 12:31:23 +00:00
BobLd
e46df38f4d Make numbersPattern private 2020-01-03 12:31:23 +00:00
BobLd
39f275aaeb Improve numbers pattern matching and include roman numerals 2020-01-03 12:31:23 +00:00
BobLd
07f51712c6 Update PublicApiScannerTests 2020-01-03 12:31:23 +00:00