when a marked content region contains an image we load it eagerly since we won't have access to the necessary classes at evaluation time. we also default image colorspace to the active graphics state colorspace if the dictionary doesn't contain a valid entry.
since the properties in marked content may be indirect references or belong to the page resources array, the value should be calculated during content processing. this change tidies up the marked content classes so they do not expose mutable data and uses the pdf token scanner overloads to load dictionary data.
since an inline image's data stream may contain the characters 'ei' as a result of compression it's possible to read an end image operator mid-data, this results in the next operator also being end image and the content stream being in an invalid state. to recover from this when we detect this situation we remove the previous operator, read to the current operator and replace the operator and data bytes in the list of operations.
since pdfbox defaults to us letter if the mediabox is missing rather than throwing we remove the behaviour where uselenientparsing is false which used to throw, now we log an error. throwing didn't provide any benefit to consumers.
if the font descriptor uses the fromsubtype flag the actual type of the font can differ from that specified in the font dictionary. in this case a truetype font actually contains a type1c, compact font format, font. in this case we fall back to using the type1 parser.
also handles a closesubpath command appearing without any path construction operators.
if the font uses a named encoding which is not recognised, use the corresponding encoding based on the base font name, or fall back to windows ansi encoding.
some documents declare stream objects without an endobj marker at the end of the stream. if a new obj token is encountered after reading a stream we reset the scanner to the object number token and return the stream.
don't cache objects parsed if their offset doesn't match the cross-reference offset, unless the object was parsed by a brute-force search operation. this is because 1 object may lie in 2 streams, 1 valid and 1 invalid. If the invalid stream is parsed first for another object then the valid stream will never be read.
if the previous cross-reference location points to an offset outside the file size we skip it.
also makes cid font factory more resilient by skipping missing descriptors.
some objects may be defined in more than one stream. parsing both streams would overwrite the object in the cache. to prevent this we avoid overwriting the existing object in the cache if it has the expected offset from the cross reference table.
an %%eof for a pdf file may appear further back than the last 1024 bytes. this change doubles the search range. it also handles an empty differences array being defined for a font encoding.
we also remove the old approach to dependency injection from the code since we are now favouring static classes where possible.
in order to include all projects from the solution we create a new solution with an entry-point assembly which references all projects. calling dotnet pack on this single project then packages all assemblies into the produced nuget package.
also remove old glyph list references from the main project since they have moved to the fonts project.
highlight, link, strikeout, squiggly and underline annotation types may define a set of quadrilaterals using the quadpoints entry. this defines the regions to show/activate the annotation. the order of points in the quadpoints array does not match the specification so we provide a convenience class to access the point data rather than interpreting it as a rectangle: https://stackoverflow.com/questions/9855814/pdf-spec-vs-acrobat-creation-quadpoints.
glyph data in TrueType fonts can be very large and slow to parse. to avoid this we store the raw table data at parsing time and enable lazy loading of glyph descriptions.
by default nuget pack does not include project dependencies. this is suboptimal since it would require managing at least 5 nuget packages. this uses a workaround detailed here https://github.com/nuget/home/issues/3891 to copy the dependent dlls to the generated nuget package. this doesn't resolve the issue of how we publish the documentlayoutanalysis project, since it is the top of the dependency tree and we publish its parent, rather than it.
the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.
to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.
since c# only produces a deflate stream when compressing it is necessary to provide the header and footer bytes to convert this to a valid zlib stream. this involves setting the correct 2 bytes for the header and appending a 4 byte adler checksum for the uncompressed data after the compressed data stream.
when writing a new pdf document we now use the flate filter to compress the page content stream. we also move letters in the same word into the same showtext operation.
add additional optional and required (but really optional) tables to the truetype subset generated. adds cvt, fpgm and name tables to the output font file. also pads tables so they correctly appear on 4 byte boundaries.
our subsetted font was invalid because composite glyphs may include hinting instructions following the components. we use the existing glyph offsets to read the full length of the composite glyph data. the output files for roboto are now valid in all tested readers.
this completes the initial work required for truetype font subsetting and non-ascii support, some further work remains to tidy up the generated file and compress the page content stream.
adds a single test which proves that the invalid truetype subsetting with roboto is related to our font subsetting code, since we can subset the same text correctly with windows calibri we must be reading roboto incorrectly.
each glyph included in the subset must count towards the number of glyphs, the horizontal metrics and the maximum profile table for the output truetype font. each glyph must also lie on a 4 byte boundary in the output file.
the output file is valid for the windows system font calibri containing accented characters but the roboto subset files are still invalid.
moves all subsetting related classes into their own namespace which will be made public.
first pass at implementing composite glyph (glyphs formed by combining other simple glyphs) support for the subsetter. the produced file is valid as a pdf but does not display correctly for any composite glyphs. we need to check we're copying the full run of the composite glyph data as well as correctly setting any glyph indices, one idea is to try parsing the resulting font in pdfbox to see if fontbox can handle the subset we produce. next step is to add a test case with a single composite glyph and see what we're missing.
also remove the old cmap replacer code because it is obsoleted by the full subsetter.
removes some debugging code in the cmap replacer and moves the main glyph parsing logic in the glyph table subsetter to a method.
in the next commit we will delete the cmap replacer since it doesn't work and is no longer needed but we want a clean version of it in the commit history for reference.
* add writeable support for format 0 cmap subtable, index to location table and horizontal metrics table.
* add fix for writing cmap table offsets.
* add subsetter which copies on the glyphs required to the output font file. this ensures the output font uses valid indices, only includes required glyph data and is compliant with adobe acrobat reader.
a couple of tests are still failing because support for composite glyphs needs to be added to the subsetter.