Commit Graph

34 Commits

Author SHA1 Message Date
Eliot Jones
efc258b0f0 use tokenscanner when converting array to rectangle
an arrray of 4 items representing a rectangle may define its values as indirect references. when converting to a rectangle we pass a pdf token scanner to resolve any indirect references.
2020-01-13 10:20:08 +00:00
Eliot Jones
0b048fde57 handle eof further back in file
an %%eof for a pdf file may appear further back than the last 1024 bytes. this change doubles the search range. it also handles an empty differences array being defined for a font encoding.

we also remove the old approach to dependency injection from the code since we are now favouring static classes where possible.
2020-01-07 11:48:09 +00:00
Eliot Jones
bbde38f656 move tokenizers to their own project
since both pdfs and Adobe Type1 fonts use postscript type objects, tokenization is needed by the main project and the fonts project
2020-01-05 10:40:44 +00:00
Eliot Jones
74774995d6 complete move of truetype, afm and standard14 fonts
the 3 font types mentioned are moved to the new fonts project, any referenced types are moved to the core project. most truetype classes are made public #8.
2020-01-04 22:39:13 +00:00
Eliot Jones
7c0ef111ea move classes to new projects
to make the project more useful and expose more usable classes we're rearchitecting in the following way. code used to read fonts from external file formats like truetype, adobe font metrics (afm) and adobe type 1 fonts are moving to a new project which doesn't reference most of the pdf logic. the shared logic is moving to a new flat-structured project called core. this is a sort-of onion type architecture, with core being the... core, fonts being the next layer of the onion, pdfpig itself the next. this will then support additional libraries/projects as outer layers of the onion as well as releasing standalone version of the font library as pdfbox does with fontbox.
2020-01-04 16:38:18 +00:00
Eliot Jones
b355a31ae8 write valid zlib stream for flate
since c# only produces a deflate stream when compressing it is necessary to provide the header and footer bytes to convert this to a valid zlib stream. this involves setting the correct 2 bytes for the header and appending a 4 byte adler checksum for the uncompressed data after the compressed data stream.
2020-01-04 10:27:07 +00:00
Eliot Jones
336947db73 add writing methods to truetype tables #98
since we have verified the problem with the characters not appearing in acrobat reader isn't the checksum (other files also have invalid checksums but work) it seems likely the issue is with the os/2 table.

this change moves the logic for writing out the cmap table, the format 6 cmap sub-table, truetype table headers and the os/2 table into the classes themselves. now we can write an os/2 table and we've tested that the output matches the input, we can overwrite the os/2 table in order to work out which of the os/2 errors is causing our font to be invalid.

the writeable interface should be added to more and more parts of the codebase so that writing, editing and document creation become first class citizens rather than hardcoded additions.

this change also adds the macroman (1,0) cmap subtable to edited fonts so that it is present for consumers which expect it.
2020-01-04 10:27:07 +00:00
Eliot Jones
935d182888 use doubles where calculations are being run 2019-12-24 12:22:17 +00:00
Eliot Jones
a967e0898a handle missing width and height correctly for compact font format fonts #75 2019-12-04 14:19:06 +00:00
Eliot Jones
677d2b5e8f #82 make resource store state local to the page and operation being processed
resources such as fonts are linked to page content operations using name labels, e.g. "/F1", these resource labels can be reassigned on different pages or inside form xobjects. we now clear the entire resource state for each page which is parsed and after form xobject operations which use resource dictionaries.
2019-11-25 14:34:02 +00:00
BobLd
99f260befb Enhancing NearestNeighbourWordExtractor
- Making the code easier to read
- Using 20% of Width instead of 60%
- Making DefaultWordExtractor public
2019-10-21 20:51:27 +01:00
Eliot Jones
68bcaf3901 #55 move support for images to page and add inline images
support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page.

treat all warnings as errors.
2019-10-08 14:04:36 +01:00
Eliot Jones
e02e130947 #57 add creation and modified date to document information
this enables users to check if xmp metadata is outdated
2019-10-03 12:56:48 +01:00
Eliot Jones
d98b8b43c1 small performance tweaks and remove package license expression
package license url is deprecated in favour of package license expression but nuget doesn't seem to support expressions properly for published packages yet so we'll keep the deprecated url for the time being. having both url and expression causes the build to fail.

small obvious performance improvements for file header passing and getting the encoding information using the existing reverse name to code map.
2019-08-18 13:47:01 +01:00
Eliot Jones
3c49371c68 test hex to string implementation and remove unused method 2019-07-07 17:30:54 +01:00
Eliot Jones
f375cb6f04 keep letters in word when using default word extractor 2019-05-30 20:07:52 +01:00
BobLd
65647febcf - Adding a TextDirection enum.
- In the Letter class:
     - Renaming 'Location' to 'StartBaseLine' and adding 'EndBaseLine' for better localisation of the letter ('Location' is also kept).
     - Adding TextDirection.
2019-04-19 21:33:31 +01:00
Eliot Jones
575953c0ed add multi targeting frameworks in the same project for net 4.5 through net 7.0 and net standard 2.0 2019-01-06 11:06:02 +00:00
Eliot Jones
21aa964e0b #24 add different field types and code to read them 2019-01-02 22:28:50 +00:00
Eliot Jones
20e843f5ae #24 start adding classes for the acroform api 2019-01-01 17:44:46 +00:00
Eliot Jones
a5349dd77a start adding retrieval of annotations 2018-12-20 18:18:32 +00:00
Eliot Jones
a5ce43774b revert change to public api of letter. update readme 2018-11-26 20:18:00 +00:00
Eliot Jones
fdd48b25d8 #15 change default word extraction for latex test 2018-11-25 10:10:28 +00:00
Eliot Jones
17909f8565 #15 add classes to extract words and initial tests 2018-11-24 20:51:27 +00:00
Eliot Jones
2fa781b8e9 #10 make all token classes public and expose via a public structure member on pdf document 2018-11-24 19:02:06 +00:00
Eliot Jones
2c159f71e8 #6 rename some cff classes, change protection levels and start fixing bugs with charstrings which include hints in routine calls 2018-11-18 16:32:28 +00:00
Eliot Jones
0f68dfeb19 #10 move tokens to the root namespace for discoverability. upgrade xunit versions. there is a bug with test discovery for stringtokenizertests 2018-11-16 20:00:12 +00:00
Eliot Jones
904f773525 add code for drawing type 1 glyphs and converting to svg 2018-11-13 20:45:54 +00:00
Eliot Jones
1d4dc7767d change type1 commands to be static and lazily evaluated and return the command sequences from the parser 2018-11-01 19:34:22 +00:00
Eliot Jones
e24a306c31 remove all old parsing logic 2018-01-21 14:48:49 +00:00
Eliot Jones
da7d83d863 finish the migration 2018-01-20 20:20:40 +00:00
Eliot Jones
7d90f4858a continue migrating code to tokenizer 2018-01-20 18:42:29 +00:00
Eliot Jones
a0deab446b switch classes still using the cos object approach to the tokenization approach initally used for parsing cmap files. 2018-01-19 00:35:04 +00:00
Eliot Jones
ec62542b64 change the project name to something silly 2018-01-10 19:49:32 +00:00