Commit Graph

753 Commits

Author SHA1 Message Date
Eliot Jones
8ca947542f skip unrelated entries in document name tree 2019-12-05 13:47:42 +00:00
Eliot Jones
2e5c995322 make external nodes different to document nodes and finish reimplementation 2019-12-05 13:21:19 +00:00
Eliot Jones
2ea71ce3bb fix off-by-one error in format 4 cmap subtable for truetype #91 2019-12-05 12:21:58 +00:00
Eliot Jones
ecf0b8743b make bookmarknode immutable and use scanner when retrieving bookmarks 2019-12-05 12:03:30 +00:00
Eliot Jones
928347bcce merge pull request #84 from BobLd/master
add basic bookmarks extraction capabilities.
2019-12-04 14:24:10 +00:00
Eliot Jones
a967e0898a handle missing width and height correctly for compact font format fonts #75 2019-12-04 14:19:06 +00:00
Eliot Jones
8a51795e99 update codecov version for azure pipeline 2019-11-27 16:45:05 +00:00
Eliot Jones
80f024dbed make form access public 2019-11-27 16:36:25 +00:00
Eliot Jones
df3cb43cfc update coverage libraries 2019-11-27 16:16:11 +00:00
Eliot Jones
ed53773c7b handle checked state of radio buttons and checkboxes 2019-11-27 15:34:28 +00:00
Eliot Jones
910e22a4e9 wrap checkboxes and radiobuttons in their own form field types with access to the child collections 2019-11-26 16:33:24 +00:00
BobLd
9da0623fab Merge branch 'master' of https://github.com/UglyToad/PdfPig 2019-11-26 12:16:43 +00:00
Eliot Jones
677d2b5e8f #82 make resource store state local to the page and operation being processed
resources such as fonts are linked to page content operations using name labels, e.g. "/F1", these resource labels can be reassigned on different pages or inside form xobjects. we now clear the entire resource state for each page which is parsed and after form xobject operations which use resource dictionaries.
2019-11-25 14:34:02 +00:00
Eliot Jones
9028f932b2 #83 decrypt dictionary hex values 2019-11-25 12:42:32 +00:00
BobLd
89daa2818e update PublicApiScannerTests 2019-11-04 15:17:25 +00:00
BobLd
a8559c1167 Add basic bookmarks extraction capabilities. 2019-11-04 15:11:54 +00:00
Eliot Jones
ef6d509f44 Merge pull request #80 from BobLd/master
Enhancing NearestNeighbourWordExtractor
2019-11-04 09:56:21 +00:00
BobLd
99f260befb Enhancing NearestNeighbourWordExtractor
- Making the code easier to read
- Using 20% of Width instead of 60%
- Making DefaultWordExtractor public
2019-10-21 20:51:27 +01:00
Eliot Jones
0e39c88008 Merge pull request #77 from BobLd/master
AltoDocument: make all `xxxSpecified` setters public to allow `Deserialize`
2019-10-20 12:58:01 +01:00
BobLd
0b2a0f4bc7 AltoDocument: make all xxxSpecified setter public to allow Deserialize. 2019-10-20 12:25:34 +01:00
Eliot Jones
80fc404b10 #47 improve performance by caching truetype bounding boxes
also uses less reflection when parsing the page content stream
2019-10-18 15:56:28 +01:00
Eliot Jones
84990722ca #76 add infinite loop protection for brute force search
also treats 'm' or 'j' in endstream/endobj as a valid object number start character
2019-10-17 16:50:01 +01:00
Eliot Jones
efe7896824 #75 support vertical writing mode fonts 2019-10-17 15:57:04 +01:00
Eliot Jones
a2147902a0 merge pull request #72 from uglytoad/fix-export-formatting
fix export formatting
2019-10-17 11:28:06 +01:00
Eliot Jones
09b26c43e0 #74 add intersectswith method to rectangle 2019-10-17 11:21:49 +01:00
Eliot Jones
57dfee3211 move alto xml exporter to root export namespace 2019-10-17 10:46:43 +01:00
Eliot Jones
3f1321141a #73 process xobject form content when extracting text and images 2019-10-16 14:59:16 +01:00
Eliot Jones
6174877892 #71 ignore malformed dates in true type header table. fix reading of dates from bytes 2019-10-16 10:51:02 +01:00
Eliot Jones
f14c52a05a fix tests for renaming and validating generate alto xml 2019-10-15 13:59:09 +01:00
Eliot Jones
d68bd88824 format and tidy up alto export autogenerated code. tidy up docstrum 2019-10-14 18:30:18 +01:00
Eliot Jones
e2c9db8d50 merge pull request #69 from BobLd/master
Support for hORC, Atlo Xml and Page Xml output formats
2019-10-14 10:10:47 +01:00
BobLd
e76badaeaf Update PublicApiScannerTests with new public classes 2019-10-11 08:57:16 +01:00
BobLd
e9b3db7102 Make ITextExporter implementations public 2019-10-11 08:55:03 +01:00
BobLd
f886411e12 Merge https://github.com/UglyToad/PdfPig 2019-10-10 16:52:45 +01:00
Eliot Jones
dec4c31a33 fix bug where cross reference stream subsections were skipped
a single cross-reference stream may contain multiple disjoint runs of object numbers, previously we only took the first now we load all objects.

adds indexer to array token for ease-of-use.

adds page number and bounds information to all form fields.
2019-10-10 16:05:21 +01:00
BobLd
a15f56a6ac Better handling of UTF8 in XmlWriter 2019-10-10 14:14:05 +01:00
BobLd
fe1a3c4b8b updated from comments
- still need to look at XmlWriter
2019-10-10 12:29:28 +01:00
Eliot Jones
2ef45f71d5 make missing acroform types public and start improving data
also changes pages to use a proper tree structure since this will be required for resource inheritance and for acroform widget dictionaries.
2019-10-09 14:28:37 +01:00
Eliot Jones
81ab414c56 add is supported flag to filters and add missing doc comment 2019-10-08 15:53:42 +01:00
BobLd
bf09aee99c Adding images regions 2019-10-08 15:29:18 +01:00
BobLd
9ab943e1f9 Merge branch 'master' of https://github.com/UglyToad/PdfPig 2019-10-08 14:16:59 +01:00
Eliot Jones
77f968b6ea merge pull request #70 from uglytoad/add-images
#55 move support for images to page and add inline images
2019-10-08 14:11:19 +01:00
Eliot Jones
68bcaf3901 #55 move support for images to page and add inline images
support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page.

treat all warnings as errors.
2019-10-08 14:04:36 +01:00
BobLd
eb5400e01b Correct PageXmlTextExporter's Height and Width 2019-10-08 12:00:04 +01:00
BobLd
d939be1b9c update PublicApiScannerTests 2 2019-10-07 16:09:30 +01:00
BobLd
f4f2b0e3fd update PublicApiScannerTests 2019-10-07 16:02:11 +01:00
BobLd
93313118e9 Support for hORC, AtloXml and PageXml output formats
Tested with:
- 'hocrjs' for hORC (see https://unpkg.com/hocrjs)
- 'PAGE Viewer' for hORC, AtloXml and PageXml (see http://www.primaresearch.org/tools/PAGEViewer)
2019-10-07 15:19:30 +01:00
Eliot Jones
c3da10055b Merge pull request #68 from BobLd/master
Improve PdfPath
2019-10-07 11:46:11 +01:00
BobLd
1c3519fd51 Update PdfPath.cs
Need to account the case where a `Close` command is called but the first and last commands are not connected.
2019-10-06 12:47:12 +01:00
BobLd
1975db4752 correct typo 2019-10-04 14:50:22 +01:00