Commit Graph

576 Commits

Author SHA1 Message Date
Eliot Jones
6174877892 #71 ignore malformed dates in true type header table. fix reading of dates from bytes 2019-10-16 10:51:02 +01:00
Eliot Jones
f14c52a05a fix tests for renaming and validating generate alto xml 2019-10-15 13:59:09 +01:00
Eliot Jones
d68bd88824 format and tidy up alto export autogenerated code. tidy up docstrum 2019-10-14 18:30:18 +01:00
Eliot Jones
e2c9db8d50 merge pull request #69 from BobLd/master
Support for hORC, Atlo Xml and Page Xml output formats
2019-10-14 10:10:47 +01:00
BobLd
e76badaeaf Update PublicApiScannerTests with new public classes 2019-10-11 08:57:16 +01:00
BobLd
e9b3db7102 Make ITextExporter implementations public 2019-10-11 08:55:03 +01:00
BobLd
f886411e12 Merge https://github.com/UglyToad/PdfPig 2019-10-10 16:52:45 +01:00
Eliot Jones
dec4c31a33 fix bug where cross reference stream subsections were skipped
a single cross-reference stream may contain multiple disjoint runs of object numbers, previously we only took the first now we load all objects.

adds indexer to array token for ease-of-use.

adds page number and bounds information to all form fields.
2019-10-10 16:05:21 +01:00
BobLd
a15f56a6ac Better handling of UTF8 in XmlWriter 2019-10-10 14:14:05 +01:00
BobLd
fe1a3c4b8b updated from comments
- still need to look at XmlWriter
2019-10-10 12:29:28 +01:00
Eliot Jones
2ef45f71d5 make missing acroform types public and start improving data
also changes pages to use a proper tree structure since this will be required for resource inheritance and for acroform widget dictionaries.
2019-10-09 14:28:37 +01:00
Eliot Jones
81ab414c56 add is supported flag to filters and add missing doc comment 2019-10-08 15:53:42 +01:00
BobLd
bf09aee99c Adding images regions 2019-10-08 15:29:18 +01:00
BobLd
9ab943e1f9 Merge branch 'master' of https://github.com/UglyToad/PdfPig 2019-10-08 14:16:59 +01:00
Eliot Jones
77f968b6ea merge pull request #70 from uglytoad/add-images
#55 move support for images to page and add inline images
2019-10-08 14:11:19 +01:00
Eliot Jones
68bcaf3901 #55 move support for images to page and add inline images
support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page.

treat all warnings as errors.
2019-10-08 14:04:36 +01:00
BobLd
eb5400e01b Correct PageXmlTextExporter's Height and Width 2019-10-08 12:00:04 +01:00
BobLd
d939be1b9c update PublicApiScannerTests 2 2019-10-07 16:09:30 +01:00
BobLd
f4f2b0e3fd update PublicApiScannerTests 2019-10-07 16:02:11 +01:00
BobLd
93313118e9 Support for hORC, AtloXml and PageXml output formats
Tested with:
- 'hocrjs' for hORC (see https://unpkg.com/hocrjs)
- 'PAGE Viewer' for hORC, AtloXml and PageXml (see http://www.primaresearch.org/tools/PAGEViewer)
2019-10-07 15:19:30 +01:00
Eliot Jones
c3da10055b Merge pull request #68 from BobLd/master
Improve PdfPath
2019-10-07 11:46:11 +01:00
BobLd
1c3519fd51 Update PdfPath.cs
Need to account the case where a `Close` command is called but the first and last commands are not connected.
2019-10-06 12:47:12 +01:00
BobLd
1975db4752 correct typo 2019-10-04 14:50:22 +01:00
BobLd
5d3e4cd4e1 Improve PdfPath
- Determine if Closed path
- Determine if Clockwise or CounterClockwise
- Add Centroid
2019-10-04 14:37:41 +01:00
Eliot Jones
e02e130947 #57 add creation and modified date to document information
this enables users to check if xmp metadata is outdated
2019-10-03 12:56:48 +01:00
Eliot Jones
38b6f8e812 add current geometry path to page content when it is not explicitly closed #66 2019-09-11 15:38:57 +01:00
Eliot Jones
f822ad48ea merge pull request #67 from BobLd/master
Fix error in DocstrumBB
2019-09-11 12:30:22 +01:00
BobLd
d36dee0e25 Adding handling when pageWords count = 0 for IPageSegmenters 2019-09-04 22:14:08 +01:00
BobLd
68e04603c0 Fix error in DocstrumBB 2019-09-02 19:07:27 +01:00
Eliot Jones
d089a34aa4 lazily evaluate page text and remove linq from word constructor 2019-08-25 15:06:37 +01:00
Eliot Jones
0cd7795bff add method to get all pages from document 2019-08-23 19:09:33 +01:00
Eliot Jones
3fbfc1130e lazily evaluate centroid of rectangle 2019-08-20 23:03:27 +01:00
Eliot Jones
6878d9a82d #64 use decimal values directly rather than from array for transformation matrix 2019-08-20 22:51:00 +01:00
Eliot Jones
613af46472 #62 use byte array instance rather than interface for input bytes 2019-08-20 21:37:31 +01:00
Eliot Jones
bbe5409f94 #62 use length value of stream directly to read the full stream once 2019-08-20 21:08:06 +01:00
Eliot Jones
e0a32a701b #63 make cache of parsed system fonts static and read the whole file up-front rather than using a filestream 2019-08-19 20:09:07 +01:00
Eliot Jones
0fa3b27ad3 #47 improve flate filter performance by streaming all data in single operation
also improves page constructor performance by removing linq and invoking stringbuilder directly. removes page rotation overhead by skipping multiplication for non-rotated pages and using cached transformation matrices for rotations. removes linq from filter provider and shares instances of filter types.
2019-08-19 19:48:02 +01:00
Eliot Jones
11b244eda1 remove thread-unsafe stringbuilder access from adobe font metrics parser
this also hoists the char arrays used for string splits since these will be allocated per call if declared inline
2019-08-18 14:10:38 +01:00
Eliot Jones
d98b8b43c1 small performance tweaks and remove package license expression
package license url is deprecated in favour of package license expression but nuget doesn't seem to support expressions properly for published packages yet so we'll keep the deprecated url for the time being. having both url and expression causes the build to fail.

small obvious performance improvements for file header passing and getting the encoding information using the existing reverse name to code map.
2019-08-18 13:47:01 +01:00
Eliot Jones
3ff8637bb0 keep license url in the nuget info even though it is deprecated 2019-08-18 11:59:02 +01:00
Eliot Jones
4548ae934b Merge pull request #61 from vadik299/master
Adding TextSequence number to each letter to determine if letters belong to the same Tj operation
2019-08-17 12:59:46 +01:00
Eliot Jones
8c100efe04 Merge pull request #60 from BobLd/master
Improve ClusteringAlgorithms.GroupIndexes() and add Equals() to PdfLine
2019-08-17 12:58:06 +01:00
vadik299
cc767b8cd6 Merge branch 'master' into master 2019-08-16 18:34:57 -04:00
BobLd
afa2b7baa1 Improve ClusteringAlgorithms.GroupIndexes()
Add Equals() to PdfLine
2019-08-14 19:58:31 +01:00
Eliot Jones
ac62b7247b version 0.0.9 0.0.9 2019-08-13 21:24:54 +01:00
Eliot Jones
f5e025aa70 merge pull request #58 from uglytoad/colors
adds colors to letters and prepares code to add colors to paths.
2019-08-13 20:50:06 +01:00
Eliot Jones
f55091f3d2 make color types public and add stream based tests to prevent future breaking as observed in #52 2019-08-13 20:48:22 +01:00
Vasya
22278f64c4 Added TextSequence 2019-08-11 14:55:59 -04:00
Eliot Jones
980e67fabe Merge pull request #56 from BobLd/master
Document Layout Analysis - IPageSegmenter, Docstrum
2019-08-11 14:04:39 +01:00
BobLd
9f13739add correcting typo 2019-08-11 13:54:47 +01:00