Commit Graph

48 Commits

Author SHA1 Message Date
BobLd
99f260befb Enhancing NearestNeighbourWordExtractor
- Making the code easier to read
- Using 20% of Width instead of 60%
- Making DefaultWordExtractor public
2019-10-21 20:51:27 +01:00
Eliot Jones
57dfee3211 move alto xml exporter to root export namespace 2019-10-17 10:46:43 +01:00
Eliot Jones
f14c52a05a fix tests for renaming and validating generate alto xml 2019-10-15 13:59:09 +01:00
BobLd
e76badaeaf Update PublicApiScannerTests with new public classes 2019-10-11 08:57:16 +01:00
BobLd
fe1a3c4b8b updated from comments
- still need to look at XmlWriter
2019-10-10 12:29:28 +01:00
Eliot Jones
2ef45f71d5 make missing acroform types public and start improving data
also changes pages to use a proper tree structure since this will be required for resource inheritance and for acroform widget dictionaries.
2019-10-09 14:28:37 +01:00
BobLd
9ab943e1f9 Merge branch 'master' of https://github.com/UglyToad/PdfPig 2019-10-08 14:16:59 +01:00
Eliot Jones
68bcaf3901 #55 move support for images to page and add inline images
support both xobject and inline images. adds unsupported filters so that exceptions are only thrown when accessing lazily evaluated image.bytes property rather than when opening the page.

treat all warnings as errors.
2019-10-08 14:04:36 +01:00
BobLd
d939be1b9c update PublicApiScannerTests 2 2019-10-07 16:09:30 +01:00
BobLd
f4f2b0e3fd update PublicApiScannerTests 2019-10-07 16:02:11 +01:00
BobLd
93313118e9 Support for hORC, AtloXml and PageXml output formats
Tested with:
- 'hocrjs' for hORC (see https://unpkg.com/hocrjs)
- 'PAGE Viewer' for hORC, AtloXml and PageXml (see http://www.primaresearch.org/tools/PAGEViewer)
2019-10-07 15:19:30 +01:00
Eliot Jones
f5e025aa70
merge pull request #58 from uglytoad/colors
adds colors to letters and prepares code to add colors to paths.
2019-08-13 20:50:06 +01:00
Eliot Jones
f55091f3d2 make color types public and add stream based tests to prevent future breaking as observed in #52 2019-08-13 20:48:22 +01:00
Eliot Jones
980e67fabe
Merge pull request #56 from BobLd/master
Document Layout Analysis - IPageSegmenter, Docstrum
2019-08-11 14:04:39 +01:00
Eliot Jones
0349bedd3e #57 add access to document metadata and expose wrapper type 2019-08-11 12:42:30 +01:00
BobLd
c14d77e414 PublicApiScannerTests updated 2019-08-10 16:36:50 +01:00
BobLd
eb9a9fd00e Document Layout Analysis - IPageSegmenter, Docstrum
- Create a TextBlock class
- Creates IPageSegmenter
- Add other useful distances: angle, etc.
- Update RecursiveXYCut
 - With IPageSegmenter and TextBlock
 - Make XYNode and XYLeaf internal
- Optimise (faster) NearestNeighbourWordExtractor and isolate the clustering algorithms for use outside of this class
- Implement a Docstrum inspired page segmentation algorithm
2019-08-10 16:01:27 +01:00
BobLd
801ea3ba7f Modified PublicApiScannerTests 2019-08-07 14:22:39 +01:00
BobLd
83889cfb52 Document Layout Analysis - Text edges extractor
Text edges are where words have either there BoundingBox's left, right or mid coordinate aligned on the same vertical line.
Useful to detect tables, justified text, lists, etc.
2019-08-06 15:24:16 +01:00
Eliot Jones
0b9ae1db13 add color information to the operation context. create color classes for letters and paths to use 2019-08-04 16:47:47 +01:00
Eliot Jones
1d551d6de3 add and document core classes for colorspace information 2019-08-04 12:57:06 +01:00
Eliot Jones
364bd25fa8 #48 add handling of inline image data to pdf content parsing
an inline image in a pdf content stream starts with the bi tag, then id declares the start of image data and ei the end. attempting to parse the bytes after the id tag as usual resulted in errors. this change adds special case handling for inline images.
2019-08-03 15:42:19 +01:00
Eliot Jones
453faf50af start adding colorspace path operations to the operation context 2019-07-10 21:31:23 +01:00
Eliot Jones
283e1d38fa
use azure pipelines instead of appveyor for builds
* trial azure pipelines

[skip ci]

* use vs2017

* build pr commits

* include codecov and update test nuget

* add codecov call

* add publish test results step

* include coverlet package for test coverage and allow coverlet dynamic public types

* add azure pipelines badge and remove appveyor badge

* add nuget pack step

* use build configuration variable for nuget pack and move after build

* fix path to package to pack

* change nuget to dotnet pack

* remove old codecov related tools
2019-07-09 21:21:11 +01:00
Eliot Jones
c495065178 support gs operator, fix systemfonts, apply rotation to glyphs
- begin adding support for extended graphics state (the 'gs' operator) including setting the font #39.
- apply page level rotation to the glyph bounding box and width to get correct glyph sizes #41.
- wrap page rotation in a value type to ensure the value is restricted to right angle rotations and provide convenience members #42.
- fix bug where system font finder never worked for truetype fonts because it began reading the file from the wrong offset.
2019-07-06 14:03:23 +01:00
BobLd
080354dc54 Corrected PublicApiScannerTests 2019-06-18 21:32:14 +01:00
BobLd
f8d0883da5 Update with corrections 2019-06-18 20:48:49 +01:00
BobLd
4416793f6d Corrected PublicApiScannerTests 2019-06-16 19:19:44 +01:00
BobLd
a0c864e8af Addind Document Layout Analysis:
- Nearest Neighbour Word Extractor
- Recursive X-Y Cut algorithm, useful for multi-column pdf documents
2019-06-16 13:57:30 +01:00
BobLd
f4ec425bf0 - Correction of the PdfLine's length formula;
- Moving Line to TextLine
2019-05-15 19:44:47 +01:00
BobLd
de421d65a1 Adding Line, PdfLine 2019-05-12 19:39:58 +01:00
Eliot Jones
23c033c788 implement validation of owner password and throw more descriptive exception for encrypted documents 2019-05-09 19:02:39 +01:00
BobLd
70852c2855 - Adding a TextDirection enum.
- In the Letter class:
     - Renaming 'Location' to 'StartBaseLine' and adding 'EndBaseLine' for better localisation of the letter ('Location' is also kept).
     - Adding TextDirection.
- Fixed Test
2019-04-20 10:52:15 +01:00
Eliot Jones
cdf5546a1b #24 add the missing operations for the graphics state 2019-01-06 15:47:33 +00:00
Eliot Jones
406f0e6184 #24 add missing marked content operators, still 2 to go 2019-01-06 11:34:25 +00:00
Eliot Jones
4e37222729 #24 fix tests for public types 2019-01-05 15:17:24 +00:00
Eliot Jones
f1621b3924 #24 make some fields public 2019-01-05 15:13:32 +00:00
Eliot Jones
cd84edbdc8 #26 add missing operation and expose the content stream directly to the user through the page 2019-01-04 19:54:55 +00:00
Eliot Jones
2a30631ab7 #26 make all operation classes public and test 2019-01-04 18:55:36 +00:00
Eliot Jones
5c8a77bf33 #26 make almost all operators public 2019-01-03 22:20:53 +00:00
Eliot Jones
d9052e1388 update readme and document public api for document creation 2018-12-28 16:55:46 +00:00
Eliot Jones
d8b5f00fa0 #7 improve api documentation and make font descriptor public 2018-12-22 17:58:07 +00:00
Eliot Jones
d572af8a52 finish first pass of annotation api 2018-12-22 15:54:32 +00:00
Eliot Jones
997979cc92 #11 early access to the raw xobjects for images.
temporary 'safe' untested implementation of seac for type 1 charstrings.
make structure public
bump version of package and project to 0.0.3 (it had accidentally increased to 0.0.5)
2018-11-26 19:46:41 +00:00
Eliot Jones
17909f8565 #15 add classes to extract words and initial tests 2018-11-24 20:51:27 +00:00
Eliot Jones
2fa781b8e9 #10 make all token classes public and expose via a public structure member on pdf document 2018-11-24 19:02:06 +00:00
modest-as
564e32e072 Return bounding boxes for letters 2018-03-30 23:16:54 +01:00
Eliot Jones
ec62542b64 change the project name to something silly 2018-01-10 19:49:32 +00:00