Commit Graph

21 Commits

Author SHA1 Message Date
Eliot Jones
d68bd88824 format and tidy up alto export autogenerated code. tidy up docstrum 2019-10-14 18:30:18 +01:00
BobLd
93313118e9 Support for hORC, AtloXml and PageXml output formats
Tested with:
- 'hocrjs' for hORC (see https://unpkg.com/hocrjs)
- 'PAGE Viewer' for hORC, AtloXml and PageXml (see http://www.primaresearch.org/tools/PAGEViewer)
2019-10-07 15:19:30 +01:00
BobLd
d36dee0e25 Adding handling when pageWords count = 0 for IPageSegmenters 2019-09-04 22:14:08 +01:00
BobLd
68e04603c0 Fix error in DocstrumBB 2019-09-02 19:07:27 +01:00
BobLd
afa2b7baa1 Improve ClusteringAlgorithms.GroupIndexes()
Add Equals() to PdfLine
2019-08-14 19:58:31 +01:00
BobLd
9f13739add correcting typo 2019-08-11 13:54:47 +01:00
BobLd
7e8b3bdc85 Update DocstrumBB to account for middle point of the overlapping area distance. For this, using distance between 2 lines. 2019-08-11 13:45:08 +01:00
BobLd
eb9a9fd00e Document Layout Analysis - IPageSegmenter, Docstrum
- Create a TextBlock class
- Creates IPageSegmenter
- Add other useful distances: angle, etc.
- Update RecursiveXYCut
 - With IPageSegmenter and TextBlock
 - Make XYNode and XYLeaf internal
- Optimise (faster) NearestNeighbourWordExtractor and isolate the clustering algorithms for use outside of this class
- Implement a Docstrum inspired page segmentation algorithm
2019-08-10 16:01:27 +01:00
BobLd
5399456919 Making the RecursiveXYCut class static. 2019-08-09 18:50:20 +01:00
Eliot Jones
31e15ea097 remove unused docstrum class 2019-08-08 21:21:27 +01:00
BobLd
801ea3ba7f Modified PublicApiScannerTests 2019-08-07 14:22:39 +01:00
BobLd
7de6de3780 Updating with comments 2019-08-07 13:50:07 +01:00
BobLd
e19b03035e Updating woth comments 2019-08-07 13:49:05 +01:00
BobLd
85d5bb7c7e Adding enum EdgeType 2019-08-07 13:45:57 +01:00
BobLd
9694b1f8e8 Update TextEdgesExtractor.cs 2019-08-06 15:27:16 +01:00
BobLd
83889cfb52 Document Layout Analysis - Text edges extractor
Text edges are where words have either there BoundingBox's left, right or mid coordinate aligned on the same vertical line.
Useful to detect tables, justified text, lists, etc.
2019-08-06 15:24:16 +01:00
Eliot Jones
f86c2545bd treat encryption entries as optional for revisions 5+ #34
the revision 5 and 6 encryption algorithms specify the presence of additional encryption material named 'oe' and 'ue'. it turns out this is not always required so will now default to null if not present. this also adds support for those values being in hex rather than normal string format.

tidies up some commenting on the xynode class, moves public methods below constructors and adds xy to the resharper list of abbreviations for the solution.
2019-06-23 13:52:12 +01:00
BobLd
00233fa5d0 Update with corrections - 2 2019-06-20 22:10:05 +01:00
BobLd
f8d0883da5 Update with corrections 2019-06-18 20:48:49 +01:00
BobLd
2525cd243f Typo correction 2019-06-16 14:03:12 +01:00
BobLd
a0c864e8af Addind Document Layout Analysis:
- Nearest Neighbour Word Extractor
- Recursive X-Y Cut algorithm, useful for multi-column pdf documents
2019-06-16 13:57:30 +01:00