Updated Document Layout Analysis (markdown)

BobLd 2020-01-20 10:05:26 +00:00
parent 2ff56344cc
commit 6be9a382a3

@ -18,7 +18,18 @@ TO DO
## [Nearest Neighbour method](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.DocumentLayoutAnalysis/NearestNeighbourWordExtractor.cs)
### Description
TO DO
The nearest neighbour word extractor is useful to extract words from pdf documents with complex layouts.
It will seek to connect each glyph bound box's `EndBaseLine` point with the closest glyph bound box `StartBaseLine` point.
I order to decide wether two glyphs are _close enough_ from each other, the algorithm uses the maximum of both candidates Width as a reference distance.
- For glyphs with known text direction, the [Manhattan distance](https://en.wikipedia.org/wiki/Taxicab_geometry) is used and the threshold is set to 20% of the width .
- For glyphs with unknown text direction, the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) is used and the threshold is set to 50% of the width.
If the measured distance between the two glyphs is below this threshold, they are deemed to be connected.
Once glyphs are connected, they are then grouped to form words via a [depth first search algorithm](https://en.wikipedia.org/wiki/Depth-first_search). The glyphs ordering is also done by the algorithm.
It seems that both [left-to-right and right-to-left](https://en.wikipedia.org/wiki/Right-to-left) scripts have there glyph `StartBaseLine` on the left and `EndBaseLine` on the right.
So for right-to-left script, the word's glyph ordering should be in reverse oder.
### Usage
```csharp
@ -39,7 +50,10 @@ using (var document = PdfDocument.Open(@"document.pdf"))
```
### Results
TO DO
The algorithm was used on this [map](https://upload.wikimedia.org/wikipedia/commons/6/64/APISmap1.pdf) that has a complex layout, with glyphs/words having very diverse text directions.
The algorithm is able to rebuild words independently of the direction.
TO DO: Image
# Page segmenters
Page segmenters deal with the task of finding block of text in a page. 3 different methods are currently available: