mirror of
https://github.com/UglyToad/PdfPig.git
synced 2025-08-20 09:21:57 +08:00
Updated Document Layout Analysis (markdown)
parent
2ff56344cc
commit
6be9a382a3
@ -18,7 +18,18 @@ TO DO
|
||||
## [Nearest Neighbour method](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.DocumentLayoutAnalysis/NearestNeighbourWordExtractor.cs)
|
||||
|
||||
### Description
|
||||
TO DO
|
||||
The nearest neighbour word extractor is useful to extract words from pdf documents with complex layouts.
|
||||
It will seek to connect each glyph bound box's `EndBaseLine` point with the closest glyph bound box `StartBaseLine` point.
|
||||
I order to decide wether two glyphs are _close enough_ from each other, the algorithm uses the maximum of both candidates Width as a reference distance.
|
||||
- For glyphs with known text direction, the [Manhattan distance](https://en.wikipedia.org/wiki/Taxicab_geometry) is used and the threshold is set to 20% of the width .
|
||||
- For glyphs with unknown text direction, the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) is used and the threshold is set to 50% of the width.
|
||||
|
||||
If the measured distance between the two glyphs is below this threshold, they are deemed to be connected.
|
||||
|
||||
Once glyphs are connected, they are then grouped to form words via a [depth first search algorithm](https://en.wikipedia.org/wiki/Depth-first_search). The glyphs ordering is also done by the algorithm.
|
||||
It seems that both [left-to-right and right-to-left](https://en.wikipedia.org/wiki/Right-to-left) scripts have there glyph `StartBaseLine` on the left and `EndBaseLine` on the right.
|
||||
So for right-to-left script, the word's glyph ordering should be in reverse oder.
|
||||
|
||||
|
||||
### Usage
|
||||
```csharp
|
||||
@ -39,7 +50,10 @@ using (var document = PdfDocument.Open(@"document.pdf"))
|
||||
```
|
||||
|
||||
### Results
|
||||
TO DO
|
||||
The algorithm was used on this [map](https://upload.wikimedia.org/wikipedia/commons/6/64/APISmap1.pdf) that has a complex layout, with glyphs/words having very diverse text directions.
|
||||
The algorithm is able to rebuild words independently of the direction.
|
||||
|
||||
TO DO: Image
|
||||
|
||||
# Page segmenters
|
||||
Page segmenters deal with the task of finding block of text in a page. 3 different methods are currently available:
|
||||
|
||||
Loading…
Reference in New Issue
Block a user