Updated Document Layout Analysis (markdown)

2025-08-20 09:21:57 +08:00 · 2020-01-20 10:05:26 +00:00 · 2020-01-20 10:05:26 +00:00 · 6be9a382a3
commit 6be9a382a3
parent 2ff56344cc
1 changed files with 16 additions and 2 deletions
--- a/Document-Layout-Analysis.md
+++ b/Document-Layout-Analysis.md
@ -18,7 +18,18 @@ TO DO
 ## [Nearest Neighbour method](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.DocumentLayoutAnalysis/NearestNeighbourWordExtractor.cs)

 ### Description
-TO DO
+The nearest neighbour word extractor is useful to extract words from pdf documents with complex layouts.
+It will seek to connect each glyph bound box's `EndBaseLine` point with the closest glyph bound box `StartBaseLine` point.
+I order to decide wether two glyphs are _close enough_ from each other, the algorithm uses the maximum of both candidates Width as a reference distance.
+- For glyphs with known text direction, the [Manhattan distance](https://en.wikipedia.org/wiki/Taxicab_geometry) is used and the threshold is set to 20% of the width .
+- For glyphs with unknown text direction, the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) is used and the threshold is set to 50% of the width.
+
+If the measured distance between the two glyphs is below this threshold, they are deemed to be connected.
+
+Once glyphs are connected, they are then grouped to form words via a [depth first search algorithm](https://en.wikipedia.org/wiki/Depth-first_search). The glyphs ordering is also done by the algorithm.
+It seems that both [left-to-right and right-to-left](https://en.wikipedia.org/wiki/Right-to-left) scripts have there glyph `StartBaseLine` on the left and `EndBaseLine` on the right. 
+So for right-to-left script, the word's glyph ordering should be in reverse oder.
+

 ### Usage
 ```csharp
@ -39,7 +50,10 @@ using (var document = PdfDocument.Open(@"document.pdf"))
 ```

 ### Results
-TO DO
+The algorithm was used on this [map](https://upload.wikimedia.org/wikipedia/commons/6/64/APISmap1.pdf) that has a complex layout, with glyphs/words having very diverse text directions.
+The algorithm is able to rebuild words independently of the direction.
+
+TO DO: Image

 # Page segmenters
 Page segmenters deal with the task of finding block of text in a page. 3 different methods are currently available: