Updated Document Layout Analysis (markdown)

BobLd
2020-02-06 18:28:09 +00:00
parent e7dddb2c89
commit 3d45e72f44

@@ -115,7 +115,7 @@ The method can be tailored by providing a __minimum block width__, and __horizon
- Minimum block width is set to 1/3 of page width:
```csharp
var blocks = RecursiveXYCut.Instance.GetBlocks(words, page.Width / 3m);
var blocks = RecursiveXYCut.Instance.GetBlocks(words, page.Width / 3.0);
```
- Average of the page letters' height and width is used as gap size (the values wont change), and minimum block width is set to 1/3 of page width:
```csharp
@@ -125,7 +125,7 @@ var blocks = RecursiveXYCut.Instance.GetBlocks(words,
page.Letters.Average(l => l.GlyphRectangle.Height));
```
- A function that will be applied to each block letters height and width can also be provided. Here we use the average and minimum block width is set to 1/3 of page width (useful to handle isolated bullet points):
- A function that will be applied to each block letters height and width can also be provided. Here we use the average and minimum block width is set to 1/3 of page width:
```csharp
var blocks = RecursiveXYCut.Instance.GetBlocks(words,
@@ -137,6 +137,8 @@ var blocks = RecursiveXYCut.Instance.GetBlocks(words,
### Results
![recursive xy cut example](https://github.com/UglyToad/PdfPig/blob/master/documentation/Document%20Layout%20Analysis/recursive%20xy%20cut%20example.png)
__NB__: Isolated bullet points can be handled by setting a minimum block width, e.g. `RecursiveXYCut.Instance.GetBlocks(words, page.Width / 3.0)`
## [Docstrum for bounding boxes method](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.DocumentLayoutAnalysis/DocstrumBoundingBoxes.cs)
### Description
Paraphrasing the abstract of the original paper, _the document spectrum (or docstrum) is a method for structural page layout analysis based on bottom-up, nearest-neighbour clustering of page components. The method yields an accurate within-line, and between-line spacings and locates text lines and text blocks._