mirror of
https://github.com/UglyToad/PdfPig.git
synced 2026-03-10 00:23:29 +08:00
Updated Document Layout Analysis (markdown)
@@ -115,7 +115,7 @@ The method can be tailored by providing a __minimum block width__, and __horizon
|
||||
|
||||
- Minimum block width is set to 1/3 of page width:
|
||||
```csharp
|
||||
var blocks = RecursiveXYCut.Instance.GetBlocks(words, page.Width / 3m);
|
||||
var blocks = RecursiveXYCut.Instance.GetBlocks(words, page.Width / 3.0);
|
||||
```
|
||||
- Average of the page letters' height and width is used as gap size (the values won’t change), and minimum block width is set to 1/3 of page width:
|
||||
```csharp
|
||||
@@ -125,7 +125,7 @@ var blocks = RecursiveXYCut.Instance.GetBlocks(words,
|
||||
page.Letters.Average(l => l.GlyphRectangle.Height));
|
||||
```
|
||||
|
||||
- A function that will be applied to each block letters’ height and width can also be provided. Here we use the average and minimum block width is set to 1/3 of page width (useful to handle isolated bullet points):
|
||||
- A function that will be applied to each block letters’ height and width can also be provided. Here we use the average and minimum block width is set to 1/3 of page width:
|
||||
|
||||
```csharp
|
||||
var blocks = RecursiveXYCut.Instance.GetBlocks(words,
|
||||
@@ -137,6 +137,8 @@ var blocks = RecursiveXYCut.Instance.GetBlocks(words,
|
||||
### Results
|
||||

|
||||
|
||||
__NB__: Isolated bullet points can be handled by setting a minimum block width, e.g. `RecursiveXYCut.Instance.GetBlocks(words, page.Width / 3.0)`
|
||||
|
||||
## [Docstrum for bounding boxes method](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.DocumentLayoutAnalysis/DocstrumBoundingBoxes.cs)
|
||||
### Description
|
||||
Paraphrasing the abstract of the original paper, _the document spectrum (or docstrum) is a method for structural page layout analysis based on bottom-up, nearest-neighbour clustering of page components. The method yields an accurate within-line, and between-line spacings and locates text lines and text blocks._
|
||||
|
||||
Reference in New Issue
Block a user