From 3d45e72f446f4aa4c092bddc7b6fd74f8b1e8404 Mon Sep 17 00:00:00 2001 From: BobLd <38405645+BobLd@users.noreply.github.com> Date: Thu, 6 Feb 2020 18:28:09 +0000 Subject: [PATCH] Updated Document Layout Analysis (markdown) --- Document-Layout-Analysis.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/Document-Layout-Analysis.md b/Document-Layout-Analysis.md index 77516a5..81f2891 100644 --- a/Document-Layout-Analysis.md +++ b/Document-Layout-Analysis.md @@ -115,7 +115,7 @@ The method can be tailored by providing a __minimum block width__, and __horizon - Minimum block width is set to 1/3 of page width: ```csharp -var blocks = RecursiveXYCut.Instance.GetBlocks(words, page.Width / 3m); +var blocks = RecursiveXYCut.Instance.GetBlocks(words, page.Width / 3.0); ``` - Average of the page letters' height and width is used as gap size (the values won’t change), and minimum block width is set to 1/3 of page width: ```csharp @@ -125,7 +125,7 @@ var blocks = RecursiveXYCut.Instance.GetBlocks(words, page.Letters.Average(l => l.GlyphRectangle.Height)); ``` -- A function that will be applied to each block letters’ height and width can also be provided. Here we use the average and minimum block width is set to 1/3 of page width (useful to handle isolated bullet points): +- A function that will be applied to each block letters’ height and width can also be provided. Here we use the average and minimum block width is set to 1/3 of page width: ```csharp var blocks = RecursiveXYCut.Instance.GetBlocks(words, @@ -137,6 +137,8 @@ var blocks = RecursiveXYCut.Instance.GetBlocks(words, ### Results ![recursive xy cut example](https://github.com/UglyToad/PdfPig/blob/master/documentation/Document%20Layout%20Analysis/recursive%20xy%20cut%20example.png) +__NB__: Isolated bullet points can be handled by setting a minimum block width, e.g. `RecursiveXYCut.Instance.GetBlocks(words, page.Width / 3.0)` + ## [Docstrum for bounding boxes method](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.DocumentLayoutAnalysis/DocstrumBoundingBoxes.cs) ### Description Paraphrasing the abstract of the original paper, _the document spectrum (or docstrum) is a method for structural page layout analysis based on bottom-up, nearest-neighbour clustering of page components. The method yields an accurate within-line, and between-line spacings and locates text lines and text blocks._