From e40f9f13028d439836366dd9ab3c8eade85bb81a Mon Sep 17 00:00:00 2001 From: davebrokit <87085235+davebrokit@users.noreply.github.com> Date: Sun, 5 Jan 2025 17:25:28 +0000 Subject: [PATCH] Updated Document Layout Analysis (markdown) --- Document-Layout-Analysis.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Document-Layout-Analysis.md b/Document-Layout-Analysis.md index e627c4f..aea4ee7 100644 --- a/Document-Layout-Analysis.md +++ b/Document-Layout-Analysis.md @@ -324,7 +324,8 @@ var blocks = recursiveXYCut.GetBlocks(words); __NB__: Isolated bullet points can be handled by setting a minimum block width, e.g. `RecursiveXYCut.Instance.GetBlocks(words, new RecursiveXYCut.RecursiveXYCutOptions() { MinimumWidth = page.Width / 3.0 })` -__NB__: DominantFontHeightFunc: The examples above use the average letter glyph height for that page. But using the median glyph height would generally produce better results (the median ignores extremes that can impact the results). You may also want to consider using the median glyph height across all the pages in certain situations. +__NB__: DominantFontHeightFunc: The examples above use the average letter glyph height for that page. But using the median glyph height would generally produce better results (the median ignores extremes that can impact the results). You may also want to consider using the median glyph height across all the pages in certain situations. With text where the gap between lines may be wider than the font height consider using the mean/median distance between the text lines. + ## [Docstrum for bounding boxes method](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.DocumentLayoutAnalysis/DocstrumBoundingBoxes.cs) ### Description