mirror of
https://github.com/UglyToad/PdfPig.git
synced 2026-03-10 00:23:29 +08:00
Updated Document Layout Analysis (markdown)
@@ -259,10 +259,11 @@ TO DO
|
||||
TO DO: image
|
||||
|
||||
## [Unsupervised Reading Order Detector](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.DocumentLayoutAnalysis/ReadingOrderDetector/UnsupervisedReadingOrderDetector.cs)
|
||||
TO DO
|
||||
|
||||
### Description
|
||||
TO DO
|
||||
As per the original paper (see References): _We follow the approach of Aiello et al., who defined a set of binary relations for intervals in X and Y direction that allow a certain amount of tolerance for the coordinate values. In total there are [13 relations in both X and Y direction](https://en.wikipedia.org/wiki/Allen%27s_interval_algebra), and for each pair of two-dimensional bounding boxes exactly one X relation and exactly one Y relation is true. This tolerance is implemented by a parameter `T`; if two coordinates are closer than `T` they are considered equal. This flexibility is necessary because due to the inherent noise in the PDF extraction text blocks in the same column might not be exactly aligned (here we choose `T = 5`). Aiello et al. then defined the `BeforeInReading` relation as a Boolean combination of binary relations for intervals in X and Y direction, which states for any pair of bounding boxes whether the first one occurs at some point (not necessarily immediately) before the other in a column-wise reading order._
|
||||
|
||||
_In addition to [the above], we also define the `BeforeInRendering` relation that tells whether a block is rendered at some time before another block in the PDF. We incorporate both relations into a single partial ordering of blocks by specifying a directed graph with an edge between every pair of blocks for which at least one of the two relations hold._ – S. Klampfl et al.
|
||||
|
||||
### References
|
||||
- Section 5.1 of [_Unsupervised document structure analysis of digital scientific articles_](http://www.know-center.tugraz.at/download_extern/papers/ijdl-2013.pdf) by S. Klampfl, M. Granitzer, K. Jack, R. Kern
|
||||
@@ -290,7 +291,27 @@ using (var document = PdfDocument.Open(@"document.pdf"))
|
||||
```
|
||||
|
||||
#### Advanced case
|
||||
TO DO
|
||||
- Set the tolerance parameter `T` to 10.
|
||||
```csharp
|
||||
using (var document = PdfDocument.Open(@"document.pdf"))
|
||||
{
|
||||
for (var i = 0; i < document.NumberOfPages; i++)
|
||||
{
|
||||
var page = document.GetPage(i + 1);
|
||||
|
||||
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
|
||||
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
|
||||
|
||||
var unsupervisedReadingOrderDetector = new UnsupervisedReadingOrderDetector(10);
|
||||
var orderedBlocks = unsupervisedReadingOrderDetector.Get(blocks);
|
||||
|
||||
foreach (var block in orderedBlocks)
|
||||
{
|
||||
// Do something
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Result
|
||||
Viewing the exported xml file in PRImA Research Lab's LayoutEvalGUI:
|
||||
@@ -388,7 +409,7 @@ using (var document = PdfDocument.Open(@"document.pdf"))
|
||||

|
||||
|
||||
## [Decoration Text Block Classifier](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.DocumentLayoutAnalysis/DecorationTextBlockClassifier.cs)
|
||||
The algorithm returns text blocks that are classified as decoration blocks. From the paper in reference: _Many digital documents have archival information such as author names, publication titles, page numbers, and release dates printed repeatedly at the border of each page. Most prominently this content is placed inside headers or footers, but sometimes also at the left or right edge of the page. We refer to text blocks containing this type of information as decoration blocks._ – S. Klampfl, M. Granitzer, K. Jack, R. Kern
|
||||
The algorithm returns text blocks that are classified as decoration blocks. From the paper in reference: _Many digital documents have archival information such as author names, publication titles, page numbers, and release dates printed repeatedly at the border of each page. Most prominently this content is placed inside headers or footers, but sometimes also at the left or right edge of the page. We refer to text blocks containing this type of information as decoration blocks._ – S. Klampfl et al.
|
||||
|
||||
The document should contain more than 2 pages to work.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user