package license url is deprecated in favour of package license expression but nuget doesn't seem to support expressions properly for published packages yet so we'll keep the deprecated url for the time being. having both url and expression causes the build to fail.
small obvious performance improvements for file header passing and getting the encoding information using the existing reverse name to code map.
- Create a TextBlock class
- Creates IPageSegmenter
- Add other useful distances: angle, etc.
- Update RecursiveXYCut
- With IPageSegmenter and TextBlock
- Make XYNode and XYLeaf internal
- Optimise (faster) NearestNeighbourWordExtractor and isolate the clustering algorithms for use outside of this class
- Implement a Docstrum inspired page segmentation algorithm
Text edges are where words have either there BoundingBox's left, right or mid coordinate aligned on the same vertical line.
Useful to detect tables, justified text, lists, etc.
an inline image in a pdf content stream starts with the bi tag, then id declares the start of image data and ei the end. attempting to parse the bytes after the id tag as usual resulted in errors. this change adds special case handling for inline images.
previously if no matching unicode was found for a character code we would return a null letter. instead we now map from the character code directly to a character. this seems to work for most documents, except where there are ligatures, e.g. fi or ff, but is still better than not returning anything.
point size was previously only calculated based on the transformation matrix but now uses the transformation matrix, the rotation matrix and the font matrix values. the calculated value still seems unlikely to be correct so it is exposed using the page's experimental access for now, rather than as a public getter.
- begin adding support for extended graphics state (the 'gs' operator) including setting the font #39.
- apply page level rotation to the glyph bounding box and width to get correct glyph sizes #41.
- wrap page rotation in a value type to ensure the value is restricted to right angle rotations and provide convenience members #42.
- fix bug where system font finder never worked for truetype fonts because it began reading the file from the wrong offset.
we need to apply rotation to the crop and media box and therefore find the correct width and height. but for now correctly deriving the rotation from the page tree should help consumers.