Commit Graph

520 Commits

Author SHA1 Message Date
Eliot Jones
8c100efe04 Merge pull request #60 from BobLd/master
Improve ClusteringAlgorithms.GroupIndexes() and add Equals() to PdfLine
2019-08-17 12:58:06 +01:00
vadik299
cc767b8cd6 Merge branch 'master' into master 2019-08-16 18:34:57 -04:00
BobLd
afa2b7baa1 Improve ClusteringAlgorithms.GroupIndexes()
Add Equals() to PdfLine
2019-08-14 19:58:31 +01:00
Eliot Jones
ac62b7247b version 0.0.9 2019-08-13 21:24:54 +01:00
Eliot Jones
f5e025aa70 merge pull request #58 from uglytoad/colors
adds colors to letters and prepares code to add colors to paths.
2019-08-13 20:50:06 +01:00
Eliot Jones
f55091f3d2 make color types public and add stream based tests to prevent future breaking as observed in #52 2019-08-13 20:48:22 +01:00
Vasya
22278f64c4 Added TextSequence 2019-08-11 14:55:59 -04:00
Eliot Jones
980e67fabe Merge pull request #56 from BobLd/master
Document Layout Analysis - IPageSegmenter, Docstrum
2019-08-11 14:04:39 +01:00
BobLd
9f13739add correcting typo 2019-08-11 13:54:47 +01:00
BobLd
7e8b3bdc85 Update DocstrumBB to account for middle point of the overlapping area distance. For this, using distance between 2 lines. 2019-08-11 13:45:08 +01:00
Eliot Jones
0349bedd3e #57 add access to document metadata and expose wrapper type 2019-08-11 12:42:30 +01:00
BobLd
eb9a9fd00e Document Layout Analysis - IPageSegmenter, Docstrum
- Create a TextBlock class
- Creates IPageSegmenter
- Add other useful distances: angle, etc.
- Update RecursiveXYCut
 - With IPageSegmenter and TextBlock
 - Make XYNode and XYLeaf internal
- Optimise (faster) NearestNeighbourWordExtractor and isolate the clustering algorithms for use outside of this class
- Implement a Docstrum inspired page segmentation algorithm
2019-08-10 16:01:27 +01:00
Eliot Jones
fc2d532b82 use single instances of black and white for rgb/gray colors 2019-08-10 14:58:02 +01:00
BobLd
9b24223190 Removing ToDouble() 2019-08-10 13:52:01 +01:00
BobLd
bd58879e32 Update from comments 2019-08-10 13:05:25 +01:00
BobLd
474ce9a442 Improving PdfPoint 2019-08-09 19:58:48 +01:00
BobLd
5399456919 Making the RecursiveXYCut class static. 2019-08-09 18:50:20 +01:00
BobLd
ac065e988a Adding Centroid to PdfRectangle. 2019-08-09 17:22:16 +01:00
Richard Webb
f70b7c69a0 Change StreamInputBytes.Seek to reset isAtEnd to false 2019-08-08 23:14:16 +01:00
Eliot Jones
31e15ea097 remove unused docstrum class 2019-08-08 21:21:27 +01:00
Eliot Jones
c5d03bca97 move application of transformation matrix outside path 2019-08-08 21:19:18 +01:00
BobLd
801ea3ba7f Modified PublicApiScannerTests 2019-08-07 14:22:39 +01:00
BobLd
7de6de3780 Updating with comments 2019-08-07 13:50:07 +01:00
BobLd
e19b03035e Updating woth comments 2019-08-07 13:49:05 +01:00
BobLd
85d5bb7c7e Adding enum EdgeType 2019-08-07 13:45:57 +01:00
BobLd
9694b1f8e8 Update TextEdgesExtractor.cs 2019-08-06 15:27:16 +01:00
BobLd
83889cfb52 Document Layout Analysis - Text edges extractor
Text edges are where words have either there BoundingBox's left, right or mid coordinate aligned on the same vertical line.
Useful to detect tables, justified text, lists, etc.
2019-08-06 15:24:16 +01:00
Eliot Jones
4dde4ca0c1 add colors to letters based on current font and graphics state 2019-08-05 19:26:10 +01:00
Eliot Jones
0df35b8488 fix naming of color space to be 2 words 2019-08-05 18:32:44 +01:00
Eliot Jones
0b9ae1db13 add color information to the operation context. create color classes for letters and paths to use 2019-08-04 16:47:47 +01:00
Eliot Jones
1d551d6de3 add and document core classes for colorspace information 2019-08-04 12:57:06 +01:00
Eliot Jones
f07ab7d2c3 version 0.0.7 2019-08-03 16:14:58 +01:00
Eliot Jones
364bd25fa8 #48 add handling of inline image data to pdf content parsing
an inline image in a pdf content stream starts with the bi tag, then id declares the start of image data and ei the end. attempting to parse the bytes after the id tag as usual resulted in errors. this change adds special case handling for inline images.
2019-08-03 15:42:19 +01:00
BobLd
2b43867e19 check if 'fontProgram' is null in Type2CidFont.GetWidthFromFont() 2019-07-30 14:53:29 +01:00
vadimy
7d3a0929b6 Refactoring and fixing according to Eliot's comments 2019-07-24 00:00:00 -04:00
vadik299
7c50733cbc Update src/UglyToad.PdfPig/Content/PageContent.cs
Co-Authored-By: Eliot Jones <elioty@hotmail.co.uk>
2019-07-23 21:05:00 -04:00
vadik299
f2a64d9362 Update src/UglyToad.PdfPig/Content/Page.cs
Co-Authored-By: Eliot Jones <elioty@hotmail.co.uk>
2019-07-23 21:04:28 -04:00
vadimy
b9d0cca2a6 Added "Paths" collection to Page object.
Added matrix transformation to path operators.
2019-07-16 00:35:29 -04:00
Eliot Jones
453faf50af start adding colorspace path operations to the operation context 2019-07-10 21:31:23 +01:00
Eliot Jones
3c49371c68 test hex to string implementation and remove unused method 2019-07-07 17:30:54 +01:00
Eliot Jones
557d8bc948 map missing character codes directly #44
previously if no matching unicode was found for a character code we would return a null letter. instead we now map from the character code directly to a character. this seems to work for most documents, except where there are ligatures, e.g. fi or ff, but is still better than not returning anything.
2019-07-07 13:53:25 +01:00
Eliot Jones
198cca1336 change point size calculation to use rotation #41
point size was previously only calculated based on the transformation matrix but now uses the transformation matrix, the rotation matrix and the font matrix values. the calculated value still seems unlikely to be correct so it is exposed using the page's experimental access for now, rather than as a public getter.
2019-07-07 12:12:09 +01:00
Eliot Jones
0dfe742770 continue searching for xref tokens even if an %%eof is encountered #38 2019-07-06 14:26:38 +01:00
Eliot Jones
c495065178 support gs operator, fix systemfonts, apply rotation to glyphs
- begin adding support for extended graphics state (the 'gs' operator) including setting the font #39.
- apply page level rotation to the glyph bounding box and width to get correct glyph sizes #41.
- wrap page rotation in a value type to ensure the value is restricted to right angle rotations and provide convenience members #42.
- fix bug where system font finder never worked for truetype fonts because it began reading the file from the wrong offset.
2019-07-06 14:03:23 +01:00
Eliot Jones
88e02cabab include rotation in page object #42
we need to apply rotation to the crop and media box and therefore find the correct width and height. but for now correctly deriving the rotation from the page tree should help consumers.
2019-07-05 19:18:14 +01:00
Eliot Jones
52d2a90dfc finish revision 5 and 6 owner password handling #34
moves owner password check first. correctly calculates encryption key for owner password for revision 5 and 6.
2019-06-25 19:44:26 +01:00
Eliot Jones
76f8222f74 start adding support for undocumented revision 6 encryption
revision 6 was added in the pdf 2.0 specification which is document iso 32000-2:2017. because iso are rent-seeking they charge money to view this specification so it is effectively undocumented. this site details some of the algorithm https://web.archive.org/web/20180311160224/esec-lab.sogeti.com/posts/2011/09/14/the-undocumented-password-validation-algorithm-of-adobe-reader-x.html. the code in this change ports the pdfbox logic line by line. it doesn't implement the correct behaviour for owner password yet.
2019-06-24 20:37:25 +01:00
Eliot Jones
cc98bf1089 remove byte order marks from unicode strings #32 2019-06-23 15:22:37 +01:00
Eliot Jones
f86c2545bd treat encryption entries as optional for revisions 5+ #34
the revision 5 and 6 encryption algorithms specify the presence of additional encryption material named 'oe' and 'ue'. it turns out this is not always required so will now default to null if not present. this also adds support for those values being in hex rather than normal string format.

tidies up some commenting on the xynode class, moves public methods below constructors and adds xy to the resharper list of abbreviations for the solution.
2019-06-23 13:52:12 +01:00
Eliot Jones
ff9e2ad83f handle hex registry and ordering. decrypt hex tokens #34
cid fonts may contain a registry, ordering and supplement to identify the font. we were checking for string registry and ordering tokens but failing on hex tokens.

for encrypted documents we now decrypt hex data.
2019-06-23 13:27:32 +01:00