Commit Graph

399 Commits

Author SHA1 Message Date
BobLd
e19b03035e Updating woth comments 2019-08-07 13:49:05 +01:00
BobLd
85d5bb7c7e Adding enum EdgeType 2019-08-07 13:45:57 +01:00
BobLd
9694b1f8e8 Update TextEdgesExtractor.cs 2019-08-06 15:27:16 +01:00
BobLd
83889cfb52 Document Layout Analysis - Text edges extractor
Text edges are where words have either there BoundingBox's left, right or mid coordinate aligned on the same vertical line.
Useful to detect tables, justified text, lists, etc.
2019-08-06 15:24:16 +01:00
Eliot Jones
f07ab7d2c3 version 0.0.7 v0.0.7 2019-08-03 16:14:58 +01:00
Eliot Jones
364bd25fa8 #48 add handling of inline image data to pdf content parsing
an inline image in a pdf content stream starts with the bi tag, then id declares the start of image data and ei the end. attempting to parse the bytes after the id tag as usual resulted in errors. this change adds special case handling for inline images.
2019-08-03 15:42:19 +01:00
Eliot Jones
5ee9c49f8a merge pull request #49 from BobLd/master
check if 'fontProgram' is null in Type2CidFont.GetWidthFromFont()
2019-07-30 19:16:09 +01:00
BobLd
2b43867e19 check if 'fontProgram' is null in Type2CidFont.GetWidthFromFont() 2019-07-30 14:53:29 +01:00
Eliot Jones
413ebe35f9 merge pull request #46 from vadik299/master
Adding Paths collection
2019-07-29 18:07:12 +01:00
vadimy
7d3a0929b6 Refactoring and fixing according to Eliot's comments 2019-07-24 00:00:00 -04:00
vadik299
7c50733cbc Update src/UglyToad.PdfPig/Content/PageContent.cs
Co-Authored-By: Eliot Jones <elioty@hotmail.co.uk>
2019-07-23 21:05:00 -04:00
vadik299
f2a64d9362 Update src/UglyToad.PdfPig/Content/Page.cs
Co-Authored-By: Eliot Jones <elioty@hotmail.co.uk>
2019-07-23 21:04:28 -04:00
vadimy
6ded47da92 Updated tests 2019-07-16 00:36:57 -04:00
vadimy
b9d0cca2a6 Added "Paths" collection to Page object.
Added matrix transformation to path operators.
2019-07-16 00:35:29 -04:00
Eliot Jones
453faf50af start adding colorspace path operations to the operation context 2019-07-10 21:31:23 +01:00
Eliot Jones
283e1d38fa use azure pipelines instead of appveyor for builds
* trial azure pipelines

[skip ci]

* use vs2017

* build pr commits

* include codecov and update test nuget

* add codecov call

* add publish test results step

* include coverlet package for test coverage and allow coverlet dynamic public types

* add azure pipelines badge and remove appveyor badge

* add nuget pack step

* use build configuration variable for nuget pack and move after build

* fix path to package to pack

* change nuget to dotnet pack

* remove old codecov related tools
2019-07-09 21:21:11 +01:00
Eliot Jones
3c49371c68 test hex to string implementation and remove unused method 2019-07-07 17:30:54 +01:00
Eliot Jones
50bf1784bd merge pull request #43 from Numpsy/test_document_information
Add a unit test for reading document information
2019-07-07 17:01:42 +01:00
Eliot Jones
869bb1828b add contributing guide with set-up help 2019-07-07 14:41:48 +01:00
Eliot Jones
557d8bc948 map missing character codes directly #44
previously if no matching unicode was found for a character code we would return a null letter. instead we now map from the character code directly to a character. this seems to work for most documents, except where there are ligatures, e.g. fi or ff, but is still better than not returning anything.
2019-07-07 13:53:25 +01:00
Eliot Jones
198cca1336 change point size calculation to use rotation #41
point size was previously only calculated based on the transformation matrix but now uses the transformation matrix, the rotation matrix and the font matrix values. the calculated value still seems unlikely to be correct so it is exposed using the page's experimental access for now, rather than as a public getter.
2019-07-07 12:12:09 +01:00
Richard Webb
10dcbd0cc4 Add a unit test for reading document information 2019-07-06 22:18:18 +01:00
Eliot Jones
0dfe742770 continue searching for xref tokens even if an %%eof is encountered #38 2019-07-06 14:26:38 +01:00
Eliot Jones
c495065178 support gs operator, fix systemfonts, apply rotation to glyphs
- begin adding support for extended graphics state (the 'gs' operator) including setting the font #39.
- apply page level rotation to the glyph bounding box and width to get correct glyph sizes #41.
- wrap page rotation in a value type to ensure the value is restricted to right angle rotations and provide convenience members #42.
- fix bug where system font finder never worked for truetype fonts because it began reading the file from the wrong offset.
2019-07-06 14:03:23 +01:00
Eliot Jones
88e02cabab include rotation in page object #42
we need to apply rotation to the crop and media box and therefore find the correct width and height. but for now correctly deriving the rotation from the page tree should help consumers.
2019-07-05 19:18:14 +01:00
Eliot Jones
52d2a90dfc finish revision 5 and 6 owner password handling #34
moves owner password check first. correctly calculates encryption key for owner password for revision 5 and 6.
2019-06-25 19:44:26 +01:00
Eliot Jones
76f8222f74 start adding support for undocumented revision 6 encryption
revision 6 was added in the pdf 2.0 specification which is document iso 32000-2:2017. because iso are rent-seeking they charge money to view this specification so it is effectively undocumented. this site details some of the algorithm https://web.archive.org/web/20180311160224/esec-lab.sogeti.com/posts/2011/09/14/the-undocumented-password-validation-algorithm-of-adobe-reader-x.html. the code in this change ports the pdfbox logic line by line. it doesn't implement the correct behaviour for owner password yet.
2019-06-24 20:37:25 +01:00
Eliot Jones
cc98bf1089 remove byte order marks from unicode strings #32 2019-06-23 15:22:37 +01:00
Eliot Jones
f86c2545bd treat encryption entries as optional for revisions 5+ #34
the revision 5 and 6 encryption algorithms specify the presence of additional encryption material named 'oe' and 'ue'. it turns out this is not always required so will now default to null if not present. this also adds support for those values being in hex rather than normal string format.

tidies up some commenting on the xynode class, moves public methods below constructors and adds xy to the resharper list of abbreviations for the solution.
2019-06-23 13:52:12 +01:00
Eliot Jones
ff9e2ad83f handle hex registry and ordering. decrypt hex tokens #34
cid fonts may contain a registry, ordering and supplement to identify the font. we were checking for string registry and ordering tokens but failing on hex tokens.

for encrypted documents we now decrypt hex data.
2019-06-23 13:27:32 +01:00
Eliot Jones
0f103554fb handle non-standard crypt dictionary type and use hex bytes for password #34
using an online tool to encrypt a simple document with aes-128 seems to add the dictionary type cryptalgorithm rather than cryptfilter. i couldn't find any references to cryptalgorithm in the spec or pdfbox but it seems to work ok when treated as equivalent to cryptfilter.

there are situations where the string derived from a hex token has a different length to the underlying bytes, for example if the hex token contains the '\0' byte, the encryption algorithm needs to use the raw bytes rather than the 'stringified' bytes. this change passes raw bytes for hex tokens for both the user and owner password keys.
2019-06-23 13:12:47 +01:00
Eliot Jones
d259f89bd9 Merge pull request #40 from Numpsy/rw/unicode_hex_strings
add utf-16 parsing support to hextoken
2019-06-23 12:38:44 +01:00
Eliot Jones
41eddca0bf handle incorrect xref offsets #34
previously if the cross reference did not exist at exactly the provided offset we'd immediately throw, now we assume we can read a few more tokens to find the xref table or stream start. this won't work in the case where the provided offset is past the start of the table or nowhere near the table but in those cases there's not much we can do. there's some more work to do to provide a fallback xref parser which finds the xref tables and streams using a brute-force scan of the whole document.
2019-06-23 12:05:21 +01:00
Eliot Jones
0c1b50fcc4 Merge pull request #36 from BobLd/master
Document Layout Analysis Tools
2019-06-23 11:32:50 +01:00
Richard Webb
b5b862e63f unit tests for tokenizing UTF16 encoded hex strings. 2019-06-23 01:19:43 +01:00
Richard Webb
0432f703c4 extend HexToken to support UTF-16BE encoded hex strings 2019-06-23 01:18:48 +01:00
BobLd
00233fa5d0 Update with corrections - 2 2019-06-20 22:10:05 +01:00
Eliot Jones
7b96483664 include raw dictionary token in the document information class #38 2019-06-19 21:23:06 +01:00
Eliot Jones
b7b08fa881 add gitter badge 2019-06-19 18:50:48 +01:00
Eliot Jones
35b6c4f0eb handle case where font metrics do not declare width or height #35 2019-06-19 18:47:50 +01:00
BobLd
080354dc54 Corrected PublicApiScannerTests 2019-06-18 21:32:14 +01:00
BobLd
f8d0883da5 Update with corrections 2019-06-18 20:48:49 +01:00
Eliot Jones
caf1a0c233 use invariant culture for parsing all numbers #37 2019-06-18 19:12:51 +01:00
BobLd
4416793f6d Corrected PublicApiScannerTests 2019-06-16 19:19:44 +01:00
BobLd
2525cd243f Typo correction 2019-06-16 14:03:12 +01:00
BobLd
a0c864e8af Addind Document Layout Analysis:
- Nearest Neighbour Word Extractor
- Recursive X-Y Cut algorithm, useful for multi-column pdf documents
2019-06-16 13:57:30 +01:00
Eliot Jones
2c9a3d6e96 add test coverage for direct object finder 2019-06-14 20:57:46 +01:00
Eliot Jones
98424b32aa special case handling for faulty offsets in xref with missing whitespace between eof and object number 2019-06-14 20:40:24 +01:00
Eliot Jones
4c716fcbd6 finish support for revision 5 encryption using aes 256 #34 2019-06-13 19:46:08 +01:00
Eliot Jones
d0a3cd398f start adding support for revision 5 aes-256 encrypted documents #34 2019-06-09 13:27:03 +01:00