Eliot Jones
76f8222f74
start adding support for undocumented revision 6 encryption
...
revision 6 was added in the pdf 2.0 specification which is document iso 32000-2:2017. because iso are rent-seeking they charge money to view this specification so it is effectively undocumented. this site details some of the algorithm https://web.archive.org/web/20180311160224/esec-lab.sogeti.com/posts/2011/09/14/the-undocumented-password-validation-algorithm-of-adobe-reader-x.html . the code in this change ports the pdfbox logic line by line. it doesn't implement the correct behaviour for owner password yet.
2019-06-24 20:37:25 +01:00
Eliot Jones
cc98bf1089
remove byte order marks from unicode strings #32
2019-06-23 15:22:37 +01:00
Eliot Jones
f86c2545bd
treat encryption entries as optional for revisions 5+ #34
...
the revision 5 and 6 encryption algorithms specify the presence of additional encryption material named 'oe' and 'ue'. it turns out this is not always required so will now default to null if not present. this also adds support for those values being in hex rather than normal string format.
tidies up some commenting on the xynode class, moves public methods below constructors and adds xy to the resharper list of abbreviations for the solution.
2019-06-23 13:52:12 +01:00
Eliot Jones
ff9e2ad83f
handle hex registry and ordering. decrypt hex tokens #34
...
cid fonts may contain a registry, ordering and supplement to identify the font. we were checking for string registry and ordering tokens but failing on hex tokens.
for encrypted documents we now decrypt hex data.
2019-06-23 13:27:32 +01:00
Eliot Jones
0f103554fb
handle non-standard crypt dictionary type and use hex bytes for password #34
...
using an online tool to encrypt a simple document with aes-128 seems to add the dictionary type cryptalgorithm rather than cryptfilter. i couldn't find any references to cryptalgorithm in the spec or pdfbox but it seems to work ok when treated as equivalent to cryptfilter.
there are situations where the string derived from a hex token has a different length to the underlying bytes, for example if the hex token contains the '\0' byte, the encryption algorithm needs to use the raw bytes rather than the 'stringified' bytes. this change passes raw bytes for hex tokens for both the user and owner password keys.
2019-06-23 13:12:47 +01:00
Eliot Jones
d259f89bd9
Merge pull request #40 from Numpsy/rw/unicode_hex_strings
...
add utf-16 parsing support to hextoken
2019-06-23 12:38:44 +01:00
Eliot Jones
41eddca0bf
handle incorrect xref offsets #34
...
previously if the cross reference did not exist at exactly the provided offset we'd immediately throw, now we assume we can read a few more tokens to find the xref table or stream start. this won't work in the case where the provided offset is past the start of the table or nowhere near the table but in those cases there's not much we can do. there's some more work to do to provide a fallback xref parser which finds the xref tables and streams using a brute-force scan of the whole document.
2019-06-23 12:05:21 +01:00
Eliot Jones
0c1b50fcc4
Merge pull request #36 from BobLd/master
...
Document Layout Analysis Tools
2019-06-23 11:32:50 +01:00
Richard Webb
b5b862e63f
unit tests for tokenizing UTF16 encoded hex strings.
2019-06-23 01:19:43 +01:00
Richard Webb
0432f703c4
extend HexToken to support UTF-16BE encoded hex strings
2019-06-23 01:18:48 +01:00
BobLd
00233fa5d0
Update with corrections - 2
2019-06-20 22:10:05 +01:00
Eliot Jones
7b96483664
include raw dictionary token in the document information class #38
2019-06-19 21:23:06 +01:00
Eliot Jones
b7b08fa881
add gitter badge
2019-06-19 18:50:48 +01:00
Eliot Jones
35b6c4f0eb
handle case where font metrics do not declare width or height #35
2019-06-19 18:47:50 +01:00
BobLd
080354dc54
Corrected PublicApiScannerTests
2019-06-18 21:32:14 +01:00
BobLd
f8d0883da5
Update with corrections
2019-06-18 20:48:49 +01:00
Eliot Jones
caf1a0c233
use invariant culture for parsing all numbers #37
2019-06-18 19:12:51 +01:00
BobLd
4416793f6d
Corrected PublicApiScannerTests
2019-06-16 19:19:44 +01:00
BobLd
2525cd243f
Typo correction
2019-06-16 14:03:12 +01:00
BobLd
a0c864e8af
Addind Document Layout Analysis:
...
- Nearest Neighbour Word Extractor
- Recursive X-Y Cut algorithm, useful for multi-column pdf documents
2019-06-16 13:57:30 +01:00
Eliot Jones
2c9a3d6e96
add test coverage for direct object finder
2019-06-14 20:57:46 +01:00
Eliot Jones
98424b32aa
special case handling for faulty offsets in xref with missing whitespace between eof and object number
2019-06-14 20:40:24 +01:00
Eliot Jones
4c716fcbd6
finish support for revision 5 encryption using aes 256 #34
2019-06-13 19:46:08 +01:00
Eliot Jones
d0a3cd398f
start adding support for revision 5 aes-256 encrypted documents #34
2019-06-09 13:27:03 +01:00
Eliot Jones
f3c8220ec4
add test coverage for invalid document from #33
2019-06-08 16:58:20 +01:00
Eliot Jones
2b486dccab
prevent infinite loops where a stream token's length entry references itself. perform brute force scans in case of a faulty xref table #33
2019-06-08 16:45:02 +01:00
Eliot Jones
21a4ba597e
add support for aes-128 decryption #34
2019-06-08 15:23:21 +01:00
Eliot Jones
a19122478d
begin adding support for in-document security handlers to support aes 128/256 encryption #34
2019-06-08 14:14:51 +01:00
Eliot Jones
39d05e6a47
support big endian and little endian utf 16 in string tokens #32
2019-06-05 18:03:20 +01:00
Eliot Jones
f375cb6f04
keep letters in word when using default word extractor
2019-05-30 20:07:52 +01:00
Eliot Jones
ef822b484d
0.0.6 - update version and sourcelink nuget version
v0.0.6
2019-05-19 13:39:06 +01:00
Eliot Jones
31d12eb731
handle extraneous def token in some dictionaries and skip returning glyph bounds if not in font
2019-05-19 13:27:38 +01:00
Eliot Jones
e9e376c52a
update readme and make page dictionary public
2019-05-19 13:14:38 +01:00
Eliot Jones
872e338ecb
skip invalid commands in type 1 command definitions
2019-05-19 12:58:49 +01:00
Eliot Jones
7e8f3623a4
handle type 1 parser already being at def token when reading till next def token
2019-05-19 12:28:26 +01:00
Eliot Jones
ffa7b3bcc7
generate synthetic encoding where not present and use direct object finder to lookup cropbox and mediabox
2019-05-18 15:20:07 +01:00
Eliot Jones
8a74d5b2f3
use missing width for type 1 fonts when not in pdf array
2019-05-18 14:43:22 +01:00
Eliot Jones
7a3b89ece1
tidy up some doc comments
2019-05-18 12:28:42 +01:00
Eliot Jones
f884674807
Merge branch 'master' of https://github.com/UglyToad/Pdf
2019-05-18 12:25:57 +01:00
Eliot Jones
f3bc3a37b9
add lzw filter support
2019-05-18 12:25:47 +01:00
Eliot Jones
86c5478ddb
Merge pull request #31 from BobLd/master
...
add textline and pdfline
2019-05-15 22:34:01 +01:00
Eliot Jones
9a8becde3e
update the readme to reflect expanded capabilities
2019-05-15 20:05:05 +01:00
Eliot Jones
69b6958c9d
only declare a cff font to be a cid font if the registry ordering supplement (ros) is provided
2019-05-15 20:00:24 +01:00
BobLd
f4ec425bf0
- Correction of the PdfLine's length formula;
...
- Moving Line to TextLine
2019-05-15 19:44:47 +01:00
BobLd
97f0f6fe75
Minor modifications and updates
2019-05-14 20:56:34 +01:00
Eliot Jones
5cf62eaa11
fix counting hintmask bytes where cntrmask is present in type 2 charstrings for cff fonts
2019-05-14 20:08:44 +01:00
BobLd
de421d65a1
Adding Line, PdfLine
2019-05-12 19:39:58 +01:00
BobLd
2011d504a7
In Content:
...
- Adding a 'Line' of text object
- Adding a 'TextDirection' property in the 'Word' object
In Geometry:
- Adding a 'PdfLine' object
- Making the 'PdfRectangle' creator public
2019-05-12 19:34:00 +01:00
Eliot Jones
55d34e3998
use standardencoding name for seac command in type 1 charstrings
2019-05-11 15:57:19 +01:00
Eliot Jones
5b5a0b7f55
fix null reference bug and handle escaped escape characters in string tokenization
2019-05-11 15:35:56 +01:00