PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2025-09-23 04:36:44 +08:00

Author	SHA1	Message	Date
BobLd	c14d77e414	PublicApiScannerTests updated	2019-08-10 16:36:50 +01:00
BobLd	eb9a9fd00e	Document Layout Analysis - IPageSegmenter, Docstrum - Create a TextBlock class - Creates IPageSegmenter - Add other useful distances: angle, etc. - Update RecursiveXYCut - With IPageSegmenter and TextBlock - Make XYNode and XYLeaf internal - Optimise (faster) NearestNeighbourWordExtractor and isolate the clustering algorithms for use outside of this class - Implement a Docstrum inspired page segmentation algorithm	2019-08-10 16:01:27 +01:00
Eliot Jones	2d6e49426a	Merge pull request #54 from BobLd/master Improving PdfPoint	2019-08-10 14:37:16 +01:00
BobLd	9b24223190	Removing ToDouble()	2019-08-10 13:52:01 +01:00
BobLd	bd58879e32	Update from comments	2019-08-10 13:05:25 +01:00
BobLd	474ce9a442	Improving PdfPoint	2019-08-09 19:58:48 +01:00
Eliot Jones	f243117cfa	Merge pull request #53 from BobLd/master Adding Centroid to PdfRectangle.	2019-08-09 18:59:29 +01:00
BobLd	5399456919	Making the RecursiveXYCut class static.	2019-08-09 18:50:20 +01:00
BobLd	ac065e988a	Adding Centroid to PdfRectangle.	2019-08-09 17:22:16 +01:00
Eliot Jones	d6757e69cb	Merge pull request #52 from Numpsy/rw/streaminputbytes Change StreamInputBytes.Seek to reset isAtEnd to false	2019-08-09 09:40:19 +01:00
Richard Webb	54cd0ae516	Extend the ArrayAndStreamBehaveTheSame test to test seeking back to the start	2019-08-08 23:14:59 +01:00
Richard Webb	f70b7c69a0	Change StreamInputBytes.Seek to reset isAtEnd to false	2019-08-08 23:14:16 +01:00
Eliot Jones	31e15ea097	remove unused docstrum class	2019-08-08 21:21:27 +01:00
Eliot Jones	fe270aa9bd	merge pull request #50 from BobLd/master document layout analysis - text edges extractor	2019-08-08 21:16:42 +01:00
BobLd	801ea3ba7f	Modified PublicApiScannerTests	2019-08-07 14:22:39 +01:00
BobLd	7de6de3780	Updating with comments	2019-08-07 13:50:07 +01:00
BobLd	e19b03035e	Updating woth comments	2019-08-07 13:49:05 +01:00
BobLd	85d5bb7c7e	Adding enum EdgeType	2019-08-07 13:45:57 +01:00
Eliot Jones	709294975b	Merge pull request #51 from BenyErnest/patch-1 pdate README.md	2019-08-06 19:48:48 +01:00
Benito E. Gómez	d6c4d62dac	Update README.md I think you need to pass the byte array to the File.WriteAllBytes method.	2019-08-06 11:34:54 -04:00
BobLd	9694b1f8e8	Update TextEdgesExtractor.cs	2019-08-06 15:27:16 +01:00
BobLd	83889cfb52	Document Layout Analysis - Text edges extractor Text edges are where words have either there BoundingBox's left, right or mid coordinate aligned on the same vertical line. Useful to detect tables, justified text, lists, etc.	2019-08-06 15:24:16 +01:00
Eliot Jones	f07ab7d2c3	version 0.0.7 v0.0.7	2019-08-03 16:14:58 +01:00
Eliot Jones	364bd25fa8	#48 add handling of inline image data to pdf content parsing an inline image in a pdf content stream starts with the bi tag, then id declares the start of image data and ei the end. attempting to parse the bytes after the id tag as usual resulted in errors. this change adds special case handling for inline images.	2019-08-03 15:42:19 +01:00
Eliot Jones	5ee9c49f8a	merge pull request #49 from BobLd/master check if 'fontProgram' is null in Type2CidFont.GetWidthFromFont()	2019-07-30 19:16:09 +01:00
BobLd	2b43867e19	check if 'fontProgram' is null in Type2CidFont.GetWidthFromFont()	2019-07-30 14:53:29 +01:00
Eliot Jones	413ebe35f9	merge pull request #46 from vadik299/master Adding Paths collection	2019-07-29 18:07:12 +01:00
vadimy	7d3a0929b6	Refactoring and fixing according to Eliot's comments	2019-07-24 00:00:00 -04:00
vadik299	7c50733cbc	Update src/UglyToad.PdfPig/Content/PageContent.cs Co-Authored-By: Eliot Jones <elioty@hotmail.co.uk>	2019-07-23 21:05:00 -04:00
vadik299	f2a64d9362	Update src/UglyToad.PdfPig/Content/Page.cs Co-Authored-By: Eliot Jones <elioty@hotmail.co.uk>	2019-07-23 21:04:28 -04:00
vadimy	6ded47da92	Updated tests	2019-07-16 00:36:57 -04:00
vadimy	b9d0cca2a6	Added "Paths" collection to Page object. Added matrix transformation to path operators.	2019-07-16 00:35:29 -04:00
Eliot Jones	453faf50af	start adding colorspace path operations to the operation context	2019-07-10 21:31:23 +01:00
Eliot Jones	283e1d38fa	use azure pipelines instead of appveyor for builds * trial azure pipelines [skip ci] * use vs2017 * build pr commits * include codecov and update test nuget * add codecov call * add publish test results step * include coverlet package for test coverage and allow coverlet dynamic public types * add azure pipelines badge and remove appveyor badge * add nuget pack step * use build configuration variable for nuget pack and move after build * fix path to package to pack * change nuget to dotnet pack * remove old codecov related tools	2019-07-09 21:21:11 +01:00
Eliot Jones	3c49371c68	test hex to string implementation and remove unused method	2019-07-07 17:30:54 +01:00
Eliot Jones	50bf1784bd	merge pull request #43 from Numpsy/test_document_information Add a unit test for reading document information	2019-07-07 17:01:42 +01:00
Eliot Jones	869bb1828b	add contributing guide with set-up help	2019-07-07 14:41:48 +01:00
Eliot Jones	557d8bc948	map missing character codes directly #44 previously if no matching unicode was found for a character code we would return a null letter. instead we now map from the character code directly to a character. this seems to work for most documents, except where there are ligatures, e.g. fi or ff, but is still better than not returning anything.	2019-07-07 13:53:25 +01:00
Eliot Jones	198cca1336	change point size calculation to use rotation #41 point size was previously only calculated based on the transformation matrix but now uses the transformation matrix, the rotation matrix and the font matrix values. the calculated value still seems unlikely to be correct so it is exposed using the page's experimental access for now, rather than as a public getter.	2019-07-07 12:12:09 +01:00
Richard Webb	10dcbd0cc4	Add a unit test for reading document information	2019-07-06 22:18:18 +01:00
Eliot Jones	0dfe742770	continue searching for xref tokens even if an %%eof is encountered #38	2019-07-06 14:26:38 +01:00
Eliot Jones	c495065178	support gs operator, fix systemfonts, apply rotation to glyphs - begin adding support for extended graphics state (the 'gs' operator) including setting the font #39. - apply page level rotation to the glyph bounding box and width to get correct glyph sizes #41. - wrap page rotation in a value type to ensure the value is restricted to right angle rotations and provide convenience members #42. - fix bug where system font finder never worked for truetype fonts because it began reading the file from the wrong offset.	2019-07-06 14:03:23 +01:00
Eliot Jones	88e02cabab	include rotation in page object #42 we need to apply rotation to the crop and media box and therefore find the correct width and height. but for now correctly deriving the rotation from the page tree should help consumers.	2019-07-05 19:18:14 +01:00
Eliot Jones	52d2a90dfc	finish revision 5 and 6 owner password handling #34 moves owner password check first. correctly calculates encryption key for owner password for revision 5 and 6.	2019-06-25 19:44:26 +01:00
Eliot Jones	76f8222f74	start adding support for undocumented revision 6 encryption revision 6 was added in the pdf 2.0 specification which is document iso 32000-2:2017. because iso are rent-seeking they charge money to view this specification so it is effectively undocumented. this site details some of the algorithm https://web.archive.org/web/20180311160224/esec-lab.sogeti.com/posts/2011/09/14/the-undocumented-password-validation-algorithm-of-adobe-reader-x.html. the code in this change ports the pdfbox logic line by line. it doesn't implement the correct behaviour for owner password yet.	2019-06-24 20:37:25 +01:00
Eliot Jones	cc98bf1089	remove byte order marks from unicode strings #32	2019-06-23 15:22:37 +01:00
Eliot Jones	f86c2545bd	treat encryption entries as optional for revisions 5+ #34 the revision 5 and 6 encryption algorithms specify the presence of additional encryption material named 'oe' and 'ue'. it turns out this is not always required so will now default to null if not present. this also adds support for those values being in hex rather than normal string format. tidies up some commenting on the xynode class, moves public methods below constructors and adds xy to the resharper list of abbreviations for the solution.	2019-06-23 13:52:12 +01:00
Eliot Jones	ff9e2ad83f	handle hex registry and ordering. decrypt hex tokens #34 cid fonts may contain a registry, ordering and supplement to identify the font. we were checking for string registry and ordering tokens but failing on hex tokens. for encrypted documents we now decrypt hex data.	2019-06-23 13:27:32 +01:00
Eliot Jones	0f103554fb	handle non-standard crypt dictionary type and use hex bytes for password #34 using an online tool to encrypt a simple document with aes-128 seems to add the dictionary type cryptalgorithm rather than cryptfilter. i couldn't find any references to cryptalgorithm in the spec or pdfbox but it seems to work ok when treated as equivalent to cryptfilter. there are situations where the string derived from a hex token has a different length to the underlying bytes, for example if the hex token contains the '\0' byte, the encryption algorithm needs to use the raw bytes rather than the 'stringified' bytes. this change passes raw bytes for hex tokens for both the user and owner password keys.	2019-06-23 13:12:47 +01:00
Eliot Jones	d259f89bd9	Merge pull request #40 from Numpsy/rw/unicode_hex_strings add utf-16 parsing support to hextoken	2019-06-23 12:38:44 +01:00

1 2 3 4 5 ...

417 Commits