PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2025-09-19 10:47:56 +08:00

Author	SHA1	Message	Date
BobLd	bff18d81ca	Improve minimum bounding box orientation	2020-01-31 16:24:59 +00:00
BobLd	483b30f44d	Remove rounding	2020-01-31 16:24:59 +00:00
BobLd	253ae32193	Remove ordering from minimal bounding rectangle	2020-01-31 16:24:59 +00:00
BobLd	0dad611cb1	Implement minimum bounding box algorithm	2020-01-31 16:24:59 +00:00
BobLd	36c03459a7	first and last letter	2020-01-31 16:24:59 +00:00
BobLd	f221b58936	Remove useless code	2020-01-31 16:24:59 +00:00
BobLd	ea27820ca4	Improve Word bounding box TextDirection.Other case	2020-01-31 16:24:59 +00:00
BobLd	2e5fdb5867	Fix PdfRectangle's Centroid and Translate()	2020-01-31 16:24:59 +00:00
BobLd	adaccf97b3	Add files via upload	2020-01-31 16:24:59 +00:00
BobLd	380c36918b	Remove unnecessary code	2020-01-31 16:24:59 +00:00
BobLd	0cbf3434bc	Remove 'orderFunc' from 'NearestNeighbourWordExtractor' to use the order found by clustering algo	2020-01-31 16:24:59 +00:00
BobLd	3b90370f28	Using Math.Min(letter.Width, letter.GlyphRectangle.Width) for rotated 180 word bounding box	2020-01-31 16:24:59 +00:00
BobLd	c4b6bbc8e5	Using Math.Max(letter.Width, letter.GlyphRectangle.Width) for word bounding box	2020-01-31 16:24:59 +00:00
BobLd	6d8744e722	More decimals to Width and Height + handle the case where both bottom points are identical	2020-01-31 16:24:59 +00:00
BobLd	bc69376743	Increase max distance for TextDirection.Other in NearestNeighbourWordExtractor	2020-01-31 16:24:59 +00:00
BobLd	a326d7e9d9	TextDirection.Unknown -> TextDirection.Other Imporve NearestNeighbourWordExtractor for TextDirection.Other	2020-01-31 16:24:59 +00:00
BobLd	9bcafdaa98	Update word bounding box computation	2020-01-31 16:24:59 +00:00
BobLd	27edf6cf77	Handle Width and Height for rotated rectangles	2020-01-31 16:24:59 +00:00
BobLd	75bd94e538	Better handling of TextDirection.Unknown word bounding box	2020-01-31 16:24:59 +00:00
BobLd	75821919a7	Fix NearestNeighbourWordExtractor for rotated text	2020-01-31 16:24:59 +00:00
Eliot Jones	8ab2838063	recover from invalid cross reference position if we are reading a cross reference offset which contains a number we assumed it was a stream object. if it's not we now brute-force the entire file looking for an 'xref' token. this should be combined with a search for cross-reference streams and should run when we read neither the numeric token or an 'xref' token but for now this fixes the observed issue. also adds number of images to the page api to prevent consumers needing to enumerate.	2020-01-28 18:07:05 +00:00
Eliot Jones	29061b1fd2	handle unexpected adobe type 1 format an encoding array in an adobe type 1 font may be missing its declaration ending in 'for', if we encounter 'dup' while looking for the 'for' token we have a special case to go straight into reading the encoding. also handles a case where the page content stream contains a path-closing operator without any path being active.	2020-01-28 16:05:53 +00:00
Eliot Jones	6292fc256d	handle direct font objects in the resource dictionary fonts can appear as dicitionary objects rather than indirect references in the resource dictionary for a page. if we encounter this we parse and store the font by name for retrieval during content parsing.	2020-01-27 18:07:51 +00:00
Eliot Jones	6cf257a331	strings record encoding used to create them. in order to recreate the valid bytes for use in decryption it is necessary to know which encoding was used to read a string token. this is because utf16-be encoding has a byte-order marker which should be included in the resulting bytes.	2020-01-26 17:07:58 +00:00
Eliot Jones	693a3d5958	use offset to file header to correct cross references if the %pdf version header comment is offset from the start of the file the cross reference offsets will also be wrong by that amount. this change updates the cross reference location logic to use the offset from the located version header.	2020-01-26 15:30:20 +00:00
Eliot Jones	a561c8954e	handle the format header being preceded by nonsense some files seem to have the format header preceded by large amounts of junk but this appears to be valid for chrome and acrobat reader. this change ups the amount of nonsense to be read prior to the version header. also makes parsing of the version header culture invariant which may be related to #85.	2020-01-25 16:53:41 +00:00
Eliot Jones	d9492ab2f8	handle empty encrypted portion in adobe type 1 font the encrypted portion of an adobe type 1 font can be empty in which case we default to a blank private dictionary and charstrings set.	2020-01-25 16:41:54 +00:00
Eliot Jones	736f83e227	handle null charstring names it appears charstring definitions in adobe type 1 fonts can omit the charstring name. in this case we set the name to the string value of the charstring index.	2020-01-25 16:35:08 +00:00
Eliot Jones	ba09a13d08	more end image recovery logic since inline image data may contain the end image "ei" token inside the data stream there's no reliable way to actually determine if we've read all the data. for this reason if we end up with an invalid state parsing operations after we've read the end image token we try to recover by reading from the previous token to the next end image token if any. we supply log information to let the consumer know this is what we're doing. it's still not bullet-proof but it should be good enough. also support negative page rotation values by adding them to a 360 degree rotation so -90 degrees clockwise is 270 degrees clockwise.	2020-01-25 15:53:08 +00:00
Eliot Jones	3ac8d7ed91	update the github pages site updates the information on the github pages site for the new api changes. includes some more seo friendly terms to improve discoverability, more engaging images as well as comprehensive code examples to improve onboarding.	2020-01-25 14:36:07 +00:00
Eliot Jones	3243be3ec5	change rectangle drawing logic for tests support rotated output rectangles in the visual verification tests.	2020-01-22 13:45:52 +00:00
Eliot Jones	0ed4e58556	add test cases for rectangle transforms our bounding rectangle values still seem to be wrong for rotated letters. this change adds some test cases for common transformation matrix operations on a rectangle, scale, translate and rotate.	2020-01-22 13:28:47 +00:00
Eliot Jones	f29170fef8	use default width if present if no widths array entry exists for the character and no font program is present for a true type simple font then use the 0 index glyph width if present in the widths array.	2020-01-14 15:18:07 +00:00
Eliot Jones	b50f476c31	update local tests we set the file type filter to only pick up pdfs.	2020-01-14 14:59:14 +00:00
Eliot Jones	f6e12f40d8	support named tounicode cmaps rather than streams type 0 fonts tounicode cmap may refer to a known cmap name rather than an embedded cmap stream.	2020-01-14 14:58:20 +00:00
Eliot Jones	a36f5a3af3	handle missing embedded cid font for type 0 fonts all font file entries in the font descriptor for type 0 fonts are optional. if the font is missing we default to returning the default bounding box.	2020-01-14 14:52:51 +00:00
Eliot Jones	e8401b87cf	version 0.1.0 0.1.0	2020-01-13 10:46:47 +00:00
Eliot Jones	efc258b0f0	use tokenscanner when converting array to rectangle an arrray of 4 items representing a rectangle may define its values as indirect references. when converting to a rectangle we pass a pdf token scanner to resolve any indirect references.	2020-01-13 10:20:08 +00:00
BobLd	47672d3f90	Make TextBlock.SetReadingOrder(int) public	2020-01-13 09:25:57 +00:00
BobLd	fd014cfaa7	Add files via upload	2020-01-12 11:15:58 +00:00
BobLd	e8216b29c5	Add reading order in PageXml export	2020-01-12 11:15:58 +00:00
BobLd	e7417be75a	ReadingOrderDetector and tidying DLA project	2020-01-11 11:18:11 +00:00
Eliot Jones	b4d917dcdc	merge pull request #122 from uglytoad/marked-content marked content	2020-01-10 17:07:21 +00:00
Eliot Jones	41cc7abd1b	prevent negative point size for fonts	2020-01-10 14:40:28 +00:00
Eliot Jones	17b7cf2f61	load images eagerly for marked content when a marked content region contains an image we load it eagerly since we won't have access to the necessary classes at evaluation time. we also default image colorspace to the active graphics state colorspace if the dictionary doesn't contain a valid entry.	2020-01-10 13:52:21 +00:00
Eliot Jones	2a579afd4d	add missing doc comments for operation context marked content	2020-01-09 15:35:55 +00:00
Eliot Jones	d011f37316	merge master	2020-01-09 15:32:10 +00:00
Eliot Jones	43574097f1	rename marked content elements and use factory since the properties in marked content may be indirect references or belong to the page resources array, the value should be calculated during content processing. this change tidies up the marked content classes so they do not expose mutable data and uses the pdf token scanner overloads to load dictionary data.	2020-01-09 15:30:16 +00:00
BobLd	097692f1cb	Move ArtifactType inside PdfArtifactMarkedContent	2020-01-09 11:24:32 +00:00
Eliot Jones	6c1e3c76a8	version 0.1.0-beta002 0.1.0-beta002	2020-01-08 14:26:45 +00:00

1 2 3 4 5 ...

697 Commits