PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2025-09-19 02:37:56 +08:00

Author	SHA1	Message	Date
Eliot Jones	f29170fef8	use default width if present if no widths array entry exists for the character and no font program is present for a true type simple font then use the 0 index glyph width if present in the widths array.	2020-01-14 15:18:07 +00:00
Eliot Jones	b50f476c31	update local tests we set the file type filter to only pick up pdfs.	2020-01-14 14:59:14 +00:00
Eliot Jones	f6e12f40d8	support named tounicode cmaps rather than streams type 0 fonts tounicode cmap may refer to a known cmap name rather than an embedded cmap stream.	2020-01-14 14:58:20 +00:00
Eliot Jones	a36f5a3af3	handle missing embedded cid font for type 0 fonts all font file entries in the font descriptor for type 0 fonts are optional. if the font is missing we default to returning the default bounding box.	2020-01-14 14:52:51 +00:00
Eliot Jones	e8401b87cf	version 0.1.0 0.1.0	2020-01-13 10:46:47 +00:00
Eliot Jones	efc258b0f0	use tokenscanner when converting array to rectangle an arrray of 4 items representing a rectangle may define its values as indirect references. when converting to a rectangle we pass a pdf token scanner to resolve any indirect references.	2020-01-13 10:20:08 +00:00
BobLd	47672d3f90	Make TextBlock.SetReadingOrder(int) public	2020-01-13 09:25:57 +00:00
BobLd	fd014cfaa7	Add files via upload	2020-01-12 11:15:58 +00:00
BobLd	e8216b29c5	Add reading order in PageXml export	2020-01-12 11:15:58 +00:00
BobLd	e7417be75a	ReadingOrderDetector and tidying DLA project	2020-01-11 11:18:11 +00:00
Eliot Jones	b4d917dcdc	merge pull request #122 from uglytoad/marked-content marked content	2020-01-10 17:07:21 +00:00
Eliot Jones	41cc7abd1b	prevent negative point size for fonts	2020-01-10 14:40:28 +00:00
Eliot Jones	17b7cf2f61	load images eagerly for marked content when a marked content region contains an image we load it eagerly since we won't have access to the necessary classes at evaluation time. we also default image colorspace to the active graphics state colorspace if the dictionary doesn't contain a valid entry.	2020-01-10 13:52:21 +00:00
Eliot Jones	2a579afd4d	add missing doc comments for operation context marked content	2020-01-09 15:35:55 +00:00
Eliot Jones	d011f37316	merge master	2020-01-09 15:32:10 +00:00
Eliot Jones	43574097f1	rename marked content elements and use factory since the properties in marked content may be indirect references or belong to the page resources array, the value should be calculated during content processing. this change tidies up the marked content classes so they do not expose mutable data and uses the pdf token scanner overloads to load dictionary data.	2020-01-09 15:30:16 +00:00
BobLd	097692f1cb	Move ArtifactType inside PdfArtifactMarkedContent	2020-01-09 11:24:32 +00:00
Eliot Jones	6c1e3c76a8	version 0.1.0-beta002 0.1.0-beta002	2020-01-08 14:26:45 +00:00
Eliot Jones	005cbe5754	update readme to show builder output removes outdated information from the readme and add an image to show pdfdocumentbuilder output	2020-01-08 14:13:04 +00:00
Eliot Jones	66fc244083	included embedded files api in readme adds section describing use of new embedded files api. resizes documentation image to be smaller.	2020-01-08 14:04:07 +00:00
Eliot Jones	16c3322cce	add example image to readme images make the readme more engaging and gives users an idea of what the output looks like for word and text extraction.	2020-01-08 13:56:48 +00:00
Eliot Jones	a496daf0ce	ignore hflex when calculating hint bytes hflex and hflex1 should not count towards the hint byte count for a hintmask operator in type 2 charstrings.	2020-01-08 13:27:33 +00:00
Eliot Jones	4976fa1027	handle incorrect end image detected since an inline image's data stream may contain the characters 'ei' as a result of compression it's possible to read an end image operator mid-data, this results in the next operator also being end image and the content stream being in an invalid state. to recover from this when we detect this situation we remove the previous operator, read to the current operator and replace the operator and data bytes in the list of operations.	2020-01-08 12:17:30 +00:00
Eliot Jones	a083214da2	handle missing mediabox irrespective of parsing type since pdfbox defaults to us letter if the mediabox is missing rather than throwing we remove the behaviour where uselenientparsing is false which used to throw, now we log an error. throwing didn't provide any benefit to consumers.	2020-01-08 11:34:35 +00:00
BobLd	7be36fdc58	Update PublicApiScannerTests 2	2020-01-08 11:07:27 +00:00
BobLd	4b929482cc	Update PublicApiScannerTests	2020-01-08 10:46:49 +00:00
BobLd	49d836c5cb	Add description to GetMarkedContents()	2020-01-08 10:36:58 +00:00
BobLd	84bab1b627	Add basic marked content extraction capabilities	2020-01-08 10:34:01 +00:00
Eliot Jones	63b118b141	handle type1 fonts disguised as truetype if the font descriptor uses the fromsubtype flag the actual type of the font can differ from that specified in the font dictionary. in this case a truetype font actually contains a type1c, compact font format, font. in this case we fall back to using the type1 parser. also handles a closesubpath command appearing without any path construction operators.	2020-01-07 16:49:21 +00:00
Eliot Jones	d267d7501a	use encoding specified in base font if present if the font uses a named encoding which is not recognised, use the corresponding encoding based on the base font name, or fall back to windows ansi encoding.	2020-01-07 16:01:45 +00:00
Eliot Jones	e588b2bc50	support documents without endobj for stream some documents declare stream objects without an endobj marker at the end of the stream. if a new obj token is encountered after reading a stream we reset the scanner to the object number token and return the stream.	2020-01-07 15:27:01 +00:00
Eliot Jones	10dc5a8eed	don't cache invalid offsets unless brute forced don't cache objects parsed if their offset doesn't match the cross-reference offset, unless the object was parsed by a brute-force search operation. this is because 1 object may lie in 2 streams, 1 valid and 1 invalid. If the invalid stream is parsed first for another object then the valid stream will never be read.	2020-01-07 14:54:12 +00:00
Eliot Jones	903d71a93d	skip cross references outside file if the previous cross-reference location points to an offset outside the file size we skip it. also makes cid font factory more resilient by skipping missing descriptors.	2020-01-07 12:37:41 +00:00
Eliot Jones	5114b2da2c	avoid overwriting cache for valid objects some objects may be defined in more than one stream. parsing both streams would overwrite the object in the cache. to prevent this we avoid overwriting the existing object in the cache if it has the expected offset from the cross reference table.	2020-01-07 11:48:09 +00:00
Eliot Jones	0b048fde57	handle eof further back in file an %%eof for a pdf file may appear further back than the last 1024 bytes. this change doubles the search range. it also handles an empty differences array being defined for a font encoding. we also remove the old approach to dependency injection from the code since we are now favouring static classes where possible.	2020-01-07 11:48:09 +00:00
Eliot Jones	3c19b988e2	merge pull request #120 from vadik299/master Fix for rectangle width/height incorrectly parsed	2020-01-07 08:44:47 +00:00
vadik299	f00eb5efa2	Update AppendRectangle.cs (fix) Rectangle width and height should be also transformed by CurrentTransformationMatrix	2020-01-07 00:23:10 -05:00
vadik299	6ca2190f67	Merge pull request #2 from UglyToad/master update	2020-01-07 00:20:12 -05:00
Eliot Jones	fc9c1b6ff5	add method to retrieve single glyph bounds from truetype this improves performance since we only need to load a single rectangle rather than the entire glyphs array including all points.	2020-01-06 14:43:51 +00:00
Eliot Jones	09c72a2fb2	handle 0 length gylph in true type font	2020-01-06 14:12:46 +00:00
Eliot Jones	80845863a8	version 0.1.0-beta001 0.1.0-beta001	2020-01-06 12:31:18 +00:00
Eliot Jones	e2c3b6dc8b	update package icon #96 and readme updates nuget package definition to use new format of package icon as required by #96. add readme information for hyperlinks and truetype fonts #8.	2020-01-06 12:28:54 +00:00
Eliot Jones	0183c0af5f	add project for nuget package #119 in order to include all projects from the solution we create a new solution with an entry-point assembly which references all projects. calling dotnet pack on this single project then packages all assemblies into the produced nuget package. also remove old glyph list references from the main project since they have moved to the fonts project.	2020-01-06 11:31:41 +00:00
Eliot Jones	00bd285262	add support for quadpoints to annotations highlight, link, strikeout, squiggly and underline annotation types may define a set of quadrilaterals using the quadpoints entry. this defines the regions to show/activate the annotation. the order of points in the quadpoints array does not match the specification so we provide a convenience class to access the point data rather than interpreting it as a rectangle: https://stackoverflow.com/questions/9855814/pdf-spec-vs-acrobat-creation-quadpoints.	2020-01-05 16:23:07 +00:00
Eliot Jones	e064d39671	remove unused project references from document layout analysis	2020-01-05 15:44:02 +00:00
Eliot Jones	02f9166c00	use lazy loading for glyph data glyph data in TrueType fonts can be very large and slow to parse. to avoid this we store the raw table data at parsing time and enable lazy loading of glyph descriptions.	2020-01-05 15:42:23 +00:00
Eliot Jones	1948b4ad9f	merge pull request #117 from uglytoad/refactor-font-project refactor font project	2020-01-05 14:31:22 +00:00
Eliot Jones	e0a45e3774	include dependencies as dlls in the published nuget by default nuget pack does not include project dependencies. this is suboptimal since it would require managing at least 5 nuget packages. this uses a workaround detailed here https://github.com/nuget/home/issues/3891 to copy the dependent dlls to the generated nuget package. this doesn't resolve the issue of how we publish the documentlayoutanalysis project, since it is the top of the dependency tree and we publish its parent, rather than it.	2020-01-05 13:56:14 +00:00
Eliot Jones	e1b39983d0	handle missing encodings in cff fonts	2020-01-05 13:16:31 +00:00
Eliot Jones	b29354e3e6	move compact font format fonts to fonts project	2020-01-05 12:08:01 +00:00

1 2 3 4 5 ...

665 Commits