PdfPig

lsm/PdfPig

mirror of https://github.com/UglyToad/PdfPig.git synced 2025-10-14 10:55:04 +08:00

Author	SHA1	Message	Date
Eliot Jones	f319e7f4b5	adds per character byte mapping to truetype #98 this starts to add logic for per-character mapping of unicode characters to byte values for truetype fonts in the pdf document builder. in order to support unicode characters outside the 0-255 range when creating new pdf documents without using composite fonts, we need to map values outside these range into this range. to do this we start at 1 and map each character we encounter to the next code, up to a maximum of 255. we provide a custom tounicode cmap in the font dictionary which maps these byte values, 0-255, back to unicode code points (short). we also provide a custom firstchar, lastchar and widths array for the font mapping just the values we use. since fonts no longer contain just the latin character set the font descriptor enum is set to have the symbolic flag set. this means values will be looked up in either the mac-roman (1, 0) or windows-symbol (3, 0) cmap tables (these cmap tables are distinct from cmap tables in the pdf file) inside the actual truetype font bytes. this means the currently generated font file is invalid, because while the widths array and tounicode cmap return the correct values the actual font itself returns whatever values where in those positions before the remapping occurred. in order to fix this we will need to override the windows-symbol cmap contained in the underlying truetype font to match our mapping. this will be a lot of work and involve significant rewriting of the font file itself, in order to preserve checksum integrity.	2020-01-04 10:27:07 +00:00
BobLd	f67cce31b5	Adding a 'minimumEditDistanceNormalised' parameter to allow for other edit distance implementations.	2020-01-03 12:31:23 +00:00
BobLd	e46df38f4d	Make numbersPattern private	2020-01-03 12:31:23 +00:00
BobLd	39f275aaeb	Improve numbers pattern matching and include roman numerals	2020-01-03 12:31:23 +00:00
BobLd	07f51712c6	Update PublicApiScannerTests	2020-01-03 12:31:23 +00:00
BobLd	a233cc627c	Add headers/footers (decoration) classifier for TextBlock	2020-01-03 12:31:23 +00:00
BobLd	d246bf5c74	- remove unnecessary casts - make PageXmlTextExporter.Deserialize() public	2019-12-31 10:43:07 +00:00
BobLd	c421f3f85e	Correct NearestNeighbourWordExtractor file name (remove whitespace)	2019-12-29 15:25:19 +00:00
BobLd	3a060d9769	Update PublicApiScannerTests	2019-12-28 14:43:09 +00:00
BobLd	c810ca74a6	Format and tidy up PAGE xml export autogenerated code.	2019-12-28 14:43:09 +00:00
BobLd	ba3856b68f	Add files via upload	2019-12-28 14:43:09 +00:00
BobLd	968bf10453	Add files via upload	2019-12-28 14:43:09 +00:00
BobLd	1c078b0f99	Add files via upload	2019-12-28 14:43:09 +00:00
BobLd	b922cf70db	Add files via upload	2019-12-28 14:43:09 +00:00
BobLd	9a38229df8	Delete docstrum bounding boxes example.png	2019-12-28 14:43:09 +00:00
BobLd	f159b95604	Delete recursive xy cut example.png	2019-12-28 14:43:09 +00:00
BobLd	12a64bb141	Add files via upload	2019-12-28 14:43:09 +00:00
BobLd	db55d11c88	Delete docstrum bounding boxes example.png	2019-12-28 14:43:09 +00:00
BobLd	9fc5241cca	Delete recursive xy cut example.png	2019-12-28 14:43:09 +00:00
BobLd	0dadd25f9f	Add files via upload - recursive xy cut example - docstrum bounding boxes example	2019-12-28 14:43:09 +00:00
BobLd	d3dfdaf6fc	Delete empty	2019-12-28 14:43:09 +00:00
BobLd	9684a93095	Adding whitespace coverage example	2019-12-28 14:43:09 +00:00
BobLd	319ff1a44d	Create empty	2019-12-28 14:43:09 +00:00
Eliot Jones	87528199c6	use byte values when showing text for document builder #98 when writing text content the current show text operator was just writing the unicode string value and hoping it produced the correct value in the resulting document despite the values being consumed in a different encoding. this change adds a method to retrieve the corresponding byte value for a unicode character and uses that to write a hex show text operator to the page content. this is only implemented for standard14 fonts in this change. for standard14 fonts we look up the corresponding name for the unicode value from the adobe glyph list. once we find the corresponding glyph name we look up the code value in the encoding we have chosen when writing standard14 fonts (macromanencoding). this value is then the byte value written to the show text operator. if the value does not appear in any of the lookups we throw a not support exception. this also adds a test case which will still fail for czech characters in a truetype font, the issue reported in #98.	2019-12-28 14:42:27 +00:00
BobLd	5e3f5651b8	Update NearestNeighbourWordExtractor .cs Removing the font name check (`string.Equals(l1.FontName, l2.FontName, StringComparison.OrdinalIgnoreCase)`) because some special characters or ligature may belong to different subsets.	2019-12-27 13:08:44 +00:00
BobLd	3b79ebc5d5	Update PageXmlTextExporter.cs Set the `PageXmlTextRegion` type to default `PageXmlTextSimpleType.Paragraph` to avoid a crash in LayoutEvalGUI 1.9	2019-12-27 13:08:04 +00:00
BobLd	b5bab67889	Update MathExtensions.cs Handling null or empty double array.	2019-12-27 10:51:32 +00:00
Eliot Jones	a4805ce97d	improve performance of the truetype name table parsing	2019-12-25 10:52:00 +00:00
Eliot Jones	815705494a	cache last loaded font from the resource store during content parsing	2019-12-24 23:24:52 +00:00
Eliot Jones	ec060ae81b	add hardcoded switch branches for more content operations also adds a gitignore entry for the 'benchmark' subfolder in tools where custom benchmarking applications can be built and run without being added to source control.	2019-12-24 23:12:04 +00:00
Eliot Jones	ce38238f2c	ignore invalid minfeature values for type 1 fonts	2019-12-24 16:56:46 +00:00
Eliot Jones	23c7e44fc8	handle stream length being an object stream value	2019-12-24 15:22:47 +00:00
Eliot Jones	9c9a08c6a7	make numeric tokenizer threadsafe by removing cache	2019-12-24 12:24:40 +00:00
Eliot Jones	3bef786d5c	use performant hasflag method for truetype simple glyphs	2019-12-24 12:22:17 +00:00
Eliot Jones	649abdf966	use named constants for relevant type2 charstring command bytes	2019-12-24 12:22:17 +00:00
Eliot Jones	526af82e1a	fix naming and tostring for type2 charstring sequence	2019-12-24 12:22:17 +00:00
Eliot Jones	be00c3b1b7	remove union types from charstring parser to prevent allocations	2019-12-24 12:22:17 +00:00
Eliot Jones	4f9eb1a25a	use short to save space when storing the set of glyph points	2019-12-24 12:22:17 +00:00
Eliot Jones	ba9fe40bc1	cache some more common values and improve performance of tokenizers	2019-12-24 12:22:17 +00:00
Eliot Jones	e048bb8c2c	performance tuning for numeric tokens and parsing	2019-12-24 12:22:17 +00:00
Eliot Jones	1e29c298cf	use correct numeric types when parsing truetype fonts	2019-12-24 12:22:17 +00:00
Eliot Jones	935d182888	use doubles where calculations are being run	2019-12-24 12:22:17 +00:00
Eliot Jones	e984180b3d	add method to retrieve any embedded files	2019-12-21 16:16:36 +00:00
Eliot Jones	4d697e3669	allow the user to supply multiple passwords for decryption previously the only way to test if a password was correct was to supply a single password and throw if the value was incorrect. this was slow. now parsing options supports a list of passwords as well as a single password option (which is equivalent to a list with a single item). these passwords are all tested at the same time and an exception is only thrown once all passwords are tested.	2019-12-20 15:11:05 +00:00
Eliot Jones	5e68720495	add support for type1c cid fonts	2019-12-20 14:46:25 +00:00
Eliot Jones	f401ab3ba0	handle case insensitive truetype table tags and missing tables for postscript fonts	2019-12-20 14:40:25 +00:00
Eliot Jones	3084a9aab6	support streams containing only carriage returns. handle comments in arrays and dictionaries * while the pdf specification says stream data should follow a newline following a stream operator some files have only a carriage return following the stream operator. * since comment tokens may appear inside an array or dictionary we ignore them if they occur here since they will break interpretation of the dictionary or array contents.	2019-12-20 14:04:58 +00:00
Eliot Jones	3e6fa4b694	correctly map character code to glyph id when retrieving bounding boxes for truetype fonts previously we just treated character codes as glyph ids when getting the bounding box from the truetype font program itself. this change uses the code for character code to glyph id mapping from pdfbox, with some changes, to retrieve the correct bounding box where possible. since this relies in some places on using the unicode value or name, rather than character code, we add a cache to the individual truetype fonts to store the character code to unicode mapping which should have the benefit of improving performance.	2019-12-20 12:48:00 +00:00
Eliot Jones	7296c3c125	merge pull request #105 from BobLd/master whitespace covering algorithm and #104	2019-12-20 11:57:31 +00:00
Eliot Jones	e37e4c37b3	require end image token to be followed by at least 1 whitespace	2019-12-19 17:34:40 +00:00

1 2 3 4 5 ...

643 Commits