this file cotains corrupt content following an inline image but other parsers
just treat this content as part of the image and parse the rest of the file
successfully
* remove alpha postfix, releases will increment version
* update the master build job to draft a release
* add publish action to publish full release
* enable setting assembly and file version
* bump assembly and file version for package project
---------
Co-authored-by: BobLd <38405645+BobLd@users.noreply.github.com>
* Refactor letter handling by orientation for efficiency
Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to `GetWords` and minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over the `letters` collection.
* Update target frameworks to include net9.0
Expanded compatibility in `UglyToad.PdfPig.csproj` by adding
`net9.0` to the list of target frameworks, alongside existing
versions.
* Add .NET 9.0 support and refactor key components
Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features.
Refactored `GetBlocks` in `DocstrumBoundingBoxes.cs` for improved input handling and performance.
Significantly optimized `NearestNeighbourWordExtractor.cs` by replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency.
Consistent updates across `Fonts`, `Tests`, `Tokenization`, and `Tokens` project files to include .NET 9.0 support.
* Improve null checks and optimize list handling
- Updated null check for `words` in `DocstrumBoundingBoxes.cs` for better readability and performance.
- Changed from `ToList()` to `ToArray()` to avoid unnecessary enumeration.
- Added `results.TrimExcess()` in `NearestNeighbourWordExtractor.cs` to optimize memory usage.
---------
Co-authored-by: Chuck Beasley <CBeasley@kilpatricktownsend.com>
the existing numeric tokenizer involved allocations and string parsing. since
the number formats in pdf files are fairly predictable we can improve this
substantially
the file provided in issue #926 contains the following syntax
in pdf object streams:
```
% 750 0 obj
<< >>
```
currently we read the comment token and skip the rest
however this producer is writing nonsense to the stream.
comment tokens are only valid outside streams in pdf files
so we align to the behavior of pdfbox here by skipping the
entire line containing a comment inside a stream which fixes
parsing this file.
* Avoid encoding ASCII in more cases
* Make Space a const
* Use WriteWhiteSpace extension to eliminate possible virtual call
* Use ASCII when encoding constrained character subset
* Simplify pragmas
* Revert Whitespace rename
* Fix using statement order
* Remove obsolete serialization support on .NET
* Remove obsolete serialization support on .NET (part 2)
* Make AdobeFontMetricsLigature a struct
* Make AdobeFontMetricsCharacterSize a struct
* Eliminate allocation in CompactFontFormatData
* Pass TransformationMatrix by reference
* Seal Encoding classes
* Make SubTableHeaderEntry a readonly struct
* Introduce StringSplitter and eliminate various allocations in GlyphListFactory
* Eliminate a few substring allocations
* Use char overload on StringBuilder
* Eliminate virtual calls on stringIndex
* Optimize ReadHelper ReadLong and ReadInt methods
* Add additional readonly annotations to PdfRectangle
* Optimize NameTokenizer
* Eliminate allocation in TrueTypeGlyphTableSubsetter
* Use empty arrays
* Eliminate allocations in OperationWriteHelper.WriteHex
* Use simplified DecryptCbc method on .NET 6+
* Fix windows-1252 encoding not working on net6.0 and 8.0
* Update int buffers to exact unsigned max length and eliminate additional byte allocation
* Fix typo
* Remove unused constant
if we're parsing a known dictionary (e.g. all keys are required
and there are no additional optional keys) and we encounter
an error we provide the possibility to recover by assuming
a dictionary end token after all required tokens are consumed
if parsing by looking for dictionary end failed due to a format
exception