mirror of
https://github.com/UglyToad/PdfPig.git
synced 2025-08-20 06:18:27 +08:00
update readme to avoid people using page.Text
or asking about editing docs (#1109)
* update readme to avoid people using `page.Text` or asking about editing docs we need to be more clear because beloved chat gpt falls into the trap of recommending `page.Text` when asked about the library even though this text is usually the wrong field to use * tabs to spaces * rogue tab
This commit is contained in:
parent
27df4af5f9
commit
9cb3b71e62
66
README.md
66
README.md
@ -2,19 +2,11 @@
|
|||||||
|
|
||||||
# PdfPig
|
# PdfPig
|
||||||
|
|
||||||
[](https://gitter.im/pdfpig/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
|
|
||||||
[](https://www.nuget.org/packages/PdfPig/)
|
[](https://www.nuget.org/packages/PdfPig/)
|
||||||
|
|
||||||
[](https://github.com/UglyToad/PdfPig/actions/workflows/build_and_test.yml)
|
[](https://github.com/UglyToad/PdfPig/actions/workflows/build_and_test.yml)
|
||||||
[![Build and test [MacOS]](https://github.com/UglyToad/PdfPig/actions/workflows/build_and_test_macos.yml/badge.svg)](https://github.com/UglyToad/PdfPig/actions/workflows/build_and_test_macos.yml)
|
[![Build and test [MacOS]](https://github.com/UglyToad/PdfPig/actions/workflows/build_and_test_macos.yml/badge.svg)](https://github.com/UglyToad/PdfPig/actions/workflows/build_and_test_macos.yml)
|
||||||
|
|
||||||
This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents
|
PdfPig supports reading text and content from PDF files. It also supports basic PDF file creation.
|
||||||
containing text and geometrical shapes.
|
|
||||||
|
|
||||||
This project aims to port [PDFBox](https://github.com/apache/pdfbox) to C#.
|
|
||||||
|
|
||||||
## Wiki
|
|
||||||
Check out our [wiki](https://github.com/UglyToad/PdfPig/wiki) for more examples and detailed guides on the API.
|
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@ -32,29 +24,26 @@ While the version is below 1.0.0 minor versions will change the public API witho
|
|||||||
|
|
||||||
See the [wiki](https://github.com/UglyToad/PdfPig/wiki) for more examples
|
See the [wiki](https://github.com/UglyToad/PdfPig/wiki) for more examples
|
||||||
|
|
||||||
### Read words in a page
|
### Reading text from a PDF
|
||||||
|
|
||||||
The simplest usage at this stage is to open a document, reading the words from every page:
|
The simplest usage at this stage is to open a document, reading the words from every page:
|
||||||
|
|
||||||
```cs
|
```cs
|
||||||
|
// using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;
|
||||||
|
// using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;
|
||||||
using (PdfDocument document = PdfDocument.Open(@"C:\Documents\document.pdf"))
|
using (PdfDocument document = PdfDocument.Open(@"C:\Documents\document.pdf"))
|
||||||
{
|
{
|
||||||
foreach (Page page in document.GetPages())
|
foreach (Page page in document.GetPages())
|
||||||
{
|
{
|
||||||
string pageText = page.Text;
|
string text = ContentOrderTextExtractor.GetText(page);
|
||||||
|
IEnumerable<Word> words = page.GetWords(NearestNeighbourWordExtractor.Instance);
|
||||||
foreach (Word word in page.GetWords())
|
}
|
||||||
{
|
|
||||||
Console.WriteLine(word.Text);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
An example of the output of this is shown below:
|
You **should not** use `page.Text` directly, unless you know what you're doing. The `Text` property preserves the internal content order which is rarely ever the text in the order you want.
|
||||||
|
|
||||||

|
These layout analysis tools should get you the text you want in most cases.
|
||||||
|
|
||||||
Where for the PDF text ("Write something in") shown at the top the 3 words (in pink) are detected and each word contains the individual letters with glyph bounding boxes.
|
|
||||||
|
|
||||||
### Create PDF Document
|
### Create PDF Document
|
||||||
To create documents use the class `PdfDocumentBuilder`. The Standard 14 fonts provide a quick way to get started:
|
To create documents use the class `PdfDocumentBuilder`. The Standard 14 fonts provide a quick way to get started:
|
||||||
@ -80,6 +69,12 @@ The output is a 1 page PDF document with the text "Hello World!" in Helvetica ne
|
|||||||
|
|
||||||
Each font must be registered with the `PdfDocumentBuilder` prior to use enable pages to share the font resources. Only Standard 14 fonts and TrueType fonts (.ttf) are supported.
|
Each font must be registered with the `PdfDocumentBuilder` prior to use enable pages to share the font resources. Only Standard 14 fonts and TrueType fonts (.ttf) are supported.
|
||||||
|
|
||||||
|
Document creation supports very limited changes to existing PDF documents. However it does not support any of the following:
|
||||||
|
|
||||||
|
- Editing forms
|
||||||
|
- Copying or changing annotations, metadata or document structure data
|
||||||
|
- Adding or removing text with existing fonts
|
||||||
|
|
||||||
### Advanced Document Extraction
|
### Advanced Document Extraction
|
||||||
In this example a more advanced document extraction is performed. `PdfDocumentBuilder` is used to create a copy of the pdf with debug information (bounding boxes and reading order) added.
|
In this example a more advanced document extraction is performed. `PdfDocumentBuilder` is used to create a copy of the pdf with debug information (bounding boxes and reading order) added.
|
||||||
|
|
||||||
@ -259,7 +254,7 @@ string title = document.Information.Title;
|
|||||||
|
|
||||||
### Document Structure
|
### Document Structure
|
||||||
|
|
||||||
The document now has a Structure member:
|
The `PdfDocument` has a Structure member:
|
||||||
|
|
||||||
UglyToad.PdfPig.Structure structure = document.Structure;
|
UglyToad.PdfPig.Structure structure = document.Structure;
|
||||||
|
|
||||||
@ -283,7 +278,7 @@ PageSize size = Page.Size;
|
|||||||
bool isA4 = size == PageSize.A4;
|
bool isA4 = size == PageSize.A4;
|
||||||
```
|
```
|
||||||
|
|
||||||
`Page` provides access to the text of the page:
|
`Page` provides access to the text of the page but you should use `ContentOrderTextExtractor` or alternatives if indexing the text, e.g. for RAG/LLMs:
|
||||||
|
|
||||||
string text = page.Text;
|
string text = page.Text;
|
||||||
|
|
||||||
@ -329,7 +324,7 @@ Retrieving annotations on each page is provided using the method:
|
|||||||
|
|
||||||
page.GetAnnotations()
|
page.GetAnnotations()
|
||||||
|
|
||||||
This call is not cached and the document must not have been disposed prior to use.
|
This call is not cached and the document must not have been disposed prior to use. Annotations cannot be edited.
|
||||||
|
|
||||||
### Bookmarks
|
### Bookmarks
|
||||||
|
|
||||||
@ -357,6 +352,8 @@ A page has a method to extract hyperlinks (annotations of link type):
|
|||||||
|
|
||||||
IReadOnlyList<UglyToad.PdfPig.Content.Hyperlink> hyperlinks = page.GetHyperlinks();
|
IReadOnlyList<UglyToad.PdfPig.Content.Hyperlink> hyperlinks = page.GetHyperlinks();
|
||||||
|
|
||||||
|
Hyperlinks cannot be added or edited when building documents.
|
||||||
|
|
||||||
### TrueType
|
### TrueType
|
||||||
|
|
||||||
The classes used to work with TrueType fonts in the PDF file are available for public consumption. Given an input file:
|
The classes used to work with TrueType fonts in the PDF file are available for public consumption. Given an input file:
|
||||||
@ -396,18 +393,17 @@ var resultFileBytes = PdfMerger.Merge(filePath1, filePath2);
|
|||||||
File.WriteAllBytes(@"C:\pdfs\outputfilename.pdf", resultFileBytes);
|
File.WriteAllBytes(@"C:\pdfs\outputfilename.pdf", resultFileBytes);
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Wiki
|
||||||
|
Check out our [wiki](https://github.com/UglyToad/PdfPig/wiki) for more examples and detailed guides on the API.
|
||||||
|
|
||||||
|
## Issues
|
||||||
|
|
||||||
|
Please do file an issue if you encounter a bug. See our [issue policy](https://github.com/UglyToad/PdfPig/issues/1095) and [contributing guide](https://github.com/UglyToad/PdfPig/blob/master/CONTRIBUTING.md) for details.
|
||||||
|
|
||||||
## API Reference
|
## API Reference
|
||||||
|
|
||||||
If you wish to generate doxygen documentation, run `doxygen doxygen-docs` and open `docs/doxygen/html/index.html`.
|
If you wish to generate doxygen documentation, run `doxygen doxygen-docs` and open `docs/doxygen/html/index.html`.
|
||||||
|
|
||||||
See also the [wiki](https://github.com/UglyToad/PdfPig/wiki) for a detailed documentation on parts of the API
|
|
||||||
|
|
||||||
## Issues
|
|
||||||
|
|
||||||
Please do file an issue if you encounter a bug.
|
|
||||||
|
|
||||||
However in order for us to assist you, you **must** provide the file which causes your issue. Please host this in a publically available place.
|
|
||||||
|
|
||||||
## Credit
|
## Credit
|
||||||
|
|
||||||
This project wouldn't be possible without the work done by the [PDFBox](https://pdfbox.apache.org/) team and the Apache Foundation.
|
This project started as an effort to port [PDFBox](https://github.com/apache/pdfbox) to C#. This project wouldn't be possible without the work done by the [PDFBox](https://pdfbox.apache.org/) team and the Apache Foundation.
|
||||||
|
Loading…
Reference in New Issue
Block a user