update readme to avoid people using page.Text or asking about editing docs (#1109)

* update readme to avoid people using `page.Text` or asking about editing docs we need to be more clear because beloved chat gpt falls into the trap of recommending `page.Text` when asked about the library even though this text is usually the wrong field to use * tabs to spaces * rogue tab
2026-01-09 14:04:35 +08:00 · 2025-07-26 12:58:35 -05:00
parent 27df4af5f9
commit 9cb3b71e62
1 changed files with 31 additions and 35 deletions
--- a/README.md
+++ b/README.md
@@ -2,19 +2,11 @@

 # PdfPig

-[![Gitter](https://badges.gitter.im/pdfpig/community.svg)](https://gitter.im/pdfpig/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
 [![nuget](https://img.shields.io/nuget/dt/PdfPig)](https://www.nuget.org/packages/PdfPig/)
-
 [![Build and test](https://github.com/UglyToad/PdfPig/actions/workflows/build_and_test.yml/badge.svg)](https://github.com/UglyToad/PdfPig/actions/workflows/build_and_test.yml)
 [![Build and test [MacOS]](https://github.com/UglyToad/PdfPig/actions/workflows/build_and_test_macos.yml/badge.svg)](https://github.com/UglyToad/PdfPig/actions/workflows/build_and_test_macos.yml)

-This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents
-containing text and geometrical shapes.
-
-This project aims to port [PDFBox](https://github.com/apache/pdfbox) to C#.
-
-## Wiki
-Check out our [wiki](https://github.com/UglyToad/PdfPig/wiki) for more examples and detailed guides on the API.
+PdfPig supports reading text and content from PDF files. It also supports basic PDF file creation.

 ## Installation

@@ -32,29 +24,26 @@ While the version is below 1.0.0 minor versions will change the public API witho

 See the [wiki](https://github.com/UglyToad/PdfPig/wiki) for more examples 

-### Read words in a page
+### Reading text from a PDF
+
 The simplest usage at this stage is to open a document, reading the words from every page:

 ```cs
+// using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;
+// using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;
 using (PdfDocument document = PdfDocument.Open(@"C:\Documents\document.pdf"))
 {
-	foreach (Page page in document.GetPages())
-	{
-		string pageText = page.Text;
-
-		foreach (Word word in page.GetWords())
-		{
-			Console.WriteLine(word.Text);
-		}
-	}
+    foreach (Page page in document.GetPages())
+    {
+        string text = ContentOrderTextExtractor.GetText(page);
+        IEnumerable<Word> words = page.GetWords(NearestNeighbourWordExtractor.Instance);
+    }
 }
 ```

-An example of the output of this is shown below:
+You **should not** use `page.Text` directly, unless you know what you're doing. The `Text` property preserves the internal content order which is rarely ever the text in the order you want.

-![Image shows three words 'Write something in' in 2 sections, the top section is the normal PDF output, the bottom section is the same text with 3 word bounding boxes in pink and letter bounding boxes in blue-green](https://raw.githubusercontent.com/UglyToad/Pdf/master/documentation/Letters/example-text-extraction.png)
-
-Where for the PDF text ("Write something in") shown at the top the 3 words (in pink) are detected and each word contains the individual letters with glyph bounding boxes.
+These layout analysis tools should get you the text you want in most cases.

 ### Create PDF Document
 To create documents use the class `PdfDocumentBuilder`. The Standard 14 fonts provide a quick way to get started:
@@ -80,6 +69,12 @@ The output is a 1 page PDF document with the text "Hello World!" in Helvetica ne

 Each font must be registered with the `PdfDocumentBuilder` prior to use enable pages to share the font resources. Only Standard 14 fonts and TrueType fonts (.ttf) are supported.

+Document creation supports very limited changes to existing PDF documents. However it does not support any of the following:
+
+- Editing forms
+- Copying or changing annotations, metadata or document structure data
+- Adding or removing text with existing fonts
+
 ### Advanced Document Extraction
 In this example a more advanced document extraction is performed. `PdfDocumentBuilder` is used to create a copy of the pdf with debug information (bounding boxes and reading order) added.

@@ -259,7 +254,7 @@ string title = document.Information.Title;

 ### Document Structure

-The document now has a Structure member:
+The `PdfDocument` has a Structure member:

    UglyToad.PdfPig.Structure structure = document.Structure;

@@ -283,7 +278,7 @@ PageSize size = Page.Size;
 bool isA4 = size == PageSize.A4;
 ```

-`Page` provides access to the text of the page:
+`Page` provides access to the text of the page but you should use `ContentOrderTextExtractor` or alternatives if indexing the text, e.g. for RAG/LLMs:

    string text = page.Text;

@@ -329,7 +324,7 @@ Retrieving annotations on each page is provided using the method:

    page.GetAnnotations()

-This call is not cached and the document must not have been disposed prior to use.
+This call is not cached and the document must not have been disposed prior to use. Annotations cannot be edited.

 ### Bookmarks

@@ -357,6 +352,8 @@ A page has a method to extract hyperlinks (annotations of link type):

    IReadOnlyList<UglyToad.PdfPig.Content.Hyperlink> hyperlinks = page.GetHyperlinks();

+Hyperlinks cannot be added or edited when building documents.
+
 ### TrueType

 The classes used to work with TrueType fonts in the PDF file are available for public consumption. Given an input file:
@@ -396,18 +393,17 @@ var resultFileBytes = PdfMerger.Merge(filePath1, filePath2);
 File.WriteAllBytes(@"C:\pdfs\outputfilename.pdf", resultFileBytes);
 ```

+## Wiki
+Check out our [wiki](https://github.com/UglyToad/PdfPig/wiki) for more examples and detailed guides on the API.
+
+## Issues
+
+Please do file an issue if you encounter a bug. See our [issue policy](https://github.com/UglyToad/PdfPig/issues/1095) and [contributing guide](https://github.com/UglyToad/PdfPig/blob/master/CONTRIBUTING.md) for details.
+
 ## API Reference

 If you wish to generate doxygen documentation, run `doxygen doxygen-docs` and open `docs/doxygen/html/index.html`.

-See also the [wiki](https://github.com/UglyToad/PdfPig/wiki) for a detailed documentation on parts of the API
-
-## Issues
-
-Please do file an issue if you encounter a bug.
-
-However in order for us to assist you, you **must** provide the file which causes your issue. Please host this in a publically available place.
-
 ## Credit

-This project wouldn't be possible without the work done by the [PDFBox](https://pdfbox.apache.org/) team and the Apache Foundation.
+This project started as an effort to port [PDFBox](https://github.com/apache/pdfbox) to C#. This project wouldn't be possible without the work done by the [PDFBox](https://pdfbox.apache.org/) team and the Apache Foundation.