com.aspose.pdf

Interfaces

Classes

Enums

Exceptions

com.aspose.pdf

Class TextAbsorber

  • Direct Known Subclasses:
    TextFragmentAbsorber, TextParagraphAbsorber


    public class TextAbsorber
    extends Object

    Represents an absorber object of a text. Performs text extraction and provides access to the result via TextAbsorber.Text object.


     The example demonstrates how to extract text on the first PDF document page.
     
     // open document
     Document doc = new Document(inFile);
     // create TextAbsorber object to extract text
     TextAbsorber absorber = new TextAbsorber();
     // accept the absorber for first page
     doc.getPages().get(1).accept(absorber);
     // get the extracted text
     String extractedText = absorber.getText();
     

    The TextAbsorber object is used to extract text from a Pdf document or the document's page.

    • Constructor Detail

      • TextAbsorber

        public TextAbsorber()

        Initializes a new instance of the TextAbsorber.


         The example demonstrates how to extract text from all pages of the PDF document.
         
         // open document
         Document doc = new Document(inFile);
         // create TextAbsorber object to extract text
         TextAbsorber absorber = new TextAbsorber();
         // accept the absorber for all document's pages
         doc.getPages().accept(absorber);
         // get the extracted text
         String extractedText = absorber.getText();
         

        Performs text extraction and provides access to the extracted text via TextAbsorber.Text object.

      • TextAbsorber

        public TextAbsorber(TextExtractionOptions extractionOptions)

        Initializes a new instance of the TextAbsorber with extraction options.


         The example demonstrates how to extract text from all pages of the PDF document.
         
         // open document
         Document doc = new Document(inFile);
         // create TextAbsorber object to extract text with formatting
         TextAbsorber absorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
         // accept the absorber for all document's pages
         doc.getPages().accept(absorber);
         // get the extracted text
         String extractedText = absorber.getText();
         

        Performs text extraction and provides access to the extracted text via TextAbsorber.Text object.

        Parameters:
        extractionOptions - Text extraction options
      • TextAbsorber

        public TextAbsorber(TextExtractionOptions extractionOptions,
                            TextSearchOptions textSearchOptions)

        Initializes a new instance of the TextAbsorber with extraction and text search options.

        Parameters:
        extractionOptions - Text extraction options
        textSearchOptions - Text search options

        Performs text extraction and provides access to the extracted text via TextAbsorber.Text object.

      • TextAbsorber

        public TextAbsorber(TextSearchOptions textSearchOptions)

        Initializes a new instance of the TextAbsorber with text search options.

        Parameters:
        textSearchOptions - Text search options

        Performs text extraction and provides access to the extracted text via TextAbsorber.Text object.

    • Method Detail

      • getText

        public String getText()

        Gets extracted text that the TextAbsorber extracts on the PDF document or page.

        Returns:
        String value
         The example demonstrates how to extract text from all pages of the PDF document.
         
         // open document
         Document doc = new Document(inFile);
         // create TextAbsorber object to extract text
         TextAbsorber absorber = new TextAbsorber();
         // accept the absorber for all document's pages
         doc.getPages().accept(absorber);
         // get the extracted text
         String extractedText = absorber.getText();
         
      • hasErrors

        public boolean hasErrors()

        Value indicates whether errors were found during text extraction. Searching for errors will performed only if TextSearchOptions.LogTextExtractionErrors = true; And it may decrease performance.

        Returns:
        boolean value
      • getErrors

        public List<TextExtractionError> getErrors()

        List of TextExtractionError objects. It contain information about errors were found during text extraction. Searching for errors will performed only if TextSearchOptions.LogTextExtractionErrors = true; And it may decrease performance.

        Returns:
        List of TextExtractionError objects
      • visit

        public void visit(Page page)

        Extracts text on the specified page


         The example demonstrates how to extract text on the first PDF document page.
         
         // open document
         Document doc = new Document(inFile);
         // create TextAbsorber object to extract text
         TextAbsorber absorber = new TextAbsorber();
         // accept the absorber for all document's pages
         absorber.visit(doc.getPages(1));
         // get the extracted text
         String extractedText = absorber.getText();
         
        Parameters:
        page - Pdf pocument page object.
      • visit

        public void visit(XForm form)

        Extracts text on the specified XForm.


         The example demonstrates how to extract text on the first PDF document page.
         
          // open document
          Document doc = new Document(inFile);
          
          // create TextAbsorber object to extract text
          TextAbsorber absorber = new TextAbsorber();
           
          // accept the absorber for all document's pages
          absorber.visit(doc.Pages().get(1).getResources().getForms().get("Xform1"));
             
          // get the extracted text
          String extractedText = absorber.getText();
         
        Parameters:
        form - Pdf form object.
      • visit

        public void visit(IDocument pdf)

        Extracts text on the specified document


         The example demonstrates how to extract text on PDF document.
         
         // open document
         Document doc = new Document(inFile);
         // create TextAbsorber object to extract text
         TextAbsorber absorber = new TextAbsorber();
         // accept the absorber for all document's pages
         absorber.visit(doc);
         // get the extracted text
         String extractedText = absorber.getText();
         
        Parameters:
        pdf - Pdf pocument object.
      • getExtractionOptions

        public TextExtractionOptions getExtractionOptions()

        Gets text extraction options.


         The example demonstrates how to set Pure text formatting mode and perform text extraction.
         
         // open document
         Document doc = new Document(inFile);
         // create TextAbsorber object to extract text with formatting
         TextAbsorber absorber = new TextAbsorber();
         // set pure text formatting mode
         absorber.setExtractionOptions ( new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
         // accept the absorber for all document's pages
         doc.getPages().accept(absorber);
         // get the extracted text
         String extractedText = absorber.getText();
         

        Allows to define text formatting mode TextExtractionOptions during extraction. The default mode is TextExtractionOptions.TextFormattingMode.Pure

        Returns:
        TextExtractionOptions value
      • setExtractionOptions

        public void setExtractionOptions(TextExtractionOptions value)

        Sets text extraction options.


         The example demonstrates how to set Pure text formatting mode and perform text extraction.
         
         // open document
         Document doc = new Document(inFile);
         // create TextAbsorber object to extract text with formatting
         TextAbsorber absorber = new TextAbsorber();
         // set pure text formatting mode
         absorber.setExtractionOptions ( new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure));
         // accept the absorber for all document's pages
         doc.getPages().accept(absorber);
         // get the extracted text
         String extractedText = absorber.getText();
         

        Allows to define text formatting mode TextExtractionOptions during extraction. The default mode is TextExtractionOptions.TextFormattingMode.Pure

        Parameters:
        value - TextExtractionOptions value
      • getTextSearchOptions

        public TextSearchOptions getTextSearchOptions()
        Gets text search options.

        Allows to define rectangle which delimits the extracted text. By default the rectangle is empty. That means page boundaries only defines the text extraction region.

        Returns:
        TextSearchOptions value
      • setTextSearchOptions

        public void setTextSearchOptions(TextSearchOptions value)
        Sets text search options.

        Allows to define rectangle which delimits the extracted text. By default the rectangle is empty. That means page boundaries only defines the text extraction region.

        Parameters:
        value - TextSearchOptions value