Constructor and Description |
---|
PdfExtractor()
Initializes new
PdfExtractor object. |
PdfExtractor(IDocument document)
Initializes new
PdfExtractor object on base of the document . |
Modifier and Type | Method and Description |
---|---|
void |
bindPdf(InputStream inputStream)
Binds PDF document from stream.
|
void |
bindPdf(String inputFile)
Bind input PDF file.
|
void |
extractAttachment()
Extracts attachments from a Pdf document.
|
void |
extractAttachment(String attachmentFileName)
Extracts attachment to PDF file by attachment name.
|
void |
extractImage()
Extract images from PDF file.
|
void |
extractMarkedContentAsImages(Page page,
String path)
Gets all the Marked Content containers as separate images.
|
void |
extractText()
Extracts text from a Pdf document.
|
void |
extractText(Charset encoding)
Extracts text from a Pdf document using specified encoding.
|
void |
extractTextInternal(TextEncodingInternal encoding)
For Internal usage only
|
ByteArrayOutputStream[] |
getAttachment()
Saves all the attachment file to streams.
|
void |
getAttachment(String outputPath)
Stores attachment into file.
|
List<FileSpecification> |
getAttachmentInfo()
Gets the list of attachments.
|
List<String> |
getAttachNames()
Returns list of attachments in PDF file.
|
int |
getEndPage()
Gets end page in the page range where extracting operation will be performed.
|
int |
getExtractImageMode()
Sets the mode for extract images process.
|
int |
getExtractTextMode()
Gets the mode for extract text's result.
|
boolean |
getNextImage(OutputStream outputStream)
Retreive next image from PDF file and stores it into stream.
|
boolean |
getNextImage(OutputStream outputStream,
ImageType format)
Retreive next image from PDF file and stores it into stream with given image format.
|
boolean |
getNextImage(String outputFile)
Retreives next image from PDF document.
|
boolean |
getNextImage(String outputFile,
ImageType format)
Retreives next image from PDF document with given image format.
|
void |
getNextPageText(OutputStream outputStream)
Saves one page's text to stream.
|
void |
getNextPageText(String outputFile)
Saves one page's text to file.
|
String |
getPassword()
Gets input file's password.
|
int |
getResolution()
Gets resolution for extracted images.
|
int |
getStartPage()
Gets start page in the page range where extracting operation will be performed.
|
void |
getText(OutputStream outputStream)
Saves text to stream. see also:
ExtractText
|
void |
getText(OutputStream outputStream,
boolean filterNotAscii)
Saves text to stream. see also:
ExtractText
|
void |
getText(String outputFile)
Saves text to file. see also:
ExtractText
|
TextSearchOptions |
getTextSearchOptions()
Gets text search options.
|
boolean |
hasNextImage()
Checks if more images are accessible in PDF document.
|
boolean |
hasNextPageText()
Indicates that whether can get more texts or not.
|
boolean |
isBidi()
Is true when text has hebriew or arabic symbols.
|
void |
setEndPage(int value)
Sets end page in the page range where extracting operation will be performed.
|
void |
setExtractImageMode(int value)
Sets the mode for extract images process.
|
void |
setExtractTextMode(int value)
Sets the mode for extract text's result.
|
void |
setPassword(String value)
Sets input file's password.
|
void |
setResolution(int value)
Set resolution for extracted images.
|
void |
setStartPage(int value)
Sets start page in the page range where extracting operation will be performed.
|
void |
setTextSearchOptions(TextSearchOptions value)
Sets text search options.
|
public PdfExtractor()
Initializes new PdfExtractor
object.
public PdfExtractor(IDocument document)
Initializes new PdfExtractor
object on base of the document
.
document
- Pdf document.public int getStartPage()
Gets start page in the page range where extracting operation will be performed.
PdfExtractor ext = new PdfExtractor(); ext.bindBdf("sample.pdf"); ext.setStartPage(2); ext.setEndPage(5); ext.extractText();
public void setStartPage(int value)
Sets start page in the page range where extracting operation will be performed.
PdfExtractor ext = new PdfExtractor(); ext.bindBdf("sample.pdf"); ext.setStartPage(2); ext.setEndPage(5); ext.extractText();
value
- start page in the page range.public int getEndPage()
Gets end page in the page range where extracting operation will be performed.
PdfExtractor ext = new PdfExtractor(); ext.bindBdf("sample.pdf"); ext.setStartPage(2); ext.setEndPage(3); ext.extractText();
public void setEndPage(int value)
Sets end page in the page range where extracting operation will be performed.
PdfExtractor ext = new PdfExtractor(); ext.bindBdf("sample.pdf"); ext.setStartPage(2); ext.setEndPage(3); ext.extractText();
value
- end page.public int getExtractTextMode()
Gets the mode for extract text's result.
The example demonstratres the ExtractTextMode
property usage in text extraction scenario.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(@"D:\Text\text.pdf");
extractor.setExtractTextMode(1);
extractor.extractText();
extractor.getText(@"D:\Text\text.txt");
Value: 0 is pure text mode and 1 is raw ordering mode. Default is 0.public void setExtractTextMode(int value)
Sets the mode for extract text's result.
The example demonstratres the ExtractTextMode
property usage in text extraction scenario.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(@"D:\Text\text.pdf");
extractor.setExtractTextMode(1);
extractor.extractText();
extractor.getText(@"D:\Text\text.txt");
Value: 0 is pure text mode and 1 is raw ordering mode. Default is 0.value
- extract text's result.public TextSearchOptions getTextSearchOptions()
Gets text search options.
public void setTextSearchOptions(TextSearchOptions value)
Sets text search options.
value
- text search options.public int getExtractImageMode()
Sets the mode for extract images process.
ExtractImageMode
public void setExtractImageMode(int value)
Sets the mode for extract images process.
value
- ExtractImageMode valueExtractImageMode
public boolean isBidi()
Is true when text has hebriew or arabic symbols. This case must be specially considered because string functions change their behaviour and start process text from right to left (except numbers and other non text chars).
public void extractText()
Extracts text from a Pdf document.
First example demonstratres how to extract all the text from PDF file. PdfExtractor extractor = new PdfExtractor(); extractor.bindPdf("D:\Text\text.pdf"); extractor.extractText(); extractor.getText("D:\Text\text.txt");Second example demonstratres how to extract each page's text into one txt file.
PdfExtractor extractor = new PdfExtractor(); extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf"); extractor.extractText(); String prefix = TestPath + "Aspose.Pdf.Kit"; String suffix = ".txt"; int pageCount = 1; while (extractor.hasNextPageText()) { extractor.getNextPageText(prefix + pageCount + suffix); pageCount++; }
public void extractText(Charset encoding)
Extracts text from a Pdf document using specified encoding.
First example demonstrates how to extract all the text from PDF file. PdfExtractor extractor = new PdfExtractor(); extractor.bindPdf("D:\\Text\\text.pdf"); extractor.extractText(Encoding.Unicode); extractor.getText("D:\\Text\\text.txt");Second example demonstrates how to extract each page's text into one txt file.
PdfExtractor extractor = new PdfExtractor(); extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf"); extractor.extractText(java.nio.charset.Charset.forName("UTF-8")); String prefix = TestPath + "Aspose.Pdf.Kit"; String suffix = ".txt"; int pageCount = 1; while (extractor.hasNextPageText()) { extractor.getNextPageText(prefix + pageCount + suffix); pageCount++; }
encoding
- The encoding of the extracted text.public void extractTextInternal(TextEncodingInternal encoding)
encoding
- The encoding of the extracted text.public void getText(String outputFile)
Saves text to file. see also:ExtractText
outputFile
- The file path and name to save the text.public void getText(OutputStream outputStream)
Saves text to stream. see also:ExtractText
outputStream
- The stream to save the text.public void bindPdf(String inputFile)
Bind input PDF file.
PdfExtractor ext = new PdfExtractor(); ext.bindPdf("sample.pdf");
public void bindPdf(InputStream inputStream)
Binds PDF document from stream.
PdfExtractor ext = new PdfExtractor(); InputStream stream = new FileInputStream("sample.pdf"); ext.bindPdf(stream);
public void extractImage()
Extract images from PDF file.
PdfExtractor extractor = new PdfExtractor(); extractor.bindPdf("sample.pdf"); extractor.extractImage(); int i = 1; while (extractor.HasNextImage()) { extractor.getNextImage("image-" + i +".pdf"); }
public boolean hasNextImage()
Checks if more images are accessible in PDF document. Note: ExtractImage must be called before using of this method.
PdfExtractor extractor = new PdfExtractor(); extractor.bindPdf("sample.pdf"); extractor.extractImage(); int i = 1; while (extractor.hasNextImage()) { extractor.getNextImage("image-" + i +".pdf"); }
public boolean getNextImage(String outputFile)
Retreives next image from PDF document. Note: ExtractImage must be called before using of this method.
PdfExtractor extractor = new PdfExtractor(); extractor.bindPdf("sample.pdf"); extractor.extractImage(); int i = 1; while (extractor.hasNextImage()) { extractor.getNextImage("image-" + i +".pdf"); }
outputFile
- File where image will be storedpublic boolean getNextImage(String outputFile, ImageType format)
Retreives next image from PDF document with given image format. Note: ExtractImage must be called before using of this method.
outputFile
- File where image will be storedformat
- ImageType elementpublic boolean getNextImage(OutputStream outputStream, ImageType format)
Retreive next image from PDF file and stores it into stream with given image format.
outputStream
- Stream where image data will be savedformat
- The format of the image.public boolean getNextImage(OutputStream outputStream)
Retreive next image from PDF file and stores it into stream.
outputStream
- Stream where image data will be savedpublic List<String> getAttachNames()
Returns list of attachments in PDF file. Note: ExtractAttachments must be called befor using this method.
Example demonstrates how to extract attachment names form PDF file.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(TestSettings.GetInputFile("sample.pdf"));
extractor.ExtractAttachment();
List attachments = extractor.getAttachNames();
for (String name : (Iterable<String>)
attachments)
System.out.println(name);
public void extractAttachment()
public void extractAttachment(String attachmentFileName)
Extracts attachment to PDF file by attachment name.
attachmentFileName
- Name of attachment to extractpublic void getAttachment(String outputPath)
Stores attachment into file.
outputPath
- Directory path where attachment(s) will be stored. Null or empty string means attachment(s)
will be placed in the application directory.public boolean hasNextPageText()
Indicates that whether can get more texts or not.
The example demonstratres the HasNextPageText
property usage in text extraction scenario.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf");
extractor.extractText(Encoding.Unicode);
String prefix = TestPath + "Aspose.Pdf.Kit";
String suffix = ".txt";
int pageCount = 1;
while (extractor.hasNextPageText())
{
extractor.getNextPageText(prefix + pageCount + suffix);
pageCount++;
}
public void getNextPageText(String outputFile)
Saves one page's text to file.
The example demonstratres the GetNextPageText method usage in text extraction scenario. PdfExtractor extractor = new PdfExtractor(); extractor.bindPdf(TestPath + @"Aspose.Pdf.Kit.Pdf"); extractor.extractText(Encoding.Unicode); String prefix = TestPath + @"Aspose.Pdf.Kit"; String suffix = ".txt"; int pageCount = 1; while (extractor.hasNextPageText()) { extractor.getNextPageText(prefix + pageCount + suffix); pageCount++; }
outputFile
- The file path and name to save the text.public void getNextPageText(OutputStream outputStream)
Saves one page's text to stream.
The example demonstratres the GetNextPageText
method usage in text extraction scenario.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(TestPath + @"Aspose.Pdf.Kit.Pdf");
extractor.extractText(Encoding.Unicode);
String prefix = TestPath + "Aspose.Pdf.Kit";
String suffix = ".txt";
int pageCount = 1;
while (extractor.hasNextPageText())
{
FileInputStream fs = new FileInputStream(prefix + pageCount + suffix, FileMode.Create);
extractor.getNextPageText(fs);
fs.close();
pageCount++;
}
outputStream
- The stream to save the text.public void getText(OutputStream outputStream, boolean filterNotAscii)
Saves text to stream. see also:ExtractText
outputStream
- The stream to save the text.filterNotAscii
- If this parameter is true all Not ASCII simbols will be removedpublic ByteArrayOutputStream[] getAttachment()
Saves all the attachment file to streams.
PdfExtractor extractor = new PdfExtractor(); extractor.bindPdf(path + "Attach.pdf"); extractor.extractAttachment(); IList names = extractor.getAttachNames(); ByteArrayOutputStream[] tempStreams = extractor.getAttachment(); for (int i=0; i<tempStreams.Length; i++) { string name = (string)names[i]; OutputStream fs = new FileOutputStream(path + name); fs.write(tempStreams[i].toByteArray()); fs.close(); }
public List<FileSpecification> getAttachmentInfo()
Gets the list of attachments.
public int getResolution()
Gets resolution for extracted images. Default value is 150. Images which have greater resolution value are more clear. However increasing resolution value results in increasing time and memory needed to extract images. Usually to get clear image it's enough to set resolution to 150 or 300.
public void setResolution(int value)
Set resolution for extracted images. Default value is 150. Images which have greater resolution value are more clear. However increasing resolution value results in increasing time and memory needed to extract images. Usually to get clear image it's enough to set resolution to 150 or 300.
value
- int valuepublic String getPassword()
Gets input file's password.
public void setPassword(String value)
Sets input file's password.
value
- String valuepublic void extractMarkedContentAsImages(Page page, String path)
Gets all the Marked Content containers as separate images.
Every Marked Content will be saved as image with png format named withMCID_<ID number of block for the page>.png
page
- Page for process.path
- The path where images will be saved.