I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc.
My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only extract the text.
I am using tika 1.12. Does anyone know how to get around this cahcing issue. This is how I am using Autodetect:
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File(child.getPath()));
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
String s=null;
s =handler.toString();
handler=null;
context=null;
inputstream.close();
PDFont.clearResources();
So I fudged a workaround and just called System.gc(); everytime the file had finished being processed which works a treat but doesn't really answer the question.
Related
I have to rephrase my question, basically my request is very straight forward, i want to display Asian characters in the generated pdf file from iText7.
As of now i have download the NotoSansCJKsc-Regular.otf file and assign a variable to hold the path, below is my code:
public static string FONT = #"D:\Projects\Resources\NotoSansCJKsc-Regular.otf";
PdfWriter writer = new PdfWriter(#"C:\temp\test.pdf");
PdfDocument pdfDoc = new PdfDocument(writer);
Document doc = new Document(pdfDoc, PageSize.A4);
PdfFont fontChinese = PdfFontFactory.CreateFont(FONT, PdfEncodings.IDENTITY_H);
doc.SetFont(fontChinese);
but the issue i am facing now is whenever the code runs to this section:
PdfFont fontChinese = PdfFontFactory.CreateFont(FONT, PdfEncodings.IDENTITY_H);
i am always getting this error: The request could not be performed because of an I/O device error. and this error doesn't make sense to me and I am struggling to find out the solution, could someone in here had the similar issue plz, the code is in C#.
Many thanks.
I can confirm that above code is working as expected, the .otf file that I was originally downloaded was corrupted, hence I got above error.
I created a java app that create some documents in output. The documents are created with apache POI api and are made of text abn tables.
My boss now decided they also want them in pdf format for storing them. They of course have 0$ budget. I tryed using iText 4.2 (which comes under lgpl license) but i lost all tables (I'm having only naked text)
This is my script:
try{
XWPFDocument doc = new XWPFDocument(POIXMLDocument.openPackage(s + ".doc"));
XWPFWordExtractor wx = new XWPFWordExtractor(doc);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(s + ".pdf"));
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String text = wx.getText();
text=text.replaceAll("\\cM?\r?\n", "");
document.add(new Paragraph(text));
}
catch(Exception e){
System.out.println("Exception during test");
e.printStackTrace();
}
Any help? Even a change of direction would be great. I was wandering if I could simply write a macro that open the doc, type save as, and save it as pdf with same name. Launching it eventually inside java app.
Thank you
You might want to take a look at this, pretty similar question.
I wrote an answer there which is not tested yet since i got no time to do so. But it might sill solve your problem or atleast give you some hints for further researches.
I am receiving the EI not found error in this specific pdf found under https://bfs.ever-team.com/files/6fce4cef9769e40d1994e684a881d4bf/facture3_1.pdf.
I am using itextpdf-5.4.3 jar and below is the code:
com.itextpdf.awt.geom.Rectangle rec = new com.itextpdf.awt.geom.Rectangle(307, 728, 742, 400);
RenderFilter filter = new RegionTextRenderFilter(rec);
TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
String currentText = PdfTextExtractor.getTextFromPage(reader, i , strategy);
Method getTextFromPage is returning the error,
I checked other threads but it was mentioned that this error should be fixed in the latest jar, but it seems it is not facture3_1.pdfworking for my file.
Can anyone advise please.
A crosspost of this question has been answered on the iText mailing list. To close the question here, too, that answer is copied here:
The issue can be reproduced with iText 5.4.3 but not with the current development snapshot. The OP, therefore, should update his iText version.
InlineImageParseException: EI not found after end of image data
EI denotes the end of an inline image. The handling of inline images is tricky and not strictly well-defined. iText recently improved its handling of inline images to correctly parse more PDFs with such inline images.
I've created a program which needs to convert PDF files into image files, and for this GhostScript is the best choice. But once in a while, the library stalls completely on a page and doesn't continue, it just keeps using CPU power and working, as though it might be caught in an infinite loop. The error is easily reproduce-able as it happens every time on the specific PDF files that it occurs on, though no error is given from GhostScript of any kind, and nothing is out of the ordinary in the PDF files themselves as far as I can see.
I have however been able to find out that the stalling is due to a specific element or elements in the pdf files, and by deleting the elements the pdf will easily render in GhostScript, but this is not a solution, nor an answer I can use.
PDF link* - http://www.filedropper.com/usjunis1-32webtest
*saved with free version of PDF-XChange Editor, so it has watermarks at the top, but it is the square that creates the stalling. I've also seen it happen on vector graphics objects, so it is not limited to squares.
Code -
private void startImageProcessing(String pdfFile)
{
GhostscriptVersionInfo gvi = new GhostscriptVersionInfo(new Version(0, 0, 0), Directory.GetCurrentDirectory() + #"\gsdll32.dll", string.Empty, GhostscriptLicense.GPL);
Ghostscript.NET.Processor.GhostscriptProcessor processor = new Ghostscript.NET.Processor.GhostscriptProcessor(gvi, true);
processor.StartProcessing(CreateTestArgs(pdfFile, pdfFile.Substring(0, pdfFile.Length - 4) + "\\"+prefix+"-%03d.jpg", 72 * scale), new ConsoleStdIO(true));
}
private static string[] CreateTestArgs(string inputPath, string outputPath, int dpi)
{
List<string> gsArgs = new List<string>();
gsArgs.Add("-dSAFER");
gsArgs.Add("-dBATCH");
gsArgs.Add("-dNOPAUSE");
gsArgs.Add("-sDEVICE=jpeg");
gsArgs.Add("-r" + dpi);
gsArgs.Add("-dJPEGQ=100");
gsArgs.Add("-dNumRenderingThreads=" + Environment.ProcessorCount.ToString());
gsArgs.Add("-dTextAlphaBits=4");
gsArgs.Add("-dGraphicsAlphaBits=4");
gsArgs.Add(#"-sOutputFile=" + outputPath);
gsArgs.Add(#"-f" + inputPath);
return gsArgs.ToArray();
}
I've also created a pdf file only containing one of the wrong elements for testing, and it has both had the error when saved by Adobe Acrobat, and PDF-XChange Editor, so the error is not due to a specific program that I've used to save the PDF either.
In nutch I'm implementing a plug-in that will get the content of webpages and process them in special way.
My main problem is I want to convert webpages to plainText to be able to processed,, I read that tika toolkit can do that
so, I found this code that use tika to parse urls, I write it under filter method
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
{
byte[] raw = content.getContent();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
String plainText = handler.toString();
LOG.info("Mime: " + metadata.get(Metadata.CONTENT_TYPE));
LOG.info("content: " + handler.toString());
}
The result of metadata.get(Metadata.CONTENT_TYPE) is text/html
but handler.toString() is empty !
Update:
Also I try to use this line after the parser method
LOG.info ("Status : "+ new ParseStatus().toString());
and I get this result:
Status : notparsed(0,0)
Since version 1.1 Nutch includes a Tika plugin (see also NUTCH-766) that should cover your need. I don't know if there's more comprehensive documentation available. You might want to ask the Nutch users mailing list for more details (or someone here on SO can fill in).
As Jukka Zitting said, Tika is already leveraged in nutch. In the code that you pasted, there is no place that you had set the metadata and ParseStatus to any nutch specific data structure. So you dont see the ParseStatus accordingly.