Apache Tika - read chunk at a time from a file? - apache

Is there any way to read chunk at a time (instead of reading the entire file) from a file using Tika API?
following is my code. As you can see I am reading the entire file at once. I would like to read chunk at a time and create a text file the content.
InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
....
p.parse(stream,handler,meta, context);
...
String content = handler.toString();

There's (now) and Apache Tika example which shows how you can capture the plain text output, and return it in chunks based on the maximum allowed size of a chunk. You can find it in ContentHandlerExample - method is parseToPlainTextChunks
Based on that, if you wanted to output to a file instead, and on a per-chunk basis, you'd tweak it to be something like:
final int MAXIMUM_TEXT_CHUNK_SIZE = 100 * 1024 * 1024;
final File outputDir = new File("/tmp/");
private class ChunkHandler extends ContentHandlerDecorator {
private int size = 0;
private int fileNumber = -1;
private OutputStreamWriter out = null;
#Override
public void characters(char[] ch, int start, int length) throws IOException {
if (out == null || size+length > MAXIMUM_TEXT_CHUNK_SIZE) {
if (out != null) out.close();
fileNumber++;
File f = new File(outputDir, "output-" + fileNumber + ".txt);
out = new OutputStreamWriter(new FileOutputStream(f, "UTF-8"));
}
out.write(ch, start, length);
}
public void close() throws IOException {
if (out != null) out.close();
}
}
public void parse(File file) {
InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
ContentHandler handler = new ChunkHandler();
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
((ChunkHandler)handler).close();
}
That will give you plain text files in the given directory, of no more than a maximum size. All html tags will be ignored, you'll only get the plain textual content

Related

Uploading byte array to MYSQL shows empty field (nothing uploaded?)

I've been trying to write a script which opens a file with UniFileViewer, sets a path to the file and passes it to various methods which serialize it and upload it to my MYSQL server. When I run the script it executes without problems but the result is 0 bytes uploaded to the database. I must have missed something here...
public class FileOpen : MonoBehaviour
{
public UITexture ProfilePic;
public static Texture2D tex = null;
public static String selectedFilePath;
void openFile()
{
UniFileBrowser.use.OpenFileWindow(OpenFile);
}
public static Texture2D LoadPNG(string selectedFilePath)
{
byte[] fileData;
fileData = File.ReadAllBytes(selectedFilePath);
tex = new Texture2D(2, 2);
tex.LoadImage(fileData); //..this will auto-resize the texture dimensions.
return tex;
}
//OPENS THE FILE AND SENDS IT TO THE SERVER
void OpenFile(string filePath)
{
selectedFilePath = filePath;
LoadPNG(selectedFilePath);
Texture2D uploadFile = tex;
byte[] bytes = uploadFile.EncodeToPNG();
string fileToSend = Convert.ToBase64String(bytes);
string[] datas = new string[1];
datas[0] = fileToSend;
LoginPro.Manager.ExecuteOnServer("SaveFile", SendToServer_Success, SendToServer_Error, datas);
}

iText FileStream overwirtes

I am in the process of putting together some code that will merge pdf's based on the file name prefix. I currently have the below code that grabs the filename and doesn't merge, but overwrites. I believe my problem is the FileStream placement, but if I move it out of the current location, I can't get the filename. Any suggestions? Thanks.
static void CreateMergedPDFs()
{
string srcDir = "C:/PDFin/";
string resultPDF = "C:/PDFout/";
{
var files = System.IO.Directory.GetFiles(srcDir);
string prevFileName = null;
int i = 1;
foreach (string file in files)
{
string filename = Left(Path.GetFileName(file), 8);
using (FileStream stream = new FileStream(resultPDF + filename + ".pdf", FileMode.Create))
{
if (prevFileName == null || filename == prevFileName)
{
Document pdfDoc = new Document(PageSize.A4);
PdfCopy pdf = new PdfCopy(pdfDoc, stream);
pdfDoc.Open();
{
pdf.AddDocument(new PdfReader(file));
i++;
}
if (pdfDoc != null)
pdfDoc.Close();
Console.WriteLine("Merges done!");
}
}
}
}
}
}
}
The behavior you are describing is consistent with your code. You are creating the loop in an incorrect way.
Try this:
static void CreateMergedPDFs()
{
string srcDir = "C:/PDFin/";
string resultPDF = "C:/PDFout/merged.pdf";
FileStream stream = new FileStream(resultPDF, FileMode.Create);
Document pdfDoc = new Document(PageSize.A4);
PdfCopy pdf = new PdfCopy(pdfDoc, stream);
pdfDoc.Open();
var files = System.IO.Directory.GetFiles(srcDir);
foreach (string file in files)
{
pdf.AddDocument(new PdfReader(file));
}
pdfDoc.Close();
Console.WriteLine("Merges done!");
}
}
That makes more sense, doesn't it?
If you want to group files based on their prefix, you should read the answer to the question Group files in a directory based on their prefix
In the answer to this question, it is assumed that the prefix and the rest of the filename are separated by a - character. For instance 1-abc.pdf and 1-xyz.pdf have the prefix 1 whereas 2-abc.pdf and 2-xyz.pdf have the prefix 2. In your case, it's not clear how you'd determine the prefix, but it's easy to get a list of all the files, sort them and make groups of files based on whatever algorithm you want to determine the prefix.

Pdfbox - adding pdf embedded File and save the PDDocument to OutputStream does not keep the embedded Files

I'm using Pdfbox (1.8.8) to adding attachments to a pdf. My problem is when one of the attachments is of type .pdf and i'm saving the PDDocument to OutputStream the final pdf document does not include the attachments. If a save the PDDocument to a file instead an OutputStream all works just fine, and if the attachments does not include any pdf, both save to file or OutputStream works fine.
I would like to know if there is any way to add pdf embedded Files and save the PDDocument to OutputStream keeping the attached files in the final pdf that is generated.
The code i'm using is:
private void insertAttachments(OutputStream out, ArrayList<Attachment> attachmentsResources) {
final PDDocument doc;
Boolean hasPdfAttach = false;
try {
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
// final PDFTextStripper pdfStripper = new PDFTextStripper();
// final String text = pdfStripper.getText(doc);
final PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
final Map embeddedFileMap = new HashMap();
PDEmbeddedFile embeddedFile;
File file = null;
for (Attachment attach : attachmentsResources) {
// first create the file specification, which holds the embedded file
final PDComplexFileSpecification fileSpecification = new PDComplexFileSpecification();
fileSpecification.setFile(attach.getFilename());
file = AttachmentUtils.getAttachmentFile(attach);
final InputStream is = new FileInputStream(file.getAbsolutePath());
embeddedFile = new PDEmbeddedFile(doc, is);
// set some of the attributes of the embedded file
if ("application/pdf".equals(attach.getMimetype())) {
hasPdfAttach = true;
}
embeddedFile.setSubtype(attach.getMimetype());
embeddedFile.setSize((int) (long) attach.getFilesize());
fileSpecification.setEmbeddedFile(embeddedFile);
// now add the entry to the embedded file tree and set in the document.
embeddedFileMap.put(attach.getFilename(), fileSpecification);
// final String text2 = pdfStripper.getText(doc);
}
// final String text3 = pdfStripper.getText(doc);
efTree.setNames(embeddedFileMap);
// ((COSDictionary) efTree.getCOSObject()).removeItem(COSName.LIMITS); (this not work for me)
// attachments are stored as part of the "names" dictionary in the document catalog
final PDDocumentNameDictionary names = new PDDocumentNameDictionary(doc.getDocumentCatalog());
names.setEmbeddedFiles(efTree);
doc.getDocumentCatalog().setNames(names);
// final ByteArrayOutputStream pdfboxToDocumentStream = new ByteArrayOutputStream();
final String tmpfile = "temporary.pdf";
if (hasPdfAttach) {
final File f = new File(tmpfile);
doc.save(f);
doc.close();
//i have try with parser but without success too
// PDFParser parser = new PDFParser(new FileInputStream(tmpfile));
// parser.parse();
// PDDocument doc2 = parser.getPDDocument();
final PDDocument doc2 = PDDocument.loadNonSeq(f, new RandomAccessFile(new File(getHomeTMP()
+ "tempppp.pdf"), "r"));
doc2.save(out);
doc2.close();
} else {
doc.save(out);
doc.close();
}
//that does not work too
// final InputStream in = new FileInputStream(tmpfile);
// IOUtils.copy(in, out);
// out = new FileOutputStream(tmpFile);
// doc.save (out);
} catch (IOException e1) {
e1.printStackTrace();
} catch (Exception e2) {
e2.printStackTrace();
}
}
Best regards
Solution:
private void insertAttachments(OutputStream out, ArrayList<Attachment> attachmentsResources) {
final PDDocument doc;
try {
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
((ByteArrayOutputStream) out).reset();
final PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
final Map embeddedFileMap = new HashMap();
PDEmbeddedFile embeddedFile;
File file = null;
for (Attachment attach : attachmentsResources) {
// first create the file specification, which holds the embedded file
final PDComplexFileSpecification fileSpecification = new PDComplexFileSpecification();
fileSpecification.setFile(attach.getFilename());
file = AttachmentUtils.getAttachmentFile(attach);
final InputStream is = new FileInputStream(file.getAbsolutePath());
embeddedFile = new PDEmbeddedFile(doc, is);
// set some of the attributes of the embedded file
embeddedFile.setSubtype(attach.getMimetype());
embeddedFile.setSize((int) (long) attach.getFilesize());
fileSpecification.setEmbeddedFile(embeddedFile);
// now add the entry to the embedded file tree and set in the document.
embeddedFileMap.put(attach.getFilename(), fileSpecification);
}
efTree.setNames(embeddedFileMap);
((COSDictionary) efTree.getCOSObject()).removeItem(COSName.LIMITS);
// attachments are stored as part of the "names" dictionary in the document catalog
final PDDocumentNameDictionary names = new PDDocumentNameDictionary(doc.getDocumentCatalog());
names.setEmbeddedFiles(efTree);
doc.getDocumentCatalog().setNames(names);
((COSDictionary) efTree.getCOSObject()).removeItem(COSName.LIMITS);
doc.save(out);
doc.close();
} catch (IOException e1) {
e1.printStackTrace();
} catch (Exception e2) {
e2.printStackTrace();
}
}
You store the new PDF after the original PDF in out:
Look at all the uses of out in your method:
private void insertAttachments(OutputStream out, ArrayList<Attachment> attachmentsResources) {
...
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
...
doc2.save(out);
...
doc.save(out);
So you get as input a ByteArrayOutputStream and take its current content as input (i.e. the ByteArrayOutputStream is not empty but already contains a PDF) and after some processing you append the modified PDF to the ByteArrayOutputStream. Depending on the PDF viewer you present this to, you will be shown either the original or the manipulated PDF or a (very correct) error message that the file is garbage.
If you want the ByteArrayOutputStream to contain only the manipulated PDF, simply add
((ByteArrayOutputStream) out).reset();
or (if you are not sure about the state of the stream)
out = new ByteArrayOutputStream();
right after
doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) out).toByteArray()));
PS: According to the comments the OP tried the above proposed changes to his code without success.
I cannot run the code as presented in the question because it is not self-contained. Thus, I reduced it to the essentials to get a self-contained test:
#Test
public void test() throws IOException, COSVisitorException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try (
InputStream sourceStream = getClass().getResourceAsStream("test.pdf");
InputStream attachStream = getClass().getResourceAsStream("artificial text.pdf"))
{
final PDDocument document = PDDocument.load(sourceStream);
final PDEmbeddedFile embeddedFile = new PDEmbeddedFile(document, attachStream);
embeddedFile.setSubtype("application/pdf");
embeddedFile.setSize(10993);
final PDComplexFileSpecification fileSpecification = new PDComplexFileSpecification();
fileSpecification.setFile("artificial text.pdf");
fileSpecification.setEmbeddedFile(embeddedFile);
final Map<String, PDComplexFileSpecification> embeddedFileMap = new HashMap<String, PDComplexFileSpecification>();
embeddedFileMap.put("artificial text.pdf", fileSpecification);
final PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
efTree.setNames(embeddedFileMap);
final PDDocumentNameDictionary names = new PDDocumentNameDictionary(document.getDocumentCatalog());
names.setEmbeddedFiles(efTree);
document.getDocumentCatalog().setNames(names);
document.save(baos);
document.close();
}
Files.write(Paths.get("attachment.pdf"), baos.toByteArray());
}
As you see PDFBox here uses only streams. The result:
Thus, PDFBox without problem stores a PDF into which it has embedded a PDF file attachment.
The problem, therefore, most likely have nothing to do with this work flow as such

How to extract text from PDFs using a PIG UDF and Apache Tika?

I'm attempting to write a PIG eval function (UDF) to extract text from pdf files using Apache Tika. However, my function only writes 0 or 1 bytes to output whenever I try to run the function. How could I fix my code?
public class ExtractTextFromPDFs extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
String pdfText;
if (input == null || input.size() == 0 || input.get(0) == null) {
return "N/A";
}
DataByteArray dba = (DataByteArray)input.get(0);
InputStream is = new ByteArrayInputStream(dba.get());
ContentHandler contenthandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser pdfparser = new AutoDetectParser();
try {
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
} catch (SAXException | TikaException e) {
e.printStackTrace();
}
pdfText = contenthandler.toString();
//close the input stream
if(is != null){
is.close();
}
return pdfText;
}
}
I run the code using 'c = foreach b generate ExtractTextFromPDFs(content);' where b is a pdf and content is a bytearray.

com.sun.jersey.api.client.UniformInterfaceException (returned a response status of 400)

I am trying to set up file upload example using JAX RS. I could set up the project and successfully upload file in a server location. But i get the following error when file size is more than 10KB (weird!!)
com.sun.jersey.api.client.UniformInterfaceException: POST http://localhost:9090/DOAFileUploader/rest/file/upload returned a response status of 400
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:607)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:507)
at com.sony.doa.rest.client.DOAClient.upload(DOAClient.java:75)
at com.sony.doa.rest.client.DOAMain.main(DOAMain.java:34)
I am new to JAX RS and i'm not sure what exactly the issue is. Do i need to set some parameters client side or server side (like size, timeout etc)?
This is the client side code calling webservice:
public void upload() {
File file = new File(inputFilePath);
FormDataMultiPart part = new FormDataMultiPart();
part.bodyPart(new FileDataBodyPart("file", file, MediaType.APPLICATION_OCTET_STREAM_TYPE));
WebResource resource = Client.create().resource(url);
String response = resource.type(MediaType.MULTIPART_FORM_DATA_TYPE).post(String.class, part);
System.out.println(response);
}
This is the server side code:
#Path("/file")
public class UploadFileService {
#POST
#Path("/upload")
#Consumes(MediaType.MULTIPART_FORM_DATA)
public Response uploadFile(
#FormDataParam("file") InputStream uploadedInputStream,
#FormDataParam("file") FormDataContentDisposition fileDetail) {
String uploadedFileLocation = "e://uploaded/"
+ fileDetail.getFileName();
writeToFile(uploadedInputStream, uploadedFileLocation);
String output = "File uploaded to : " + uploadedFileLocation;
return Response.status(200).entity(output).build();
}
private void writeToFile(InputStream uploadedInputStream,
String uploadedFileLocation) {
try {
OutputStream out = new FileOutputStream(new File(
uploadedFileLocation));
int read = 0;
byte[] bytes = new byte[16000];
out = new FileOutputStream(new File(uploadedFileLocation));
while ((read = uploadedInputStream.read(bytes)) != -1) {
out.write(bytes, 0, read);
}
out.flush();
out.close();
} catch (IOException e) {
e.printStackTrace();
} } }
Please let me know what settings i have to change for file sizes greater than 10KB?
Thanks!
I use org.apache.commons.fileupload.servlet.ServletFileUpload in a Jersey context, and it works fine., and yes, it set the max file size, sorry I missed this before.
here is a snipet of code I use (this is a multipart form, so there are other fields along with the file)
private LibraryUpload parseLibraryUpload(HttpServletRequest request) {
LibraryUpload libraryUpload;
File libraryZip = null;
String name = null;
String version = null;
ServletFileUpload upload = new ServletFileUpload();
upload.setFileSizeMax(MAX_FILE_SIZE);
FileItemIterator iter;
try {
iter = upload.getItemIterator(request);
while (iter.hasNext()) {
....
if (item.isFormField()) {
....
}else{
BufferedInputStream buffer = new BufferedInputStream(stream);
buffer.mark(MAX_FILE_SIZE);
libraryZip = File.createTempFile("fromUpload", null);
IOUtils.copy(buffer, new FileOutputStream(libraryZip));
...
}
I have encountered the same problem with Jersey. I have activated jersey trace but nothing help me.
I have changed the library by an apache Library and I see than the problem with linked to a repository for temporary files for tomcat. The repository was not exist. For files under 10k, the repository was not used.
So, after the repository creation, I used jersey library and all works fine.