Extract xml data from gzip file using apache tika? - apache

I am working a project in which i need to extract xml(sitemap)data from gz file using apache tika[AM NEW TO TIKA].
the fie name is something like sitemap01.xml.gz
I could extract data from normal text file or html,but i don't know how to extract xml from gz and extract the meta and data from xml...
I searched Google for past two days.
Do i need to use delegateParser in tika to extract data from xml?
Please guide me to some sample or articles....
Here is my try
public void parseXml() throws IOException{
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
InputStream stream =this.getClass().getResourceAsStream("sitemap.xml.gz");
try {
parser.parse(stream,handler,metadata,context);
for(int i = 0; i <metadata.names().length; i++) {
String name = metadata.names()[i];
System.out.println(name + " : " + metadata.get(name));
}
System.out.println(handler.toString());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}finally{
if(stream!=null) {
stream.close();
}
}
}

The thing you're missing is setting a recursing parser on your ParseContext. You probably want something like:
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(....)
By setting a Parser on the ParseContext, you tell Tika to call that when it encounters embedded documents (such as the XML inside your GZip)

Here is how you can use XML parser from Apache Tika for your case:
//detecting the file type
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
File inFile = new File("sitemap.xml.gz");
System.out.println(inFile.isFile());
FileInputStream inputstream = new FileInputStream(inFile);
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
System.out.println(pcontext.toString());
System.out.println("Contents of the document:" + handler.toString());//this one contains all contents from xml files and tags are also removed
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));

Related

Apache Tika: Parsed PDF contains rectangle box , how to remove/parse these chars?

I am parsing pdfs with apache tika, but I have these in the result String .
I think it's a enumeration dash. How can I remove this from the result string or parse it the right way?
Other characters are working fine e.g. $ % & and so on.
Here is the code to parse from an InputStream:
private SolrInputDocument getDocument(InputStream stream, String docPath) throws SolrServerException {
SolrInputDocument solrDocument = new SolrInputDocument();
Parser parser = new AutoDetectParser(); // Should auto-detect!
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(false);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser);
Metadata metadata = new Metadata();
metadata.add("encoding", "unicode");
try {
parser.parse(stream, handler, metadata, parseContext);
String body = handler.toString();
body = CharMatcher.JAVA_ISO_CONTROL.removeFrom(body);
solrDocument.addField("id", docPath);
solrDocument.addField("body", body);
solrDocument.addField("url", docPath);
solrDocument.addField("extension", PDF_EXTENSION);
} catch (IOException | SAXException | TikaException e) {
throw new SolrServerException(e);
}
return solrDocument;
}

throwing exception inside the java 8 stream foreach

I am using java 8 stream and I can not throw the exceptions inside the foreach of stream.
stream.forEach(m -> {
try {
if (isInitial) {
isInitial = false;
String outputName = new SimpleDateFormat(Constants.HMDBConstants.HMDB_SDF_FILE_NAME).format(new Date());
if (location.endsWith(Constants.LOCATION_SEPARATOR)) {
savedPath = location + outputName;
} else {
savedPath = location + Constants.LOCATION_SEPARATOR + outputName;
}
File output = new File(savedPath);
FileWriter fileWriter = null;
fileWriter = new FileWriter(output);
writer = new SDFWriter(fileWriter);
}
writer.write(m);
} catch (IOException e) {
throw new ChemIDException(e.getMessage(),e);
}
});
and this is my exception class
public class ChemIDException extends Exception {
public ChemIDException(String message, Exception e) {
super(message, e);
}
}
I am using loggers to log the errors in upper level. So I want to throw the exception to top. Thanks
Try extending RuntimeException instead. The method that is created to feed to the foreach does not have that type as throwable, so you need something that is runtime throwable.
WARNING: THIS IS PROBABLY NOT A VERY GOOD IDEA
But it will probably work.
Why are you using forEach, a method designed to process every element, when all you want to do, is to process the first element? Instead of realizing that forEach is the wrong method for the job (or that there are more methods in the Stream API than forEach), you are kludging this with an isInitial flag.
Just consider:
Optional<String> o = stream.findFirst();
if(o.isPresent()) try {
String outputName = new SimpleDateFormat(Constants.HMDBConstants.HMDB_SDF_FILE_NAME)
.format(new Date());
if (location.endsWith(Constants.LOCATION_SEPARATOR)) {
savedPath = location + outputName;
} else {
savedPath = location + Constants.LOCATION_SEPARATOR + outputName;
}
File output = new File(savedPath);
FileWriter fileWriter = null;
fileWriter = new FileWriter(output);
writer = new SDFWriter(fileWriter);
writer.write(o.get());
} catch (IOException e) {
throw new ChemIDException(e.getMessage(),e);
}
which has no issues with exception handling. This example assumes that the Stream’s element type is String. Otherwise, you have to adapt the Optional<String> type.
If, however, your isInitial flag is supposed to change more than once during the stream processing, you are definitely using the wrong tool for your job. You should have read and understood the “Stateless behaviors” and “Side-effects” sections of the Stream API documentation, as well as the “Non-interference” section, before using Streams. Just converting loops to forEach invocations on a Stream doesn’t improve the code.

Read file contents from VirtualFile - intellij plugin development

How can I read file contents from a virtual file. I am currently using this way
BufferedReader br = new BufferedReader(new InputStreamReader(virtualFile.getInputStream()));
String currentLine;
StringBuilder stringBuilder = new StringBuilder();
while ((currentLine = br.readLine()) != null) {
stringBuilder.append(currentLine);
stringBuilder.append("\n");
}
} catch (IOException e1) {
e1.printStackTrace();
return 0;
}
However am getting some garbled string appended when I print the stringbuilder.
Some common ways of reading VirtualFile contents are:
file.contentsToByteArray()
LoadTextUtil.loadText(file)
FileDocumentManager.getInstance().getDocument(file).get*CharSequence()
You can use VfsUtil.loadText(virtualFile);
Also, to make sure that the file is updated you can use virtualFile.refresh(false, false);
here you can find some more useful information.

Extract text data line by line from pdf using pdfbox API in java

I have used to extract text data from PDF using Apache PDFBox API, but below code is not returned data sequentially (line by line)
Code:
try {
RandomAccess scratchFile = null;
pdDoc = PDDocument.loadNonSeq(new File(fileName), scratchFile);
pdfStripper = new PDFTextStripper();
parsedText = pdfStripper.getText(pdDoc);
system.out.println(parsedText);
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
return null;
}

Reading content of a JAR file (at runtime)? [duplicate]

This question already has answers here:
How to list the files inside a JAR file?
(17 answers)
Closed 8 years ago.
I have read the posts:
Viewing contents of a .jar file
and
How do I list the files inside a JAR file?
But I, sadly, couldn't find a good solution to actually read a JAR's content (file by file).
Furthermore, could someone give me a hint, or point to a resource, where my problem is discussed?
I just could think of a not-so-straight-forward-way to do this:
I could somehow convert the list of a JAR's resources to a list of
inner-JAR URLs, which I then could open using openConnection().
You use JarFile to open a Jar file. With it you can get ZipEntry or JarEntry (they can be seen as the same thing) by using 'getEntry(String name)' or 'entires'. Once you get an Entry, you can use it to get InputStream by calling 'JarFile.getInputStream(ZipEntry ze)'. Well you can read data from the stream.
Here is the complete code which reads all the file contents inside the jar file.
public class ListJar {
private static void process(InputStream input) throws IOException {
InputStreamReader isr = new InputStreamReader(input);
BufferedReader reader = new BufferedReader(isr);
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
reader.close();
}
public static void main(String arg[]) throws IOException {
JarFile jarFile = new JarFile("/home/bathakarai/gold/click-0.15.jar");
final Enumeration<JarEntry> entries = jarFile.entries();
while (entries.hasMoreElements()) {
final JarEntry entry = entries.nextElement();
if (entry.getName().contains(".")) {
System.out.println("File : " + entry.getName());
JarEntry fileEntry = jarFile.getJarEntry(entry.getName());
InputStream input = jarFile.getInputStream(fileEntry);
process(input);
}
}
}
}
Here is how I read it as a ZIP file,
try {
ZipInputStream is = new ZipInputStream(new FileInputStream("file.jar"));
ZipEntry ze;
byte[] buf = new byte[4096];
int len;
while ((ze = is.getNextEntry()) != null) {
System.out.println("----------- " + ze);
len = ze.getSize();
// Dump len bytes to the file
...
}
is.close();
} catch (Exception e) {
e.printStackTrace();
}
This is more efficient than JarFile approach if you want decompress the whole file.