Tika with Grobid throwing error when parsing pdf document

Tika with Grobid throwing error when parsing pdf document - tika-server

I am trying to extract both document metadata and journal header metadata from a pdf document. I verified that Tika Server (v1.21 / v1.24) and Grobid (v0.6.0) are independently able to extract metadata from the pdf document. However, when I run Grobid within Tika Server ( following instructions mentioned in https://cwiki.apache.org/confluence/display/TIKA/GrobidJournalParser ), I get the below error (snippet) for the same pdf document:
org.xml.sax.SAXParseException; Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at org.apache.tika.utils.XMLReaderUtils.buildDOM(XMLReaderUtils.java:407)
at org.apache.tika.parser.journal.TEIDOMParser.parse(TEIDOMParser.java:44)
at org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:85)
at org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:224)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422)
....
I ran the below command to start Tika Server with Grobid:
java -classpath /home/avlurs/grobid-0.6.0/grobidparser-resources/:tika-server-1.21.jar org
.apache.tika.server.TikaServerCli --config /home/avlurs/grobid-0.6.0/grobidparser-resources/tika-config.xml &
I ran the below command to test the metadata extraction:
curl -T /home/avlurs/temp/in/JournalTest.pdf -H "Content-Disposition: attachment;filename=
JournalTest.pdf" http://localhost:9998/rmeta
In addition to throwing the above mentioned error, I am getting the document metadata from Tika in the output. However, Grobid metadata is not being extracted.
Appreciate any inputs / suggestions to address this issue. Thanks.

The Grobid service updated the location of their API endpoints to under /api in July 2017 but the GrobidParser wasn't updated to use the new location.
I've just committed a fix for this as part of TIKA-3191, which will be released in Tika 1.25. We're hoping to get that out in the next few week, but until then you can use a source build or a snapshot build.
I also plan to update the Tika GrobidParser Wiki Page to have more up to date instructions in place that explain using the current Gradle build and Docker image options Grobid has these days.

Related

Converting docx to PDF/A with libre office writer

I am happily converting docx files to PDF via the command line (controlled via C# process calls) out of my service.
Unfortunately I could not find any internet search results on how to set the options for the output PDF that the GUI offers me. I am specifically looking for generating PDF/A and tagged PDF via the command line.
Anyone ever done this and knows how to do that?
EDIT:
Obviously getting a PDF/A can be done by using unoconv instead.
On windows one would use the following command line in a checked out unoconv repository:
python.exe .\unoconv -f pdf -eSelectPdfVersion=1 C:\temp\libre\renderingtest.docx
I did not find further information on how to select other things (tagged PDF etc.) and where to get a complete list of the options that are available.
EDIT: It seems as one could try the different options in the GUI. The settings get saved to C:\Users\<userName>\AppData\Roaming\LibreOffice\4\user\registrymodifications.xcu. Then one can look up the changed setting and provide that to unoconv as this:
python.exe .\unoconv -f pdf -eUseTaggedPDF=1 -eSelectPdfVersion=1 C:\temp\libre\renderingtest.docx
Still not sure if I am doing this correctly though.

The gotenberg project shows how that can be done using unocov.
$ curl --request POST 'http://localhost:3000/forms/libreoffice/convert' --form 'files=#"doc.docx"' --form 'nativePdfFormat="PDF/A-1a"' -o pdfA.pdf
Example PDF

How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility?

I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly.
When I try to send a pdf with an image on it I get the following.
WARNING: Tesseract OCR is installed and will be automatically applied to image f
iles unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via
TikaConfig.
Can I configure the TikaConfig using the command line utility ? Or do I have to clone the project and update poms and rebuild. I really do not want to have to do that.
There is some info here on how to use the command line utility and the TikaConfig but I cannot figure out how to enable TesseractOCRParser with it.
Any help, greatly appreciated.

OK so with the help of this post on the Apache Tika Forum Thank you guys.
I managed to get it working.
Its a hack but It works. What I did was extract the Tika-app Jar file. Then locate the PDFParser.properties and change the following properties like this
extractInlineImages true
extractUniqueInlineImagesOnly false
ocrStrategy ocr_and_text_extraction
Then locate TesseractOCRConfig.properties.
And change this one property to 1..
enableImageProcessing=1
Save the above properties files. Zip it all up again.
And use your new zipped up jar file and it will now extract text and text from images from a pdf file.

I tried user3250052's approach but I was unable to recompress the jar file in a way that was executable. That's owing to my own inexperience with Java, but regardless, the less hacky way is to call a custom tika config file when calling tika:
java -jar tika-app.jar --config=tika-config.xml image.pdf
This is what my tika-config.xml looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
<!--for example: <mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
<service-loader dynamic="true" loadErrorHandler="IGNORE"/>
<encodingDetectors>
<encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
</encodingDetectors>
<translator class="org.apache.tika.language.translate.DefaultTranslator"/>
<detectors>
<detector class="org.apache.tika.detect.DefaultDetector"/>
</detectors>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="extractInlineImages" type="bool">true</param>
</params>
</parser>
</parsers>
</properties>
To build that that config file, first I ran:
java -jar tika-app.jar --dump-current-config
That will dump for you the default config. I took that and put it into tika-config.xml and added:
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="extractInlineImages" type="bool">true</param>
</params>
</parser>
which I gleaned from https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) (option 1).
Even though tesseract is enabled by default (so OCR will work out of the box on image files), PDFs do not get OCRed without that option set because, as noted in the above link, "by default, extracting inline images is turned off because some rare PDFs contain thousands of inline images per page, and it has a big hit on performance, both memory usage and time".
Now everything (OCR on image files, OCR of images in or image-based PDFs, and also naturally text extraction of text-based PDFs) works with the java app tika. I found plenty of documentation on getting this to work on the java server tika but very little on the java app tika, so I'm hoping this saves someone the few hours it took me to figure that out (let me know).

I would recommend using ocrStrategy auto
This tries to extract and then falls back onto OCR

Aerospike docker - 100L, 'UDF: Execution Error 1

I deployed an Aerospike container using the official docker hub image. When I try to execute test_list = client.llist(key, 'test_list'), my Python client script returns the following error:
exception.UDFError: (100L, 'UDF: Execution Error 1', 'src/main/llist/llist_operations.c', 93)
I looked at the Aerospike logs and found that each time this code is executed, the error below gets printed:
: WARNING (udf): (src/main/mod_lua.c:599) Lua Create Error: module 'llist' not found:
no field package.preload['llist']
no file './llist.lua'
no file '/usr/local/share/luajit-2.0.3/llist.lua'
no file '/usr/local/share/lua/5.1/llist.lua'
no file '/usr/local/share/lua/5.1/llist/init.lua'
no file '/opt/aerospike/sys/udf/lua/llist.lua'
no file '/opt/aerospike/sys/udf/lua/external/llist.lua'
no file '/opt/aerospike/usr/udf/lua/llist.lua'
no file './llist.so'
no file '/usr/local/lib/lua/5.1/llist.so'
no file '/usr/local/lib/lua/5.1/loadall.so'
no file '/opt/aerospike/sys/udf/lua/llist.so'
no file '/opt/aerospike/sys/udf/lua/external/llist.so'
no file '/opt/aerospike/usr/udf/lua/llist.so'
: INFO (udf): (udf.c:954) lua error, ret:1
I could not find the relevant lua files or a lua installation in the container. I have my code working fine when I run it directly on the host. Is there some extra configuration that needs to be done to the container?

LDTs were dropped in 3.15.
https://www.aerospike.com/docs/guide/ldt_guide.html
Excerpt:
Aerospike has removed the Large Data Type feature as of server version 3.15 after deprecating this functionality 12 months earlier. Please see the removal notice and deprecation notice. The features listed below are no longer in Aerospike servers.

Blazemeter (JMeter test) zip file upload fail

I get Non HTTP response message: /home/jmeter/my_file-to-upload.zip (No such file or directory) when uploading zip file in BlazeMeter.
But logs state that file is stored as expected INFO o.a.j.s.FileServer: Stored: /home/jmeter/my_file-to-upload.zip
I added test file along with jmx file when creating test as instructed here. Also, have gone through BlazeMeter blogs and tutorials, nothing helped.
This test works perfectly fine executed locally or in Team Service, but I need it in BlazeMeter.

Blazemeter platform automatically extracts any zip file that is being uploaded, and thats the reason your test is unable to find the required file.
As a workaround, you can upload the file in a different format, and change the upload path to match the new file format.
For example: Change the zip format to gzip, and change the upload path in your script to be /home/jmeter/my_file-to-upload.gz instead of /home/jmeter/my_file-to-upload.zip.
If there are other questions we can help with, feel free to contact us at support#blazemeter.com
Blazemeter Support Team

Unable to open a saved Gephi project file

Recently I worked on a project done in the network visualization and analysis software Gephi, and I saved it with the ".gephi" extension. However, when I try to reopen the file, it gives the following error message:-
"The project file couldn't be opened. Please check the file has .gephi extension.
XMLStreamException - ParseError at [row,col]:[1,1]
Message: Premature end of file."
I'm a beginner in Gephi and only an amateur programmer. I do not understand this error message, and thus have no ideas on how to resolve it. I tried updating Gephi to the latest version. I also tried to open the file from within Gephi. Neither of those steps have resolved the problem. Can anyone help me out with this, please?

The error message "premature end of file" means that the xml file was not complete. I suppose that the whole file is empty or just the xml part of the file. so maybe the file got corrupted while saving.
Can you try to open the file with notepad or a hexeditor to verify that it has some content?

There must be some bug on the gephi files writing or reading process.
In order to identify the problem it would help if you can post a gephi log file when each error happens.
You can find the log file on gephi user directory (check http://wiki.gephi.org/index.php/Troubleshooting)
For example in Windows 7 the path is C:\Users\Your_User\AppData\Roaming.gephi\dev\var\log\messages.log
Also, if you can share the files, it will be easier to fix.

This could be related to an open bug where Java6 is used to save the gephi file and then Java7 is used to load the file, say on a different machine.
The jdk used by Gephi can be specified in /etc/gephi.conf or alternatively it can be specified as a parameter --jdkhome when launching Gephi.

The problem is with java and javac:
If you created your gephi file with open java-6-openjdk (for example) and then you sitch your java to java-7-openjdk, then this problem surges.
I fix my gephi returning to the same java and javac executables in Linux by:
(In terminal)
sudo update-alternatives --config java
and then
(In terminal)
sudo update-alternatives --config javac
Hope this can help!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Tika with Grobid throwing error when parsing pdf document - tika-server

Related

Converting docx to PDF/A with libre office writer

How do you enable the TesseractOCRParser using TikaConfig and the Tika command line utility?

Aerospike docker - 100L, 'UDF: Execution Error 1

Blazemeter (JMeter test) zip file upload fail

Unable to open a saved Gephi project file

Categories

Resources