I'd like to have a simple setup of solr where I can index and search large folders of pdf/docx files. I mostly need just full text search, no need to have fields separated and the original documents do not seem to have well defined structure anyway. I follow https://lucene.apache.org/solr/quickstart.html which is straightforward, however, when I try to index my own folder with some pdf files, some files return error like:
POSTing file G1504225.pdf (application/pdf) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for
url: http://localhost:8983/solr/gettingstarted/update/extract?
resource.name=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf&literal.id=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">400</int><int
name="QTime">263</int></lst><lst name="error"><lst name="metadata"><str
name="error-class">org.apache.solr.common.SolrException</str><str
name="root-error-class">java.lang.NumberFormatException</str><str
name="error-class">org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException</str><str name="root-error-class">org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException</str></lst><str name="msg">Async exception during distributed update: Error from server at http://127.0.1.1:8983/solr/gettingstarted_shard2_replica1: Bad Request
request:
http://127.0.1.1:8983/solr/gettingstarted_shard2_replica1/update?update.chain=add-unknown-fields-to-the-schema&update.distrib=TOLEADER&distrib.from=http%3A%2F%2F127.0.1.1%3A8983%2Fsolr%2Fgettingstarted_shard1_replica1%2F&wt=javabin&version=2
Remote error message: ERROR: [doc=/home/solr/solr-6.5.1/../train_data/G1504225.pdf] Error adding field 'title'='United Nations' msg=For input string: "United Nations"</str><int name="code">400</int></lst>
</response>
SimplePostTool: WARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response code: 400 for URL:
http://localhost:8983/solr/gettingstarted/update/extract?
resource.name=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf&literal.id=%2Fhome%2Fsolr%2Fsolr-6.5.1%2F..%2Ftrain_data%2FG1504225.pdf
Most of the files are fine and I can search them. Any ideas?
Solr uses Tika to extract the text from those files. Some types of files, pdf specially, are hard to parse, as it is a proprietary format and Tika is always trying to catch up edge cases etc. So it is normal that some files will throw errors. You have to expect that.
See how many instances of NumberFormatException/pdfbox are found...(pdfbox is the library Tika uses for pdf files).
If you really want to get all the text from all pdf, even the ones erroring, you can put them in a special folder, and process them again extracting the text yourself with another library, different libraries will have different results of the same pdf, so you can use the superset of the text several libraries produce. But you will have to write some glue code for this, unless Tika allows you to plug specific libraries for specific file types (not sure if it does now, it didn't do that before).
Related
Good Morning,
I'm using Forge Autodesk Data visualization API. I'm trying to upload the CSV data that is exactly in this format https://github.com/Autodesk-Forge/forge-dataviz-iot-reference-app/blob/main/server/gateways/csv/Hyperion-1.csv, but what i get is internal error 500
Failed to load resource: the server responded with a status of 500 (Internal Server Error)
Hyperion.Data.Adapter.js?73b8:543
SyntaxError: Unexpected token s in JSON at position 0 Hyperion.Data.Adapter.js?73b8:543
eval # Hyperion.Data.Adapter.js?73b8:543
It could be tha the problem is in the format of the csv file? this are my enviromental variables setted:
ADAPTER_TYPE= csv
CSV_MODEL_JSON=server\gateways\synthetic-data\device-models.json
CSV_DEVICE_JSON=server\gateways\synthetic-data\devices.json
CSV_FOLDER=server\gateways\csv
CSV_DATA_START= #Format: YYYY-MM-DDTHH:MM:SS.000Z
CSV_DATA_END= #Format: YYYY-MM-DDTHH:MM:SS.000Z
CSV_DELIMITER="\t"
CSV_LINE_BREAK="\n"
CSV_TIMESTAMP_COLUMN="time"
CSV_FILE_EXTENSION=".csv"
this is the code i'm using https://github.com/Autodesk-Forge/forge-dataviz-iot-reference-app
I have just answered a similar question here: Setting up a CSV Data Adapter locally. When you follow the steps listed in the answer the sample app should read the CSV data without problems.
MarkLogic version - 9.0-6.2 (on windows)
I am following the guide (https://docs.marklogic.com/guide/cpf/quickStart) to perform the sample exercise provided. After installing CPF on data-hub-FINAL (with data-hub-TRIGGERS as the triggers db), I created a pipeline XML document (as given in example) in my C drive at directory C:\copyright. Then on the admin console, I navigated to databases -->data-hub-FINAL--> Content Processing--> Pipelines --> Load, and provided below values.
directory : C:\copyright
filter : *.xml
source : (file system)
However, when I click 'Ok', I am getting error message 'Invalid input: No readable XML files found:'
I verified that the pipeline xml is present and valid in the directory C:\copyright.
Any inputs appreciated!
Marklogic could not read the xml document because of non UTF-8 content in the document, as shown below.
<state-transition>
<annotation>
When a document containing ‘book' as a root element is created,
add a ‘copyright' statement.
</annotation>
For now, I removed the annotation from the xml document and successfully loaded the pipeline.
I am trying to load DICOMs from a DICOM Server. Loading a single file with the URL is working fine.
Now I want to load a whole series of DICOM Data. I get the data from the server with an HTTP-request as a zip archive.
I have tried to unzip the response with the zip.js library and pass the unziped text to the loader.parse function, to load the DICOMs as in the example "viewers_upload". But I get the error that the file could not be parsed.
Is there a way to load the data without the URL? Or how do I have to modify the example so that it will work for a zip archive?
This is the code from unzipping the file and passing it to the parser:
reader.getEntries(function(entries) {
if (entries.length) {
//getting one entry from the zipfile
entries[0].getData(new zip.ArrayBufferWriter(), function (dicom) {
loader.parse({url: "dicomName", dicom});
} , function (current, total) {
});
}
The error message is:
"dicomParser.readFixedString: attempt to read past end of buffer"
"Uncaught (in promise) parsers.dicom could not parse the file"
I think the problem might be with the returned datatype of the zipfile? Which type do I have to pass to the parse function? How has the structure of the data in the parser has to be? What length of the buffer does the parser expect?
I am having few ISO8583 logs in a text file. I want to parse these logs from this text file and write them to any database with some descriptive information such as class of the message, message function, message origin, processing code, response code etc.
I am new to the BASE24/ISO8583 and was trying to find any ready-made parser for this. Is there any such parser available ? Does jPOS provides such functionality ?
EDIT
I have the logs in ISO8583 format in ".log" file as given below:
MTI : 0200
Field-3 : 201234
Field-4 : 000000010000
Field-7 : 0110722180
Field-11 : 123456
Field-44 : A5DFGR
Field-105 : ABCDEFGHIJ 1234567890
This is same as the format given in the link shared by you.
It also consists of hex dump but I dont want to parse that.
The code given in the link is doing packing and unpacking of the message where as what I am trying is to read these logs (in unpacked form) and write them into a database table.
I think i need to write my own code for this and use the jPOS packagers in it.
It really depends on the format of the log file - are the ISO8583 messages - HexStrings, and HexDump an XML representation of ISO8583, some other application trace file ?
Once you know the format and it might require some massaging - you will want to research the ISOMsg.unpack() methods using the appropriate jPOS packager. the packager defines the field structure - of the various ISO8583 fields and field construction (lengths, character set, etc.)
a good example was found at the following blog post: looking at the "Parse (unpack) ISO Message" seciton http://jimmod.com/blog/2011/07/26/jimmys-blog-iso-8583-tutorial-build-and-parse-iso-message-using-jpos-library/
You mention - Base24 - jPOS does have a few packagers that might be close starting point.:
https://github.com/jpos/jPOS/blob/master/jpos/src/dist/cfg/packager/base24.xml
Those human-readable log formats are usually difficult to parse without loosing information. Moreover, the logs are probably PCI compliant so there's a lot of masked information there. You want to ask for ah hex dump of the messages.
what is displayed in log file is parsed ISO.Hence you need not use jpos.jpos is only for packing and unpacking when you transmit the message.
Assign the field to variable and write in DB
for example,Field 39 is response code.
Using jpos is good idea. You should go for your custom packager design class.
As title. My license file contains UTF-8 characters and by default IzPack's LicensePanel seems to expect ASCII text files.
Is there a solution to this?
UPDATE:
I tried using "encoding" attributes with my resource line:
<res id="LicencePanel.licence" src="Licence.txt" encoding="utf-8"/>
It didn't work.
I have had a similar problem with my LicencePanel.licence resource. I have an InfoPanel.Info resource in my installation as well. Both my info file (readme.txt) and licence (licence.txt) are in plain text format. My compiler accepts the readme file, but not the licence file when I run the installation.
Perhaps it isn't an encoding problem, since both files were in the same format, but the info file was accepted and the license was not.
Looks like this isn't going to work. I looked at the source for 4.3.5 and it looks like this may be a bug. Maybe it is fixed in a future version. I had a look at the source and this is the issue. Inside LicencePanel.java:
String resNamePrifix = "LicencePanel.licence";
licence = ResourceManager.getInstance().getTextResource(resNamePrifix);
ResourceManager has two methods:
public String getTextResource(String resource, String encoding) throws ResourceNotFoundException, IOException
public String getTextResource(String resource) throws ResourceNotFoundException, IOException
The first second one is being used while the first one should be used.
Edit: Just checked 5.0.0-rc1 and I think the issue occurs there too. (Didn't test just glanced at the code).