Solr File Indexing map content by pages

Solr File Indexing map content by pages - apache

I would like to index files in Solr.
I have already made an "output script" with PHP, but my project leader has given me the task of displaying the page number of the found text.
So:
- I am searching for the Word "Foo".
- Solr returns the results and also the highlighted text.
- Now I would like to know on which page this highlighted text is, to find it.
The files are *.pdf files.
One solution I have thought of would be to import the Text of the PDF Files in different fields? Or maybe in this one multivalued field named "content".
Maybe like this:
Json:
content:
1: "page one text",
2: "page two text"
and so on?
Is this possible? Or is there a better way to find this information out? Thanks for your help! :-)

You need to create a separate Solr document for every page of every PDF file. If you want to return only one result per file, then you can use FieldCollapsing to group all the results from the same PDF file.

Related

Edit a Mainframe file in the RecordEditor without a copybook

How do you Edit a (binary EBCDIC) Mainframe file in the RecordEditor with out a Cobol Copybook.
How do you generate Java code to read the file using the RecordEditor.
Note: This is an attempt to split a question that is far to broad to give meaningful answer to
into a series of simpler Question and Answer's.

Try and avoid editing a binary file with a Cobol Copybook if at all possible. This should only be attempted as a last resort !!!.
Try and get
that Cobol copybook (or some field layout document) for the file !!!
Some general advise:
It is feasible when dealing with 10 / 20 fields in a record but not if there a thousands of fields in a Record.
Take your time do not rush the process. Try and get each step correct before moving on
Finally upgrade to the latest version of the RecordEditor (currently 0.98.4)
This process will also work for normal Text file as well
RecordEditor Layout Wizard
To start the wizard select option Record Layouts >>> Layout Wizard.
File Structure screen
The file structure screen has 3 purposes:
Get the File structure - It could be Fixed Width, VB, Windows/Unix Text file
Get the Record-Length (if it is a fixed width file).
Get the font (character-set / encoding)
The RecordEditor will try and work this out for you
Field Selection Screen
The RecordEditor will try and work out where fields start and end but
it is not perfect. You need to carefully check and correct its choices
On this screen, the fields are displayed in alternating colors
you create/delete a field by clicking on
use the Clear Fields button clear all the fields
you can change what field-types to search for using the various check box's (e.g. Mainframe Zones Decimal)
The Add Fields will do another field search
Field Definition screen
On this screen you define the field names and Types. You may need to go back to the **Field Selection Screen* to adjust the fields
Editing the file
Once the Record Layout has been defined, it can be used on the open file screen
Generating Java code
When editing your file, you can generate java~JRecord code to read the file
by selecting Generate >>> Java >>> ....
You can the enter a package-id + generate options:
and finally your sample java code is generated to read / write the
file.

Netsuite PDF Templating: get number of pages as attribute

I am templating pdfs in Netsuite using freemarker and I want to display the footer only on the last page. I have been doing some research, but couldn't find a solution (since looks like the environment does not allow me to include or import libs), so I thought that just comparing the number of the page with the total pages in an if tag would be a nice and easy workaround. I already know how to display the numbers by using the <pagenumber/> and <totalpages/> tags, but still cannot get them as values so I can use them like this:
<#if (pagenumber == totalpages) >
... footer html...
</#if>
Any ideas of how or where can I get those values from?

The approach you are trying won't work, because you are mixing BFO and Freemarker syntax. Netsuite uses two different "engines" to process PDF Templates. The first step is Freemarker, which merges the record fields with your template and produces an XML file, which is then converted by BFO into a PDF file. The <totalpages/> element is meaningless to Freemarker, as it is only converted into a number by BFO later.
Unfortunately, the ability to add a footer to only the last page of a document is currently a limitation of BFO, as per the BFO FAQ:
At the moment we do not have a facility for explicitly assigning a
footer or header to the last page in a document when the number of
pages is unknown.

You CAN add it after a page break - and put the page break at the end of the body
<pbr footer="nlfooter" footer-height="25%"></pbr>
</body>
The issue here is - on a one page output - you will get 2 pages minimum... it will always ADD a page for the disclaimer / footer...

How use the one Template for multiple pages in a XWPFDocument with Java

I would like to know, how can i reuse one template (with one page inside and some variables) multiple times a XWPFDocument object.
My idea is:
load the template once in a XWPFDocument as an template-object
clone/create/copy the template-object with all his styles and headers etc
fill the clone with content
add this clone to the destination-XWPFDocument
I got this work for one single page only.
When i try to clone/create/copy the template-object it will lose all his style informations.
How to copy a paragraph of .docx to another .docx withJava and retain the style
How to copy some content in one .docx to another .docx , using POI without losing format?

POI probably does not support this out of the box, but I have done a similar thing in my project poi-mail-merge, it works with the underlying XML to repeatedly replace markers in a template Microsoft Word document and combine the results into one resulting document.
So it basically duplicates the template document multiple times into the resulting document.
See here for how I do it there, basically I work on the XML body text and do replacements/changes there and then append it onto the result document.

POI Mail Merge propably helps in other cases but in my case it doesn't work.
My Workaround is to update my Template-XWPFDocument to the needed structure first, save it temporarily and read it back into a XWPFDocument-object.
Here the steps:
Read the template-file into a XWPFDocument
Read the records from data-file e.g. csv
Calculate the numbers of pages related to the data-records
Get the Bodyelements-Objects from the Template-XWPFDocument
Create new Bodyelements (depending to the numbers of pages) in the Template-XWPFDocument and replace them with the same Objects that we get before
Save the updated Template-XWPFDocument temporarily
Read the temporarily saved Template into a XWPFDocument
Replace all placeholder and fill them with your CSV-Data
Hope this helps somebody

Indexing Multiple documents and mapping to unique solr id

My use case is to index 2 files: metadata file and a binary PDF file to a unique solr id. Metadata file has content in form of XML file and some schema fields are mapped to elements in that XML file.
What I do: Extract content from PDF files(using pdftotext), process that content and retrieve specific information(example: PDF's first page/line has information about the medicine, research stage). Information retrieved(medicine/research stage) needs to be indexed and one should be able to search/sort/facet.
I can create a XML file with information retrieved(lets call this as metadata file). Now assuming my schema would be
<field name="medicine" type="text" stored="true" indexed="true"/>
<field name="researchStage". ../>
Is there a way to put this metadata file and the PDF file in Solr?
What I have tried:
Based on a suggestion in archives, I zipped these files and gave to ExtractRequestHandler. I was able to put all the content in SOLR and make it searchable. But it appear as content of zip file.(I had to apply some patches to Solr Code base to make this work). But this is not sufficient as the content in metadata file is not mapped to field names.
curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=#file.zip"
I tried to work with DataImportHandler(binURLdatasource). But I don't think I understand how it works. So could not go far.
I thought of adding metadata tags to PDF itself. For this to work, ExtractrequestHandler should process this metadata. I am not sure of that either.
So I tried "pdftk" to add metadata. Was not able to add custom tags to it. It only updates/adds title/author/keywords etc. Does anyone know similar unix tool.
If someone has tips, please share.
I want to avoid creating 1 file(by merging PDF text + metadata file).

Given a file record1234.pdf and metadata like:
<metadata>
<field1>value1</field1>
<field2>value2</field2>
<field3>value3</field3>
</metadata>
Do the programmatic equivalent of
curl "http://localhost:8983/solr/update/extract?
literal.id=record1234.pdf
&literal.field1=value1
&literal.field2=value2
&literal.field3=value3
&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&" -F "tutorial=#tutorial.pdf"
Adapted from http://wiki.apache.org/solr/ExtractingRequestHandler#Literals .
This will create a new entry in the index containing the text output from Tika/Solr CEL as well as the fields you specify.
You should be able to perform these operations in your favorite language.
the content in metadata file is not mapped to field names
If they dont map to a predefined field, then use dynamic fields. For example you can set a *_i to be an integer field.
I want to avoid creating 1 file(by merging PDF text + metadata file).
That looks like programmer fatigue :-) But, do you have a good reason?

Search only particular index

I am using Apache Solr for my search , using this i am indexing variety of resources such as (PDF,MS Word document).
If let say the user giving the query like "PDF: java" then i wants to search only the PDF files
Any ideas
Thanks
Dilip.

Well, like I commented. Set up a filetype[string] in your schema and set it when you upload that file.
http://localhost:8983/solr/update/extract?literal.id=123&literal.filetype=pdf
and when you search
http://localhost:8983/solr/select?q=text:electrical design AND filetype:pdf
Quick hack: if your documents are identified by filename, you can tell Solr to limit results to those ending in *.pdf by
http://localhost:8983/solr/select?q=text:electrical design AND id:*.pdf

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas