Indexing Multiple documents and mapping to unique solr id - pdf

My use case is to index 2 files: metadata file and a binary PDF file to a unique solr id. Metadata file has content in form of XML file and some schema fields are mapped to elements in that XML file.
What I do: Extract content from PDF files(using pdftotext), process that content and retrieve specific information(example: PDF's first page/line has information about the medicine, research stage). Information retrieved(medicine/research stage) needs to be indexed and one should be able to search/sort/facet.
I can create a XML file with information retrieved(lets call this as metadata file). Now assuming my schema would be
<field name="medicine" type="text" stored="true" indexed="true"/>
<field name="researchStage". ../>
Is there a way to put this metadata file and the PDF file in Solr?
What I have tried:
Based on a suggestion in archives, I zipped these files and gave to ExtractRequestHandler. I was able to put all the content in SOLR and make it searchable. But it appear as content of zip file.(I had to apply some patches to Solr Code base to make this work). But this is not sufficient as the content in metadata file is not mapped to field names.
curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=#file.zip"
I tried to work with DataImportHandler(binURLdatasource). But I don't think I understand how it works. So could not go far.
I thought of adding metadata tags to PDF itself. For this to work, ExtractrequestHandler should process this metadata. I am not sure of that either.
So I tried "pdftk" to add metadata. Was not able to add custom tags to it. It only updates/adds title/author/keywords etc. Does anyone know similar unix tool.
If someone has tips, please share.
I want to avoid creating 1 file(by merging PDF text + metadata file).

Given a file record1234.pdf and metadata like:
<metadata>
<field1>value1</field1>
<field2>value2</field2>
<field3>value3</field3>
</metadata>
Do the programmatic equivalent of
curl "http://localhost:8983/solr/update/extract?
literal.id=record1234.pdf
&literal.field1=value1
&literal.field2=value2
&literal.field3=value3
&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&" -F "tutorial=#tutorial.pdf"
Adapted from http://wiki.apache.org/solr/ExtractingRequestHandler#Literals .
This will create a new entry in the index containing the text output from Tika/Solr CEL as well as the fields you specify.
You should be able to perform these operations in your favorite language.
the content in metadata file is not mapped to field names
If they dont map to a predefined field, then use dynamic fields. For example you can set a *_i to be an integer field.
I want to avoid creating 1 file(by merging PDF text + metadata file).
That looks like programmer fatigue :-) But, do you have a good reason?

Related

How to write WIC XMP people tags to jpg?

I have images with people tagging information in xml format. I wish to edit this information and also add it to pictures that do not yet have it. By looking at the xml I assume it is based on the people tagging used in the microsoft imaging component.
I haven't quite understood the format, but I understood it sof far, that I can alter or gemerate the xml, I just do not know where to write it in the image. I am probably just doing some stupid mistake, because I am not experienced with these image metadatas. So if you think I'm just on the wrong track and that can be done much simpler, please tell me.
In those images that already contain this xml, I can use search and replace to update the xml. However I have a lot of pictures that do not yet contain that information and I do not know where I should write it to inside the image.
Images that already contain this information can be read with exiftool as follows:
exiftool -xmp -b existingTags.JPG
The result is the following xml:
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP
Core 4.4.0-Exiv2"> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:MP="http://ns.microsoft.com/photo/1.2/" xmlns:MPRI="http://ns.microsoft.com/photo/1.2/t/RegionInfo#"
xmlns:MPReg="http://ns.microsoft.com/photo/1.2/t/Region#" xmp:Rating="0"> <dc:subject> <rdf:Bag> <rdf:li>Valeriya
</rdf:li> </rdf:Bag> </dc:subject> <MP:RegionInfo rdf:parseType="Resource"> <MPRI:Regions> <rdf:Bag> <rdf:li
MPReg:Rectangle="0.48, 0.418, 0.059333, 0.089" MPReg:PersonDisplayName="findus_l"/> </rdf:Bag> </MPRI:Regions>
</MP:RegionInfo> </rdf:Description> </rdf:RDF> </x:xmpmeta> <?xpacket end="w"?>
However I cannot write the information using exiftool. When I ran this command, it simply reads the information again, instead of writing the contents of the file to the image:
exiftool -xmp<=alteredXMP.txt existingTags.JPG
A bit of research has shown me, that exiftool can only write specific xmp tags, and the people tagging tags from windows imaging component do not seem to be part of this.
Where in the image file should I write the information? Can I somehow find this spot programmatically and then just insert the xml there?
I am using Kotlin as programming language but I don't mind having to call command line functions or other programs.
Background: I have a Synology Diskstation and use the included software called photo station. The photo station supports tagging of people on the images and uses this given format. I like the photo station in many ways, but the face recognition is bad, so I want to use my own but have photo station be able to read it.
The data you are trying to write is part of the Microsoft Region Structure. XMP Structured data is a complex subject but you should be able to add the data with exiftool by writing region names to the RegionPersonDisplayName tag and the region dimensions to the RegionRectangle. Using the data in your example, the command would be:
exiftool -RegionPersonDisplayName=findus_l -RegionRectangle="0.48, 0.418, 0.059333, 0.089" /path/to/files
If you have to write multiple regions, you can just add them on, but you must keep names and the matching dimensions in the same order. For example
exiftool -RegionPersonDisplayName=findus_l -RegionRectangle="0.48, 0.418, 0.059333, 0.089" -RegionPersonDisplayName="John Smith" -RegionRectangle="0.37645533, 0.04499886, 0.35111009, 0.26633097" /path/to/files
These commands would overwrite any existing region data. If you are adding new names without overwriting, you would change the equal signs to PlusEqual +=.

Xpages File Download collect and display extra information

I'm using file upload and download control. I understand how to use the provided display columns, but how would I go about collecting other info about each uploaded file and then displaying it (i.e. Display Name and Notes that the user would enter)?
<xp:fileUpload id="fileUpload1"
value="#{document1.files}" style="width:80%"
useUploadname="false">
<xp:eventHandler event="onchange"
submit="true" refreshMode="complete"
disableValidators="true">
</xp:eventHandler>
</xp:fileUpload>
<xp:br></xp:br>
<xp:fileDownload rows="30" id="FD1"
displayLastModified="false" value="#{document1.files}"
style="width:98%" hideWhen="true" displayType="false"
displayCreated="true" rules="all"
lastModifiedTitle="Last Modified">
<xp:this.allowDelete><![CDATA[${javascript:database.queryAccessRoles(session.getEffectiveUserName()).contains('[Admin]')}]]></xp:this.allowDelete>
</xp:fileDownload>
If I understand your question correctly: you want to add additional information columns into the file download control that are derived from information stored or computed elsewhere, e.g. from a NotesItem (a field in the Notes document)?
In this case you need to construct your own output using a repeat control. You can render a table or a list - whatever you deem fit for display.
The “trick” is how to construct the URL for download - which is simply:
/yourdatabase.nsf/0/unid/AttachmentName?OpenAttachment
(typed off memory. You might need to double check syntax).
Word of caution: if you have lots of attachments, you might consider having separate documents for them and use a view - above URL works in views too. Saves you a versioning headache (in case multiple users can upload to the same document).
Let us know how it goes

Sensenet: sort by file type on List View

It is possible to sort the content on a list view by file type?
For instance I want to see all the doc files first, then the folders, then the pdf files, etc.
In Sense/Net to sort by a value, you have to have a dedicated field for that value, because our search engine can sort only by fields.
Unfortunately the value of file extension is not stored in a dedicated field like that, however it is possible to display it. You need to add a new "extension" field to the CTD of the files:
<Field name="Extension" type="ShortText">
<DisplayName>Extension of the file.</DisplayName>
</Field>
and define a ContentHandeler, inherited from File, that has access to the file name, and can retrive the extension. For further details visit SenseNet-wiki: http://wiki.sensenet.com/How_to_create_a_ContentHandler

Solr File Indexing map content by pages

I would like to index files in Solr.
I have already made an "output script" with PHP, but my project leader has given me the task of displaying the page number of the found text.
So:
- I am searching for the Word "Foo".
- Solr returns the results and also the highlighted text.
- Now I would like to know on which page this highlighted text is, to find it.
The files are *.pdf files.
One solution I have thought of would be to import the Text of the PDF Files in different fields? Or maybe in this one multivalued field named "content".
Maybe like this:
Json:
content:
1: "page one text",
2: "page two text"
and so on?
Is this possible? Or is there a better way to find this information out? Thanks for your help! :-)
You need to create a separate Solr document for every page of every PDF file. If you want to return only one result per file, then you can use FieldCollapsing to group all the results from the same PDF file.

Lucene query that eliminates xml tags in full text search

In alfresco I need to write a lucene query such a way that It has to eliminate/exclude the xml tags from content while searching.
Example If a file try.xml is searched against the content, my search should not search for the xml tags.
try.xml
<sample>This is an example</sample>
If I give the search text as "sample" it should not return the file name "try.xml".
So how could I achieve this?
Edit
I have tried with the below query and no change.
#cm\:name:"try*" -TEXT:"<*>" +TEXT:"sample"
Whats wrong in the above query. I just tried to get the file name which starts with "try" and eliminating the text inside tag, and trying to search for text "sample".
By default Alfresco treats XML files as plain text and indexes the xml tags as words, that's why they can be found via full text search. XML content is handled by the StringExtractingContentTransformer in Alfresco which converts text/xml to text/plain before indexing it.
To check which transformers are registered in your Alfresco installation you can check
http://localhost:8080/alfresco/service/mimetypes?mimetype=text/xml#text/xml
To prevent the indexing of xml attributes you have to write a special transformer which strips out the XML tags. See http://wiki.alfresco.com/wiki/Content_Transformations for an introduction in content transformation with Alfresco. The easiest way would be to integrate a command line utility that converts the xml file into text or you could implement a java class which does the transformation.
There's no standard way to do what you need, here's an excerpt of the official documentation:
Wild card queries Wildcard queries
using * and ? are support as terms and
phrases. For tokenized fields the
pattern match can not be exact as all
the non token characters (whitespace,
punctuation, etc) will have been lost
and treated as equal.
Basically, angle brackets are stripped out by default. You need to hack the indexing and query parsing processes in order to enable your wanted behavior.
Could you not just exclude the xml mimetype? (See http://wiki.alfresco.com/wiki/Search#Finding_nodes_by_content_mimetype for the syntax)
I guess you might want to exclude html too (so you'd exclude text/html and text/xml), that'd prevent you getting any nodes in your results that contain xml tags.