How to index blob field in Apache Solr indexing? - apache

I am using Apache Solr to index my data, I have blob field which I want to be indexed too...but I dont know what is the fieldType to be declared in the 'scheme.xml'....
I tried following:
" field name="abstract" type="text" indexed="true" stored="true" required="true" "
but when I tried to search then that field is shown as :
id, abstract, title, price, publishedDate
1, [B#1e9b7b2, Spain Consumer, 3795.0, 2009-01-19T18:30:00Z
'abstract' is my blob filed which is nothing but big string...and I wanted text search on same field but when I indexed it then it is showing like this...
please suggest me what can I do?
thanking in advance...

Solr FAQ mentions this for the blob http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5
You can check for searching-rich-format-documents-stored-dbms
There was an JIRA issue for contributing the BlobTransformer, but doesn't seem to make it into the Code. You can refer the patch and pick the transformer for your use probably.
Not sure if its renamed/refactored/renamed differently in the Current versions.

Related

Extract data from XML string in Hive Table without using XPath

I am trying to use a view to extract a string(value) from a large XML string that sits in a single column in a hive table. I need to get the associated FOO_STRING_VALUE for COMPANY_ID, SALE_IND, and CLOSING_IND.
<Message>
<Header>
<FOO_STRING>
<FOO_STRING_NAME>COMPANY_ID</FOO_STRING_NAME>
<FOO_STRING_VALUE>44-1235</FOO_STRING_VALUE>
</FOO_STRING>
<FOO_STRING>
<FOO_STRING_NAME>SALE_IND</FOO_STRING_NAME>
<FOO_STRING_VALUE>Y</FOO_STRING_VALUE>
</FOO_STRING>
<FOO_STRING>
<FOO_STRING_NAME>CLOSING_IND</FOO_STRING_NAME>
<FOO_STRING_VALUE>Y</FOO_STRING_VALUE>
</FOO_STRING>
</Header>
</Message>
The XML file can have up to 50 "FOO_STRINGS" and there is no guarantee in what order they will be in so I can not use XPATH unless I have 50 xpath_string calls for each Name/Value pair and matched them up later. I am using xpath like this .....
xpath_string(xml_txt, '/Message/Header/FOO_STRING[1]/FOO_STRING_VALUE') AS String_Val_1
xpath_string(xml_txt, '/Message/Header/FOO_STRING[2]/FOO_STRING_VALUE') AS String_Val_2
xpath_string(xml_txt, '/Message/Header/FOO_STRING[3]/FOO_STRING_VALUE') AS String_Val_3
However, if the order changes than it doesn't work. I'm wondering if there is a quick way to get to find the FOO_STRING_NAME needed the and get the corresponding Value using regexp_extract() or some other way? I am not familiar with Regex so any help or suggestions would be helpful, Thank you a ton
" if the order changes than it doesn't work "
Don't use position, then.
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="COMPANY_ID"]/FOO_STRING_VALUE') AS String_Val_1
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="SALE_IND"]/FOO_STRING_VALUE') AS String_Val_2
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="CLOSING_IND"]/FOO_STRING_VALUE') AS String_Val_3

How do I get empty fields in SOLR indexed for a schemaless collection?

How do I get empty fields in SOLR indexed? I am using solr 7.2.0
I am using schemaless SOLR to try to index everything as string, but for files with empty fields, those fields do not get indexed. Is there a way to get them to show up?
col1,col2,col3
a,,1
d,e,
g,h,3
for example column 1 shows up as
{
"col1":"a",
"col3":"1",
}
I'm trying to also get col2 to show up.
in my solrconfig.xml i have this
<dynamicField name="*" type="text_general" indexed="true" stored="true" required="true" default="" />
and I have any traces of the remove-blank processor removed from my config. I've reloaded and deleted/recreated by collection multiple times. Is there a solution for this?
The CSV import module has its own option to keep empty fields - f.<field name>.keepEmpty=true.
If you don't give that option, the CSV handler will never give the empty field value to the next step in your indexing process.
Giving f.col2.keepEmpty=True as an URL argument should at least give you a better starting point.
maybe preprocess your csv file like this:
s/,,/, ,/g
That is, add an space between both commas (you will have to specially deal with the last value differntly though, there is a regex for that).
And then try again. Right now solr is reading the value as non existant, making it a space has more chances to make it through, and would not change search results (if you don't have some crazy analysis chains)

Solr 7.1: Querying Double field for any value not possible with * anymore

I recently upgraded from Solr 6.6 to 7.1 and cannot query Double fields for any value anymore using
q: test_d:*
(zero results although the field is set). However,
q: test_d:[* TO *]
works. This seems to affect all numeric field types (tested for Integers, Floats and Doubles). For String, Text, Boolean fields the single asterisk works just fine like before.
Is it possbile to reconfigure Solr to have the old behavior or do I have to rewrite all queries and introduce a switch for numeric field types? Until now, no field value type differentiation was needed (which is good!).
Minimal Working Example
Use the example-DIH-solr core supplied with the Solr distributable, push the document
{"id":"foo","test_b":true,"test_i":42,"test_f":42.0,"test_d":42.0}
and use
q: test_b:*
q: test_d:*
q: test_i:*
q: test_f:*
Only the query for the Boolean field will yield a result.
Double field definition changed. To restore the old behaviour you can use / change this:
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
and add back the double field type definition to the schema:
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
This worked in the past but most likely per accident - see https://issues.apache.org/jira/browse/SOLR-11746 for a bug report / solr issue to track this.

Sitecore 7 Lucene: strip HTML from computed field

I am pasting together all "paragraph" child nodes from an "article" node in a computed field. This is to achieve that an article can be searched & found by its paragraph contents.
To achieve this, I did the following, under the <fields hint="raw:AddComputedIndexField"> node:
<field fieldName="Paragraphs" storageType="YES" indexType="TOKENIZED">
MyWebsite.ComputedFields.Paragraphs,MyWebsite
</field>
In this computed field, I concat the paragraph HTML bodies together.
I was assuming Sitecore would strip the HTML for me (like it does for rich text fields), but it does noet.
For "rich text" fields, it is probably the RichTextFieldReader that strips the HTML tags out. Decompiling the code confirms this.
The RichTextFieldReader is configured in the FieldReaders section. Trying to add a raw:AddFieldReaderByFieldNamesection below, does not seem to do anything.
The full section looks as follows, but does not work in this setup:
<FieldReaders type="Sitecore.ContentSearch.FieldReaders.FieldReaderMap, Sitecore.ContentSearch">
<mapFieldByTypeName hint="raw:AddFieldReaderByFieldTypeName">
....default stuff here...
</mapFieldByTypeName>
<mapFieldByFieldName hint="raw:AddFieldReaderByFieldName">
<fieldReader fieldName="Paragraphs" fieldReaderType="Sitecore.ContentSearch.FieldReaders.RichTextFieldReader, Sitecore.ContentSearch"></fieldReader>
</mapFieldByFieldName>
</FieldReaders>
Any other clues on how to achieve this (by config, not by using HTML agility pack etc)
The problem is the mapFieldByFieldName is expecting to match a field with that name from the Sitecore item, not a custom computed field in your index so the field reader is never called.
I don't know how to achieve this from config, but if you do not want to directly use HAP but are willing to use some code then after you paste your fields together in your computed field class just do what Sitecore does in the GetPlainText() method:
string input = "concatenated string";
return HttpUtility.HtmlDecode(Regex.Replace(input, "<[^>]*>", string.Empty));
or use the util method Sitecore.StringUtil.RemoveTags(text)

Location aware search

I am trying location aware search with spatial example found in
http://www.ibm.com/developerworks/java/library/j-spatial/#indexing.approaches.
The schema.xml has a geohash field, but this field is not present in any of the .osm files (present in data folder) used to index. I am not able to understand how the value is assigned to it, so that when I give this query
http://localhost:8983/solr/select/?q=_val_:"recip (ghhsin(geohash(44.79, -93), geohash, 3963.205), 1, 1, 0)"^100
result set has geohash value retrieved. How is it happening? Please help me.
The Solr wiki has a pretty good page on how Spatial search can be done with solr 1.5+.
To summarize, your schema defines 'geohash' typed fields:
<fieldtype name="geohash" class="solr.GeoHashField"/>
<field name="destination" type="geohash" indexed="true" stored="true"/>
Data feeders pass in geohashed coordinates:
<field name="destination">cbj1pb56p4b</field> <!-- 45.17614 -93.87341 -->
You probably should go back to using simple latitude and longitude coordinates to start off with. There are better docs for it.