nested field in Solr 5.2 - indexing

I'm new to Solr and I have a very specific problem that I need to solve:
I have a csv file that contains my Solr document. Now, I do have a column (field) that's not only multiValued, but also contains 'subfields'
for example
"id":"0101",
"addMaterials":[{"name":"Mat1", "property":"prop1"},
{"name":"Mat2","property":"prop2"},
{"name":"Mat3","property":"prop3"}],
"mainProperty":"mainproperty1",
"URL":"http://www.mySite..."
where id, addMaterials, mainProperty, and URL are my main fields while 'name' and 'property' are my subfields. I know that Solr is designed to handle denormalized documents but denormalizing is not a possible solution for my application.
What I'm thinking is to just separate my data set and move the fields (that have subfields) to another document and somehow make a new field to link it to the orginial document (e.g. fromIdField).
Is there any other solution to do this? My minimum goal is to index the values of addMaterials field (even without indexing the subfields)
from:
"addMaterials":[{"name":"Mat1", "property":"prop1"},
{"name":"Mat2","property":"prop2"},
{"name":"Mat3","property":"prop3"}],
to
"addMaterials":{"name":"Mat1", "property":"prop1"}
"addMaterials":{"name":"Mat2", "property":"prop2"}
"addMaterials":{"name":"Mat3", "property":"prop3"}
Thanks in advance.

I have found a solution to my problem. Instead of separating my data set, I kept the addMaterials field as a multiValued field and ignored the subfields. So I only have one multiValued field to be indexed. What I did was to use the update/ request of Solr to index my csv file and put },{ as my separator in my addMaterials multiValued field. The indexed document looks like this:
"addMaterials": ["[{\"name\":\"Mat1\", \"property\":\"prop1\"",
"\"name\":\"Mat2\", \"property\":\"prop2\"",
"\"name\":\"Mat3\", \"property\":\"prop3\"}]"]
I indexed my document using this:
curl "http://localhost:8983/solr/<coreName>/update/csv?
stream.file=C:/userName/Solr/solr-5.2.0/documentFolder/myFile.csv&
f.addMaterials.split=true&
f.addMaterials.separator=\},\{&
stream.contentType=text/plain;charset=utf-8"
Also, this assumes that the addMaterials field is a multiValued field. So make sure you modify your schema first before indexing your document using the procedure above. Otherwise, it will give an error saying that the f. is not a multiValued field.
Of course, if you need to query against the sub-fields then I guess you can use the !join command/function of Solr.

Related

Is there a way to do string replacement/substitution in sql?

I have some records in a CMS that include HTML fragments with custom tags for a widget tool. The maker of the CMS has apparently updated their CMS without providing proper data conversion. Their widgets use keys for layout based on screen width such as block_lg, block_md, block_sm. The problem kicks in with the fact they used to have a block_xs and they have now shifted them all -- dropping the block_xs and instead placing a block_xl on the other end.
We don't really use these things, but their widget configurations do. What this means for us is the values for each key are identical. The problem occurs when the updated CMS code is looking for the 'block_xl' in any widget definition tags, it can't find it and errors out.
What I'm thinking then is that the new code will appear to 'ignore' the block_xs due to how it reads the tags. (and similarly, the old code will ignore block_xl) Since the values for each are identical, I need to basically read any widget definition and add a block_xl value to it matching the value of [any one of] the other width parameters.
Since the best place order-wise would be 'before' the block_lg value, it's probably easiest to do it as follows:
Replace any thing matching posix style regex matching /block_lg(="\d+,\d+")/ with: block_xl="$1" block_lg="$1"
Or whatever the equivalent of that would be.
Example of an existing CMS block with multiple widget definitions:
<div>{{widget type="CleverSoft\CleverBlock\Block\Widget"
widget_title="The Album" classes="highlight-bottom modish greenfont font52 fontlight"
enable_fullwidth="0" block_ids="127" lazyload="0"
block_lg="127,12," block_md="127,12," block_sm="127,12," block_xs="127,12,"
template="widget/block.phtml" scroll="0" background_overlay_o="0"}}</div>
<!-- Image Block -->
<div>{{widget type="CleverSoft\CleverBlock\Block\Widget"
widget_title="What’s Your Favorite Cover Style?"
classes="zoo-widget-style2 modish grey font26 fontlight"
enable_fullwidth="0" block_ids="126" lazyload="0"
block_lg="126,12," block_md="126,12," block_sm="126,12," block_xs="126,12,"
template="widget/block.phtml" scroll="0" background_overlay_o="0"}}</div>
What I would prefer to end up with from the above (adding block_xl):
<div>{{widget type="CleverSoft\CleverBlock\Block\Widget"
widget_title="The Album" classes="highlight-bottom modish greenfont font52 fontlight"
enable_fullwidth="0" block_ids="127" lazyload="0"
block_xl="127,12," block_lg="127,12," block_md="127,12," block_sm="127,12," block_xs="127,12,"
template="widget/block.phtml" scroll="0" background_overlay_o="0"}}</div>
<!-- Image Block -->
<div>{{widget type="CleverSoft\CleverBlock\Block\Widget"
widget_title="What’s Your Favorite Cover Style?"
classes="zoo-widget-style2 modish grey font26 fontlight"
enable_fullwidth="0" block_ids="126" lazyload="0"
block_xl="126,12," block_lg="126,12," block_md="126,12," block_sm="126,12," block_xs="126,12,"
template="widget/block.phtml" scroll="0" background_overlay_o="0"}}</div>
I know how to do it in php and if necessary, I will just replace it on my local DB and write an sql script to update the modified records, but the html blocks can be kind of big in some cases. It would be preferable, if it is possible, to make the substitutions right in the SQL but I'm not sure how to do it or if it's even possible to do.
And yes, there can be more than one instance of a widget in any given cms page or block. (i.e. there may be a need for more than one such substitutions with different local 'values' assigned to the block_lg)
If anyone can help me do it in SQL, it would be greatly appreciated.
for reference, the tables effected are called cms_page and cms_block, the name of the row in both cases is content
SW

#Dblookup and formatting on web

I have been developing a web application using domino, therein I have dblookup-ing the field from notes client; Now, this is working fine but the format of value is missing while using on web.
For example in lotus notes client the field value format is as above
I am one, I am two, I am one , I am two, labbblallalalalalalalalalalalalalalalalalalaallllal
Labbbaalalalallalalalalalaalallaal
Hello there, labblalalallalalalllaalalalalalalalalalalalalalalalalalalalalalalala
Now when I retrieve the value of the field on web it seems it takes 2 immediate after 1. and so forth, I was expecting line feed here which is not happening.
The field above is multi valued field. Also on web I have used computed text which does db lookup from notes client.
Please help me what else could/alternate solution for this case.
Thanks
HD
Your multi-valued field has display options associated with it and the Notes client honors those. Obviously, your options are set up to display entries separated by newlines.
The computed text that you are using for the web does not have options like that and the field options are irrelevant because you aren't displaying the field. Your code has to insert the #Newlines. That's pretty easy because #DbLookup returns a list, and if you concatenate a list and a scalar, the scalar will be appended to each element of the list. (Look at the third example under "concatenation, pairwise" here to see what I mean.
The way you've worded your question is a little unclear to me, but what you need in your computed text formula is either something like this:
list := #DbLookup(etc,. etc.);
list + #Newline;
Or something like this:
multiValueFieldContainingListWithDbLookupResult + #NewLine;
I used #implode(Dblookupreturnedvalue;"");
thanks All :)

How to display custom fields in rally?

I have a ruby script that is trying to pull up some custom fields from Rally, the filter works just fine ( the filter contains one of the custom fields i want to pull up) but when I try to display it, it doesn't show up in the list (the value returned for all custom fields is blank, while appropriate values are returned for FormattedID, Name, Description).
Here's the link [link]http://pastebin.ubuntu.com/6124958/
Please see this post.
Do you fetch the fields?
What version of WS API are you using? If it is v2.0 is c_ prepended to the name of the field in the code?
How is the field spelled in your code and how that spelling compares to Name and Display Name of the field in UI?
There is another reason why custom fields might not appear (other than the versioning and 'c_' issues nickm mentioned). I just found this out after a ton of head banging. The Rally SDK's ui stuff will filter out all fields that are hidden (such as _ref, or other 'hidden' custom fields) so you cannot view them in your apps in grids, charts, etc. For example, when constructing a Rally.ui.grid.Grid, a class called Rally.ui.grid.ColumnBuilder is constructed and it executes a command on the columns of the chart that look like this:
_removeHiddenColumns: function (columns) {
return _.filter(columns, function (column) {
return !column.modelField || !column.modelField.hidden;
});
},
As you can see, if you try to display any fields that are hidden (like _ref for example) in a grid, the column gets removed. So, although you may be fetching the custom fields, they will not show up unless those fields are not 'hidden'.

Solr File Indexing map content by pages

I would like to index files in Solr.
I have already made an "output script" with PHP, but my project leader has given me the task of displaying the page number of the found text.
So:
- I am searching for the Word "Foo".
- Solr returns the results and also the highlighted text.
- Now I would like to know on which page this highlighted text is, to find it.
The files are *.pdf files.
One solution I have thought of would be to import the Text of the PDF Files in different fields? Or maybe in this one multivalued field named "content".
Maybe like this:
Json:
content:
1: "page one text",
2: "page two text"
and so on?
Is this possible? Or is there a better way to find this information out? Thanks for your help! :-)
You need to create a separate Solr document for every page of every PDF file. If you want to return only one result per file, then you can use FieldCollapsing to group all the results from the same PDF file.

Lucene query that eliminates xml tags in full text search

In alfresco I need to write a lucene query such a way that It has to eliminate/exclude the xml tags from content while searching.
Example If a file try.xml is searched against the content, my search should not search for the xml tags.
try.xml
<sample>This is an example</sample>
If I give the search text as "sample" it should not return the file name "try.xml".
So how could I achieve this?
Edit
I have tried with the below query and no change.
#cm\:name:"try*" -TEXT:"<*>" +TEXT:"sample"
Whats wrong in the above query. I just tried to get the file name which starts with "try" and eliminating the text inside tag, and trying to search for text "sample".
By default Alfresco treats XML files as plain text and indexes the xml tags as words, that's why they can be found via full text search. XML content is handled by the StringExtractingContentTransformer in Alfresco which converts text/xml to text/plain before indexing it.
To check which transformers are registered in your Alfresco installation you can check
http://localhost:8080/alfresco/service/mimetypes?mimetype=text/xml#text/xml
To prevent the indexing of xml attributes you have to write a special transformer which strips out the XML tags. See http://wiki.alfresco.com/wiki/Content_Transformations for an introduction in content transformation with Alfresco. The easiest way would be to integrate a command line utility that converts the xml file into text or you could implement a java class which does the transformation.
There's no standard way to do what you need, here's an excerpt of the official documentation:
Wild card queries Wildcard queries
using * and ? are support as terms and
phrases. For tokenized fields the
pattern match can not be exact as all
the non token characters (whitespace,
punctuation, etc) will have been lost
and treated as equal.
Basically, angle brackets are stripped out by default. You need to hack the indexing and query parsing processes in order to enable your wanted behavior.
Could you not just exclude the xml mimetype? (See http://wiki.alfresco.com/wiki/Search#Finding_nodes_by_content_mimetype for the syntax)
I guess you might want to exclude html too (so you'd exclude text/html and text/xml), that'd prevent you getting any nodes in your results that contain xml tags.