SOLR indexing SQL data per language /multi language

SOLR indexing SQL data per language /multi language - indexing

I know that there are similar questions about SOLR, some give insights but not a solution for exaclty what I am trying to do.
I would like to create one core having multi language data.
For exemple it is possible to have field like description_fr, description_en. I would like to send description fr when the request whant the data in french and do not send the description_en.
Some of my questions :
How do I define the data to be indexed
How do I tell the application to request the search against the English or French version of the fields ?
Thanks a lot

I would recommend a talk from a friend of mine at the latest Berlin Buzzwords [1]
This could be interesting to you for the future.
Sticking to your current question I would proceed identifying the language of the query ( which is an hard task as the query is usually composed by few terms).
Then, depending on the language I would sent to Solr a request to return only 1 stored field for the content.
e.g.
In the index I have :
description_it, description_en
q = "prodotto scalare"
Language Identified : it
Request : http://localhost:8983/solr/select?q=prodotto scalare&fl=description_it
You just need a library to detect the language[2] and a mapping between the language ISO code and your solr fields.
You can build this at API time or directly in Solr as a plugin.
[1] https://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/embracing_diversity_searching_over_multiple_languages.pdf
[2]
A couple of popular examples :
Tika - https://www.programcreek.com/java-api-examples/index.php?api=org.apache.tika.language.LanguageIdentifier
Google - https://github.com/shuyo/language-detection

Related

Is there a way to properly experiment with Solr field-types?

I'm working with Solr for a basic search engine, and I've created a couple different fieldTypes that include various filters and tokenizers in their analyzer chains.
However, I'm finding it very difficult to assess how these components of the chain interact and when I query in the Solr Admin, I consistently get different results than I expect-- with no clue as to why.
Is there a way to see what a phrase like education:"x university" is being transformed into when I type it in the q section of the Admin?
Also, when the phrase goes through the chain can it be transformed into multiple things that are all searched or is it just a single modified phrase?
Thanks for any help!

Use Analysis in Solr Admin to check how each field and its type process the tokens both while querying and indexing.
Analyse Fieldname / FieldType:
from the drop down option select field/type that you want to analyse and clieck on Analyse values.
ex: what tokenizer used, which all filter classes applied to token and how token is transformed after passing each filter class.
if
Verbose Output is checked, it shows more details about each filter class used for the selected field/type.

How to desing RESTful advanced search/filter

First of all, I have read RESTful URL design for search and How to design RESTful search/filtering? questions. I am trying to design more advanced options for searching in a simple and RESTful way.
The answers to those questions have given me some insight and clues on how to design my previous application url pattern for search/filter functionality.
First, I came up with quite nice and simple solution for basic filtering options, using pattern:
Equality search: key = val
IN search: key = val1 & key = val2
But as application has grown, so were the search requirements. And I ended up with some rather unpleasant and complex url pattern for advanced searching options which included:
Negation search: key-N = val
Like search: key-L = val
OR search: key1-O = val1 & key2 = val2
Range search: key1-RS = val1 & key1-RE = val2
Whats more, beside filters, query has to get information about pagination and order by, so the filter parameter has F- suffix, order by fields has O- suffix, and pagination has P- suffix.
I hope that at this point I do not have to add that parsing such request is rather malicious task, with the possibility of ambiguity if key will contain '-'. I have created some regexp to parse it, and it works quite well as for now, but...
Now I am starting to write a new web app and I have the chance to redesign this piece from scratch.
I am wondering about creating object in a browser containing all information in structured and self-explanatory way and send it to server as as JSON string, like:
filter = {{'type':'like','field':key,'value':val1,'operator':'and','negation':false},..}
But I get strange feeling that this is not good idea - I really don't know why.
So, this would be the definition of my context. Now the question:
I am searching for simpler and safer pattern for implementing advanced search including options I mentioned above as RESTful GET parameters - can you share some ideas?
Or maybe some insights on not doing this in a RESTful way?
Also, if you see some pitfalls in JSON way, please share them.
EDIT:
I know what makes sending json as get parameter, not so good idea. Encoding it - it makes it ugly and hard to read.
Info provided by links sended by thierry templier, gave me something to think about and I managed to design more consistient and safe filter handling in GET parameters. Below is definition of syntax.
For filters - multiple F parameters (one for each search criterium):
F = OPERATOR:NEGATION:TYPE:FIELD:VAL[:VAL1,:VAL2...]
allowed values:
[AND|OR]:[T|F]:[EQ|LI|IN|RA]:FIELD_NAME:VALUE1:VALUE2...
For order by - multiple O parameters (one for each ordered field):
O = ODINAL_NO:DIRECTION:FIELD
allowed values:
[0-9]+:[ASC|DESC]:FIELD_NAME
Pagination - single P parameter:
P = ITEMS_PER_PAGE:FROM_PAGE:TO_PAGE
I think this will be good solution - it meets all my requirements, it is easy to parse and write, it is readable and I do not see how that syntax can become ambiguous.
I wloud appreciate any thoughts on that idea - do you see any pitfalls?

There are several options here. But it's clear that if your queries tend to be complex with operators, and so on... you can't use a set of query parameters. I see two approaches:
Provide the query as JSON content to a method POST
Provide the query in a query parameter with a specific format / grammar to a method GET
I think that you could have a look at what ElasticSearch for their queries. They are able to describe very complex queries with JSON contents (using several levels). Here is a link to their query DSL: http://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html.
You can also have a look at what OData does for queries. They choose another approach with a single query parameter $filter. Here are some links that can give you some examples: https://msdn.microsoft.com/en-us/library/hh169248(v=nav.70) and http://www.odata.org/documentation/odata-version-3-0/url-conventions/. This option requires to have a grammar on the server side to parse your query.
In general, this link could also give you some hints at this level in its section "Filtering data": https://templth.wordpress.com/2014/12/15/designing-a-web-api/.
Hope it gives you some helpful hints to design your queries within your RESTful services ;-)
Thierry

Fulltext Solr statistical search

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.

for the most frequent terms look into the terms- and statistical component

besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393

For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.

Lucene/Solr Searching problem?

I have a problem that i want to search in the specific locations in the indexed text, let we have a lucene document that contains text as
<Cover>
This document contains following items
1. Business overview.
2. Risk Factors.
3. Management
</Cover>
<BusinessOverview>
our business is xyz
</BusinessOverview>
<RiskFactors>
we have xyz risk factors
</RiskFactors>
<Management>
we have xyz type of management
</Mangement>
now in above code html tags(could be any thing) divide main document in sections now i want to have a functionality if user give some text to search and does not mention any specific section the text should be searched in whole document but user if user specify some section along with text to search, the search should be done only in that particular section. I want to know is this type of search is possible with solr/lucene.
Regards
Ahsan

You can use the <copyField> option to have a "field of fields"
se here:
http://wiki.apache.org/solr/FAQ#How_do_I_use_copyField_with_wildcards.3F
http://www.ibm.com/developerworks/java/library/j-solr1/

Your schema should reflect that need ; the data sent to the indexer would have then to match this schema properly. Once done, you'll be able to query against scpcific fields.
You could also use an xml importer.

I have never worked with solr but lucene itself has very flexible query language, see this link. So answer is yes it is possible.

Retrieving per keyword/field match position in Lucene Solr -- possible?

Is there any way to retrieve the match field/position for each keyword for each matching document from solr?
For example, if the document has title "Retrieving per keyword/field match position in Lucene Solr -- possible?" and the query is "solr keyword", I'd like to get, in addition to the doc-id (I normally only want the doc-id, not the full document), something that can tell me the matches are at:
solr:
title: 9
keyword:
title: 3
I'm pretty sure such info is computing during query execution (for phrase queries), but is it possible to return these to the application?
Thanks!

Debugging Relevance Issues in Search suggest using Solr analysis, which you can get to from the admin URL, using something like http://localhost:8983/solr/admin/analysis.jsp?highlight=on .
This highlights matching terms and gives their position.

AFAIK there is no way to do that directly, but you can use hit highlighting to implement it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas