Highlighting matched results on _all fields - lucene

I want the matched results to be highlighted. This works for me if I mention the field name and it returns the highlighted text, however if I give the field as "_all", it is not returning any value.
This works for me:
curl -XGET "http://localhost:9200/my_index/my_type/_search?q=stackoverflow&size=999" -d '{
"highlight":{
"fields":{
"my_field":{}
}
}
}'
This returns the expected value as follows:
[highlight] => stdClass Object ( [my_field] => Array ( [0] => stackoverflow is the best website for techies ) )
But when I give this:
curl -XGET "http://localhost:9200/my_index/my_type/_search?q=stackoverflow&size=999" -d '{
"highlight":{
"fields":{
"_all":{}
}
}
}'
I get null value/no result.
[highlight] => stdClass Object ( [_all] => Array () )
How do I get it to work on any field so that I don't have to mention the field name?

To avoid the need to add _all as a stored field in your index
An alternative quick fix: use * instead of _all:
curl -XGET "http://localhost:9200/my_index/my_type/_search?q=stackoverflow&size=999" -d '{
"highlight":{
"fields":{
"*":{}
}
}
}'

If you are using ES 2.x then you need to set require_field_match option to false due to changes made, From the doc
The default value for the require_field_match option has changed from false to true, meaning that the highlighters will, by default, only take the fields that were queried into account.
This means that, when querying the _all field, trying to highlight on
any field other than _all will produce no highlighted snippets.
"highlight": {
"fields": {
"*": {}
},
"require_field_match": false
}

You need to map the _all field as stored. The mapping below should do the trick. Note though that this will add to the index size.
{
"my_type": {
"_all": {
"enabled": true,
"store": "yes"
}
}}

This library has functions for query highlighting including highlighting across all fields. The README explains how to create an elasticsearch index with _all field stored etc:
https://github.com/niranjan-uma-shankar/Elasticsearch-PHP-class

Related

Oracle SQL JSON_QUERY ignore key field

I have a json with several keys being a number instead of a fixed string. Is there any way I could bypass them in order to access the nested values?
{
"55568509":{
"registers":{
"001":{
"isPlausible":false,
"deviceNumber":"55501223",
"register":"001",
"readingValue":"5295",
"readingDate":"2021-02-25T00:00:00.000Z"
}
}
}
}
My expected output here would be 5295, but since 59668509 can vary from json to json, JSON_QUERY(data, '$."59668509".registers."001".readingValue) would not be an option. I'm not able to use regexp here because this is only a part of the original json, which contains more than this.
UPDATE: full json with multiple occurrences:
This is how my whole json looks like. I would like all the readingValue in brackets, in the example below, my expected output would be [32641, 00964].
WITH test_table ( data ) AS (
SELECT
'{
"session":{
"sessionStartDate":"2021-02-26T12:03:34+0000",
"interactionDate":"2021-02-26T12:04:19+0000",
"sapGuid":"369F01DFXXXXXXXXXX8553F40CE282B3",
"agentId":"USER001",
"channel":"XXX",
"bpNumber":"5551231234",
"contractAccountNumber":"55512312345",
"contactDirection":"",
"contactMethod":"Z08",
"interactionId":"5550848784",
"isResponsibleForPayingBill":"Yes"
},
"payload":{
"agentId":"USER001",
"contractAccountNumber":"55512312345",
"error":{
"55549271":{
"registers":{
"001":{
"isPlausible":false,
"deviceNumber":"55501223",
"register":"001",
"readingValue":"32641",
"readingDate":"2021-02-26T00:00:00.000Z"
}
},
"errors":[
{
"contractNumber":"55501231",
"language":"EN",
"errorCode":"62",
"errorText":"Error Text1",
"isHardError":false
},
{
"contractNumber":"55501232",
"language":"EN",
"errorCode":"62",
"errorText":"Error Text2",
"isHardError":false
}
],
"bpNumber":"5557273667"
},
"55583693":{
"registers":{
"001":{
"isPlausible":false,
"deviceNumber":"555121212",
"register":"001",
"readingValue":"00964",
"readingDate":"2021-02-26T00:00:00.000Z"
}
},
"errors":[
],
"bpNumber":"555123123"
}
}
}
}'
FROM
dual
)
SELECT
JSON_QUERY(data, '$.payload.error.*.registers.*[*].readingValue') AS reading_value
FROM
test_table;
UPDATE 2:
Solved, this would do the trick, upvoting the first comment.
JSON_QUERY(data, '$.payload.error.*.registers.*.readingValue' WITH WRAPPER) AS read_value
As I explained in the comment to your question, if you are getting that result from the JSON you posted, you are not using JSON_QUERY(); you must be using JSON_VALUE(). Either that, or there's something else you didn't share with us.
In any case, let's say you are using JSON_VALUE() with the arguments you showed. You are asking, how can you modify the path so that the top-level attribute name is not hard-coded. That is trivial: use asterisk (*) instead of the hard-coded name. (This would work the same with JSON_QUERY() - it's about JSON paths, not the specific function that uses them.)
with test_table (data) as (
select
'{
"59668509":{
"registers":{
"001":{
"isPlausible":false,
"deviceNumber":"40157471",
"register":"001",
"readingValue":"5295",
"readingDate":"2021-02-25T00:00:00.000Z"
}
}
}
}' from dual
)
select json_value (data, '$.*."registers"."001"."readingValue"'
returning number) as reading_value
from test_table
;
READING_VALUE
-------------
5295
As an aside that is not related to your question in any way: In your JSON you have an object with a single attribute named "registers", whose value is another object with a single attribute "001", and in turn, this object has an attribute named "register" with value "001". Does that make sense to you? It doesn't to me.

What is the best way to create a subset of my data in Elasticsearch?

I have an index in elasticsearch containing apache log data. Here is what I want to do:
Identify all visitors (by ip number) that accessed a certain file (e.g. /signup.php).
Do a search/query/aggregation on my data, but limit the documents that are examined to those containing an ip number found in step 1.
In the sql world, I would just create a temporary table and insert all the matching IP numbers from step one. Next I would query my main table and limit the result set by joining in my temporary table on IP number.
I understand joins are not possible in elasticsearch. The elasticsearch documentation suggests a few ways to handle situations like this:
Application side joins
This does not seem practical, because the list of IP numbers may be very large and it seems inefficient to send the results to the client and then pass it back to elasticsearch in one huge terms filter.
Denormalizing the data
This would involve iterating over the matching IP numbers and updating every document in the index for any given IP number with something like "in_group": true, so I can use that in my query later on. This also seems very impractical and inefficient, especially since the source query (step 1) is dynamic.
Nested Object and/or parent-Child relationship
I'm not sure if dynamically creating new documents with nested objects is practical in this case. It seems to me that I would end up copying huge parts of my data.
I'm new to elasticsearch and noSQL in general, so perhaps I'm just looking at the problem the wrong way and I shouldn't be trying to emulate a JOIN in the first place.
But this seems like such a common case for segmenting a dataset, it makes me wonder if I am overlooking some other obvious way of doing this?
Any help would be appreciated!
If I understood your question correctly, you are trying to get a subset of your documents based on certain condition and use that sub set to query/search/aggregate it further.
If true, why would you like to store it in another view(sql types). The main power of elasticsearch is it's caching capability of filters and thus it highly reduces your query time. Using this feature, all the queries/searches/aggregation you need to perform on, would require a term filter which would specify the condition you are trying to do in step 1. Now, whatever other operations you want to do, you can do it in the same query on the already shrinked dataset.
If you have other different use cases, then the storage of document(mapping) might be considered to get changed for easier and faster retrieval.
This is a current workaround that I use:
Run this bash script to save the first query ip-list to a temp index, then use a terms-query filter (in Kibana) to query using the ip-list from step1.
#!/usr/bin/env bash
es_host='https://************'
elk_user='************'
cred=($(pass ELK/************ | tr "\n" " ")) ##password
index_name='iis-************'
index_hostname='"************"'
temp_index_path='temp1/_doc/1'
results_limit=1000
timestamp_gte='"2018-03-20T13:00:00"' #UTC
timestamp_lte='"now"' #UTC
resp_data="$(curl -X POST $es_host/$index_name/_search -u $elk_user:${cred[0]} -H 'Content-Type: application/json; charset=utf-8' -d #- << EOF
{
"query": {
"bool": {
"must": [{
"match": {
"index_hostname": {
"query": $index_hostname
}
}
},
{
"regexp": {
"iis.access.url":{
"value": ".*((jpg)|(jpeg)|(png))"
}
}
}],
"must_not": {
"match": {
"iis.access.agent": {
"query": "Amazon+CloudFront"
}
}
},
"filter": {
"range": {
"#timestamp": {
"gte": $timestamp_gte,
"lte": $timestamp_lte
}
}
}
}
},
"aggs" : {
"whatever" : {
"terms" : { "field" : "iis.access.remote_ip", "size":$results_limit }
}
},
"size" : 0
}
EOF
)"
ip_list="$(echo "$resp_data" | jq '.aggregations.whatever.buckets[].key' | tr "\n" ",\ " | head -c -1)"
resp_data2="$(curl -X PUT $es_host/$temp_index_path -u $elk_user:${cred[0]} -H 'Content-Type: application/json; charset=utf-8' -d #- << EOF
{
"ips" : [$ip_list]
}
EOF
)"
echo "$resp_data2"
Query DSL - "terms-query" filter:
{
"query": {
"terms": {
"iis.access.remote_ip": {
"id": "1",
"index": "temp1",
"path": "ips",
"type": "_doc"
}
}
}
}

How to create an alias on two indexes with logstash?

In the cluster that I am working on there are two main indexes, let's say indexA and indexB but these two indexes are indexed each day so normaly I have indexA-{+YYYY.MM.dd} and indexB-{+YYYY.MM.dd}.
What I want is to have one alias that gathers indexA-{+YYYY.MM.dd} and indexB-{+YYYY.MM.dd} together and named alias-{+YYYY.MM.dd}.
Does anyone know how to gather two indexes in one alias with logstash ?
Thank you in advance
As far as I know, there's no way to do it with logstash directly. You can do it from an external program using the elasticsearch API: http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html
For example:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "add" : { "index" : "indexA-2015.01.01", "alias" : "alias-2015.01.01" } },
{ "add" : { "index" : "indexB-2015.01.01", "alias" : "alias-2015.01.01" } }
]
}'
The other option (which doesn't meet your requirements of having it named alias-yyyy.mm.dd) is to use an index template that automatically adds an alias when the index is created.
See http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html:
curl -XPUT localhost:9200/_template/add_alias_template -d '{
"template" : "index*",
"aliases" : {
"alias" : {}
}
}
}'
This will add the alias of alias to every index named index*.
You can then do all of your queries against alias. You can setup that alias in Kibana as an index and things will just work right.

ElasticSearch analyzed fields

I'm building my search but need to analyze 1 field with different analyzers. My problem is for a field I need to have an analyzer on it for stemming (snowball) and then also one to keep the full word as one token (keyword). I can get this to work by the following index settings:
curl -X PUT "http://localhost:9200/$IndexName/" -d '{
"settings":{
"analysis":{
"analyzer":{
"analyzer1":{
"type":"custom",
"tokenizer":"keyword",
"filter":[ "standard", "lowercase", "stop", "snowball", "my_synonyms" ]
}
}
},
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms_path ": "synonyms.txt"
}
}
}
},
"mappings": {
"product": {
"properties": {
"title": {
"type": "string",
"search_analyzer" : "analyzer1",
"index_analyzer" : "analyzer1"
}
}
}
}
}';
The problem comes when searching on a single word in the title field. If it's populated with The Cat in the Hat it will store it as "The Cat in the Hat" but if I search for cats I get nothing returned.
Is this even possible to accomplish or do I need to have 2 separate fields and analyze one with keyword and the other with snowball?
I'm using nest in vb code to index the data if that matters.
Thanks
Robert
You can apply two different analyzers to the same using the fields property (previously known as multi fields).
My VB.NET is a bit rusty, so I hope you don't mind the C# examples. If you're using the latest code from the dev branch, Fields was just added to each core mapping descriptor so you can now do this:
client.Map<Foo>(m => m
.Properties(props => props
.String(s => s
.Name(o => o.Bar)
.Analyzer("keyword")
.Fields(fs => fs
.String(f => f
.Name(o => o.Bar.Suffix("stemmed"))
.Analyzer("snowball")
)
)
)
)
);
Otherwise, if you're using NEST 1.0.2 or earlier (which you likely are), you have to accomplish this via the older multi field type way:
client.Map<Foo>(m => m
.Properties(props => props
.MultiField(mf => mf
.Name(o => o.Bar)
.Fields(fs => fs
.String(s => s
.Name(o => o.Bar)
.Analyzer("keyword"))
.String(s => s
.Name(o => o.Bar.Suffix("stemmed"))
.Analyzer("snowball"))
)
)
)
);
Both ways are supported by Elasticsearch and will do the exact same thing. Applying the keyword analyzer to the primary bar field, and the snowball analyzer to the bar.stemmed field. stemmed of course was just the suffix I chose in these examples, you can use whatever suffix name you desire. In fact, you don't need to add a suffix, you can name the multi field something completely different than the primary field.

Join / split search words in elasticsearch (using tire)

I have the following analyzer (a slight tweak to the way snowball would be setup):
string_analyzer: {
filter: [ "standard", "stop", "snowball" ],
tokenizer: "lowercase"
}
Here is the field it is applied to:
indexes :title, type: 'string', analyzer: 'string_analyzer'
query do
match ['title'], search_terms, fuzziness: 0.5, max_expansions: 10, operator: 'and'
end
I have a record in my index with title foo bar.
If I search for foo bar it appears in the results.
However, if I search for foobar it doesn't.
Can someone explain why and if possible how I could get it to?
Can someone explain how I could get the reverse of this to work as well so that if I had a record with title foobar a user could search for foo bar and see it as a result?
Thanks
You can only search for tokens that are in your index. So let's look at what you are indexing.
You're currently using the lowercase tokenizer (which tokenizes a string on non-letter characters and lowercases them) then applying the standard filter (redundant, because you are not using the standard tokenizer), the stop and snowball filters.
If we create that analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"string_analyzer" : {
"filter" : [
"standard",
"stop",
"snowball"
],
"tokenizer" : "lowercase"
}
}
}
}
}
'
and use the analyze API to test it out:
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=foo+bar&analyzer=string_analyzer'
you'll see that "foo bar" produces the terms ["foo","bar"] and "foobar" produces the term ["foobar"]. So indexing "foo bar" and searching for "foobar" currently cannot work.
If you want to be able to search "inside" words, then you need to break words up into smaller tokens. To do this, we use the ngram analyzer.
So delete the test index:
curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'
and specify a new analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"ngrams" : {
"max_gram" : 5,
"min_gram" : 1,
"type" : "ngram"
}
},
"analyzer" : {
"ngrams" : {
"filter" : [
"standard",
"lowercase",
"ngrams"
],
"tokenizer" : "standard"
}
}
}
}
}
'
Now, if we test the analyzer, we get:
"foo bar" => [f,o,o,fo,oo,foo,b,a,r,ba,ar,bar]
"foobar" => [f,o,o,b,a,r,fo,oo,ob,ba,ar,foo,oob,oba,bar,foob,ooba,obar,fooba,oobar]
So if we index "foo bar" and we search for "foobar" using the match query, then the query becomes a query looking for any of those tokens, some of which exist in the index.
Unfortunately, it'll also overlap with "wear the fox hat" (f,o,a). While foobar will appear higher up the list of results because it has more tokens in common, you will still get apparently unrelated results.
This can be controlled by using the minimum_should_match parameter, eg:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"match" : {
"my_field" : {
"minimum_should_match" : "60%",
"query" : "foobar"
}
}
}
}
'
The exact value for minimim_should_match depends upon your data - experiment with it.