What is the best way to create a subset of my data in Elasticsearch?

What is the best way to create a subset of my data in Elasticsearch? - sql

I have an index in elasticsearch containing apache log data. Here is what I want to do:
Identify all visitors (by ip number) that accessed a certain file (e.g. /signup.php).
Do a search/query/aggregation on my data, but limit the documents that are examined to those containing an ip number found in step 1.
In the sql world, I would just create a temporary table and insert all the matching IP numbers from step one. Next I would query my main table and limit the result set by joining in my temporary table on IP number.
I understand joins are not possible in elasticsearch. The elasticsearch documentation suggests a few ways to handle situations like this:
Application side joins
This does not seem practical, because the list of IP numbers may be very large and it seems inefficient to send the results to the client and then pass it back to elasticsearch in one huge terms filter.
Denormalizing the data
This would involve iterating over the matching IP numbers and updating every document in the index for any given IP number with something like "in_group": true, so I can use that in my query later on. This also seems very impractical and inefficient, especially since the source query (step 1) is dynamic.
Nested Object and/or parent-Child relationship
I'm not sure if dynamically creating new documents with nested objects is practical in this case. It seems to me that I would end up copying huge parts of my data.
I'm new to elasticsearch and noSQL in general, so perhaps I'm just looking at the problem the wrong way and I shouldn't be trying to emulate a JOIN in the first place.
But this seems like such a common case for segmenting a dataset, it makes me wonder if I am overlooking some other obvious way of doing this?
Any help would be appreciated!

If I understood your question correctly, you are trying to get a subset of your documents based on certain condition and use that sub set to query/search/aggregate it further.
If true, why would you like to store it in another view(sql types). The main power of elasticsearch is it's caching capability of filters and thus it highly reduces your query time. Using this feature, all the queries/searches/aggregation you need to perform on, would require a term filter which would specify the condition you are trying to do in step 1. Now, whatever other operations you want to do, you can do it in the same query on the already shrinked dataset.
If you have other different use cases, then the storage of document(mapping) might be considered to get changed for easier and faster retrieval.

This is a current workaround that I use:
Run this bash script to save the first query ip-list to a temp index, then use a terms-query filter (in Kibana) to query using the ip-list from step1.
#!/usr/bin/env bash
es_host='https://************'
elk_user='************'
cred=($(pass ELK/************ | tr "\n" " ")) ##password
index_name='iis-************'
index_hostname='"************"'
temp_index_path='temp1/_doc/1'
results_limit=1000
timestamp_gte='"2018-03-20T13:00:00"' #UTC
timestamp_lte='"now"' #UTC
resp_data="$(curl -X POST $es_host/$index_name/_search -u $elk_user:${cred[0]} -H 'Content-Type: application/json; charset=utf-8' -d #- << EOF
{
"query": {
"bool": {
"must": [{
"match": {
"index_hostname": {
"query": $index_hostname
}
}
},
{
"regexp": {
"iis.access.url":{
"value": ".*((jpg)|(jpeg)|(png))"
}
}
}],
"must_not": {
"match": {
"iis.access.agent": {
"query": "Amazon+CloudFront"
}
}
},
"filter": {
"range": {
"#timestamp": {
"gte": $timestamp_gte,
"lte": $timestamp_lte
}
}
}
}
},
"aggs" : {
"whatever" : {
"terms" : { "field" : "iis.access.remote_ip", "size":$results_limit }
}
},
"size" : 0
}
EOF
)"
ip_list="$(echo "$resp_data" | jq '.aggregations.whatever.buckets[].key' | tr "\n" ",\ " | head -c -1)"
resp_data2="$(curl -X PUT $es_host/$temp_index_path -u $elk_user:${cred[0]} -H 'Content-Type: application/json; charset=utf-8' -d #- << EOF
{
"ips" : [$ip_list]
}
EOF
)"
echo "$resp_data2"
Query DSL - "terms-query" filter:
{
"query": {
"terms": {
"iis.access.remote_ip": {
"id": "1",
"index": "temp1",
"path": "ips",
"type": "_doc"
}
}
}
}

Related

Pushing array to Firebase via REST API

Question
How do I (with a single HTTP request to the REST API) write an array to Firebase and give each array element a (non-integer) unique ID?
As described here.
Data
The data I have to write looks like the following.
data-to-write.js
myArray = [ {"user_id": "jack", "text": "Ahoy!"},
{"user_id": "jill", "text": "Ohai!"} ];
Goal
When finished, I want my Firebase to look like this following.
my-firebase.firebaseio.com
{
"posts": {
"-JRHTHaIs-jNPLXOQivY": { // <- unique ID (non-integer)
"user_id": "jack",
"text": "Ahoy!"
},
"-JRHTHaKuITFIhnj02kE": { // <- unique ID (non-integer)
"user_id": "jill",
"text": "Ohai!"
}
}
}
I do not want it to look like this following...
my-anti-firebase.firebaseio.com
// NOT RECOMMENDED - use push() instead!
{
"posts": {
"0": { // <- ordered array index (integer)
"user_id": "jack",
"text": "Ahoy!"
},
"1": { // <- ordered array index (integer)
"user_id": "jill",
"text": "Ohai!"
}
}
}
I note this page where it says:
[...] if all of the keys are integers, and more than half of the keys between 0 and the maximum key in the object have non-empty values, then Firebase will render it as an array.
Code
Because I want to do this in a single HTTP request, I want to avoid iterating over each element in the array and, instead, I want to push a batch in a single request.
In other words, I want to do something like this:
pseudocode.js
curl -X POST -d '[{"user_id": "jack", "text": "Ahoy!"},
{"user_id": "jill", "text": "Ohai!"}]' \
// I want some type of batch operation here
'https://my-firebase.firebaseio.com/posts.json'
However, when I do this, I get exactly what I describe above that I don't want (i.e., sequential integer keys).
I want to avoid doing something like this:
anti-pseudocode.js
for(i=0; i<=myArray.length; i++;){ // I want to avoid iterating over myArray
curl -X POST -d '{"user_id": myArray[i]["user_id"],
"text": myArray[i]["text"]}' \
'https://my-firebase.firebaseio.com/posts.json'
}
Is it possible to accomplish what I have described? If so, how?

I don't think there is a way to use the Firebase API to do this as described in the OP.
However, it can be done with a server script as follows:
Iterate through each array element.
Assign each element a unique id (generated by server script).
Create a return object with keys being the unique IDs and values being the corresponding array elements.
Write object to Firebase with a single HTTP request using the patch method. Because post creates a new Firebase generated ID for the entire object itself. Whereas, patch does not; it writes directly to the parent node.
script.js
var myObject = {},
i = myArray.length;
while(i--){
var key = function(){ /* return unique ID */ }();
myObject[key] = myArray[i];
}
curl -X PATCH -d JSON.stringify(myObject) \
'https://my-firebase.firebaseio.com/posts.json'

Your decision to use POST is correct. The one which cause numeric indexes as a key is because your payload is an array. Whenever you post/put and array, the key will always be indexes. Post your object one by one if you want the server generate key for you.

Firebase will generate unique ID only if you use POST. If you use PATCH unique ID is not generated.
Hence for the given case, will need to iterate through using some server/ client side code to save data in firebase.
Peseudo Code:
For each array
curl -X POST -d
"user_id": "jack",
"text": "Ahoy!"
'https://my-firebase.firebaseio.com/posts.json'
Next

Query match without score in elasticsearch

I would like to simply match value of the field and I dont care about score (it will return always one match). I dont want elasticsearch to create me a score which may result on worse performance... or I am wrong and I should not care?
Simple query like this:
GET /testing/test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": {
"query": "My name here",
"operator": "and"
}
}
}
]
}
}
}
I expect one result with no score (and I dont want to use filtered).

You could override the default similarity with a custom one that just spits out a constant score for all matches. See the ElasticSearch documentation on how to set the Similarity module
However, for a query just involving a simple exact match on a term or phrase, the performance impact is unlikely to be significant. Profiling might help determine if this is really worth pursuing.

Unable to filter out n shingle(n - gram) facets using the "exclude" words option provided in the "facets" query

I am trying to make a tagcloud of words and phrases using the facets feature of elasticsearch.
My mapping:
curl -XPOST http://localhost:9200/myIndex/ -d '{
...
"analysis":{
"filter":{
"myCustomShingle":{
"type":"shingle",
"max_shingle_size":3,
"output_unigrams":true
}
},
"analyzer":{ //making a custom analyzer
"myAnalyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"myCustomShingle",
"stop"
]
}
}
}
...
},
"mappings":{
...
"description":{ //the field to be analyzed for making the tag cloud
"type":"string",
"analyzer":"myAnalyzer",
"null_value" : "null"
},
...
}
Query for generating facets:
curl -X POST "http://localhost:9200/myIndex/myType/_search?&pretty=true" -d '
{
"size":"0",
"query": {
match_all:{}
},
"facets": {
"blah": {
"terms": {
"fields" : ["description"],
"exclude" : [ 'evil' ], //remove facets that contain these words
"size": "50"
}
}
}
}
My problem is, when I insert a word say 'evil' in the "exclude" option of "facets", it successfully removes the facets containing the words(or single shingles) that match 'evil'. But it doesn't remove the 2/3 word shingles, "resident evil" , "evil computer", "my evil cat". How do I remove the facets of phrases containing the "exclude words"?

It isn't completely clear what you want to achieve. You usually wouldn't make facets on analyzed fields. Maybe you could explain why you're making shingles so that we can help achieving what you want in a better way.
With the exclude facet parameter you can exclude some specific entry, but evil is not the same as resident evil. If you want to exclude it you need to specify it. Facets are made based on indexed terms, and resident evil is in fact a single term in the index, which is not the same as the term evil.
Given the choice that you already made for indexing and faceting, there is a way to achieve what you want. Elasticsearch has a really powerful scripting module. You can use a script to decide whether each entry should be included in the facet or not like this:
{
"query": {
"match_all" : {}
},
"facets": {
"tags": {
"terms": {
"field" : "tags",
"script" : "term.contains('evil') ? true : false"
}
}
}
}

elasticsearch splits by space in facets

I am trying to do a simple facet request over a field containing more than one word (Eg: 'Name1 Name2', sometimes with dots and commas inside) but what I get is...
"terms" : [{
"term" : "Name1",
"count" : 15
},
{
"term" : "Name2",
"count" : 15
}]
so my field value is split by spaces and then runs the facet request...
Query example:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": [
"dataset"
],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"field": [
"speciesName"
],
"size": 50000
}
}
}
}'

Your field shouldn't be analyzed, or at least not tokenized. You need to update your mapping and then reindex if you want to index the field without tokenizing it.

First of all, javanna provided a very good answer from a practical perspective. However, for the sake of completeness, I want to mention that in some cases there is a way to do it without reindexing the data.
If the speciesName field is stored and your queries produce relatively small number of results, you can use script_field to retrieve stored field values:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": ["dataset"],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"script_field": "_fields['\''speciesName'\''].value",
"size": 50000
}
}
}
}
'
As a result of this query, elasticsearch will retrieve the speciesName field for every record in your result set and it will construct facets from these values. Needless to say, if your result set contains millions of records, performance of this query might be sluggish.
Similarly, if the field is not stored, but record source is stored, you can use script_field facet to retrieve field values from the source:
......
"script_field": "_source['\''speciesName'\'']",
......
Again, source for each record in the result list will be retrieved and parsed, so you might need some patience to run this query on a large set of records.

Highlighting matched results on _all fields

I want the matched results to be highlighted. This works for me if I mention the field name and it returns the highlighted text, however if I give the field as "_all", it is not returning any value.
This works for me:
curl -XGET "http://localhost:9200/my_index/my_type/_search?q=stackoverflow&size=999" -d '{
"highlight":{
"fields":{
"my_field":{}
}
}
}'
This returns the expected value as follows:
[highlight] => stdClass Object ( [my_field] => Array ( [0] => stackoverflow is the best website for techies ) )
But when I give this:
curl -XGET "http://localhost:9200/my_index/my_type/_search?q=stackoverflow&size=999" -d '{
"highlight":{
"fields":{
"_all":{}
}
}
}'
I get null value/no result.
[highlight] => stdClass Object ( [_all] => Array () )
How do I get it to work on any field so that I don't have to mention the field name?

To avoid the need to add _all as a stored field in your index
An alternative quick fix: use * instead of _all:
curl -XGET "http://localhost:9200/my_index/my_type/_search?q=stackoverflow&size=999" -d '{
"highlight":{
"fields":{
"*":{}
}
}
}'

If you are using ES 2.x then you need to set require_field_match option to false due to changes made, From the doc
The default value for the require_field_match option has changed from false to true, meaning that the highlighters will, by default, only take the fields that were queried into account.
This means that, when querying the _all field, trying to highlight on
any field other than _all will produce no highlighted snippets.
"highlight": {
"fields": {
"*": {}
},
"require_field_match": false
}

You need to map the _all field as stored. The mapping below should do the trick. Note though that this will add to the index size.
{
"my_type": {
"_all": {
"enabled": true,
"store": "yes"
}
}}

This library has functions for query highlighting including highlighting across all fields. The README explains how to create an elasticsearch index with _all field stored etc:
https://github.com/niranjan-uma-shankar/Elasticsearch-PHP-class

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas