How to create an alias on two indexes with logstash? - alias

In the cluster that I am working on there are two main indexes, let's say indexA and indexB but these two indexes are indexed each day so normaly I have indexA-{+YYYY.MM.dd} and indexB-{+YYYY.MM.dd}.
What I want is to have one alias that gathers indexA-{+YYYY.MM.dd} and indexB-{+YYYY.MM.dd} together and named alias-{+YYYY.MM.dd}.
Does anyone know how to gather two indexes in one alias with logstash ?
Thank you in advance

As far as I know, there's no way to do it with logstash directly. You can do it from an external program using the elasticsearch API: http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html
For example:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "add" : { "index" : "indexA-2015.01.01", "alias" : "alias-2015.01.01" } },
{ "add" : { "index" : "indexB-2015.01.01", "alias" : "alias-2015.01.01" } }
]
}'
The other option (which doesn't meet your requirements of having it named alias-yyyy.mm.dd) is to use an index template that automatically adds an alias when the index is created.
See http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html:
curl -XPUT localhost:9200/_template/add_alias_template -d '{
"template" : "index*",
"aliases" : {
"alias" : {}
}
}
}'
This will add the alias of alias to every index named index*.
You can then do all of your queries against alias. You can setup that alias in Kibana as an index and things will just work right.

Related

What is the best way to create a subset of my data in Elasticsearch?

I have an index in elasticsearch containing apache log data. Here is what I want to do:
Identify all visitors (by ip number) that accessed a certain file (e.g. /signup.php).
Do a search/query/aggregation on my data, but limit the documents that are examined to those containing an ip number found in step 1.
In the sql world, I would just create a temporary table and insert all the matching IP numbers from step one. Next I would query my main table and limit the result set by joining in my temporary table on IP number.
I understand joins are not possible in elasticsearch. The elasticsearch documentation suggests a few ways to handle situations like this:
Application side joins
This does not seem practical, because the list of IP numbers may be very large and it seems inefficient to send the results to the client and then pass it back to elasticsearch in one huge terms filter.
Denormalizing the data
This would involve iterating over the matching IP numbers and updating every document in the index for any given IP number with something like "in_group": true, so I can use that in my query later on. This also seems very impractical and inefficient, especially since the source query (step 1) is dynamic.
Nested Object and/or parent-Child relationship
I'm not sure if dynamically creating new documents with nested objects is practical in this case. It seems to me that I would end up copying huge parts of my data.
I'm new to elasticsearch and noSQL in general, so perhaps I'm just looking at the problem the wrong way and I shouldn't be trying to emulate a JOIN in the first place.
But this seems like such a common case for segmenting a dataset, it makes me wonder if I am overlooking some other obvious way of doing this?
Any help would be appreciated!
If I understood your question correctly, you are trying to get a subset of your documents based on certain condition and use that sub set to query/search/aggregate it further.
If true, why would you like to store it in another view(sql types). The main power of elasticsearch is it's caching capability of filters and thus it highly reduces your query time. Using this feature, all the queries/searches/aggregation you need to perform on, would require a term filter which would specify the condition you are trying to do in step 1. Now, whatever other operations you want to do, you can do it in the same query on the already shrinked dataset.
If you have other different use cases, then the storage of document(mapping) might be considered to get changed for easier and faster retrieval.
This is a current workaround that I use:
Run this bash script to save the first query ip-list to a temp index, then use a terms-query filter (in Kibana) to query using the ip-list from step1.
#!/usr/bin/env bash
es_host='https://************'
elk_user='************'
cred=($(pass ELK/************ | tr "\n" " ")) ##password
index_name='iis-************'
index_hostname='"************"'
temp_index_path='temp1/_doc/1'
results_limit=1000
timestamp_gte='"2018-03-20T13:00:00"' #UTC
timestamp_lte='"now"' #UTC
resp_data="$(curl -X POST $es_host/$index_name/_search -u $elk_user:${cred[0]} -H 'Content-Type: application/json; charset=utf-8' -d #- << EOF
{
"query": {
"bool": {
"must": [{
"match": {
"index_hostname": {
"query": $index_hostname
}
}
},
{
"regexp": {
"iis.access.url":{
"value": ".*((jpg)|(jpeg)|(png))"
}
}
}],
"must_not": {
"match": {
"iis.access.agent": {
"query": "Amazon+CloudFront"
}
}
},
"filter": {
"range": {
"#timestamp": {
"gte": $timestamp_gte,
"lte": $timestamp_lte
}
}
}
}
},
"aggs" : {
"whatever" : {
"terms" : { "field" : "iis.access.remote_ip", "size":$results_limit }
}
},
"size" : 0
}
EOF
)"
ip_list="$(echo "$resp_data" | jq '.aggregations.whatever.buckets[].key' | tr "\n" ",\ " | head -c -1)"
resp_data2="$(curl -X PUT $es_host/$temp_index_path -u $elk_user:${cred[0]} -H 'Content-Type: application/json; charset=utf-8' -d #- << EOF
{
"ips" : [$ip_list]
}
EOF
)"
echo "$resp_data2"
Query DSL - "terms-query" filter:
{
"query": {
"terms": {
"iis.access.remote_ip": {
"id": "1",
"index": "temp1",
"path": "ips",
"type": "_doc"
}
}
}
}

Join / split search words in elasticsearch (using tire)

I have the following analyzer (a slight tweak to the way snowball would be setup):
string_analyzer: {
filter: [ "standard", "stop", "snowball" ],
tokenizer: "lowercase"
}
Here is the field it is applied to:
indexes :title, type: 'string', analyzer: 'string_analyzer'
query do
match ['title'], search_terms, fuzziness: 0.5, max_expansions: 10, operator: 'and'
end
I have a record in my index with title foo bar.
If I search for foo bar it appears in the results.
However, if I search for foobar it doesn't.
Can someone explain why and if possible how I could get it to?
Can someone explain how I could get the reverse of this to work as well so that if I had a record with title foobar a user could search for foo bar and see it as a result?
Thanks
You can only search for tokens that are in your index. So let's look at what you are indexing.
You're currently using the lowercase tokenizer (which tokenizes a string on non-letter characters and lowercases them) then applying the standard filter (redundant, because you are not using the standard tokenizer), the stop and snowball filters.
If we create that analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"string_analyzer" : {
"filter" : [
"standard",
"stop",
"snowball"
],
"tokenizer" : "lowercase"
}
}
}
}
}
'
and use the analyze API to test it out:
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=foo+bar&analyzer=string_analyzer'
you'll see that "foo bar" produces the terms ["foo","bar"] and "foobar" produces the term ["foobar"]. So indexing "foo bar" and searching for "foobar" currently cannot work.
If you want to be able to search "inside" words, then you need to break words up into smaller tokens. To do this, we use the ngram analyzer.
So delete the test index:
curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'
and specify a new analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"ngrams" : {
"max_gram" : 5,
"min_gram" : 1,
"type" : "ngram"
}
},
"analyzer" : {
"ngrams" : {
"filter" : [
"standard",
"lowercase",
"ngrams"
],
"tokenizer" : "standard"
}
}
}
}
}
'
Now, if we test the analyzer, we get:
"foo bar" => [f,o,o,fo,oo,foo,b,a,r,ba,ar,bar]
"foobar" => [f,o,o,b,a,r,fo,oo,ob,ba,ar,foo,oob,oba,bar,foob,ooba,obar,fooba,oobar]
So if we index "foo bar" and we search for "foobar" using the match query, then the query becomes a query looking for any of those tokens, some of which exist in the index.
Unfortunately, it'll also overlap with "wear the fox hat" (f,o,a). While foobar will appear higher up the list of results because it has more tokens in common, you will still get apparently unrelated results.
This can be controlled by using the minimum_should_match parameter, eg:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"match" : {
"my_field" : {
"minimum_should_match" : "60%",
"query" : "foobar"
}
}
}
}
'
The exact value for minimim_should_match depends upon your data - experiment with it.

How to prevent Facet Terms from tokenizing

I am using Facet Terms to get all the unique values and their count for a field. And I am getting wrong results.
term: web
Count: 1191979
term: misc
Count: 1191979
term: passwd
Count: 1191979
term: etc
Count: 1191979
While the actual result should be:
term: WEB-MISC /etc/passwd
Count: 1191979
Here is my sample query:
{
"facets": {
"terms1": {
"terms": {
"field": "message"
}
}
}
}
If reindexing is an option, it would be the best to change mapping and mark this fields as not_analyzed
"your_field" : { "type": "string", "index" : "not_analyzed" }
You can use multi field type if keeping an analyzed version of the field is desired:
"your_field" : {
"type" : "multi_field",
"fields" : {
"your_field" : {"type" : "string", "index" : "analyzed"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
}
This way, you can continue using your_field in the queries, while running facet searches using your_field.untouched.
Alternatively, if this field is stored, you can use a script field facet instead:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_fields.your_field.value"
}
}
}
As the last resort, if this field is not stored, but record source is stored in the index, you can try this:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_source.your_field"
}
}
}
The first solution is the most efficient. The last solution is the least efficient and may take a lot of time on a large index.
Wow, I also got this same issue today while term aggregating in the recent elastic-search. After googling and some partial understanding, found how this geeky indexing works(which is very simple).
Queries can find only terms that actually exist in the inverted index
When you index the following string
"WEB-MISC /etc/passwd"
it will be passed to an analyzer. The analyzer might tokenize it into
"WEB", "MISC", "etc" and "passwd"
with its position details. And this tokens might filtered to lowercase such as
"web", "misc", "etc" and "passwd"
So, after indexing,the search query can see the above 4 only. not the complete word "WEB-MISC /etc/passwd". For your requirement the following are my options you can use
1.Change the Default Analyzer used by elasticsearch([link][1])
2.If it is not need, just TurnOff the analyzer by setting 'not_analyzed' for the fields you need
3.To convert the already indexed data searchable, re-indexing is the only option
I have briefly explained this problem and proposed two solutions here.
I have talked about multiple approaches here.
One is use of not_analyzed to preserve the string as it is. But then as it has the drawback of being case insensitive , a better approach would be use keyword tokenizer + lowercase filter

Highlighting matched results on _all fields

I want the matched results to be highlighted. This works for me if I mention the field name and it returns the highlighted text, however if I give the field as "_all", it is not returning any value.
This works for me:
curl -XGET "http://localhost:9200/my_index/my_type/_search?q=stackoverflow&size=999" -d '{
"highlight":{
"fields":{
"my_field":{}
}
}
}'
This returns the expected value as follows:
[highlight] => stdClass Object ( [my_field] => Array ( [0] => stackoverflow is the best website for techies ) )
But when I give this:
curl -XGET "http://localhost:9200/my_index/my_type/_search?q=stackoverflow&size=999" -d '{
"highlight":{
"fields":{
"_all":{}
}
}
}'
I get null value/no result.
[highlight] => stdClass Object ( [_all] => Array () )
How do I get it to work on any field so that I don't have to mention the field name?
To avoid the need to add _all as a stored field in your index
An alternative quick fix: use * instead of _all:
curl -XGET "http://localhost:9200/my_index/my_type/_search?q=stackoverflow&size=999" -d '{
"highlight":{
"fields":{
"*":{}
}
}
}'
If you are using ES 2.x then you need to set require_field_match option to false due to changes made, From the doc
The default value for the require_field_match option has changed from false to true, meaning that the highlighters will, by default, only take the fields that were queried into account.
This means that, when querying the _all field, trying to highlight on
any field other than _all will produce no highlighted snippets.
"highlight": {
"fields": {
"*": {}
},
"require_field_match": false
}
You need to map the _all field as stored. The mapping below should do the trick. Note though that this will add to the index size.
{
"my_type": {
"_all": {
"enabled": true,
"store": "yes"
}
}}
This library has functions for query highlighting including highlighting across all fields. The README explains how to create an elasticsearch index with _all field stored etc:
https://github.com/niranjan-uma-shankar/Elasticsearch-PHP-class

In MongoDB, is there anyway to tell what index is on a collection besides using coll.find({...}).explain()?

I think explain() will tell any possible index it can use. How about just showing all the indexes defined on the collection? (or even for the whole db?)
>db.system.indexes.find();
>db.system.indexes.find( { ns: "tablename" } );
will give you something like
{
"ns" : "test.fs.chunks",
"key" : { "files_id" : 1, "n" : 1 },
"name" : "files_id_1_n_1"
}
for every index (ns is the collection name).
Or use the collection name. I.e., if you have a users collection do:
db.users.getIndexes()