Elastic Search Index not analyzed - indexing

I'm new to elastic search, and I'm having a hard time with the analyzers.
I am creating an index like this (to replicate my problem, you can copy and paste the follwoing code in your console directly.)
Please read comments in the script for my problem and questions.
#!/bin/bash
# fails if the index doesn't exist but that's OK
curl -XDELETE 'http://localhost:9200/movies/'
# creating the index that will allow type wrapper, and generate _id automatically from the path
curl -XPOST http://localhost:9200/movies -d '{
"settings" : {
"number_of_shards" : 1,
"mapping.allow_type_wrapper" : true,
"analysis": {
"analyzer": {
"en_std": {
"type":"standard",
"stopwords": "_english_"
}
}
}
},
"mappings" : {
"movie" : {
"_id" : {
"path" : "movie.id"
}
}
}
}'
# inserting some data
curl -XPOST http://localhost:9200/movies/movie -d '{
"movie" : {
"id" : 101,
"title" : "Bat Man",
"starring" : {
"firstname" : "Christian",
"lastname" : "Bale"
}
}
}'
#trying to get by ID ... \m/ works!!!
curl -XGET http://localhost:9200/movies/movie/101
# tryign to search using query_string ... \m/ works
curl -XPOST http://localhost:9200/movies/movie/_search -d '{
"query" : {
"query_string" : {
"query" : "bat"
}
}
}'
# when i try to search in a paricular field it fails. returns 0 hits
curl -XPOST http://localhost:9200/movies/_search -d '{
"query" : {
"query_string" : {
"query" : "bat",
"fields" : ["movie.title"]
}
}
}'
#I thought the analyzer was the problem, so i checked.
curl 'http://localhost:9200/movies/movie/_search?pretty=true' -d '{
"query" : {
"query_string" : {
"query" : "bat"
}
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "movie.title"
}
}
}
}'
# The field wasn't analyzed.
# the follwoing is the result
#{
# "took" : 1,
# "timed_out" : false,
# "_shards" : {
# "total" : 1,
# "successful" : 1,
# "failed" : 0
# },
# "hits" : {
# "total" : 1,
# "max_score" : 0.13424811,
# "hits" : [ {
# "_index" : "movies",
# "_type" : "movie",
# "_id" : "101",
# "_score" : 0.13424811,
# "fields" : {
# "terms" : [ "Bat Man" ]
# }
# } ]
# }
#}
# So i even tried the term as such... Nope didn't work :( 0 hits.
curl -XPOST http://localhost:9200/movies/_search -d '{
"query" : {
"query_string" : {
"query" : "Bat Man",
"fields" : ["movie.title"]
}
}
}'
Can anyone point out what i'm doing wrong?

You should insert a sleep 1 command right after inserting the doc and everything will work.
Elasticsearch provides search in near real-time (read this). Everytime you index a document, the Lucene index is not updated (refreshed in terms of Elasticsearch) immediately. How frequently your index is refreshed is configurable on Index level. You can also forcefully refresh the index by passing the query parameter refresh=true with every HTTP request, which will make ES update the index. But you may start suffering on the performance because of that depending upon your requirement.
There is a Refresh API as well.

Related

How to query and list all types within an elasticsearch index?

Problem: What is the most correct way to simply query for and list all types within a specific index (and all indices) in elasticsearch?
I've been reading through the reference and API but can't seem to find anything obvious.
I can list indices with the command:
$ curl 'localhost:9200/_cat/indices?v'
I can get stats (which don't seem to include types) with the command:
$ curl localhost:9200/_stats
I'd expect that there'd be a straightforward command as simple as:
$ curl localhost:9200/_types
or
$ curl localhost:9200/index_name/_types
Thanks for any help you can offer.
What you call "type" is actually a "mapping type" and the way to get them is simply by using:
curl -XGET localhost:9200/_all/_mapping
Now since you only want the names of the mapping types, you don't need to install anything, as you can use simply use Python to only get you what you want out of that previous response:
curl -XGET localhost:9205/_all/_mapping | python -c 'import json,sys; indices=json.load(sys.stdin); indices = [type for index in indices for type in indices.get(index).get("mappings")]; print list(indices);'
The Python script does something very simple, i.e. it iterates over all the indices and mapping types and only retrieves the latter's names:
import json,sys;
resp = json.load(sys.stdin);
indices = [type for index in resp for type in indices.get(index).get("mappings")];
print list(indices);'
UPDATE
Since you're using Ruby, the same trick is available by using Ruby code:
curl -XGET localhost:9205/_all/_mapping | ruby -e "require 'rubygems'; require 'json'; resp = JSON.parse(STDIN.read); resp.each { |index, indexSpec | indexSpec['mappings'].each {|type, fields| puts type} }"
The Ruby script looks like this:
require 'rubygems';
require 'json';
resp = JSON.parse(STDIN.read);
resp.each { |index, indexSpec |
indexSpec['mappings'].each { |type, fields|
puts type
}
}
You can just print the index and use the _mapping API so you will see only the section of "mappings" in the index.
For example: curl -GET http://localhost:9200/YourIndexName/_mapping?pretty
You will get something like that:
{
"YourIndexName" : {
"mappings" : {
"mapping_type_name_1" : {
"properties" : {
"dateTime" : {
"type" : "date"
},
"diskMaxUsedPct" : {
"type" : "integer"
},
"hostName" : {
"type" : "keyword"
},
"load" : {
"type" : "float"
},
"memUsedPct" : {
"type" : "float"
},
"netKb" : {
"type" : "long"
}
}
},
"mapping_type_name_2" : {
"properties" : {
"dateTime" : {
"type" : "date"
},
"diskMaxUsedPct" : {
"type" : "integer"
},
"hostName" : {
"type" : "keyword"
},
"load" : {
"type" : "float"
},
"memUsedPct" : {
"type" : "float"
}
}
}
}
}
}
mapping_type_name_1 and mapping_type_name_2 are the types in this index, and you also can see the structure of these types.
Good explanation about mapping_types is here: https://logz.io/blog/elasticsearch-mapping/
private Set<String> getTypes(String indexName) throws Exception{
HttpClient client = HttpClients.createDefault();
HttpGet mappingsRequest = new HttpGet(getServerUri()+"/"+getIndexName()+"/_mappings");
HttpResponse scanScrollResponse = client.execute(mappingsRequest);
String response = IOUtils.toString(scanScrollResponse.getEntity().getContent(), Charset.defaultCharset());
System.out.println(response);
String mappings = ((JSONObject)JSONSerializer.toJSON(JSONObject.fromObject(response).get(indexName).toString())).get("mappings").toString();
Set<String> types = JSONObject.fromObject(mappings).keySet();
return types;
}

Elasticsearch - Extracting PDF content and encoding with base64

I want to be able to extract content from a PDF file and to be able to search within that content using ElasticSearch.
I did install elasticsearch/elasticsearch-mapper-attachments/2.6.0
I have created a new index named "docs".
I did create a file named "tmp.json" with that content :
{"title": "file.pdf", "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="}
I did execute the following :
curl -X PUT "http://localhost:9200/docs/attachment/_mapping" -d '{
"attachment": {
"properties" : {
'file" : {
"type" : "attachment",
"fields" : {
"title" : {"store":"yes"},
"file":{
"type":"string",
"term_vector":"with_positions_offsets",
"store":"yes"}
}
}
}
}
}'
and the following :
curl -X POST "http://localhost:9200/docs/attachment" -d #tmp.json
The problem is that the content is stored as it is in the file.
I was expecting the content to be decoded, like so :
base64.b64decode("IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg==")
That gives :
b'"God Save the Queen" (alternatively "God Save the King"'
To encode in base64, here what I do :
import json, base64
file64 = base64.b64encode(open('file.pdf', "rb").read()).decode('ascii')
f = open('tmp.json', 'w')
data = {"file":file64, "title":fname}
json.dump(data,f)
f.close()
I would like to be able to see the content using kibana (but for now I see only the base64 data ...)
This didn't work :
curl -X PUT "http://localhost:9200/docs/attachment/_mapping" -d '{
"attachment": {
"properties" : {
"content" : {
"type" : "attachment",
"fields" : {
"title" : {"store":"yes"},
"content":{
"type":"string",
"term_vector":"with_positions_offsets",
"store":"yes"}
}
}
}
}
}'
This worked, and I can see the content of the PDF through Kibana :
curl -X PUT "http://localhost:9200/docs" -d '{
"mappings" : {
"attachment" : {
"properties" : {
"content" : {
"type" : "attachment",
"fields" : {
"content" : { "store" : "yes" },
"author" : { "store" : "yes" },
"title" : { "store" : "yes"},
"date" : { "store" : "yes" },
"keywords" : { "store" : "yes", "analyzer" : "keyword" },
"name" : { "store" : "yes" },
"content_length" : { "store" : "yes" },
"content_type" : { "store" : "yes" }
}
}
}
}
}
}'

How to change tokenizer in elasticsearch in the existing index

I have the following problem:
I have an index of 30 million documents the mapping as follows:
curl -XPUT localhost:8080/xxxxx/yyyyy/_mapping?pretty=true -d '{"xxxxx":{"_id":{"type":"string","index":"not_analyzed"},"properties":{"content":
{"type":"string","store":"no"},"title":{"type":"string","index":"no"},"created_date":{"type":"integer","index":"not_analyzed"},"url":
{"type":"string","index":"not_analyzed"},"author":{"type":"string","index":"no"},"author_url":{"type":"string","index":"no"},"domain":
{"type":"string","index":"not_analyzed"},"lang":{"type":"string","index":"no"}}}}'
Tokenizer is not selected in the settings, so apply a standard.
I would like to request "facets" to create ranking links(url) in field "content". Unfortunately I can not do that because the standard tokenizer shared links (url) to pieces.
Question:
Can an existing index without reindexing change the tokenizer, so that new documents added to the index handle the new tokenizer (uax_url_email) and old documents remain unchanged.
I tried that:
curl -XPUT localhost:8080/xxxxx -d '{
"settings" : {
"index": {
"analysis" :{
"analyzer": {
"default": {
"type" : "custom",
"tokenizer" : "uax_url_email",
"filter" : "lowercase"
}
}
}
}
}
}
'
but I get an error:
{"error": "IndexAlreadyExistsException [[xxxxx] Already exists]", "status": 400}
Is there another way to not reindex with query "facets" to create ranking links (url)?
Thank you in advance of any help
Try next, for existing index "xxxxx"
curl -XPUT localhost:8080/xxxxx/_settings -d '{
"analysis" :{
"analyzer": {
"default": {
"type" : "custom",
"tokenizer" : "uax_url_email",
"filter" : "lowercase"
}
}
}
}
Be sure your elasticseach port is 8080, by default it 9200

Exact document matching with ElasticSearch

I need to query exactly against a set of "short documents". Example:
Documents:
{"name": "John Doe", "alt": "John W Doe"}
{"name": "My friend John Doe", "alt": "John A Doe"}
{"name": "John", "alt": "Susy"}
{"name": "Jack", "alt": "John Doe"}
Expected results:
If I search "John Doe", I want the score of 1 to be much bigger than the score of 2 and 4
If I search "John Doé", the same as above
If I search "John", i want to get 3 (exact match is better than repetition in name and alt)
Is it possible with ES? How can i achieve this? I tried boosting "name", but i can't find how to exactly match the document field, instead of searching inside of it.
What you are describing is exactly how a search engine works by default. A search for "John Doe" becomes a search for the terms "john" and "doe". For each term, it looks for documents that contain the term, then assigns a _score to each document, based on:
how common the term is in all documents (more common == less relevant)
how common is the term within the field of the document (more common == more relevant)
how long is the field of the document (longer == less relevant)
The reason you are not seeing clear results is that Elasticsearch is distributed, and you are testing with small amounts of data. An index by default has 5 primary shards, and your docs are indexed on different shards. Each shard has its own doc frequency counts, so the scores are being distorted.
When you add real-world amounts of data, the frequencies even themselves out over shards, but for testing small amounts of data, you need to do one of two things:
create an index with only one primary shard, or
specify search_type=dfs_query_then_fetch which first fetches the frequencies from each shard before running the query using the global frequencies
To demonstrate, first index your data:
curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1' -d '
{
"alt" : "John W Doe",
"name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1' -d '
{
"alt" : "John A Doe",
"name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1' -d '
{
"alt" : "Susy",
"name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1' -d '
{
"alt" : "John Doe",
"name" : "Jack"
}
'
Now, search for "john doe", remembering to specify dfs_query_then_fetch.
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch' -d '
{
"query" : {
"match" : {
"name" : "john doe"
}
}
}
'
Doc 1 is the first in the results:
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 1.0189849,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.81518793,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 0.3066778,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# }
# ],
# "max_score" : 1.0189849,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 8
# }
When you search for just "john":
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch' -d '
{
"query" : {
"match" : {
"name" : "john"
}
}
}
'
Doc 3 appears first:
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 1,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 0.625,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.5,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# }
# ],
# "max_score" : 1,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 5
# }
Ignoring accents
The second issue is that of matching "John Doé". This is an issue of analysis. In order to make full text more searchable, we analyse it into separate terms or tokens, which are what is stored in the index. In order to match eg john, John and JOHN when the user searches for john, each term/token is passed through a number of token filters, to put them into a standard form.
When we do a full text search, the search terms go through this exact same process. So if we have a document which contains John, this is indexed as john, and if the user searches for JOHN, we actually search for john.
In order to make Doé match doe, we need a token filter which removes accents, and we need to apply it both to the text being indexed, and to the search terms. The simplest way to do this is to use the ASCII folding token filter.
We can define a custom analyzer when we create an index, and we can specify in the mapping that a particular field should use that analyzer, both at index time and at search time.
First, delete the old index:
curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'
Then create the index, specifying the custom analyzer and the mapping:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"no_accents" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
},
"mappings" : {
"test" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "no_accents"
}
}
}
}
}
'
Reindex the data:
curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1' -d '
{
"alt" : "John W Doe",
"name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1' -d '
{
"alt" : "John A Doe",
"name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1' -d '
{
"alt" : "Susy",
"name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1' -d '
{
"alt" : "John Doe",
"name" : "Jack"
}
'
Now, test the search:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch' -d '
{
"query" : {
"match" : {
"name" : "john doé"
}
}
}
'
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 1.0189849,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.81518793,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 0.3066778,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# }
# ],
# "max_score" : 1.0189849,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 6
# }
I think you will achieve what you need if you map as multiple fields, and boost non-analyzed field:
"name": {
"type": "multi_field",
"fields": {
"untouched": {
"type": "string",
"index": "not_analyzed",
"boost": "1.1"
},
"name": {
"include_in_all": true,
"type": "string",
"index": "analyzed",
"search_analyzer": "someanalyzer",
"index_analyzer": "someanalyzer"
}
}
}
You could also boost query-time instead of indextime if you need flexibility, by using the '^'-notation in query_string
{
"query_string" : {
"fields" : ["name, name.untouched^5"],
"query" : "this AND that OR thus",
}
}

Filename search with ElasticSearch

I want to use ElasticSearch to search filenames (not the file's content). Therefore I need to find a part of the filename (exact match, no fuzzy search).
Example:
I have files with the following names:
My_first_file_created_at_2012.01.13.doc
My_second_file_created_at_2012.01.13.pdf
Another file.txt
And_again_another_file.docx
foo.bar.txt
Now I want to search for 2012.01.13 to get the first two files.
A search for file or ile should return all filenames except the last one.
How can i accomplish that with ElasticSearch?
This is what I have tested, but it always returns zero results:
curl -X DELETE localhost:9200/files
curl -X PUT localhost:9200/files -d '
{
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"filename_analyzer" : {
"type" : "custom",
"tokenizer" : "lowercase",
"filter" : ["filename_stop", "filename_ngram"]
}
},
"filter" : {
"filename_stop" : {
"type" : "stop",
"stopwords" : ["doc", "pdf", "docx"]
},
"filename_ngram" : {
"type" : "nGram",
"min_gram" : 3,
"max_gram" : 255
}
}
}
}
},
"mappings": {
"files": {
"properties": {
"filename": {
"type": "string",
"analyzer": "filename_analyzer"
}
}
}
}
}
'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"
FILES='
http://localhost:9200/files/_search?q=filename:2012.01.13
'
for file in ${FILES}
do
echo; echo; echo ">>> ${file}"
curl "${file}&pretty=true"
done
You have various problems with what you pasted:
1) Incorrect mapping
When creating the index, you specify:
"mappings": {
"files": {
But your type is actually file, not files. If you checked the mapping, you would see that immediately:
curl -XGET 'http://127.0.0.1:9200/files/_mapping?pretty=1'
# {
# "files" : {
# "files" : {
# "properties" : {
# "filename" : {
# "type" : "string",
# "analyzer" : "filename_analyzer"
# }
# }
# },
# "file" : {
# "properties" : {
# "filename" : {
# "type" : "string"
# }
# }
# }
# }
# }
2) Incorrect analyzer definition
You have specified the lowercase tokenizer but that removes anything that isn't a letter, (see docs), so your numbers are being completely removed.
You can check this with the analyze API:
curl -XGET 'http://127.0.0.1:9200/_analyze?pretty=1&text=My_file_2012.01.13.doc&tokenizer=lowercase'
# {
# "tokens" : [
# {
# "end_offset" : 2,
# "position" : 1,
# "start_offset" : 0,
# "type" : "word",
# "token" : "my"
# },
# {
# "end_offset" : 7,
# "position" : 2,
# "start_offset" : 3,
# "type" : "word",
# "token" : "file"
# },
# {
# "end_offset" : 22,
# "position" : 3,
# "start_offset" : 19,
# "type" : "word",
# "token" : "doc"
# }
# ]
# }
3) Ngrams on search
You include your ngram token filter in both the index analyzer and the search analyzer. That's fine for the index analyzer, because you want the ngrams to be indexed. But when you search, you want to search on the full string, not on each ngram.
For instance, if you index "abcd" with ngrams of length 1 to 4, you will end up with these tokens:
a b c d ab bc cd abc bcd
But if you search on "dcba" (which shouldn't match) and you also analyze your search terms with ngrams, then you are actually searching on:
d c b a dc cb ba dbc cba
So a,b,c and d will match!
Solution
First, you need to choose the right analyzer. Your users will probably search for words, numbers or dates, but they probably won't expect ile to match file. Instead, it will probably be more useful to use edge ngrams, which will anchor the ngram to the start (or end) of each word.
Also, why exclude docx etc? Surely a user may well want to search on the file type?
So lets break up each filename into smaller tokens by removing anything that isn't a letter or a number (using the pattern tokenizer):
My_first_file_2012.01.13.doc
=> my first file 2012 01 13 doc
Then for the index analyzer, we'll also use edge ngrams on each of those tokens:
my => m my
first => f fi fir firs first
file => f fi fil file
2012 => 2 20 201 201
01 => 0 01
13 => 1 13
doc => d do doc
We create the index as follows:
curl -XPUT 'http://127.0.0.1:9200/files/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"filename_search" : {
"tokenizer" : "filename",
"filter" : ["lowercase"]
},
"filename_index" : {
"tokenizer" : "filename",
"filter" : ["lowercase","edge_ngram"]
}
},
"tokenizer" : {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
"filter" : {
"edge_ngram" : {
"side" : "front",
"max_gram" : 20,
"min_gram" : 1,
"type" : "edgeNGram"
}
}
}
},
"mappings" : {
"file" : {
"properties" : {
"filename" : {
"type" : "string",
"search_analyzer" : "filename_search",
"index_analyzer" : "filename_index"
}
}
}
}
}
'
Now, test that the our analyzers are working correctly:
filename_search:
curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search'
[results snipped]
"token" : "my"
"token" : "first"
"token" : "file"
"token" : "2012"
"token" : "01"
"token" : "13"
"token" : "doc"
filename_index:
curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_index'
"token" : "m"
"token" : "my"
"token" : "f"
"token" : "fi"
"token" : "fir"
"token" : "firs"
"token" : "first"
"token" : "f"
"token" : "fi"
"token" : "fil"
"token" : "file"
"token" : "2"
"token" : "20"
"token" : "201"
"token" : "2012"
"token" : "0"
"token" : "01"
"token" : "1"
"token" : "13"
"token" : "d"
"token" : "do"
"token" : "doc"
OK - seems to be working correctly. So let's add some docs:
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"
And try a search:
curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d '
{
"query" : {
"text" : {
"filename" : "2012.01"
}
}
}
'
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "filename" : "My_second_file_created_at_2012.01.13.pdf"
# },
# "_score" : 0.06780553,
# "_index" : "files",
# "_id" : "PsDvfFCkT4yvJnlguxJrrQ",
# "_type" : "file"
# },
# {
# "_source" : {
# "filename" : "My_first_file_created_at_2012.01.13.doc"
# },
# "_score" : 0.06780553,
# "_index" : "files",
# "_id" : "ER5RmyhATg-Eu92XNGRu-w",
# "_type" : "file"
# }
# ],
# "max_score" : 0.06780553,
# "total" : 2
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 4
# }
Success!
#### UPDATE ####
I realised that a search for 2012.01 would match both 2012.01.12 and 2012.12.01 so I tried changing the query to use a text phrase query instead. However, this didn't work. It turns out that the edge ngram filter increments the position count for each ngram (while I would have thought that the position of each ngram would be the same as for the start of the word).
The issue mentioned in point (3) above is only a problem when using a query_string, field, or text query which tries to match ANY token. However, for a text_phrase query, it tries to match ALL of the tokens, and in the correct order.
To demonstrate the issue, index another doc with a different date:
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_third_file_created_at_2012.12.01.doc" }'
curl -X POST "http://localhost:9200/files/_refresh"
And do a the same search as above:
curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d '
{
"query" : {
"text" : {
"filename" : {
"query" : "2012.01"
}
}
}
}
'
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "filename" : "My_third_file_created_at_2012.12.01.doc"
# },
# "_score" : 0.22097087,
# "_index" : "files",
# "_id" : "xmC51lIhTnWplOHADWJzaQ",
# "_type" : "file"
# },
# {
# "_source" : {
# "filename" : "My_first_file_created_at_2012.01.13.doc"
# },
# "_score" : 0.13137488,
# "_index" : "files",
# "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
# "_type" : "file"
# },
# {
# "_source" : {
# "filename" : "My_second_file_created_at_2012.01.13.pdf"
# },
# "_score" : 0.13137488,
# "_index" : "files",
# "_id" : "XwLNnSlwSeyYtA2y64WuVw",
# "_type" : "file"
# }
# ],
# "max_score" : 0.22097087,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 5
# }
The first result has a date 2012.12.01 which isn't the best match for 2012.01. So to match only that exact phrase, we can do:
curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d '
{
"query" : {
"text_phrase" : {
"filename" : {
"query" : "2012.01",
"analyzer" : "filename_index"
}
}
}
}
'
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "filename" : "My_first_file_created_at_2012.01.13.doc"
# },
# "_score" : 0.55737644,
# "_index" : "files",
# "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
# "_type" : "file"
# },
# {
# "_source" : {
# "filename" : "My_second_file_created_at_2012.01.13.pdf"
# },
# "_score" : 0.55737644,
# "_index" : "files",
# "_id" : "XwLNnSlwSeyYtA2y64WuVw",
# "_type" : "file"
# }
# ],
# "max_score" : 0.55737644,
# "total" : 2
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 7
# }
Or, if you still want to match all 3 files (because the user might remember some of the words in the filename, but in the wrong order), you can run both queries but increase the importance of the filename which is in the correct order:
curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
{
"text_phrase" : {
"filename" : {
"boost" : 2,
"query" : "2012.01",
"analyzer" : "filename_index"
}
}
},
{
"text" : {
"filename" : "2012.01"
}
}
]
}
}
}
'
# [Fri Feb 24 16:31:02 2012] Response:
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "filename" : "My_first_file_created_at_2012.01.13.doc"
# },
# "_score" : 0.56892186,
# "_index" : "files",
# "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
# "_type" : "file"
# },
# {
# "_source" : {
# "filename" : "My_second_file_created_at_2012.01.13.pdf"
# },
# "_score" : 0.56892186,
# "_index" : "files",
# "_id" : "XwLNnSlwSeyYtA2y64WuVw",
# "_type" : "file"
# },
# {
# "_source" : {
# "filename" : "My_third_file_created_at_2012.12.01.doc"
# },
# "_score" : 0.012931341,
# "_index" : "files",
# "_id" : "xmC51lIhTnWplOHADWJzaQ",
# "_type" : "file"
# }
# ],
# "max_score" : 0.56892186,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 4
# }
I believe this is because of the tokenizer being used..
http://www.elasticsearch.org/guide/reference/index-modules/analysis/lowercase-tokenizer.html
The lowercase tokenizer splits out on word boundaries so 2012.01.13 will be indexed as "2012","01" and "13". Searching for the string "2012.01.13" will obviously not match.
One option would be to add the tokenisation on search as well. Therefore, searching for "2012.01.13" will be tokenised down to the same tokens as in the index and it will match. This is also handy as you then don't need to always lowercase your searches in code.
The second option would be to use an n-gram tokenizer instead of the filter. This will mean that it will ignore word boundaries (and you will get the "_"'s as well), however you may have issues with case mismatches, which is presumably the reason you added the lowercase tokenizer in the first place.
I have no experience with ES, but in Solr you would need to specify the field type as text.
Your field is of type string instead of text. String fields, are not analyzed, but stored and indexed verbatim. Give that a shot and see if it works.
properties": {
"filename": {
"type": "string",
"analyzer": "filename_analyzer"
}