How to do an automated index creation at ElasticSearch? - indexing

How to do an automated index creation at ElasticSearch?
Just like wordpress? See: http://gibrown.wordpress.com/2014/02/06/scaling-elasticsearch-part-2-indexing/
In our case we create one index for every 10 million blogs, with 25 shards per index.
Any light?
Thanks!

You do it in whatever your favorite scripting language is. You first run a query getting a count of the number of documents in the index. If it's beyond a certain amount you create a new one, either via an Elasticsearch API or a curl.
Here's the query to find the number of docs:
curl --XGET 'http://localhost:9200/youroldindex/_count'
Here's the index creation curl:
curl -XPUT 'http://localhost:9200/yournewindex/' -d '{
"settings" : {
"number_of_shards" : 25,
"number_of_replicas" : 2
}
}'
You will also probably want to create aliases so that your code can always point to a single index alias and then change the alias as you change your hot index:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html
You will probably want to predefine your mappings too:
curl -XPUT 'http://localhost:9200/yournewindex/yournewmapping/_mapping' -d '
{
"document" : {
"properties" : {
"message" : {"type" : "string", "store" : true }
}
}
}
'
Elasticsearch has fairly complete documentation, a few good places to look:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-create-index.html

Related

Elasticsearch query context vs filter context

I am little bit confused with ElasticSearch Query DSL's query context and filter context. I have 2 below queries. Both queries return same result, first one evaluate score and second one does not. Which one is more appropriate ?
1st Query :-
curl -XGET 'localhost:9200/xxx/yyy/_search?pretty' -d'
{
"query": {
"bool": {
"must": {
"terms": { "mcc" : ["5045","5499"]}
},
"must_not":{
"term":{"maximum_flag":false}
},
"filter": {
"geo_distance": {
"distance": "500",
"location": "40.959334, 29.082142"
}
}
}
}
}'
2nd Query :-
curl -XGET 'localhost:9200/xxx/yyy/_search?pretty' -d'
{
"query": {
"bool" : {
"filter": [
{"term":{"maximum_flag":true}},
{"terms": { "mcc" : ["5045","5499"]}}
],
"filter": {
"geo_distance": {
"distance": "500",
"location": "40.959334, 29.082142"
}
}
}
}
}'
Thanks,
In the official guide you have a good explanation:
Query context
A query clause used in query context answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a _score representing how well the document matches, relative to other documents.
Query context is in effect whenever a query clause is passed to a query parameter, such as the query parameter in the search API.
Filter context
In filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g.
Does this timestamp fall into the range 2015 to 2016?
Is the status field set to "published"?
Frequently used filters will be cached automatically by Elasticsearch, to speed up performance.
Filter context is in effect whenever a query clause is passed to a filter parameter, such as the filter or must_not parameters in the bool query, the filter parameter in the constant_score query, or the filter aggregation.
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-filter-context.html
About your case, we would need more information, but taking into account you are looking for exact values, a filter would suit it better.
The first query is evaluating score because your are using "term" here directly inside without wrapping it inside "filter" so by default "term" written directly inside query run in Query context format which result in calculating score.
But in the case of second query you "term" inside "filter" which change it's context from Query Context to filter Context . And in the case of filter no score is calculated (by default _score 1 is allocated to all matching documents).
You can find more details about queries behavior in this article
https://towardsdatascience.com/deep-dive-into-querying-elasticsearch-filter-vs-query-full-text-search-b861b06bd4c0

elasticsearch upsert without document id

I query my index to find a document. If I find a document I know the _id value to update else I don't have _id value.
Using the upsert below, I can update when I have _id. If I dont have _id how can I have elasticsearch to provide one and insert a new document?
Purpose: I dont want to have 2 functions, one to create a new doc and another to update it...
curl -XPOST 'localhost:9200/test/type1/{value_of_id}/_update' -d '{
"doc" : {
"name" : "new_name"
},
"doc_as_upsert" : true
}
Something like "update by query"?
See here:
https://github.com/elastic/elasticsearch/issues/2230
for the original issue/proposal, some experimental work toward implementation, discussion about the pros and cons of including, and a link to the plug-in that was developed to support the behavior.

How to create elasticsearch index alias that excludes specific fields

I'm using Elasticsearch's index aliases to create restricted views on a more-complete index to support a legacy search application. This works well. But I'd also like to exclude certain sensitive fields from the returned result (they contain email addresses, and we want to preclude harvesting.)
Here's what I have:
PUT full-index/_alias/restricted-index-alias
{
"_source": {
"exclude": [ "field_with_email" ]
},
"filter": {
"term": { "indexflag": "noindex" }
}
}
This works for queries (I don't see field_with_email), and the filter term works (I get a restricted index) but I still see the field_with_email in query results from the index alias.
Is this supposed to work?
(I don't want to exclude from _source in the mapping, as I'm also using partial updates; these are easier if the entire document is available in _source.)
No, it is not supposed to work, and the documentation doesn't suggest that it should work.

Jenkins API: Get a list of jobs filtered by build parameter - What jobs have built this Git commit?

We are sending different parameters to our Jenkins jobs, among them are the Git commit SHA1. We want to get a list of jobs that used that parameter value (the Git SHA1 - which jobs ran this commit?).
The following URL will give us all builds:
http://jenkins.example.com/api/json?tree=jobs[name,builds[number,actions[parameters[name,value]]]]&pretty=true
It takes some time to render (6 seconds) and contains too many builds (5 MB of builds).
Sample output from that URL:
{
"jobs" : [
{
"name" : "Job name - Build",
"builds" : [
{
"actions" : [
{
"parameters" : [
{
"name" : "GIT_COMMIT_PARAM",
"value" : "5447e2f43ea44eb4168d6b32e1a7487a3fdf237f"
}
]
},
(...)
How can we use the Jenkins JSON API to list all jobs with a certain build parameter value?
Also been looking for this, and luckily i found an awesome gist
https://gist.github.com/justlaputa/5634984
To answer your question:
jenkins_url + /api/json?tree=jobs[name,color]
Using your example from above
http://jenkins.example.com/api/json?tree=jobs[name,color]
So it seems like all you need to do is remove the builds parameter from your original url, and you should be fine
How can we use the Jenkins JSON API to list all jobs with a certain build parameter value?
Not sure about JSON API, but you can use XML API and combine tree and xpath parameters:
http://jenkins_url/api/xml?tree=jobs[name,builds[actions[parameters[name,value]]]]&xpath=/hudson/job[build/action/parameter[name="GIT_COMMIT_PARAM"][value="5447e2f43ea44eb4168d6b32e1a7487a3fdf237f"]]/name&wrapper=job_names&pretty=true
Result sample:
<job_names>
<name>JOB1</name>
<name>JOB2</name>
<name>JOB3</name>
...
</job_names>
Note: job falls into this list if at least one it's build was built with desired parameter
It looks it isn't supported in JSON API, however if you can use XML API, it is possible to query via XPATH, see sample below
http://jenkins.example.com/api/xml?tree=jobs[name,builds[number,actions[parameters[name,value]]]]&exclude=hudson/job/build/action/parameter[value!=%275447e2f43ea44eb4168d6b32e1a7487a3fdf237f%27]
You may tune the better query string to fit for your needs.
credit to http://blog.dahanne.net/2014/04/02/using-jenkins-hudson-remote-api-to-check-jobs-status/
Here's the query for passing jobs only:
http://jenkinsURL/job/ProjectFolderName/api/xml?tree=jobs[name,color=blue]
Here's the query for failing jobs only:
http://jenkinsURL/job/ProjectFolderName/api/xml?tree=jobs[name,color=yellow]

Updating existing documents in elasticsearch

Is it possible add more fields to existing documents in elasticsearch?
I indexed for instance the following document:
{
"user":"xyz",
"message":"for increase in fields"
}
Now I want to add 1 more field to it i.e date:
{
"user":"xyz",
"message":"for increase in fields",
"date":"2013-06-12"
}
How can this be done?
For Elastic Search Check update
The update API also support passing a partial document (since 0.20),
which will be merged into the existing document (simple recursive
merge, inner merging of objects, replacing core “keys/values” and
arrays)
Solr 4.0 also supports partial updates. check Link
This can be done with a partial update (assuming the document has an ID of 1):
curl -XPOST 'http://localhost:9200/myindex/mytype/1/_update' -d '
{
"doc" : {
"date":"2013-06-12"
}
}'
Then query the document:
curl -XGET 'http://localhost:9200/myindex/mytype/_search?q=user:xyz'
You should see something like:
"_id":"1",
"_source:
{
{
"user":"xyz",
"message":"for increase in fields",
"date":"2013-06-12"
}
}