elasticsearch upsert without document id - lucene

I query my index to find a document. If I find a document I know the _id value to update else I don't have _id value.
Using the upsert below, I can update when I have _id. If I dont have _id how can I have elasticsearch to provide one and insert a new document?
Purpose: I dont want to have 2 functions, one to create a new doc and another to update it...
curl -XPOST 'localhost:9200/test/type1/{value_of_id}/_update' -d '{
"doc" : {
"name" : "new_name"
},
"doc_as_upsert" : true
}

Something like "update by query"?
See here:
https://github.com/elastic/elasticsearch/issues/2230
for the original issue/proposal, some experimental work toward implementation, discussion about the pros and cons of including, and a link to the plug-in that was developed to support the behavior.

Related

ElasticSearch API for user entered boolean/advanced search queries

In Kibana I'm able to enter queries with AND / OR / NOT / "..." etc, but all examples I can find for the API (Python or NodeJS or .NET) use the Elasticsearch JSON query format to build queries in code. I would like users to be able to enter 'hot AND soup' or '"hot soup" AND cabbage"' etc which are possible in Lucene and Kibana but I cannot find how to use those via the API. I read all SO entries and Elasticsearch docs I could find about the subject but still missed it.
Is it at all possible and if it is, where can I find examples of that? As I cannot find it, it might not be possible at all; I just want to make sure.
Whatever the language you're programming in, you simply need to use the query_string query and pass the use input in there.
GET /_search
{
"query": {
"query_string": {
"query": "\"hot soup\" AND cabbage",
"default_field": "content"
}
}
}
Beware, though, that the query_string query is very sensitive to the syntax, so your users might not enter correct queries. To alleviate that, a more permissive query would be the simple_query_string query

How to do an automated index creation at ElasticSearch?

How to do an automated index creation at ElasticSearch?
Just like wordpress? See: http://gibrown.wordpress.com/2014/02/06/scaling-elasticsearch-part-2-indexing/
In our case we create one index for every 10 million blogs, with 25 shards per index.
Any light?
Thanks!
You do it in whatever your favorite scripting language is. You first run a query getting a count of the number of documents in the index. If it's beyond a certain amount you create a new one, either via an Elasticsearch API or a curl.
Here's the query to find the number of docs:
curl --XGET 'http://localhost:9200/youroldindex/_count'
Here's the index creation curl:
curl -XPUT 'http://localhost:9200/yournewindex/' -d '{
"settings" : {
"number_of_shards" : 25,
"number_of_replicas" : 2
}
}'
You will also probably want to create aliases so that your code can always point to a single index alias and then change the alias as you change your hot index:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html
You will probably want to predefine your mappings too:
curl -XPUT 'http://localhost:9200/yournewindex/yournewmapping/_mapping' -d '
{
"document" : {
"properties" : {
"message" : {"type" : "string", "store" : true }
}
}
}
'
Elasticsearch has fairly complete documentation, a few good places to look:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-create-index.html

Updating existing documents in elasticsearch

Is it possible add more fields to existing documents in elasticsearch?
I indexed for instance the following document:
{
"user":"xyz",
"message":"for increase in fields"
}
Now I want to add 1 more field to it i.e date:
{
"user":"xyz",
"message":"for increase in fields",
"date":"2013-06-12"
}
How can this be done?
For Elastic Search Check update
The update API also support passing a partial document (since 0.20),
which will be merged into the existing document (simple recursive
merge, inner merging of objects, replacing core “keys/values” and
arrays)
Solr 4.0 also supports partial updates. check Link
This can be done with a partial update (assuming the document has an ID of 1):
curl -XPOST 'http://localhost:9200/myindex/mytype/1/_update' -d '
{
"doc" : {
"date":"2013-06-12"
}
}'
Then query the document:
curl -XGET 'http://localhost:9200/myindex/mytype/_search?q=user:xyz'
You should see something like:
"_id":"1",
"_source:
{
{
"user":"xyz",
"message":"for increase in fields",
"date":"2013-06-12"
}
}

Delete a field and its contents in all the records and recreate it with new mapping

I have a field field10 which got created by accident when I updated a particular record in my index. I want to remove this field from my index, all its contents and recreate it with the below mapping:
"mytype":{
"properties":{
"field10":{
"type":"string",
"index":"not_analyzed",
"include_in_all":"false",
"null_value":"null"
}
}
}
When I try to create this mapping using the Put Mapping API, I get an error: {"error":"MergeMappingException[Merge failed with failures {[mapper [field10] has different index values, mapper [field10] has different index_analyzer, mapper [field10] has different search_analyzer]}]","status":400}.
How do I change the mapping of this field? I don't want to reindex millions of records just for this small accident.
Thanks
AFAIK, you can't remove a single field and recreate it.
You can not either just modify a mapping and have everything reindexed automagicaly. Imagine that you don't store _source. How can Elasticsearch know what your data look like before it was indexed?
But, you can probably modify your mapping using a multifield with field10.field10 using the old mapping and field10.new with the new analyzer.
If you don't reindex, only new documents will have content in field10.new.
If you want to manage old documents, you have to:
Send again all your docs (it will update everything) - aka reindex (you can use scan & scroll API to get your old documents)
Try to update your docs with the Update API
You can probably try to run a query like:
curl -XPOST localhost:9200/crunchbase/person/1/_update -d '{
"script" : "ctx._source.field10 = ctx._source.field10"
}'
But, as you can see, you have to run it document by document and I think it will take more time than reindexing all with the Bulk API.
Does it help?

What are the resources or tools used to manage temporal data in key-value stores?

I'm considering using MongoDB or CouchDB on a project that needs to maintain historical records. But I'm not sure how difficult it will be to store historical data in these databases.
For example, in his book "Developing Time-Oriented Database Applications in SQL," Richard Snodgrass points out tools for retrieving the state of data as of a particular instant, and he points out how to create schemas that allow for robust data manipulation (i.e. data manipulation that makes invalid data entry difficult).
Are there tools or libraries out there that make it easier to query, manipulate, or define temporal/historical structures for key-value stores?
edit:
Note that from what I hear, the 'version' data that CouchDB stores is erased during normal use, and since I would need to maintain historical data, I don't think that's a viable solution.
P.S. Here's a similar question that was never answered: key-value-store-for-time-series-data
There are a couple options if you wanted to store the data in MongoDB. You could just store each version as a separate document, as then you can query to get the object at a certain time, the object at all times, objects over ranges of time, etc. Each document would look something like:
{
object : whatever,
date : new Date()
}
You could store all the versions of a document in the document itself, as mikeal suggested, using updates to push the object itself into a history array. In Mongo, this would look like:
db.foo.update({object: obj._id}, {$push : {history : {date : new Date(), object : obj}}})
// make changes to obj
...
db.foo.update({object: obj._id}, {$push : {history : {date : new Date(), object : obj}}})
A cooler (I think) and more space-efficient way, although less time-efficient, might be to store a history in the object itself about what changed in the object at each time. Then you could replay the history to build the object at a certain time. For instance, you could have:
{
object : startingObj,
history : [
{ date : d1, addField : { x : 3 } },
{ date : d2, changeField : { z : 7 } },
{ date : d3, removeField : "x" },
...
]
}
Then, if you wanted to see what the object looked like between time d2 and d3, you could take the startingObj, add the field x with the value 3, set the field z to the value of 7, and that would be the object at that time.
Whenever the object changed, you could atomically push actions to the history array:
db.foo.update({object : startingObj}, {$push : {history : {date : new Date(), removeField : "x"}}})
Yes, in CouchDB the revisions of a document are there for replication and are usually lost during compaction. I think UbuntuOne did something to keep them around longer but I'm not sure exactly what they did.
I have a document that I need the historical data on and this is what I do.
In CouchDB I have an _update function. The document has a "history" attribute which is an array. Each time I call the _update function to update the document I append to the history array the current document (minus the history attribute) then I update the document with the changes in the request body. This way I have the entire revision history of the document.
This is a little heavy for large documents, there are some javascript diff tools I was investigating and thinking about only storing the diff between the documents but haven't done it yet.
http://wiki.apache.org/couchdb/How_to_intercept_document_updates_and_perform_additional_server-side_processing
Hope that helps.
I can't speak for mongodb but for couchdb it all really hinges on how you write your views.
I don't know the specifics of what you need but if you have a unique id for a document throughout its lifetime and store a timestamp in that document then you have everything you need for robust querying of that document.
For instance:
document structure:
{ "docid" : "doc1", "ts" : <unix epoch> ...<set of key value pairs> }
map function:
function (doc) {
if (doc.docid && doc.ts)
emit([doc.docid, doc.ts], doc);
}
}
The view will now output each doc and its revisions in historical order like so:
["doc1", 1234567], ["doc1", 1234568], ["doc2", 1234567], ["doc2", 1234568]
You can use view collation and start_key or end_key to restrict the returned documents.
start_key=["doc1", 1] end_key=["doc1", 9999999999999]
will return all historical copies of doc1
start_key=["doc2", 1234567] end_key=["doc2", 123456715]
will return all historical copies of doc2 between 1234567 and 123456715 unix epoch times.
see ViewCollation for more details