Elasticsearch bulk/batch indexing with python requests module - indexing

I have a smallish (~50,00) array of json dictionaries that I want to store/index in ES. My preference is to use python, since the data I want to index is coming from a csv file, loaded and converted to json via python. Alternatively, I would like to skip the step of converting to json, and simply use the array of python dictionaries I have. Anyway, a quick search revealed the bulk indexing functionality of ES. I want to do something like this:
post_url = 'http://localhost:9202/_bulk'
request.post(post_url, data = acc ) # acc a python array of dictionaries
or
post_url = 'http://localhost:9202/_bulk'
request.post(post_url, params = acc ) # acc a python array of dictionaries
both request give a [HTTP 500 error]

My understanding is that you have to have one "command" per line (index, create, delete...) and then some of them (like index) takes a row of data on the next line like so
{'index': ''}\n
{'your': 'data'}\n
{'index': ''}\n
{'other': 'data'}\n
NB the new-lines, even on the last row.
Empty index objects like above works if you POST to ../index/type/_bulk or else you need to specify index and type I think, have not tried that.

You the following function will do it:
def post_request(self, endpoint, data):
endpoint = 'localhost:9200/_bulk'
response = requests.post(endpoint, data=data, headers={'content-type':'application/json', 'charset':'UTF-8'})
return response
As data you need to pass a String such:
{ "index" : { "_index" : "test-index", "_type" : "_doc", "_id" : "1681", "routing" : 0 }}
{ "field1" : ... , ..., "fieldN" : ... }
{ "index" : { "_index" : "test-index", "_type" : "_doc", "_id" : "1684", "routing" : 1 }}
{ "field1" : ... , ..., "fieldN" : ... }
Make sure you add a "\n" at the end of each line.

I don't know much about Python, but did you look at Pyes?
Bulk is supported in Pyes.

Related

Django rest framework: Is there a way to clean data before validating it with a serializer?

I've got an API endpoint POST /data.
The received data is formatted in a certain way which is different from the way I store it in the db.
I'll use geometry type from postgis as an example.
class MyPostgisModel(models.Model):
...
position = models.PointField(null=True)
my_charfield = models.CharField(max_length=10)
...
errors = JSONField() # Used to save the cleaning and validation errors
class MyPostgisSerializer(serializers.ModelSerializer):
class Meta:
model = MyPostgisModel
fields = [
...
"position",
...
"my_charfield",
"errors",
]
def to_internal_value(self, data):
...
# Here the data is coming in the field geometry but in the db, it's called
# position. Moreover I need to apply the `GEOSGeometry(json.dumps(...))`
# method as well.
data["position"] = GEOSGeometry(json.dumps(data["geometry"]))
return data
The problem is that there is not only one field like position but many. And I would like (maybe wrongly) to do like the validate_*field_name* scheme but for cleaning (clean_*field_name*).
There is another problem. In this scheme, I would like to still save the rest of the data in the database even if some fields have raised ValidationError (eg: a CharField that is too long) but are not part of the primary_key/a unique_together constraint. And save the related errors into a JSONField like this:
{
"cleaning_errors": {
...
"position": 'Invalid format: {
"type": "NotAValidType", # Should be "Point"
"coordinates": [
4.22,
50.67
]
}'
...
},
"validating_errors": {
...
"my_charfield": "data was too long: 'this data is way too long for 10 characters'",
...
}
}
For the first problem, I thought of doing something like this:
class BaseSerializerCleanerMixin:
"""Abstract Mixin that clean fields."""
def __init__(self, *args, **kwargs):
"""Initialize the cleaner strategy."""
# This is the error_dict to be filled by the `clean_*field_name*`
self.cleaning_error_dict = {}
super().__init__(*args, **kwargs)
def clean_fields(self, data):
"""Clean the fields listed in self.fields_to_clean before validating them."""
cleaned_data = {}
for field_name in getattr(self.Meta, "fields", []):
cleaned_field = (
getattr(self, "clean_" + field_name)(data)
if hasattr(self, "clean_" + field_name)
else data.get(field_name)
)
if cleaned_field is not None:
cleaned_data[field_name] = cleaned_field
return cleaned_data
def to_internal_value(self, data):
"""Reformat data to put it in the database."""
cleaned_data = self.clean_fields(data)
return super().to_internal_value(cleaned_data)
I'm not sure that's a good idea and maybe there is an easy way to deal with such things.
For the second problem ; catching the errors of the validation without specifying with is_valid() returning True when no primary_key being wrongly formatted, I'm not sure how to proceed.

I am trying to read a json payload from a file and setting the values from CSV file using scenario outline and examples

I have a sample json file as below
{
book : {bookId : '<bookId>' ,
bookName : '<bookName>'
},
staff : {
sfattid : '<sfattid>',
name : '<name>'
},
libraryMember : {
libMembId : '<libMembId>',
name : '<libraryMember>'
}
}
I have a csv file with below information
I want to set the values for each variable from csv file and set the REST request 3 times during run time .
Feature: scenario outline using a dynamic table
from a csv file
Scenario Outline: staffname name: <name>
# When json payload = {book : {bookId : '<bookId>' , bookName : '<bookName>',},staff : {sfattid : '<sfattid>', name : '<name>'},libraryMember : { libMembId : '<libMembId>' ,name : '<libraryMember>'}}
When json payload = read("request.json")
Given url 'http://localhost:8089/'
And path 'returnBook'
And request payload
When method post
Then status 200
Then match karate.jsonPath(response,"$.status") == '<status>'
Examples:
| read('bookreturn.csv') |
I have wrote below code which works perfectly but in below case the same json payload is present in feature file which I want to keep in a text file .Please suggest some code .
Feature: scenario outline using a dynamic table
from a csv file
Scenario Outline: staffname name: <name>
# When json payload = {book : {bookId : '<bookId>' , bookName : '<bookName>',},staff : {sfattid : '<sfattid>', name : '<name>'},libraryMember : { libMembId : '<libMembId>' ,name : '<libraryMember>'}}
When json payload = read("request.json")
Given url 'http://localhost:8089/'
And path 'returnBook'
And request payload
When method post
Then status 200
Then match karate.jsonPath(response,"$.status") == '<status>'
Examples:
| read('bookreturn.csv') |
Sorry, you can't optimize this any further because for <name> to work it has to be within the feature file itself. Personally I think you are un-necessarily trying to over-engineer your tests. There is nothing wrong with what you have already.
If you really insist - here is the alternative, refer: https://github.com/intuit/karate#data-driven-features
* def books = read('bookreturn.csv')
* def result = call read('called.feature') books
But you will need to use 2 feature files. Each book in the loop can be used in embedded expressions. So you can read from a JSON file, and any embedded expressions in the file will work.
Just stick to what you have, seriously !

Adding new key-value pair into json using karate

My payload looks like this :
{
"override_source": "DS",
"property_code": "0078099",
"stay_date": "2018-11-26T00:00:00.000000",
"sku_prices": [
],
"persistent_override": false
}
There is an array dblist ["2","3"] , it would consists of numbers from 1 to 4. Based on the elements present in the list, I want to add key-values {"sku_price":"1500","sku_code":"2"} to my payload. I am using the following code :
* eval if(contains("3",dblist)) karate.set('pushRatesFromDS.sku_prices[]','{ "sku_price": "1500","sku_code":"3" }')
When I execute my feature file, I do not get any errors but, key-values are not added to my payload. However if I move this code to a new feature file and call it, key-value pairs get added to my payload. The code in my new feature file looks like : * set pushRatesFromDS.sku_prices[] = { "sku_price": "1500","sku_code":"2" }
Try this:
* def foo =
"""
{
"override_source": "DS",
"property_code": "0078099",
"stay_date": "2018-11-26T00:00:00.000000",
"sku_prices": [
],
"persistent_override": false
}
"""
* eval karate.set('foo', '$.sku_prices[]', { foo: 'bar' })

Pentaho SQL to MongoDb - Array Issue

I need to update elements in an array, then, when I run the transformation at the first time, the array receives the righ numbers if elements in the PROD array. But if I run it again, the array will receives the same elements
Example:
At the first time, I got the document below, and It is correct:
{
"_id" : ObjectId("58e2c81f781a75592f69f8a5"),
"DDATA_ORC" : ISODate("2016-08-02T03:00:00.000Z"),
"SNUMORC" : "113239",
"PROD" : [
{
"SPRODUTO" : "TONER HP CE411A CIANO (305A)"
}
]
}
But if I run the transformation again, the PROD array will be updated with the same SPRODUTO:
{
"_id" : ObjectId("58e2c81f781a75592f69f8a5"),
"DDATA_ORC" : ISODate("2016-08-02T03:00:00.000Z"),
"SNUMORC" : "113239",
"PROD" : [
{
"SPRODUTO" : "TONER HP CE411A CIANO (305A)"
},
{
"SPRODUTO" : "TONER HP CE411A CIANO (305A)"
}
]
}
It is a problem because I will get wrong results for queries.
That is may plugin configurations:
Options Tab and Document Path tab
I need to update the array only if It receives or lose an item.
Thanks in advance
I solved this issue.
If anyone have this problem, the solution is to create 2 "MongoDB Output". In the first output, you need to set the array (the array will be recreated every time that the update query runs sucessfuly) . I did It using a dummy field.
First Output Document Fields
In the second "MongoDB Output", You need to execute a push to populate the array.
Second Output Document Fields
In the "Output Options" tab, You have to set Update, Upsert and "Modifier Update"

updating a value in an array in mongodb from java

I have couple of documens in mongodb as follow:
{
"_id" : ObjectId("54901212f315dce7077204af"),
"Date" : ISODate("2014-10-20T04:00:00.000Z"),
"Type" : "Twitter",
"Entities" : [
{
"ID" : 4,
"Name" : "test1",
"Sentiment" : {
"Value" : 20,
"Neutral" : 1
}
},
{
"ID" : 5,
"Name" : "test5",
"Sentiment" : {
"Value" : 10,
"Neutral" : 1
}
}
]
}
Now I want to update the document that has Entities.ID=4 by adding (Sentiment.Value+4)/2 for example in the above example after update we have 12.
I wrote the following code but I am stuck in the if statement as you can see:
DBCollection collectionG;
collectionG = db.getCollection("GraphDataCollection");
int entityID = 4;
String entityName = "test";
BasicDBObject queryingObject = new BasicDBObject();
queryingObject.put("Entities.ID", entityID);
DBCursor cursor = collectionG.find(queryingObject);
if (cursor.hasNext())
{
BasicDBObject existingDocument = new BasicDBObject("Entities.ID", entityID);
//not sure how to update the sentiment.value for entityid=4
}
First I thought I should unwind the Entities array first to get the value of sentiment but if I do that then how can I wind them again and update the document with the same format as it has now but with the new sentiment value ?
also I found the this link as well :
MongoDB - Update objects in a document's array (nested updating)
but I could not understand it since it is not written in java query,
can anyone explain how I can do this in java?
You need to do this in two steps:
Get all the _id of the records which contain a Entity with sentiment
value 4.
During the find, project only the entity sub document that has
matched the query, so that we can process it to consume only its
Sentiment.Value. Use the positional operator($) for this purpose.
Instead of hitting the database every time to update each matched
record, use the Bulk API, to queue up the updates and execute it
finally.
Create the Bulk operation Writer:
BulkWriteOperation bulk = col.initializeUnorderedBulkOperation();
Find all the records which contain the value 4 in its Entities.ID field. When you match documents against this query, you would get the whole document returned. But we do not want the whole document, we would like to have only the document's _id, so that we can update the same document using it, and the Entity element in the document that has its value as 4. There may be n other Entity documents, but they do not matter. So to get only the Entity element that matches the query we use the positional operator $.
DBObject find = new BasicDBObject("Entities.ID",4);
DBObject project = new BasicDBObject("Entities.$",1);
DBCursor cursor = col.find(find, project);
What the above could would return is the below document for example(since our example assumes only a single input document). If you notice, it contains only one Entity element that has matched our query.
{
"_id" : ObjectId("54901212f315dce7077204af"),
"Entities" : [
{
"ID" : 4,
"Name" : "test1",
"Sentiment" : {
"Value" : 12,
"Neutral" : 1
}
}
]
}
Iterate each record to queue up for update:
while(cursor.hasNext()){
BasicDBObject doc = (BasicDBObject)cursor.next();
int curVal = ((BasicDBObject)
((BasicDBObject)((BasicDBList)doc.get("Entities")).
get(0)).get("Sentiment")).getInt("Value");
int updatedValue = (curVal+4)/2;
DBObject query = new BasicDBObject("_id",doc.get("_id"))
.append("Entities.ID",4);
DBObject update = new BasicDBObject("$set",
new BasicDBObject("Entities.$.Sentiment.Value",
updatedValue));
bulk.find(query).update(update);
}
Finally Update:
bulk.execute();
You need to do a find() and update() and not simply an update, because currently mongodb does not allow to reference a document field to retrieve its value, modify it and update it with a computed value, in a single update query.