bulk insert elasticsearch documents except one field if the document already exists - bulkinsert

I need to bulk insert if not existing, but I need override the whole document except one field if the document already exists.
for example:
if the database contains this document
{
"_id": 234,
"text": "hello",
"reach": 20
}
when i update using this document
{
"_id": 234,
"text": "hawdy",
"reach": 24
}
the document in the data post should contain
{
"_id": 234,
"text": "hawdy",
"reach": 20
}
if the document didn't exist in the database The second document should be used.

You can use upsert option of bulk update API for this.

Related

How to convert Json to CSV and send it to big query or google cloud bucket

I`m new to nifi and I want to convert big amount of json data to csv format.
This is what I am doing at the moment but it is not the expected result.
These are the steps:
processes to create access_token and send request body using InvokeHTTP(This part works fine I wont name the processes since this is the expected result) and getting the response body in json.
Example of json response:
[
{
"results":[
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdaasdasdad",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdasda"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdasdasdasd",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdas"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
....etc....
]
}
]
Now I am using:
===>SplitJson ($[].results[])==>JoltTransformJSON with this spec:
[{
"operation": "shift",
"spec": {
"customer": {
"id": "customer_id"
},
"campaign": {
"id": "campaign_id",
"name": "campaign_name"
},
"adGroup": {
"id": "ad_group_id",
"name": "ad_group_name"
},
"metrics": {
"clicks": "clicks",
"costMicros": "cost",
"impressions": "impressions"
},
"segments": {
"device": "device",
"date": "date"
},
"incomeRangeView": {
"resourceName": "keywords_id"
}
}
}]
==>> MergeContent( here is the problem which I don`t know how to fix)
Merge Strategy: Defragment
Merge Format: Binary Concatnation
Attribute Strategy Keep Only Common Attributes
Maximum number of Bins 5 (I tried 10 same result)
Delimiter Strategy: Text
Header: [
Footer: ]
Demarcator: ,
What is the result I get?
I get a json file that has parts of the json data
Example: I have 50k customer_ids in 1 json file so I would like to send this data into big query table and have all the ids under the same field "customer_id".
The MergeContent uses the split json files and combines them but I will still get 10k customer_ids for each file i.e. I have 5 files and not 1 file with 50k customer_ids
After the MergeContent I use ==>>ConvertRecord with these settings:
Record Reader JsonTreeReader (Schema Access Strategy: InferSchema)
Record Writer CsvRecordWriter (
Schema Write Strategy: Do Not Write Schema
Schema Access Strategy: Inherit Record Schema
CSV Format: Microsoft Excel
Include Header Line: true
Character Set UTF-8
)
==>>UpdateAttribute (custom prop: filename: ${filename}.csv) ==>> PutGCSObject(and put the data into the google bucket (this step works fine- I am able to put files there))
With this approach I am UNABLE to send data to big query(After MergeContent I tried using PutBigQueryBatch and used this command in bq sheel to get the schema I need:
bq show --format=prettyjson some_data_set.some_table_in_that_data_set | jq '.schema.fields'
I filled all the fields as needed and Load file type: I tried NEWLINE_DELIMITED_JSON or CSV if I converted it to CSV (I am not getting errors but no data is uploaded into the table)
)
What am I doing wrong? I basically want to map the data in such a way that each fields data will be under the same field name
The trick you are missing is using Records.
Instead of using X>SplitJson>JoltTransformJson>Merge>Convert>X, try just X>JoltTransformRecord>X with a JSON Reader and a CSV Writer. This skips a lot of inefficiency.
If you really need to split (and you should avoid splitting and merging unless totally necessary), you can use MergeRecord instead - again with a JSON Reader and CSV Writer. This would make your flow X>Split>Jolt>MergeRecord>X.

GCP Dataflow JOB REST response add displayData object with { "key":"datasetName", ...}

Why this code of line doesn't generate displayData object with { "key":"datasetName", ...} and how I can generate it if it's not coming by default when using BigQuery source from apache beam?
bigqcollection = p | 'ReadFromBQ' >> beam.io.Read(beam.io.BigQuerySource(project=project,query=get_java_query))
[UPDATE] Adding result that I try to produce:
"displayData": [
{
"key": "table",
"namespace": "....",
"strValue": "..."
},
{
"key": "datasetName",
"strValue": "..."
}
]
From reading the implementation of display_data() for a BigQuerySource in the most recent version of Beam, it does not extract the table and dataset from the query, which your example uses. And more significantly, it does not create any fields specifically named datasetName.
I would recommend writing a subclass of _BigQuerySource which adds the fields you need to the display data, while preserving all the other behavior.

Query for entire JSON document in nested JSON schema

Background:
I wish to locate the entire JSON document that has a condition where "state" = "new" and where length(Features.id) > 4
{
"id": "123"
"feedback": {
"Features": [
{
"state": "new"
"id": "12345"
}
]
}
}
This is what I have tried to do:
Since this is a nested document. My query looks like this:
A stackoverflow member has helped me to access the nested contents within the query, but is there a way to obtain the full document
I have used:
SELECT VALUE t.id FROM t IN f.feedback.Features where t.state = 'new' and length(t.id)>4
This will give me the ids.
My desire is to have access to the full document with this condition?
{
"id": "123"
"feedback": {
"Features": [
{
"state": "new"
"id": "12345"
}
]
}
}
Any help is appreciated
Try this
SELECT *
FROM f
WHERE
f.feedback.Features[0].state = 'new'
AND length(f.feedback.Features[0].id)>4
Here is the SELECT spec for CosmosDB for more details
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-select
Also, check out "working with JSON" in CosmosDB notes
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-working-with-json
If the Features array has more than 1 value, you can use EXISTS clause to search within them. See specs of EXISTS here with examples:
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-subquery#exists-expression

Mule AnyPoint Studio, efficiently insert inside a DB a large number of items from array (json)

i have a json like the one below as body of a POST request.
{
"summary": {
"transactionId": "5003k00000zSuNaAAK",
"transactionNumber": "T12345",
"overall": 100,
"date": "15/05/2020",
"details": [
{
"transactionDetailId": "CC12345",
"product_code": 223242234,
"price": 1500,
"amount": 1000
},
{
"transactionDetailId": "DD12345",
"product_code": 679685675,
"price": 1100,
"amount": 90
},
{
"transactionDetailId": "SS12345",
"product_code": 345346643,
"price": 2000,
"amount": 300
},
.......other 100 items
]
}
}
In my AnyPoint Studio project, using a forEach module to loop details[] and a Bulk Insert, i'm able to execute an INSERT, and write into my postgres DB all the items of the details array.
So, for each items an INSERT will be executed.
Is there a more efficient way to perform this operation, considering array with more than 1000 items?
Better way would be to extract dteails[] as payload and then do bulk insert based on this array item. No forEach is involved and it it works much faster. Also stream is used in this case and memory demand will be much better.

How should I index this schema in Elasticsearch

I am a bit lost on how to index these documents in Elasticsearch.
Document 1
{
text: ['chicken']
}
Document 2
{
text: ['chicken'], [['broth', 'stock']]
}
I need to be able to query these using either 'chicken flavored stock' or 'chicken flavored broth' and it should return both documents with the same score, since all of their terms have been matched in the input query. It also shouldn't return doc 2 with only 'chicken' as query.
Basically, I want to know that all the terms in 'text' field have been found somewhere in the query, and the internal array (ie: 'broth' and 'stock' acts like an OR clause).
Is this even possible?
Update:
I did find a (cumbersome) way of doing it. I save the document by combining their fields into phrases (ex: ['chicken broth', 'chicken stock'] for doc 2). Then I search using every combination of the input as a phrase (ex: ['chicken', 'chicken flavored', 'chicken flavored broth', 'chicken broth', ...].)
This solution does give me the results I want, but I can't help but feel this is a common case that could be handled much more elegantly. It feels like the ngrams are along the path to my answer, but I can't quite work it out.
When you index documents without adding a custom mapping, Elasticsearch using the Standard analyzer by default.
You could remove the arrays from the text fields and index your documents as:
Document 1
{
"text": "chicken"
}
Document 2
{
"text": "chicken broth stock"
}
The standard analyzer will create the following tokens in the Lucene index:
Document 1
"chicken"
Document 2
"chicken", "broth", "stock"
Your documents are matching the search terms as follows:
chicken : the term 'chicken' matches in both documents, because the text field is shorter in Document 1 it scores higher than Document 2.
chicken flavored: the term 'chicken' matches in both documents, but there is no match for the term 'flavoured'. Again, as the text field is shorter in Document 1 it scores higher than Document 2.
chicken flavored broth: the term 'chicken' matches in both documents, and the term 'broth' also matched in document 2. There is no match on the term 'flavored' in either of the documents. Document 2 is scored higher than Document 1 as it matches two of the terms in the query.
I don't really see a use case for ngrams as the above does what you want.
So here is something that you can try. Percolator can solve your problem but you will have to change the way you are indexing your documents.
So instead of indexing doc1 the way you are doing, index it like so:
PUT /test-index/.percolator/1
{
"query": {
"term": {
"text": {
"value": "chicken"
}
}
}
}
And, index doc2 like so:
PUT /test-index/.percolator/2
{
"query": {
"bool": {
"must": [
{
"term": {
"text": {
"value": "chicken"
}
}
},
{
"bool": {
"should": [
{
"term": {
"text": {
"value": "broth"
}
}
},
{
"term": {
"text": {
"value": "stock"
}
}
}
]
}
}
]
}
}
}
No instead of querying the way you were querying your documents earlier, percolate them:
GET /test-index/all_terms_search/_percolate
{
"doc": {
"text": "chicken flavored stock"
}
}
This will get both your documents. This also gives you the flexibility to control what and how much you want to match. While you are indexing your document's reverse queries in percolator, you provide an ID for that query and corresponding to that ID, you can maintain the text in a much simpler form for you to consume either in a separate index in Elasticsearch or may be some other datastore which can get matching documents really fast.