Proper way to convert Data type of a field in MongoDB - mongodb-query

Possible Replication of How to change the type of a field?
I am currently newly learning MongoDB and I am facing problem while converting Data type of field value to another data type.
Below is an example of my document
[
{
"Name of Restaurant": "Briyani Center",
"Address": " 336 & 338, Main Road",
"Location": "XYZQWE",
"PriceFor2": "500.0",
"Dining Rating": "4.3",
"Dining Rating Count": "1500",
},
{
"Name of Restaurant": "Veggie Conner",
"Address": " New 14, Old 11/3Q, Railway Station Road",
"Location": "ABCDEF",
"PriceFor2": "1000.0",
"Dining Rating": "4.4",
}]
Like above I have 12k documents. Notice the datatype of PriceFor2 is a string. I would like to convert the data type to Integer data type.
I have referred many amazing answers given in the above link. But when I try to run the query, I get .save() is not a function error. Please advice what is the problem.
Below is the code I used
db.chennaiData.find().forEach( function(x){ x.priceFor2= new NumberInt(x.priceFor2);
db.chennaiData.save(x);
db.chennaiData.save(x);});
This is the error I am getting..
TypeError: db.chennaiData.save is not a function

From MongoDB's save documentation:
Starting in MongoDB 4.2, the
db.collection.save()
method is deprecated. Use db.collection.insertOne() or db.collection.replaceOne() instead.
Likely you are having a MongoDB with version 4.2+, so the save function is no longer available. Consider migrate to the usage of insertOne and replaceOne as suggested.
For your specific scenario, it is actually preferred to do with a single update as mentioned in another SO answer. It only does one db call(while your approach fetches all documents in the collection to the application level) and performs n db call to save them back.
db.collection.update({},
[
{
$set: {
PriceFor2: {
$toDouble: "$PriceFor2"
}
}
}
],
{
multi: true
})
Mongo Playground

Related

How to convert Json to CSV and send it to big query or google cloud bucket

I`m new to nifi and I want to convert big amount of json data to csv format.
This is what I am doing at the moment but it is not the expected result.
These are the steps:
processes to create access_token and send request body using InvokeHTTP(This part works fine I wont name the processes since this is the expected result) and getting the response body in json.
Example of json response:
[
{
"results":[
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdaasdasdad",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdasda"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdasdasdasd",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdas"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
....etc....
]
}
]
Now I am using:
===>SplitJson ($[].results[])==>JoltTransformJSON with this spec:
[{
"operation": "shift",
"spec": {
"customer": {
"id": "customer_id"
},
"campaign": {
"id": "campaign_id",
"name": "campaign_name"
},
"adGroup": {
"id": "ad_group_id",
"name": "ad_group_name"
},
"metrics": {
"clicks": "clicks",
"costMicros": "cost",
"impressions": "impressions"
},
"segments": {
"device": "device",
"date": "date"
},
"incomeRangeView": {
"resourceName": "keywords_id"
}
}
}]
==>> MergeContent( here is the problem which I don`t know how to fix)
Merge Strategy: Defragment
Merge Format: Binary Concatnation
Attribute Strategy Keep Only Common Attributes
Maximum number of Bins 5 (I tried 10 same result)
Delimiter Strategy: Text
Header: [
Footer: ]
Demarcator: ,
What is the result I get?
I get a json file that has parts of the json data
Example: I have 50k customer_ids in 1 json file so I would like to send this data into big query table and have all the ids under the same field "customer_id".
The MergeContent uses the split json files and combines them but I will still get 10k customer_ids for each file i.e. I have 5 files and not 1 file with 50k customer_ids
After the MergeContent I use ==>>ConvertRecord with these settings:
Record Reader JsonTreeReader (Schema Access Strategy: InferSchema)
Record Writer CsvRecordWriter (
Schema Write Strategy: Do Not Write Schema
Schema Access Strategy: Inherit Record Schema
CSV Format: Microsoft Excel
Include Header Line: true
Character Set UTF-8
)
==>>UpdateAttribute (custom prop: filename: ${filename}.csv) ==>> PutGCSObject(and put the data into the google bucket (this step works fine- I am able to put files there))
With this approach I am UNABLE to send data to big query(After MergeContent I tried using PutBigQueryBatch and used this command in bq sheel to get the schema I need:
bq show --format=prettyjson some_data_set.some_table_in_that_data_set | jq '.schema.fields'
I filled all the fields as needed and Load file type: I tried NEWLINE_DELIMITED_JSON or CSV if I converted it to CSV (I am not getting errors but no data is uploaded into the table)
)
What am I doing wrong? I basically want to map the data in such a way that each fields data will be under the same field name
The trick you are missing is using Records.
Instead of using X>SplitJson>JoltTransformJson>Merge>Convert>X, try just X>JoltTransformRecord>X with a JSON Reader and a CSV Writer. This skips a lot of inefficiency.
If you really need to split (and you should avoid splitting and merging unless totally necessary), you can use MergeRecord instead - again with a JSON Reader and CSV Writer. This would make your flow X>Split>Jolt>MergeRecord>X.

GCP Dataflow JOB REST response add displayData object with { "key":"datasetName", ...}

Why this code of line doesn't generate displayData object with { "key":"datasetName", ...} and how I can generate it if it's not coming by default when using BigQuery source from apache beam?
bigqcollection = p | 'ReadFromBQ' >> beam.io.Read(beam.io.BigQuerySource(project=project,query=get_java_query))
[UPDATE] Adding result that I try to produce:
"displayData": [
{
"key": "table",
"namespace": "....",
"strValue": "..."
},
{
"key": "datasetName",
"strValue": "..."
}
]
From reading the implementation of display_data() for a BigQuerySource in the most recent version of Beam, it does not extract the table and dataset from the query, which your example uses. And more significantly, it does not create any fields specifically named datasetName.
I would recommend writing a subclass of _BigQuerySource which adds the fields you need to the display data, while preserving all the other behavior.

Handling multiple rows returned by IMPORTJSON script on GoogleSheets

I am trying to populate a google sheet using an API. But the API has more than one row to be returned for a single query. Following is the JSON returned by API.
# https://api.dictionaryapi.dev/api/v2/entries/en/ABANDON
[
{
"word": "abandon",
"phonetics": [
{
"text": "/əˈbændən/",
"audio": "https://lex-audio.useremarkable.com/mp3/abandon_us_1.mp3"
}
],
"meanings": [
{
"partOfSpeech": "transitive verb",
"definitions": [
{
"definition": "Cease to support or look after (someone); desert.",
"example": "her natural mother had abandoned her at an early age",
"synonyms": [
"desert",
"leave",
"leave high and dry",
"turn one's back on",
"cast aside",
"break with",
"break up with"
]
},
{
"definition": "Give up completely (a course of action, a practice, or a way of thinking)",
"example": "he had clearly abandoned all pretense of trying to succeed",
"synonyms": [
"renounce",
"relinquish",
"dispense with",
"forswear",
"disclaim",
"disown",
"disavow",
"discard",
"wash one's hands of"
]
},
{
"definition": "Allow oneself to indulge in (a desire or impulse)",
"example": "they abandoned themselves to despair",
"synonyms": [
"indulge in",
"give way to",
"give oneself up to",
"yield to",
"lose oneself in",
"lose oneself to"
]
}
]
},
{
"partOfSpeech": "noun",
"definitions": [
{
"definition": "Complete lack of inhibition or restraint.",
"example": "she sings and sways with total abandon",
"synonyms": [
"uninhibitedness",
"recklessness",
"lack of restraint",
"lack of inhibition",
"unruliness",
"wildness",
"impulsiveness",
"impetuosity",
"immoderation",
"wantonness"
]
}
]
}
]
}
]
By using the following calls via IMPORTJSON,
=ImportJSON(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2), "/phonetics/text", "noHeaders")
=ImportJSON(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2), "/meanings/partOfSpeech", "noHeaders")
=ImportJSON(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2), "/meanings/definitions/definition", "noHeaders")
=ImportJSON(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2), "/meanings/definitions/synonyms", "noHeaders")
=ImportJSON(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2), "/meanings/definitions/example", "noHeaders")
I am able to get the following in GoogleSheets,
Whereas, the actual output according to JSON should be,
As you can see a complete row is being overwritten. How can this be fixed?
EDIT
Following is the link to sheet for viewing only.
I believe your goal as follows.
You want to achieve the bottom image in your question on Google Spreadsheet.
Unfortunately, I couldn't find the method for directly retrieving the bottom image using ImportJson. So in this answer, I would like to propose a sample script for retrieving the values you expect using Google Apps Script. I thought that creating a sample script for directly achieving your goal might be simpler rather than modifying ImportJson.
Sample script:
function SAMPLE(url) {
var res = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
if (res.getResponseCode() != 200) return res.getContentText();
var obj = JSON.parse(res.getContentText());
var values = obj[0].meanings.reduce((ar, {partOfSpeech, definitions}, i) => {
definitions.forEach(({definition, example, synonyms}, j) => {
var v = [definition, Array.isArray(synonyms) ? synonyms.join(",") : synonyms, example];
var phonetics = obj[0].phonetics[i];
ar.push(j == 0 ? [(phonetics ? phonetics.text : ""), partOfSpeech, ...v] : ["", "", ...v]);
});
return ar;
}, []);
return values;
}
When you use this script, please put =SAMPLE(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2)) to a cell as the custom formula.
Result:
When above script is used, the following
Note:
In this sample script, when the structure of the JSON object is changed, it might not be able to be used. So please be careful this.
References:
Class UrlFetchApp
Custom Functions in Google Sheets

Kafka Connect S3 sink - how to use the timestamp from the message itself [timestamp extractor]

I've been struggling with a problem using kafka connect and the S3 sink.
First the structure:
{
Partition: number
Offset: number
Key: string
Message: json string
Timestamp: timestamp
}
Normally when posting to Kafka, the timestamp should be set by the producer. Unfortunately there seems to be cases where this didn't happen. This means that the Timestamp might sometimes be null
To extract this timestamp the connector was set to the following value:
"timestamp.extractor":"Record".
Now it is always certain that the Message field itself always contains a timestamp as well.
Message:
{
timestamp: "2019-04-02T06:27:02.667Z"
metadata: {
creationTimestamp: "1554186422667"
}
}
The question however is that now, I would like to use that field for the timestamp.extractor
I was thinking that this would suffice, but this doesn't seem to work:
"timestamp.extractor":"RecordField",
"timestamp.field":"message.timestamp",
This results in a NullPointer as well.
Any ideas as to how to use the timestamp from the kafka message payload itself, instead of the default timestamp field that is set for kafka v0.10+
EDIT:
Full config:
{ "name": "<name>",
"config": {
"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"tasks.max":"4",
"topics":"<topic>",
"flush.size":"100",
"s3.bucket.name":"<bucket name>",
"s3.region": "<region>",
"s3.part.size":"<partition size>",
"rotate.schedule.interval.ms":"86400000",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"format.class":"io.confluent.connect.s3.format.json.JsonFormat",
"locale":"ENGLISH",
"timezone":"UTC",
"schema.generator.class":"io.confluent.connect.storage.hive.schema.TimeBasedSchemaGenerator",
"partitioner.class":"io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"partition.duration.ms": "3600000",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd",
"timestamp.extractor":"RecordField",
"timestamp.field":"message.timestamp",
"max.poll.interval.ms": "600000",
"request.timeout.ms": "610000",
"heartbeat.interval.ms": "6000",
"session.timeout.ms": "20000",
"s3.acl.canned":"bucket-owner-full-control"
}
}
EDIT 2:
Kafka message payload structure:
{
"reference": "",
"clientId": "",
"gid": "",
"timestamp": "2019-03-19T15:27:55.526Z",
}
EDIT 3:
{
"transforms": "convert_op_creationDateTime",
"transforms.convert_op_creationDateTime.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.convert_op_creationDateTime.target.type": "Timestamp",
"transforms.convert_op_creationDateTime.field": "timestamp",
"transforms.convert_op_creationDateTime.format": "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"
}
So I tried doing a transform on the object, but it seems like I've been stuck again on this thing. The pattern seems to be invalid. Looking around the internet it does seem like this is a valid SimpleDatePattern. It seems to be complaining about the 'T'. Updated the message schema as well.
Based on the schema you've shared, you should be setting:
"timestamp.extractor":"RecordField",
"timestamp.field":"timestamp",
i.e. no message prefix to the timestamp field name.
If the data is a string, then Connect will try to parse as milliseconds - source code here.
In any case, message.timestamp assumes the data looks like { "message" : { "timestamp": ... } }, so just timestamp would be correct. And having nested fields didn't use to be possible anyway, so you might want to clarify which version of Connect you have.
I'm not entirely sure how you would get instanceof Date to evalutate to true when using JSON Converter, and even if you had set schema.enable = true, then also in the code, you can see there is only conditions for schema types of numbers and strings, but still assumes that it is milliseconds.
You can try using the TimestampConverter transformation to convert your date string.

Possible to use angular-datatables with serverside array sourced data instead of object sourced data

I'm trying to use angular-datatables with serverside processing. However, it seems that angular-datatables expects that the data from the server is in object format (object vs array data described) with column names preceding each table datapoint. I'd like to configure angular-datatables to accept array based data since I can't modify my server side output which only outputs data in array format.
I'm configuring Datatables in my javascript like so:
var vm = this;
vm.dtOptions = DTOptionsBuilder.newOptions()
.withOption('ajax', {
url: 'output/ss_results/' + $routeParams.uuid,
type: 'GET'
})
.withDataProp('data')
.withOption('processing', true)
.withOption('serverSide', true);
My data from the server looks like this in array format:
var data = [
[
"Tiger Nixon",
"System Architect",
"$3,120"
],
[
"Garrett Winters",
"Director",
"$5,300"
]
]
But as far as I can tell, angular-datatables is expecting the data in object format like so:
[
{
"name": "Tiger Nixon",
"position": "System Architect",
"extn": "5421"
},
{
"name": "Garrett Winters",
"position": "Director",
"extn": "8422"
}
]
I tried not defining dtColumns or setting it to an empty array like vm.dtColumns = []; but I get an error message when I do that. When I configure dtColumns with a promise to load the column data via ajax I get datatables error #4 because it can't find the column name preceding my table datapoints in the data retrieved from the server.
Is it possible to configure angular-datatables to accept array based data? I can't find anything on the angular-datatables website that indicates it can be configured this way.
Edit: So I removed the .withDataProp('data') which I think was causing the problem. The table works a little better now but it's still broken. After it loads, I get the message No matching records found. Even though right below it it says Showing 1 to 10 of 60,349 entries
Previous1…456…6035Next Does anyone know why this might be?
If you want to use an array of arrays instead of an array of objects, simply refer to the array indexes instead of the object names :
$scope.dtColumns = [
DTColumnBuilder.newColumn(0).withTitle('Name'),
DTColumnBuilder.newColumn(1).withTitle('Position'),
DTColumnBuilder.newColumn(2).withTitle('Office'),
DTColumnBuilder.newColumn(3).withTitle('Start date'),
DTColumnBuilder.newColumn(4).withTitle('Salary')
]
demo using the famous "Tiger Nixon" array loaded via AJAX -> http://plnkr.co/edit/16UoRqF5hvg2YpvAP8J3?p=preview