Extracting particular nested properties with a $ prefix in Amazon Redshift or Quicksight - sql

I am using PostHog for product analytics and have exported some event data to Amazon Redshift as well as S3 to be used in Quicksight.
Under the personal properties part of the JSON, each individual property is nested but begins with a $
I am quite new to SQL queries as well as getting specific details from JSON. in Quicksight using parseJson
Here is an example of the JSON from PostHog
"properties": {
"$active_feature_flags": [],
"$browser": "Chrome",
"$browser_version": 98,
"$ce_version": 1,
"$device_type": "Desktop",
"$environment": "test",
"$event_type": "click",
"$lib": "web",
"$lib_version": "1.17.8",
"$os": "Mac OS X",
"$pathname": "/events",
"$plugins_deferred": [],
"$plugins_failed": [],
"$plugins_succeeded": [
"First Event Today (4914)",
"GeoIP (5539)"
],
I have sought help from a few sources who have mentioned it isn't as simple because of the $ symbol at the beginning.
So my question would be,
How would I query this in Redshift to successfully extract $device_type and $os for example?
How would I pull the same properties using parseJson in Amazon Quicksight?

I can answer #1.
The json provided looks to be a snippet and invalid as is. So I removed the trailing ',' and used SQL to provide the surrounding '{}'. Once it is valid json this runs fine:
create table test as select '"properties": {
"$active_feature_flags": [],
"$browser": "Chrome",
"$browser_version": 98,
"$ce_version": 1,
"$device_type": "Desktop",
"$environment": "test",
"$event_type": "click",
"$lib": "web",
"$lib_version": "1.17.8",
"$os": "Mac OS X",
"$pathname": "/events",
"$plugins_deferred": [],
"$plugins_failed": [],
"$plugins_succeeded": [
"First Event Today (4914)",
"GeoIP (5539)"
]
}' as json_text;
select json_extract_path_text('{' || json_text ||'}', 'properties' ,'$device_type') as device_type,
json_extract_path_text('{' || json_text ||'}', 'properties' ,'$os') as os
from test;

Related

Proper way to convert Data type of a field in MongoDB

Possible Replication of How to change the type of a field?
I am currently newly learning MongoDB and I am facing problem while converting Data type of field value to another data type.
Below is an example of my document
[
{
"Name of Restaurant": "Briyani Center",
"Address": " 336 & 338, Main Road",
"Location": "XYZQWE",
"PriceFor2": "500.0",
"Dining Rating": "4.3",
"Dining Rating Count": "1500",
},
{
"Name of Restaurant": "Veggie Conner",
"Address": " New 14, Old 11/3Q, Railway Station Road",
"Location": "ABCDEF",
"PriceFor2": "1000.0",
"Dining Rating": "4.4",
}]
Like above I have 12k documents. Notice the datatype of PriceFor2 is a string. I would like to convert the data type to Integer data type.
I have referred many amazing answers given in the above link. But when I try to run the query, I get .save() is not a function error. Please advice what is the problem.
Below is the code I used
db.chennaiData.find().forEach( function(x){ x.priceFor2= new NumberInt(x.priceFor2);
db.chennaiData.save(x);
db.chennaiData.save(x);});
This is the error I am getting..
TypeError: db.chennaiData.save is not a function
From MongoDB's save documentation:
Starting in MongoDB 4.2, the
db.collection.save()
method is deprecated. Use db.collection.insertOne() or db.collection.replaceOne() instead.
Likely you are having a MongoDB with version 4.2+, so the save function is no longer available. Consider migrate to the usage of insertOne and replaceOne as suggested.
For your specific scenario, it is actually preferred to do with a single update as mentioned in another SO answer. It only does one db call(while your approach fetches all documents in the collection to the application level) and performs n db call to save them back.
db.collection.update({},
[
{
$set: {
PriceFor2: {
$toDouble: "$PriceFor2"
}
}
}
],
{
multi: true
})
Mongo Playground

How to convert Json to CSV and send it to big query or google cloud bucket

I`m new to nifi and I want to convert big amount of json data to csv format.
This is what I am doing at the moment but it is not the expected result.
These are the steps:
processes to create access_token and send request body using InvokeHTTP(This part works fine I wont name the processes since this is the expected result) and getting the response body in json.
Example of json response:
[
{
"results":[
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdaasdasdad",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdasda"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdasdasdasd",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdas"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
....etc....
]
}
]
Now I am using:
===>SplitJson ($[].results[])==>JoltTransformJSON with this spec:
[{
"operation": "shift",
"spec": {
"customer": {
"id": "customer_id"
},
"campaign": {
"id": "campaign_id",
"name": "campaign_name"
},
"adGroup": {
"id": "ad_group_id",
"name": "ad_group_name"
},
"metrics": {
"clicks": "clicks",
"costMicros": "cost",
"impressions": "impressions"
},
"segments": {
"device": "device",
"date": "date"
},
"incomeRangeView": {
"resourceName": "keywords_id"
}
}
}]
==>> MergeContent( here is the problem which I don`t know how to fix)
Merge Strategy: Defragment
Merge Format: Binary Concatnation
Attribute Strategy Keep Only Common Attributes
Maximum number of Bins 5 (I tried 10 same result)
Delimiter Strategy: Text
Header: [
Footer: ]
Demarcator: ,
What is the result I get?
I get a json file that has parts of the json data
Example: I have 50k customer_ids in 1 json file so I would like to send this data into big query table and have all the ids under the same field "customer_id".
The MergeContent uses the split json files and combines them but I will still get 10k customer_ids for each file i.e. I have 5 files and not 1 file with 50k customer_ids
After the MergeContent I use ==>>ConvertRecord with these settings:
Record Reader JsonTreeReader (Schema Access Strategy: InferSchema)
Record Writer CsvRecordWriter (
Schema Write Strategy: Do Not Write Schema
Schema Access Strategy: Inherit Record Schema
CSV Format: Microsoft Excel
Include Header Line: true
Character Set UTF-8
)
==>>UpdateAttribute (custom prop: filename: ${filename}.csv) ==>> PutGCSObject(and put the data into the google bucket (this step works fine- I am able to put files there))
With this approach I am UNABLE to send data to big query(After MergeContent I tried using PutBigQueryBatch and used this command in bq sheel to get the schema I need:
bq show --format=prettyjson some_data_set.some_table_in_that_data_set | jq '.schema.fields'
I filled all the fields as needed and Load file type: I tried NEWLINE_DELIMITED_JSON or CSV if I converted it to CSV (I am not getting errors but no data is uploaded into the table)
)
What am I doing wrong? I basically want to map the data in such a way that each fields data will be under the same field name
The trick you are missing is using Records.
Instead of using X>SplitJson>JoltTransformJson>Merge>Convert>X, try just X>JoltTransformRecord>X with a JSON Reader and a CSV Writer. This skips a lot of inefficiency.
If you really need to split (and you should avoid splitting and merging unless totally necessary), you can use MergeRecord instead - again with a JSON Reader and CSV Writer. This would make your flow X>Split>Jolt>MergeRecord>X.

GCP Dataflow JOB REST response add displayData object with { "key":"datasetName", ...}

Why this code of line doesn't generate displayData object with { "key":"datasetName", ...} and how I can generate it if it's not coming by default when using BigQuery source from apache beam?
bigqcollection = p | 'ReadFromBQ' >> beam.io.Read(beam.io.BigQuerySource(project=project,query=get_java_query))
[UPDATE] Adding result that I try to produce:
"displayData": [
{
"key": "table",
"namespace": "....",
"strValue": "..."
},
{
"key": "datasetName",
"strValue": "..."
}
]
From reading the implementation of display_data() for a BigQuerySource in the most recent version of Beam, it does not extract the table and dataset from the query, which your example uses. And more significantly, it does not create any fields specifically named datasetName.
I would recommend writing a subclass of _BigQuerySource which adds the fields you need to the display data, while preserving all the other behavior.

How to load a jsonl file into BigQuery when the file has mix data fields as columns

During my work flow, after extracting the data from API, the JSON has the following structure:
[
{
"fields":
[
{
"meta": {
"app_type": "ios"
},
"name": "app_id",
"value": 100
},
{
"meta": {},
"name": "country",
"value": "AE"
},
{
"meta": {
"name": "Top"
},
"name": "position",
"value": 1
}
],
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
}
]
Then it is store as .jsonl and put on GCS. However, when I load it onto BigQuery for further extraction, the automatic schema inference return the following error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Could not convert value to string. Field: value; Value: 100
I want to convert it in to the following structure:
app_type
app_id
country
position
click
price
count
ios
100
AE
Top
1
1
1
Is there a way to define manual schema on BigQuery to achieve this result? Or do I have to preprocess the jsonl file before put it to BigQuery?
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
A invalid example would be:
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
Your jsonl file should be something like this:
{"app_type":"ios","app_id":"100","country":"AE","position":"Top","click":"1","price":"1","count":"1"}
I already tested it and it works fine.
So wherever you process the conversion of the json files to jsonl files and storage to GCS, you will have to do some preprocessing.
Probably you have to options:
precreate target table with an app_id field as an INTEGER
preprocess jsonfile and enclose 100 into quotes like "100"

Possible to use angular-datatables with serverside array sourced data instead of object sourced data

I'm trying to use angular-datatables with serverside processing. However, it seems that angular-datatables expects that the data from the server is in object format (object vs array data described) with column names preceding each table datapoint. I'd like to configure angular-datatables to accept array based data since I can't modify my server side output which only outputs data in array format.
I'm configuring Datatables in my javascript like so:
var vm = this;
vm.dtOptions = DTOptionsBuilder.newOptions()
.withOption('ajax', {
url: 'output/ss_results/' + $routeParams.uuid,
type: 'GET'
})
.withDataProp('data')
.withOption('processing', true)
.withOption('serverSide', true);
My data from the server looks like this in array format:
var data = [
[
"Tiger Nixon",
"System Architect",
"$3,120"
],
[
"Garrett Winters",
"Director",
"$5,300"
]
]
But as far as I can tell, angular-datatables is expecting the data in object format like so:
[
{
"name": "Tiger Nixon",
"position": "System Architect",
"extn": "5421"
},
{
"name": "Garrett Winters",
"position": "Director",
"extn": "8422"
}
]
I tried not defining dtColumns or setting it to an empty array like vm.dtColumns = []; but I get an error message when I do that. When I configure dtColumns with a promise to load the column data via ajax I get datatables error #4 because it can't find the column name preceding my table datapoints in the data retrieved from the server.
Is it possible to configure angular-datatables to accept array based data? I can't find anything on the angular-datatables website that indicates it can be configured this way.
Edit: So I removed the .withDataProp('data') which I think was causing the problem. The table works a little better now but it's still broken. After it loads, I get the message No matching records found. Even though right below it it says Showing 1 to 10 of 60,349 entries
Previous1…456…6035Next Does anyone know why this might be?
If you want to use an array of arrays instead of an array of objects, simply refer to the array indexes instead of the object names :
$scope.dtColumns = [
DTColumnBuilder.newColumn(0).withTitle('Name'),
DTColumnBuilder.newColumn(1).withTitle('Position'),
DTColumnBuilder.newColumn(2).withTitle('Office'),
DTColumnBuilder.newColumn(3).withTitle('Start date'),
DTColumnBuilder.newColumn(4).withTitle('Salary')
]
demo using the famous "Tiger Nixon" array loaded via AJAX -> http://plnkr.co/edit/16UoRqF5hvg2YpvAP8J3?p=preview