How to convert Json to CSV and send it to big query or google cloud bucket - google-bigquery

I`m new to nifi and I want to convert big amount of json data to csv format.
This is what I am doing at the moment but it is not the expected result.
These are the steps:
processes to create access_token and send request body using InvokeHTTP(This part works fine I wont name the processes since this is the expected result) and getting the response body in json.
Example of json response:
[
{
"results":[
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdaasdasdad",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdasda"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdasdasdasd",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdas"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
....etc....
]
}
]
Now I am using:
===>SplitJson ($[].results[])==>JoltTransformJSON with this spec:
[{
"operation": "shift",
"spec": {
"customer": {
"id": "customer_id"
},
"campaign": {
"id": "campaign_id",
"name": "campaign_name"
},
"adGroup": {
"id": "ad_group_id",
"name": "ad_group_name"
},
"metrics": {
"clicks": "clicks",
"costMicros": "cost",
"impressions": "impressions"
},
"segments": {
"device": "device",
"date": "date"
},
"incomeRangeView": {
"resourceName": "keywords_id"
}
}
}]
==>> MergeContent( here is the problem which I don`t know how to fix)
Merge Strategy: Defragment
Merge Format: Binary Concatnation
Attribute Strategy Keep Only Common Attributes
Maximum number of Bins 5 (I tried 10 same result)
Delimiter Strategy: Text
Header: [
Footer: ]
Demarcator: ,
What is the result I get?
I get a json file that has parts of the json data
Example: I have 50k customer_ids in 1 json file so I would like to send this data into big query table and have all the ids under the same field "customer_id".
The MergeContent uses the split json files and combines them but I will still get 10k customer_ids for each file i.e. I have 5 files and not 1 file with 50k customer_ids
After the MergeContent I use ==>>ConvertRecord with these settings:
Record Reader JsonTreeReader (Schema Access Strategy: InferSchema)
Record Writer CsvRecordWriter (
Schema Write Strategy: Do Not Write Schema
Schema Access Strategy: Inherit Record Schema
CSV Format: Microsoft Excel
Include Header Line: true
Character Set UTF-8
)
==>>UpdateAttribute (custom prop: filename: ${filename}.csv) ==>> PutGCSObject(and put the data into the google bucket (this step works fine- I am able to put files there))
With this approach I am UNABLE to send data to big query(After MergeContent I tried using PutBigQueryBatch and used this command in bq sheel to get the schema I need:
bq show --format=prettyjson some_data_set.some_table_in_that_data_set | jq '.schema.fields'
I filled all the fields as needed and Load file type: I tried NEWLINE_DELIMITED_JSON or CSV if I converted it to CSV (I am not getting errors but no data is uploaded into the table)
)
What am I doing wrong? I basically want to map the data in such a way that each fields data will be under the same field name

The trick you are missing is using Records.
Instead of using X>SplitJson>JoltTransformJson>Merge>Convert>X, try just X>JoltTransformRecord>X with a JSON Reader and a CSV Writer. This skips a lot of inefficiency.
If you really need to split (and you should avoid splitting and merging unless totally necessary), you can use MergeRecord instead - again with a JSON Reader and CSV Writer. This would make your flow X>Split>Jolt>MergeRecord>X.

Related

Replace Objects with the corresponding ObjectId with Mongoose if found in the MongoDB

I have a MEAN-Stack setup in which i have Devices and Servicecases saved in the MongoDB-Database.
Devices can be the content of a Servicecase
If a new Case should be created, my Frontend will deliver the following form data:
content: [
{
"device": 012345678909876,
"errorDesc": "lorem"
},
{
"device": 012345678909876,
"errorDesc": "ipsum"
}
]
There could be a device document with the submitted device number in the Database. If yes, the received doc should be populated with its ObjectId to look like this:
content: [
{
device: { type: Schema.Types.ObjectId, ref: 'Device' },
errorDesc: String
},
...
]
If not, it should stay as it is
I could iterate through each device of the array and use the findOne() query and, if a doc was found, replace it, but is there a more efficient way to use the populate() transformation?

GCP Dataflow JOB REST response add displayData object with { "key":"datasetName", ...}

Why this code of line doesn't generate displayData object with { "key":"datasetName", ...} and how I can generate it if it's not coming by default when using BigQuery source from apache beam?
bigqcollection = p | 'ReadFromBQ' >> beam.io.Read(beam.io.BigQuerySource(project=project,query=get_java_query))
[UPDATE] Adding result that I try to produce:
"displayData": [
{
"key": "table",
"namespace": "....",
"strValue": "..."
},
{
"key": "datasetName",
"strValue": "..."
}
]
From reading the implementation of display_data() for a BigQuerySource in the most recent version of Beam, it does not extract the table and dataset from the query, which your example uses. And more significantly, it does not create any fields specifically named datasetName.
I would recommend writing a subclass of _BigQuerySource which adds the fields you need to the display data, while preserving all the other behavior.

How to load a jsonl file into BigQuery when the file has mix data fields as columns

During my work flow, after extracting the data from API, the JSON has the following structure:
[
{
"fields":
[
{
"meta": {
"app_type": "ios"
},
"name": "app_id",
"value": 100
},
{
"meta": {},
"name": "country",
"value": "AE"
},
{
"meta": {
"name": "Top"
},
"name": "position",
"value": 1
}
],
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
}
]
Then it is store as .jsonl and put on GCS. However, when I load it onto BigQuery for further extraction, the automatic schema inference return the following error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Could not convert value to string. Field: value; Value: 100
I want to convert it in to the following structure:
app_type
app_id
country
position
click
price
count
ios
100
AE
Top
1
1
1
Is there a way to define manual schema on BigQuery to achieve this result? Or do I have to preprocess the jsonl file before put it to BigQuery?
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
A invalid example would be:
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
Your jsonl file should be something like this:
{"app_type":"ios","app_id":"100","country":"AE","position":"Top","click":"1","price":"1","count":"1"}
I already tested it and it works fine.
So wherever you process the conversion of the json files to jsonl files and storage to GCS, you will have to do some preprocessing.
Probably you have to options:
precreate target table with an app_id field as an INTEGER
preprocess jsonfile and enclose 100 into quotes like "100"

Pentaho: Getting only one in row from JSON input file

I am getting one JSON file from SFTP and trying it to insert into oracle but in the preview section i am getting only one row and only one row inserting into table. I already tried to modify to number to preview rows to 10000 but nothing working out.
{"postal_code":"XX","build_id":"XX","categories":[],"closed":false,"closed_reasons":[],"email":["XX"],"external_link":{"facebook":[],"yelp":[""]},"hq":false,"location":{"lat":xxxxx,"lon":xxxxx},"metro":"Cxxxxx, IL","naics_codes":[{"category_code":"XX","category_description":"XX"},{"category_code":"XX","category_description":"xxxxx "},{"category_code":"XX","category_description":"xxxxx "},{"category_code":xxxxx","category_description":"XX"}],"name":"XX","place_id":"XX","place_ids":["xxx","xx"],"sic_codes":[{"category_code":"XX","category_description":"XX"},{"category_code":"XX","category_description":"XX"}]}
Your example json is:
{
"postal_code":"XX",
"build_id":"XX",
"categories":[],
"closed":false,
"closed_reasons":[],
"email":["XX"],
"external_link":{
"facebook":[],
"yelp":[""]
},
"hq":false,
"location":{
"lat":xxxxx,
"lon":xxxxx
},
"metro":"Cxxxxx, IL",
"naics_codes":[
{
"category_code":"XX",
"category_description":"XX"
},
{
"category_code":"XX",
"category_description":"xxxxx "
},
{
"category_code":"XX",
"category_description":"xxxxx "
},
{
"category_code":xxxxx",
"category_description":"XX"
}
],
"name":"XX",
"place_id":"XX",
"place_ids":["xxx","xx"],
"sic_codes":[
{
"category_code":"XX",
"category_description":"XX"
},
{
"category_code":"XX",
"category_description":"XX"
}
]
}
So, if that's the total values in the response you're getting from the SFTP server, Pentaho is behaving well, because it's only one record. If you want PDI to recognize the fields inside that json, and split it's content, you need to specify the path to each field on the "Path" field available on the "Fields" tab on the "Json Input" step.
Try using the input json file as follows:
{
"data" : [
{"key":"val","key":"val"},
{"key":"val","key":"val"},
{"key":"val","key":"val"},
{"id":"666","name":"jnit"}
]
}

Possible to use angular-datatables with serverside array sourced data instead of object sourced data

I'm trying to use angular-datatables with serverside processing. However, it seems that angular-datatables expects that the data from the server is in object format (object vs array data described) with column names preceding each table datapoint. I'd like to configure angular-datatables to accept array based data since I can't modify my server side output which only outputs data in array format.
I'm configuring Datatables in my javascript like so:
var vm = this;
vm.dtOptions = DTOptionsBuilder.newOptions()
.withOption('ajax', {
url: 'output/ss_results/' + $routeParams.uuid,
type: 'GET'
})
.withDataProp('data')
.withOption('processing', true)
.withOption('serverSide', true);
My data from the server looks like this in array format:
var data = [
[
"Tiger Nixon",
"System Architect",
"$3,120"
],
[
"Garrett Winters",
"Director",
"$5,300"
]
]
But as far as I can tell, angular-datatables is expecting the data in object format like so:
[
{
"name": "Tiger Nixon",
"position": "System Architect",
"extn": "5421"
},
{
"name": "Garrett Winters",
"position": "Director",
"extn": "8422"
}
]
I tried not defining dtColumns or setting it to an empty array like vm.dtColumns = []; but I get an error message when I do that. When I configure dtColumns with a promise to load the column data via ajax I get datatables error #4 because it can't find the column name preceding my table datapoints in the data retrieved from the server.
Is it possible to configure angular-datatables to accept array based data? I can't find anything on the angular-datatables website that indicates it can be configured this way.
Edit: So I removed the .withDataProp('data') which I think was causing the problem. The table works a little better now but it's still broken. After it loads, I get the message No matching records found. Even though right below it it says Showing 1 to 10 of 60,349 entries
Previous1…456…6035Next Does anyone know why this might be?
If you want to use an array of arrays instead of an array of objects, simply refer to the array indexes instead of the object names :
$scope.dtColumns = [
DTColumnBuilder.newColumn(0).withTitle('Name'),
DTColumnBuilder.newColumn(1).withTitle('Position'),
DTColumnBuilder.newColumn(2).withTitle('Office'),
DTColumnBuilder.newColumn(3).withTitle('Start date'),
DTColumnBuilder.newColumn(4).withTitle('Salary')
]
demo using the famous "Tiger Nixon" array loaded via AJAX -> http://plnkr.co/edit/16UoRqF5hvg2YpvAP8J3?p=preview