cannot copy json - Dynamo db Streams to redshift - amazon-s3

Following is the use case i am working on:
I have configured enable Streams when creating DynamoDB with new and old Image.I have created a Kinesis Firehose delivery stream with Destination as Redshift(Intermediate s3).
From Dynamodb my stream reaches Firhose and from there to the Bucket as JSON (S3 Bucket -Gzip)given below. My Problem is i cannot COPY this JSON to redshift.
Things i am not able to get:
Not Sure what should be the Create table Statement in Redshift
What should be the COPY Syntax in Kinesis firhose.
How should i use JsonPaths here. Kinesis Data firehouse set to return only json to my s3 bucket.
How to mention the Maniphest in the COPY Command
JSON Load to S3 is shown Below:
{
"Keys": {
"vehicle_id": {
"S": "x011"
}
},
"NewImage": {
"heart_beat": {
"N": "0"
},
"cdc_id": {
"N": "456"
},
"latitude": {
"N": "1.30951"
},
"not_deployed_counter": {
"N": "1"
},
"reg_ind": {
"N": "0"
},
"operator": {
"S": "x"
},
"d_dttm": {
"S": "11/08/2018 2:43:46 PM"
},
"z_id": {
"N": "1267"
},
"last_end_trip_dttm": {
"S": "11/08/2018 1:43:46 PM"
},
"land_ind": {
"N": "1"
},
"s_ind": {
"N": "1"
},
"status_change_dttm": {
"S": "11/08/2018 2:43:46 PM"
},
"case_ind": {
"N": "1"
},
"last_po_change_dttm": {
"S": "11/08/2018 2:43:46 PM"
},
"violated_duration": {
"N": "20"
},
"vehicle_id": {
"S": "x011"
},
"longitude": {
"N": "103.7818"
},
"file_status": {
"S": "Trip_Start"
},
"unhired_duration": {
"N": "10"
},
"eo_lat": {
"N": "1.2345"
},
"reply_eo_ind": {
"N": "1"
},
"license_ind": {
"N": "0"
},
"indiscriminately_parked_ind": {
"N": "0"
},
"eo_lng": {
"N": "102.8978"
},
"officer_id": {
"S": "xxxx#gmail.com"
},
"case_status": {
"N": "0"
},
"color_status_cd": {
"N": "0"
},
"parking_id": {
"N": "2345"
},
"ttr_dttm": {
"S": "11/08/2018 2:43:46 PM"
},
"deployed_ind": {
"N": "1"
},
"status": {
"S": "PI"
}
},
"SequenceNumber": "1200000000000956615967",
"SizeBytes": 570,
"ApproximateCreationDateTime": 1535513040,
"eventName": "INSERT"
}
My Create table Statement :
create table vehicle_status(
heart_beat integer,
cdc_id integer,
latitude integer,
not_deployed_counter integer,
reg_ind integer,
operator varchar(10),
d_dttm varchar(30),
z_id integer,
last_end_trip_dttm varchar(30),
land_ind integer,
s_ind integer,
status_change_dttm varchar(30),
case_ind integer,
last_po_change_dttm varchar(30),
violated_duration integer,
vehicle_id varchar(8),
longitude integer,
file_status varchar(30),
unhired_duration integer,
eo_lat integer,
reply_eo_ind integer,
license_ind integer,
indiscriminately_parked_ind integer,
eo_lng integer,
officer_id varchar(50),
case_status integer,
color_status_cd integer,
parking_id integer,
ttr_dttm varchar(30),
deployed_ind varchar(3),
status varchar(8));
And My Copy Statement (Manually trying to reslove this from Redshift):
COPY vehicle_status (heart_beat, cdc_id, latitude, not_deployed_counter, reg_ind, operator, d_dttm, z_id, last_end_trip_dttm, land_ind, s_ind, status_change_dttm, case_ind, last_po_change_dttm, violated_duration, vehicle_id, longitude, file_status, unhired_duration, eo_lat, reply_eo_ind, license_ind, indiscriminately_parked_ind, eo_lng, officer_id, case_status, color_status_cd, parking_id, ttr_dttm, deployed_ind, status)
FROM 's3://<my-bucket>/2018/08/29/05/vehicle_status_change-2-2018-08-29-05-24-42-092c330b-e14a-4133-bf4a-5982f2e1f49e.gz' CREDENTIALS 'aws_iam_role=arn:aws:iam::<accountnum>:role/<RedshiftRole>' GZIP json 'auto';
When i try the above procedure - i get to Insert the records - but all the columns and rows are null.
How can i copy this json format to redhsift. Have been stuck here last 3 days.Any help on this would do.
S3 Bucket:
Amazon S3/<My-bucket>/2018/08/29/05
Amazon S3/<My-bucket>/manifests/2018/08/29/05

I'm not very much familiar with Amazon, but let me try to answer most of your questions, so that you could move on. Other people are most welcome to edit this answer or additional details. Thank you!
Not Sure what should be the Create table Statement in Redshift
Your create statement create table vehicle_status(...) has no problem, though you could add distribution key, sort key and encoding based on your requirement, refer more here and here
As per AWS Kenesis documents, your table must be present in Redshift, hence you could connect to Redshift using psql command and run the create statement manually.
What should be the COPY Syntax in Kinesis firhose.
The Copy syntax would remain same either you run it via psql or firhose, luckily the copy script you have come up with works without any error, I tried it in my instance with small modification of direct AWS/SECRET key supply rather then it works fine, here the sql I run that worked fine and copied 1 data record to the table vehicle_status.
Actually your json path structure is complex, hence json 'auto' will not work. Here is the working command, I have created a sample jsonpath file for you with 4 example fields and you could follow same structure to create jsonpath file with all the data points.
COPY vehicle_status (heart_beat, cdc_id, operator, status) FROM 's3://XXX/development/test_file.json' CREDENTIALS 'aws_access_key_id=XXXXXXXXXXXXXXXXX;aws_secret_access_key=MYXXXXXXXXXXXXXXXXXXXXXX' json 's3://XXX/development/yourjsonpathfile';
And your json path file should have content similar of as below.
{
"jsonpaths": [
"$['NewImage']['heart_beat']['N']",
"$['NewImage']['cdc_id']['N']",
"$['NewImage']['operator']['S']",
"$['NewImage']['status']['S']"
]
}
I have tested it and it works.
How should i use JsonPaths here. Kinesis Data firehouse set to return only json to my s3 bucket.
I have used your example json data only and it works, so I see no issue here.
How to mention the Maniphest in the COPY Command
This is good question, I could try explaining it, I hope, here you are referring menifest.
If you see above copy command, it works fine for one file or couple of files, but think it you have lot of files, here comes the concept of menifest.
Straight from Amazon docs, "Instead of supplying an object path for the COPY command, you supply the name of a JSON-formatted text file that explicitly lists the files to be loaded."
In short, if you want to load multiple files in single shot which is preferred way as well by Redshift, you could create a simple menifest with json and supply the same in copy command.
{
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},....
]
}
upload the menifest to S3 and use the same in your copy command like below.
COPY vehicle_status (heart_beat, cdc_id, latitude, not_deployed_counter, reg_ind, operator, d_dttm, z_id, last_end_trip_dttm, land_ind, s_ind, status_change_dttm, case_ind, last_po_change_dttm, violated_duration, vehicle_id, longitude, file_status, unhired_duration, eo_lat, reply_eo_ind, license_ind, indiscriminately_parked_ind, eo_lng, officer_id, case_status, color_status_cd, parking_id, ttr_dttm, deployed_ind, status) FROM 's3://XXX/development/test.menifest' CREDENTIALS 'aws_access_key_id=XXXXXXXXXXXXXXXXX;aws_secret_access_key=MYXXXXXXXXXXXXXXXXXXXXXX' json 's3://yourbucket/jsonpath' menifest;
Here is detail reference for menifest.
I hope this gives you some ideas, how to move on and If there is spefic error you see, I would be happy to refocus on the answer.

Related

AWS mqtt SQL query

I have the following mqtt message:
{
"sensors": [
{
"lsid": 412618,
"data": [
{
"temp_in": 72.3,
"heat_index_in": 72,
"dew_point_in": 55.9,
"ts": 1652785241,
"hum_in": 56.3
}
],
"sensor_type": 243,
"data_structure_type": 12
},
{
"lsid": 421195,
}
I can get the "sensors,0.lsid" value and the entire "data" array using this query:
select get(sensors,0).lsid as ls, get(sensors, 0).data as data1 from "topic"
but what I really need is to get "temp_in:72.3" , i.e. the values from the second level array
I've tried using this :AWS Doc., but unless I'm not following it correctly, it doesn't seem to work.
Any help would be greatly appreciated

How to convert Json to CSV and send it to big query or google cloud bucket

I`m new to nifi and I want to convert big amount of json data to csv format.
This is what I am doing at the moment but it is not the expected result.
These are the steps:
processes to create access_token and send request body using InvokeHTTP(This part works fine I wont name the processes since this is the expected result) and getting the response body in json.
Example of json response:
[
{
"results":[
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdaasdasdad",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdasda"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdasdasdasd",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdas"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
....etc....
]
}
]
Now I am using:
===>SplitJson ($[].results[])==>JoltTransformJSON with this spec:
[{
"operation": "shift",
"spec": {
"customer": {
"id": "customer_id"
},
"campaign": {
"id": "campaign_id",
"name": "campaign_name"
},
"adGroup": {
"id": "ad_group_id",
"name": "ad_group_name"
},
"metrics": {
"clicks": "clicks",
"costMicros": "cost",
"impressions": "impressions"
},
"segments": {
"device": "device",
"date": "date"
},
"incomeRangeView": {
"resourceName": "keywords_id"
}
}
}]
==>> MergeContent( here is the problem which I don`t know how to fix)
Merge Strategy: Defragment
Merge Format: Binary Concatnation
Attribute Strategy Keep Only Common Attributes
Maximum number of Bins 5 (I tried 10 same result)
Delimiter Strategy: Text
Header: [
Footer: ]
Demarcator: ,
What is the result I get?
I get a json file that has parts of the json data
Example: I have 50k customer_ids in 1 json file so I would like to send this data into big query table and have all the ids under the same field "customer_id".
The MergeContent uses the split json files and combines them but I will still get 10k customer_ids for each file i.e. I have 5 files and not 1 file with 50k customer_ids
After the MergeContent I use ==>>ConvertRecord with these settings:
Record Reader JsonTreeReader (Schema Access Strategy: InferSchema)
Record Writer CsvRecordWriter (
Schema Write Strategy: Do Not Write Schema
Schema Access Strategy: Inherit Record Schema
CSV Format: Microsoft Excel
Include Header Line: true
Character Set UTF-8
)
==>>UpdateAttribute (custom prop: filename: ${filename}.csv) ==>> PutGCSObject(and put the data into the google bucket (this step works fine- I am able to put files there))
With this approach I am UNABLE to send data to big query(After MergeContent I tried using PutBigQueryBatch and used this command in bq sheel to get the schema I need:
bq show --format=prettyjson some_data_set.some_table_in_that_data_set | jq '.schema.fields'
I filled all the fields as needed and Load file type: I tried NEWLINE_DELIMITED_JSON or CSV if I converted it to CSV (I am not getting errors but no data is uploaded into the table)
)
What am I doing wrong? I basically want to map the data in such a way that each fields data will be under the same field name
The trick you are missing is using Records.
Instead of using X>SplitJson>JoltTransformJson>Merge>Convert>X, try just X>JoltTransformRecord>X with a JSON Reader and a CSV Writer. This skips a lot of inefficiency.
If you really need to split (and you should avoid splitting and merging unless totally necessary), you can use MergeRecord instead - again with a JSON Reader and CSV Writer. This would make your flow X>Split>Jolt>MergeRecord>X.

How to load a jsonl file into BigQuery when the file has mix data fields as columns

During my work flow, after extracting the data from API, the JSON has the following structure:
[
{
"fields":
[
{
"meta": {
"app_type": "ios"
},
"name": "app_id",
"value": 100
},
{
"meta": {},
"name": "country",
"value": "AE"
},
{
"meta": {
"name": "Top"
},
"name": "position",
"value": 1
}
],
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
}
]
Then it is store as .jsonl and put on GCS. However, when I load it onto BigQuery for further extraction, the automatic schema inference return the following error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Could not convert value to string. Field: value; Value: 100
I want to convert it in to the following structure:
app_type
app_id
country
position
click
price
count
ios
100
AE
Top
1
1
1
Is there a way to define manual schema on BigQuery to achieve this result? Or do I have to preprocess the jsonl file before put it to BigQuery?
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
A invalid example would be:
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
Your jsonl file should be something like this:
{"app_type":"ios","app_id":"100","country":"AE","position":"Top","click":"1","price":"1","count":"1"}
I already tested it and it works fine.
So wherever you process the conversion of the json files to jsonl files and storage to GCS, you will have to do some preprocessing.
Probably you have to options:
precreate target table with an app_id field as an INTEGER
preprocess jsonfile and enclose 100 into quotes like "100"

How to extract the field from JSON object with QueryRecord

I have been struggling with this problem for a long time. I need to create a new JSON flowfile using QueryRecord by taking an array (field ref) from input JSON field refs and skip the object field as shown in example below:
Input JSON flowfile
{
"name": "name1",
"desc": "full1",
"refs": {
"ref": [
{
"source": "source1",
"url": "url1"
},
{
"source": "source2",
"url": "url2"
}
]
}
}
QueryRecord configuration
JSONTreeReader setup as Infer Schema and JSONRecordSetWriter
select name, description, (array[rpath(refs, '//ref[*]')]) as sources from flowfile
Output JSON (need)
{
"name": "name1",
"desc": "full1",
"references": [
{
"source": "source1",
"url": "url1"
},
{
"source": "source2",
"url": "url2"
}
]
}
But got error:
QueryRecord Failed to write MapRecord[{references=[Ljava.lang.Object;#27fd935f, description=full1, name=name1}] with schema ["name" : "STRING", "description" : "STRING", "references" : "ARRAY[STRING]"] as a JSON Object due to java.lang.ClassCastException: null
Try the following approach, in your case it shoud work:
1) Read your JSON field fully (I imitated it with GenerateFlowFile processor with your example)
2) Add EvaluateJsonPath processor which will put 2 header fileds (name, desc) into the attributes:
3) Add SplitJson processor which will split your JSON byt refs/ref/ groups (split by "$.refs.ref"):
4) Add ReplaceText processor which will add you header fields (name, desc) to the split lines (replace "[{]" value with "{"name":"${json.name}","desc":"${json.desc}","):
5) It`s done:
Full process in my demo case:
Hope this helps.
Solution!: use JoltTransformJSON to transform JSON by Jolt specification. About this specification.

AWS: Other function than COPY by transferring data from S3 to Redshift with amazon-data-pipeline

I'm trying to transfer data from the Amazon S3-Cloud to Amazon-Redshift with the Amazon-Data-Pipeline tool.
Is it possible while transferring the Data to change the Data with e.G. an SQL Statement so that just the results of the SQL-Statement will be the input into Redshift?
I only found the Copy Command like:
{
"id": "S3Input",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://example-bucket/source/inputfile.csv"
},
Source: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-get-started-copy-data-cli.html
Yes, it is possible. There are two approaches to it:
Use transformSQL of RedShiftCopyActivity
transformSQL is useful if the transformations are performed within the scope of the record that are getting loaded on a timely basis, e.g. every day or hour. That way changes are only applied to the batch and not to the whole table.
Here is an excerpt from the documentation:
transformSql: The SQL SELECT expression used to transform the input data. When you copy data from DynamoDB or Amazon S3, AWS Data Pipeline creates a table called staging and initially loads it in there. Data from this table is used to update the target table. If the transformSql option is specified, a second staging table is created from the specified SQL statement. The data from this second staging table is then updated in the final target table. So transformSql must be run on the table named staging and the output schema of transformSql must match the final target table's schema.
Please, find an example of usage of transformSql below. Notice that select is from staging table. It will effectively run CREATE TEMPORARY TABLE staging2 AS SELECT <...> FROM staging;. Also, all fields must be included and match the existing table in RedShift DB.
{
"id": "LoadUsersRedshiftCopyActivity",
"name": "Load Users",
"insertMode": "OVERWRITE_EXISTING",
"transformSql": "SELECT u.id, u.email, u.first_name, u.last_name, u.admin, u.guest, CONVERT_TIMEZONE('US/Pacific', cs.created_at_pst) AS created_at_pst, CONVERT_TIMEZONE('US/Pacific', cs.updated_at_pst) AS updated_at_pst FROM staging u;",
"type": "RedshiftCopyActivity",
"runsOn": {
"ref": "OregonEc2Resource"
},
"schedule": {
"ref": "HourlySchedule"
},
"input": {
"ref": "OregonUsersS3DataNode"
},
"output": {
"ref": "OregonUsersDashboardRedshiftDatabase"
},
"onSuccess": {
"ref": "LoadUsersSuccessSnsAlarm"
},
"onFail": {
"ref": "LoadUsersFailureSnsAlarm"
},
"dependsOn": {
"ref": "BewteenRegionsCopyActivity"
}
}
Use script of SqlActivity
SqlActivity allows operations on the whole dataset, and can be scheduled to run after particular events through dependsOn mechanism
{
"name": "Add location ID",
"id": "AddCardpoolLocationSqlActivity",
"type": "SqlActivity",
"script": "INSERT INTO locations (id) SELECT 100000 WHERE NOT EXISTS (SELECT * FROM locations WHERE id = 100000);",
"database": {
"ref": "DashboardRedshiftDatabase"
},
"schedule": {
"ref": "HourlySchedule"
},
"output": {
"ref": "LocationsDashboardRedshiftDatabase"
},
"runsOn": {
"ref": "OregonEc2Resource"
},
"dependsOn": {
"ref": "LoadLocationsRedshiftCopyActivity"
}
}
There is an optional field in RedshiftCopyActivity called 'transformSql'.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html
I have not personally used this, but from the looks of it, it seems like - you will treat your s3 data being in a temp table and this sql stmt will return transformed data for redshift to insert.
So, you will need to list all fields in the select whether or not you are transforming that field.
AWS Datapipeline SqlActivity
{
"id" : "MySqlActivity",
"type" : "SqlActivity",
"database" : { "ref": "MyDatabase" },
"script" : "insert into AnalyticsTable (select (cast(requestEndTime as bigint) - cast(requestBeginTime as bigint)) as requestTime, hostname from StructuredLogs where hostname LIKE '%.domain.sfx');",
"schedule" : { "ref": "Hour" },
"queue" : "priority"
}
So basically in
"script" any sql script/transformations/commands Amazon Redshift SQL Commands
transformSql is fine but support only The SQL SELECT expression used to transform the input data. ref : RedshiftCopyActivity