Multi level JSON from CSV using Pig and JSONStorage - apache-pig

I have a CSV file of the below format
customerid, period, credit, debit
100, jan-2017, 500, 300
100, jan-2017, 300,0
100, feb-2017, 200,100
100, mar-2017, 200,10
200, jan-2017, 100, 200
200, feb-2017,100,200
Now my requirement is to first group by customer id and then group by period and consolidate the transactions and create a hierarchical JSON as below using Apache Pig scripts.
{
{
"customerid": 100,
"periods": [{
"period": "jan-2017",
"transactions": [{"credit": 500,"debit": 300},....]
}, {
"period": "feb-2017",
"transactions": [...]
}, {
"period": "mar-2017",
"transactions": [....]
}]
}, {
"customerid": 200,
"periods": [{
"period": "jan-2017",
"transactions": [.....]
}, {
"period": "feb-2017",
"transactions": [.....]
}]
}
}
I am fairly new to Pig but managed to write the below script
Data = LOAD 'data.csv' USING PigStorage(',') AS (
company_id:chararray,
period:chararray,
debit:chararray,
credit:chararray)
CompanyBag = GROUP Data BY (company_id);
final_trsnactionjson = FOREACH CompanyBag {
ByCompanyId = FOREACH Data {
PeriodBag = GROUP Data BY (period);
IdPeriodItemRoot = FOREACH PeriodBag{
ItemRecords = FOREACH Source GENERATE debit as debit, credit as credit
GENERATE group as period, TOTUPLE(ItemRecords) as transactions;
}
}
GENERATE group as customerid, TOTUPLE(PeriodBag) AS periods;
};
But this is giving me the below error
mismatched input '{' expecting GENERATE
I searched a lot on how to generate nested Json using Pig, but could not find any good pointers. Where am I going wrong? Thanks in advance for the help

Please use JsonLoader available in Pig.
https://pig.apache.org/docs/r0.11.1/func.html#jsonloadstore
you can provide nested schema in "AS"
Use com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') for simpler use for handling any nested JSON arrays.

Related

How to use PSQL to extract data from an object (inside an array inside an object inside an array)

This is data that is currently sitting in a single cell (e.g. inside warehouse table in warehouse_data column) in our database (I'm unable to change the structure/DB design so would need to work with this), how would I be able to select the name of the shirt with the largest width? In this case, would expect output to be tshirt_b (without quotation marks)
{
"wardrobe": {
"apparel": {
"variety": [
{
"data": {
"shirt": {
"size": {
"width": 30
}
}
},
"names": [
{
"name": "tshirt_a"
}
]
},
{
"data": {
"shirt": {
"size": {
"width": 40
}
}
},
"names": [
{
"name": "tshirt_b"
}
]
}
]
}
}
}
I've tried a select statement, being able to get out
"names": [
{
"name": "tshirt_b"
}
]
but not too much further than that e.g.:
select jsonb_array_elements(warehouse_data#>'{wardrobe,apparel,variety}')->>'names'
from 'warehouse'
where id = 1;
In this table, we'd have 2 columns, one with the data and one with a unique identifier. I imagine I'd need to be able to select into size->>width, order DESC and limit 1 (if that's able to then limit it to include the entire object with data & shirt or with the max() func?
I'm really stuck so any help would be appreciated, thank you!
You'll first want to normalise the data into a relational structure:
SELECT
(obj #>> '{data,shirt,size,width}')::int AS width,
(obj #>> '{names,0,name}') AS name
FROM warehouse, jsonb_array_elements(warehouse_data#>'{wardrobe,apparel,variety}') obj
WHERE id = 1;
Then you can do your processing on that as a subquery, e.g.
SELECT name
FROM (
SELECT
(obj #>> '{data,shirt,size,width}')::int AS width,
(obj #>> '{names,0,name}') AS name
FROM warehouse, jsonb_array_elements(warehouse_data#>'{wardrobe,apparel,variety}') obj
WHERE id = 1
) shirts
ORDER BY width DESC
LIMIT 1;

How to Extract the Fields in Bigquery in Nested JSON

I have the following BigQuery :
select JSON_EXTRACT_SCALAR(payload, "$.payload") from mytable
It returns this result :
[
{
"productInfo": {
"productId": "123",
"productType": "Dolls"
},
"storefrontPricingList": [
{
"currentPrice": {
"unitValue": {
"currencyAmount": 10,
"currencyUnit": "USD"
},
"currentValue": {
"currencyAmount": 10,
"currencyUnit": "USD"
},
"variableUnitValue": {
"currencyAmount": 10,
"currencyUnit": "USD"
},
"sellValue": {
"currencyAmount": 10,
"currencyUnit": "USD"
},
"type": "EA"
},
"currentPriceType": "OKAY"
}
]
}
]
Now i want to access theses attributes productInfo.productId , currentPrice.unitValue.currencyAmount.
How we can access these elements i tries couple of things but all giving me null :
Like
select JSON_EXTRACT_SCALAR(payload, "$.payload[0].productInfo.productId") from mytable
select JSON_EXTRACT_SCALAR(payload, "$.payload[0].storefrontPricingList[0]. currentPrice. unitValue. currencyAmount") from mytable
Can you try this ?
-- Declaring a bigQuery variable
DECLARE json_data JSON DEFAULT (SELECT PARSE_JSON('{"payload": [{"productInfo": {"productId": "123","productType": "Dolls" },"storefrontPricingList": [{"currentPrice": {"unitValue": {"currencyAmount": 10,"currencyUnit": "USD"},"currentValue": {"currencyAmount": 10,"currencyUnit": "USD"},"variableUnitValue": {"currencyAmount": 10,"currencyUnit": "USD"},"sellValue": {"currencyAmount": 10,"currencyUnit": "USD"},"type": "EA"},"currentPriceType": "OKAY"}]}]}'));
-- Select statement for extraction
select ARRAY(
SELECT JSON_EXTRACT_SCALAR(payload, '$.productInfo.productId') from UNNEST(JSON_EXTRACT_ARRAY(json_data,"$.payload"))payload
)extracted_productID,
ARRAY(
SELECT JSON_EXTRACT_SCALAR(payload, '$.storefrontPricingList[0].currentPrice.unitValue.currencyAmount') from UNNEST(JSON_EXTRACT_ARRAY(json_data,"$.payload"))payload
)extracted_currencyAmount
Used combination of Array function and Json function for BQ.
Output:

Extract info from JSON string

I am querying CloudTrail logs from S3 bucket via Amazon Athena.
My goal is to extract start/stop time for given instance id.
I am using query as:
SELECT
eventName,
eventTime,
if(responseelements like '%i-0000000000000%' ,'i-0000000000000') as instanceId
FROM cloudtrail_logs_pp
WHERE (responseelements like '%i-0000000000000%' )
AND (eventName = 'StopInstances' OR eventName = 'StartInstances')
AND "timestamp" BETWEEN '2022/01/01'
AND '2022/08/01'
ORDER BY eventTime
The issue I am facing is of duplicated entries via API call.
The structure of json file is:
{
"eventVersion": "1.08",
"eventTime": "2022-06-22T05:15:33Z",
"eventName": "StartInstances",
"requestParameters": {
"instancesSet": {
"items": [{
"instanceId": "i-00000"
}]
}
},
"responseElements": {
"requestId": "e95d270a",
"instancesSet": {
"items": [{
"instanceId": "i-00000",
"currentState": {
"code": 0,
"name": "pending"
},
"previousState": {
"code": 80,
"name": "stopped"
}
}]
}
},
"sessionCredentialFromConsole": "true"
},
However there are few entries where current and previous state are same.
How can I enhance my query to remove those entries?
Also there are cases when multiple instances were stopped / started - so can't use index in the query.

Athena query JSON Array without struct

In Athena how can I structure a select statement to query the below by timestamp? The data is stored as a string
[{
"data": [{
"ct": "26.7"
}, {
"ct": "24.9",
}, {
"ct": "26.8",
}],
"timestamp": "1658102460"
}, {
"data": [{
"ct": "26.7",
}, {
"ct": "25.0",
}],
"timestamp": "1658102520"
}]
I tried the below but it just came back empty.
SELECT json_extract_scalar(insights, '$.timestamp') as ts
FROM history
What I am trying to get to is returning only the data where a timestamp is between X & Y.
When I try doing this as a struct and a cross join with unnest it's very very slow so I am trying to find another way.
json_extract_scalar will not help here cause it returns only one value. Trino improved vastly json path support but Athena has much more older version of the Presto engine which does not support it. So you need to cast to array and use unnest (removed trailing commas from json):
-- sample data
WITH dataset (json_str) AS (
values ('[{
"data": [{
"ct": "26.7"
}, {
"ct": "24.9"
}, {
"ct": "26.8"
}],
"timestamp": "1658102460"
}, {
"data": [{
"ct": "26.7"
}, {
"ct": "25.0"
}],
"timestamp": "1658102520"
}]')
)
-- query
select mp['timestamp'] timestamp,
mp['data'] data
from dataset,
unnest(cast(json_parse(json_str) as array(map(varchar, json)))) as t(mp)
Output:
timestamp
data
1658102460
[{"ct":"26.7"},{"ct":"24.9"},{"ct":"26.8"}]
1658102520
[{"ct":"26.7"},{"ct":"25.0"}]
After that you can apply filtering and process data.

How to create JSON Array Inside JSON object using FOR JSON SQL Server 2016

How to create JSON Array Inside JSON object using FOR JSON SQL Server 2016 (TABLE to JSON)
Here is my query:
SELECT
m.MeetingId AS tblMeeting_MeetingId,
m.Attended AS tblMeeting_Attended,
m3.CompanyId AS tblMeetingAttendants_CompanyId,
m3.MeetingAttendantsId AS tblMeetingAttendants_AttendantNameWithTitle,
m4.UserId AS tblMeetingAttendees_UserId,
m5.BrokerId AS tblMeetingBroker_BrokerId
FROM Bv.tblMeeting m
LEFT JOIN Bv.tblMeetingAttendants m3 ON m.MeetingId = m3.MeetingId
LEFT JOIN Bv.tblMeetingAttendees m4 ON m.MeetingId = m4.MeetingId
LEFT JOIN Bv.tblMeetingBroker m5 ON m.MeetingId = m5.MeetingId
WHERE m.MeetingId = 739
FOR JSON AUTO, INCLUDE_NULL_VALUES
Above query gives me result like this:
[
{
"tblMeeting_MeetingId": 739,
"tblMeeting_Attended": false,
"tblMeeting_MeetingSubject": " Benchmark China Internet Analyst",
"m3": [
{
"tblMeetingAttendants_CompanyId": 83,
"tblMeetingAttendants_AttendantNameWithTitle": 499,
"m4": [
{
"tblMeetingAttendees_UserId": null,
"m5": [
{
"tblMeetingBroker_BrokerId": 275
}
]
}
]
},
{
"tblMeetingAttendants_CompanyId": 83,
"tblMeetingAttendants_AttendantNameWithTitle": 500,
"m4": [
{
"tblMeetingAttendees_UserId": null,
"m5": [
{
"tblMeetingBroker_BrokerId": 275
}
]
}
]
},
{
"tblMeetingAttendants_CompanyId": 83,
"tblMeetingAttendants_AttendantNameWithTitle": 501,
"m4": [
{
"tblMeetingAttendees_UserId": null,
"m5": [
{
"tblMeetingBroker_BrokerId": 275
}
]
}
]
}
]
}
]
But i want result like this
[
{
"tblMeeting_MeetingId": 739,
"tblMeeting_Attended": false,
"tblMeeting_MeetingSubject": " Benchmark China Internet Analyst",
"tblMeetingAttendants_AttendantNameWithTitle": [499,500,501],
"tblMeetingAttendees_UserId": null,
"tblMeetingBroker_BrokerId": 275
}
]
Please reply as soon as possible
Thanks in advance.
It seems like this is impossible without using string concatenation and writing your own functions. There is no magic JSON_ARRAY_AGGREGATE() function. I have been looking for one myself.
Here is a related question: SQL Server 2016 for JSON output integer array
Can use JSON_QUERY with JSON PATH to format your data into a JSON array. I sampled just your MeetingID and MeetingAttendantID columns to demonstrate the concept
Build JSON Array using JSON_QUERY
DROP TABLE IF EXISTS #MeetingAttendance
CREATE TABLE #MeetingAttendance (MeetingID INT,AttendantID INT)
INSERT INTO #MeetingAttendance
VALUES (739,499)
,(739,500)
,(739,501)
SELECT tblMeeting_MeetingId = MeetingID
,tblMeetingAttendants_AttendantNameWithTitle = JSON_QUERY('['+STRING_AGG(CONCAT('"',AttendantID,'"'),',') + ']')
FROM #MeetingAttendance
GROUP BY MeetingID
FOR JSON PATH,WITHOUT_ARRAY_WRAPPER
Results
{
"tblMeeting_MeetingId": 739,
"tblMeetingAttendants_AttendantNameWithTitle": [
"499",
"500",
"501"
]
}