Related
I would like to update multiple dates in a json file.
My input json contains many properties but the following is the extracted part that matters: I want to parse the date in metadata, here 2022-07-27, replace it by today's date (e.g. 2022-08-05), and set the delta (here 9 days), and add that delta to all other date found in "data_1h" / "time"
edit: (first forgotten) I need also that metadata's date get eventually replaced by today's date.
I could wrote a small tool in any language, but I would like a linux script that can be run from a gitlab pipeline. It is about preparing mockdata for some tests.
So I started fighting with jq, awk or sed, but am a bit confused there. Maybe an experienced jq guy would find the solution immediately?
{
"metadata":
{
"modelrun_utc": "2022-07-27 00:00",
"modelrun_updatetime_utc": "2022-07-27 07:27"
},
"data_1h":
{
"time": ["2022-07-27 00:00", "2022-07-27 01:00", "2022-08-03 11:00", "2022-08-03 12:00", "2022-08-03 13:00", "2022-08-03 14:00"]
}
}
Any idea?
pseudo code would be:
base_date_str=$(jq .metadata.modelrun_utc $1)
echo $base_date_str
base_date=$(date -d $base_date_str)
today=$(date)
delta=$base_date-$today
input-data=$(jq .data_1h.time $1)
foreach (s in $input-data)
# transform s to date d, add delta to d, replace s by d in output string
replace modelrun_utc modelrun_updatetime_utc by today's date only, keeping the time.
# write output json
How does this look like in real shell commands?
Expected output:
{
"metadata": {
"modelrun_utc": "2022-08-05 00:00",
"modelrun_updatetime_utc": "2022-08-05 07:27"
},
"data_1h": {
"time": [
"2022-08-05 00:00",
"2022-08-05 01:00",
"2022-08-12 11:00",
"2022-08-12 12:00",
"2022-08-12 13:00",
"2022-08-12 14:00"
]
}
}
Here's one way using jq logic, not shell commands:
jq '(
.metadata.modelrun_utc | strptime("%Y-%m-%d %H:%M")
| (now - mktime) / (24 * 60 * 60)
) as $diffdays | .data_1h.time[] |= (
strptime("%Y-%m-%d %H:%M") | .[2] += $diffdays
| mktime | strftime("%Y-%m-%d %H:%M")
)'
{
"metadata": {
"modelrun_utc": "2022-07-27 00:00",
"modelrun_updatetime_utc": "2022-07-27 07:27"
},
"data_1h": {
"time": [
"2022-08-05 00:00",
"2022-08-05 01:00",
"2022-08-12 11:00",
"2022-08-12 12:00",
"2022-08-12 13:00",
"2022-08-12 14:00"
]
}
}
Demo
I have a products table which contains a JSON column product_logs. Inside of this, it contains something similar to:
{
"c8eebc99-d936-3245-bc8d-17694f4ecb58": {
"created_at": "2022-05-08T15:33:33.591166Z",
"event": "product-created",
"user": null
},
"ce7b171b-b479-332f-bf9e-54b948581179": {
"created_at": "2022-05-08T15:33:33.591174Z",
"event": "near-sell-by",
"user": null
}
}
I only want to return rows of products that have a near-sell-by event in the product_logs so I try to do this:
SELECT
products.*
FROM products,
JSON_TABLE(product_logs, '$[*]', COLUMNS (
created_at DATETIME PATH '$.created_at',
event VARCHAR(MAX) PATH '$.event'
) logs
WHERE
logs.event = 'near-sell-by'
However, I seem to be getting the following error:
#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '(product_logs, '$[*]', COLUMNS (
created_at DATETIME PATH '$.cr...' at line 4
Any help to where I'm going wrong would be greatly appreciated
You seem to have copied, from another database, there is no varchar8max) in mysql, to syntax is a bit complicated, and you need to undestand json pretty well.
a gui like workbench, at least can help you identify the error, but it will not help you
CREATE TABLE products (product_logs varchar(1209))
INSERT INTO products VALUES ('{
"c8eebc99-d936-3245-bc8d-17694f4ecb58": {
"created_at": "2022-05-08T15:33:33.591166Z",
"event": "product-created",
"user": null
},
"ce7b171b-b479-332f-bf9e-54b948581179": {
"created_at": "2022-05-08T15:33:33.591174Z",
"event": "near-sell-by",
"user": null
}
}
')
SELECT
products.*,logs.created_at,logs.event
FROM products,
JSON_TABLE(products.product_logs, '$.*'
COLUMNS (
created_at DATETIME PATH '$.created_at',
event Text PATH '$.event'
)) logs
WHERE
logs.event = 'near-sell-by'
product_logs | created_at | event
:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------ | :-----------
{<br> "c8eebc99-d936-3245-bc8d-17694f4ecb58": {<br> "created_at": "2022-05-08T15:33:33.591166Z",<br> "event": "product-created",<br> "user": null<br> },<br> "ce7b171b-b479-332f-bf9e-54b948581179": {<br> "created_at": "2022-05-08T15:33:33.591174Z",<br> "event": "near-sell-by",<br> "user": null<br> }<br>}<br> | 2022-05-08 15:33:34 | near-sell-by
db<>fiddle here
SELECT
s."firstName",
jsonb_agg(
DISTINCT jsonb_build_object(
'yearId',
y.id,
'classes',
(
SELECT
jsonb_agg(
jsonb_build_object(
'classId',
c.id
)
)
FROM
classes AS c
WHERE
y.id = cy."yearId"
AND c.id = cy."classId"
AND s.id = cys."studentId"
)
)
) AS years
FROM
users AS s
LEFT JOIN "classYearStudents" AS cys ON cys."studentId" = s.id
LEFT JOIN "classYears" AS cy ON cy.id = cys."classYearId"
LEFT JOIN "years" AS y ON y.id = cy."yearId"
GROUP BY
s.id
SQL Output
firstName | years
-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Jarrell | [{"yearId": "bd5b69ac-6638-4d3e-8a52-94c24ed9a039", "classes": [{"classId": "2590b596-e894-4af5-8ac5-68d109eee995"}]}, {"yearId": "bd5b69ac-6638-4d3e-8a52-94c24ed9a039", "classes": [{"classId": "fe4a11f2-5f38-4f7a-bbce-609bc7ad8f99"}]}]
Kevon | [{"yearId": "7f5789b5-999e-45e4-aba4-9f45b29a69ef", "classes": [{"classId": "c8cda7d1-7321-443c-b0ad-6d18451613b5"}]}, {"yearId": "bd5b69ac-6638-4d3e-8a52-94c24ed9a039", "classes": [{"classId": "2590b596-e894-4af5-8ac5-68d109eee995"}]}, {"yearId": "bd5b69ac-6638-4d3e-8a52-94c24ed9a039", "classes": [{"classId": "fe4a11f2-5f38-4f7a-bbce-609bc7ad8f99"}]}]
Antone | [{"yearId": "7f5789b5-999e-45e4-aba4-9f45b29a69ef", "classes": [{"classId": "c8cda7d1-7321-443c-b0ad-6d18451613b5"}]}, {"yearId": "bd5b69ac-6638-4d3e-8a52-94c24ed9a039", "classes": [{"classId": "2590b596-e894-4af5-8ac5-68d109eee995"}]}, {"yearId": "bd5b69ac-6638-4d3e-8a52-94c24ed9a039", "classes": [{"classId": "fe4a11f2-5f38-4f7a-bbce-609bc7ad8f99"}]}]
(3 rows)
The problem
What I wanted was for the years with the same ID to be merged together and have multiple classes per year id. As you can see bd5b69ac-6638-4d3e-8a52-94c24ed9a039 on the first row (Jarell) has two entries in the year's column array with each having one class.
Current JSON output
[
{
"yearId": "bd5b69ac-6638-4d3e-8a52-94c24ed9a039",
"classes": [{ "classId": "2590b596-e894-4af5-8ac5-68d109eee995" }]
},
{
"yearId": "bd5b69ac-6638-4d3e-8a52-94c24ed9a039",
"classes": [{ "classId": "fe4a11f2-5f38-4f7a-bbce-609bc7ad8f99" }]
}
]
Desired output
[
{
"yearId": "bd5b69ac-6638-4d3e-8a52-94c24ed9a039",
"classes": [
{ "classId": "2590b596-e894-4af5-8ac5-68d109eee995" },
{ "classId": "fe4a11f2-5f38-4f7a-bbce-609bc7ad8f99" }
]
}
]
Is this possible?
Hard to say without exact definition of underlying table and and objective of the query.
You need two levels of aggregation in any case. And you can probably largely simplify:
SELECT sub.id, sub."firstName"
, jsonb_agg(jsonb_build_object('yearId', sub."yearId"
, 'classes', sub.classes)) AS years
FROM (
SELECT s.id, s."firstName", cy."yearId"
, jsonb_agg(jsonb_build_object('classId', cy."classId")) AS classes
FROM users s
LEFT JOIN "classYearStudents" cys ON cys."studentId" = s.id
LEFT JOIN "classYears" cy ON cy.id = cys."classYearId"
GROUP BY s.id, cy."yearId"
) sub
GROUP BY sub.id, sub."firstName";
Not sure if and where you need DISTINCT in this query.
I kept the user ID in the result, as first names are hardly unique.
Don't use CaMeL-case identifiers with Postgres if you can avoid it. See:
Are PostgreSQL column names case-sensitive?
I want to understand if it is possible using the sqlalchemy core syntax to get the given data in the same format as with pure SQL.
DB relations
events = Table(
"events",
metadata,
Column("id", Integer(), primary_key=True),
......
Column("location_id", ForeignKey(locations.c.id, ondelete="SET NULL"), nullable=False),
Column("activities_id", ForeignKey(activities.c.id, ondelete="SET NULL"), nullable=False),
)
locations = Table(
"locations",
metadata,
Column("id", Integer(), primary_key=True),
....
....
)
activities = Table(
"activities",
metadata,
Column("id", Integer(), primary_key=True),
....
)
I want to group the fields from the tables locations and activities into a separate group. In plain SQL this query runs like this (using Postgresql)
SELECT json_build_object('id', e.id, 'title', e.title, 'creator', json_agg(c), " \
'activitie', json_agg(a), 'users', jsonb_agg(u), 'location', jsonb_agg(l)) " \
AS event FROM events AS e " \
LEFT JOIN event_users AS eu ON e.id = eu.events_id " \
LEFT JOIN users AS u ON eu.users_id = u.id " \
LEFT JOIN users AS c ON e.creator = c.id " \
LEFT JOIN activities AS a ON e.activities_id = a.id " \
LEFT JOIN locations AS l ON e.location_id = l.id " \
WHERE e.id = {_id} GROUP BY e.id
And the result is
{'event':
'{
"id" : 1, "title" : "Test",
"creator" : [{"id":1,"created_at":"2021-03-03T23:39:23.469751+03:00","email":"test#email","phone":"232323","hashed_password":"sdsdsds","is_active":true}],
"activitie" : [{"id":1,"name":"basketball","is_active":true}],
"users" : [null],
"location" : [{"id": 3, "lat": 54.49142965, "city": "Berlin", "long": 26.9173560217231, "street": "Stephans", "building": "12"}]
}'
}
As you can see each Foreignkey field is grouped under a common key.
Now I'm trying to do something similar on the sqlalchemy core (without the user table)
query = (
select(
[
events.c.id,
events.c.title,
locations.c.city,
locations.c.street,
locations.c.building,
activities.c.name,
]
)
.select_from(
events.join(locations).join(activities)
)
.where(
and_(
events.c.id == pk,
locations.c.id == events.c.location_id,
activities.c.id == events.c.activities_id)
)
.order_by(desc(events.c.created_at))
)
print(query)
ev = dict(await database.fetch_one(query))
And get the result
{'id': 1, 'title': 'Test', 'city':Berlin', street': 'Stephans', 'building': '12', 'name': 'basketball'}
How to group result like ?
{
'id': 1,
'title': 'Test',
'location': [
'city': 'Berlin'
'street': 'Stephans',
'building': '12',
],
'activity': [
'name': 'basketball'
]
}
p.s. sql query with #van's code
SELECT json_build_object(:json_build_object_2, events.id, :json_build_object_3, events.title, :json_build_object_4, json_agg(json_build_object(:json_build_object_5, locations.city, :json_build_object_6, locations.street, :json_build_object_7, locations.building)), :json_build_object_8, json_agg(json_build_object(:json_build_object_9, locations.id, :json_build_object_10, locations.lat, :json_build_object_11, locations.long, :json_build_object_12, locations.city, :json_build_object_13, locations.street, :json_build_object_14, locations.building)), :json_build_object_15, json_agg(json_build_object(:json_build_object_16, activities.name))) AS json_build_object_1
Below query should do the job:
from sqlalchemy import func
# ...
query = (
select(
[
func.json_build_object(
"id",
events.c.id,
"title",
events.c.title,
"location",
func.json_agg(
func.json_build_object(
"city",
locations.c.city,
"street",
locations.c.street,
"building",
locations.c.building,
)
),
"location_all_columns_example",
func.json_agg(func.json_build_object(
*itertools.chain(*[(_.name, _) for _ in locations.c])
)),
"activity",
func.json_agg(
func.json_build_object(
"name",
activities.c.name,
)
),
)
]
)
.select_from(events.join(locations).join(activities))
.where(
and_(
events.c.id == pk,
locations.c.id == events.c.location_id,
activities.c.id == events.c.activities_id,
)
)
.order_by(desc(events.c.created_at))
.group_by(events.c.id) # !!! <- IMPORTANT
)
Please note that you need the group_by clause.
How can I join two tables in a select statement in which I also use a UDF? I stored the SQL query and UDF function in two files that I call via the bq command line. However, when I run it, I get the following error:
BigQuery error in query operation: Error processing job
'[projectID]:bqjob_[error_number]':
Table name cannot be resolved: dataset name is missing.
Note that I'm logged in the correct project via the gcloud auth method.
My SQL statement:
SELECT
substr(date,1,6) as date,
device,
channelGroup,
COUNT(DISTINCT CONCAT(fullVisitorId,cast(visitId as string))) AS sessions,
COUNT(DISTINCT fullVisitorId) AS users,
FROM
defaultChannelGroup(
SELECT
a.date,
a.device.deviceCategory AS device,
b.hits.page.pagePath AS page,
a.fullVisitorId,
a.visitId,
a.trafficSource.source AS trafficSourceSource,
a.trafficSource.medium AS trafficSourceMedium,
a.trafficSource.campaign AS trafficSourceCampaign
FROM FLATTEN(
SELECT date,device.deviceCategory,trafficSource.source,trafficSource.medium,trafficSource.campaign,fullVisitorId,visitID
FROM
TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
,hits) as a
LEFT JOIN FLATTEN(
SELECT hits.page.pagePath,hits.time,visitID,fullVisitorId
FROM
TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
WHERE
hits.time = 0
and trafficSource.medium = 'organic'
,hits) as b
ON a.fullVisitorId = b.fullVisitorId AND a.visitID = b.visitID
)
GROUP BY
date,
device,
channelGroup
ORDER BY sessions DESC
where I replaced my datasetname with the correct name of course;
and some of the UDF (which works with another query):
function defaultChannelGroup(row, emit)
{
function output(channelGroup) {
emit({channelGroup:channelGroup,
fullVisitorId: row.fullVisitorId,
visitId: row.visitId,
device: row.device,
date: row.date
});
}
computeDefaultChannelGroup(row, output);
}
bigquery.defineFunction(
'defaultChannelGroup',
['date', 'device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId'],
//['device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId'],
[{'name': 'channelGroup', 'type': 'string'},
{'name': 'fullVisitorId', 'type': 'string'},
{'name': 'visitId', 'type': 'integer'},
{'name': 'device', 'type': 'string'},
{'name': 'date', 'type': 'string'}
],
defaultChannelGroup
);
The select statements within the FLATTEN function needed to be in brackets.
Ran the bq command in the shell:
bq query --udf_resource=udf.js "$(cat query.sql)"
query.sql contains the following scripts:
SELECT
substr(date,1,6) as date,
device,
channelGroup,
COUNT(DISTINCT CONCAT(fullVisitorId,cast(visitId as string))) AS sessions,
COUNT(DISTINCT fullVisitorId) AS users,
COUNT(DISTINCT transactionId) as orders,
CAST(SUM(transactionRevenue)/1000000 AS INTEGER) as sales
FROM
defaultChannelGroup(
SELECT
a.date as date,
a.device.deviceCategory AS device,
b.hits.page.pagePath AS page,
a.fullVisitorId as fullVisitorId,
a.visitId as visitId,
a.trafficSource.source AS trafficSourceSource,
a.trafficSource.medium AS trafficSourceMedium,
a.trafficSource.campaign AS trafficSourceCampaign,
a.hits.transaction.transactionRevenue as transactionRevenue,
a.hits.transaction.transactionID as transactionId
FROM FLATTEN((
SELECT date,device.deviceCategory,trafficSource.source,trafficSource.medium,trafficSource.campaign,fullVisitorId,visitID,
hits.transaction.transactionID, hits.transaction.transactionRevenue
FROM
TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
),hits) as a
LEFT JOIN FLATTEN((
SELECT hits.page.pagePath,hits.time,trafficSource.medium,visitID,fullVisitorId
FROM
TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
WHERE
hits.time = 0
and trafficSource.medium = 'organic'
),hits) as b
ON a.fullVisitorId = b.fullVisitorId AND a.visitID = b.visitID
)
GROUP BY
date,
device,
channelGroup
ORDER BY sessions DESC
and udf.js contains the following function (the 'computeDefaultChannelGroup' function is not included):
function defaultChannelGroup(row, emit)
{
function output(channelGroup) {
emit({channelGroup:channelGroup,
date: row.date,
fullVisitorId: row.fullVisitorId,
visitId: row.visitId,
device: row.device,
transactionId: row.transactionId,
transactionRevenue: row.transactionRevenue,
});
}
computeDefaultChannelGroup(row, output);
}
bigquery.defineFunction(
'defaultChannelGroup',
['date', 'device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId', 'transactionId', 'transactionRevenue'],
[{'name': 'channelGroup', 'type': 'string'},
{'name': 'date', 'type': 'string'},
{'name': 'fullVisitorId', 'type': 'string'},
{'name': 'visitId', 'type': 'integer'},
{'name': 'device', 'type': 'string'},
{'name': 'transactionId', 'type': 'string'},
{'name': 'transactionRevenue', 'type': 'integer'}
],
defaultChannelGroup
);
Ran without error and matched the data in Google Analytics.