TL;DR: I need two UPDATE scripts that would turn (1) into (2) and vice-versa (add/delete color property from every object in the JSONB column)
(1)
id | name | squares
1 | s1 | [{"x": 5, "y": 5, "width": 10, "height": 10}, {"x": 0, "y": 0, "width": 20, "height": 20}]
2 | s2 | [{"x": 0, "y": 3, "width": 13, "height": 11}, {"x": 2, "y": 3, "width": 20, "height": 20}]
(2)
id | name | squares
1 | s1 | [{"x": 5, "y": 5, "width": 10, "height": 10, "color": "#FFFFFF"}, {"x": 0, "y": 0, "width": 20, "height": 20, "color": "#FFFFFF"}]
2 | s2 | [{"x": 0, "y": 3, "width": 13, "height": 11, "color": "#FFFFFF"}, {"x": 2, "y": 3, "width": 20, "height": 20, "color": "#FFFFFF"}]
My schema
I have a table called scene with squares column, which has a type of JSONB. Inside this column I store values like this: [{"x": 5, "y": 5, "width": 10, "height": 10}, {"x": 0, "y": 0, "width": 20, "height": 20}].
What I want to do
I want to now add color to my squares, which implies also adding some default color (like "#FFFFFF") to every square in every scene record in the existing production database, so I need a migration.
The problem
I need to write a migration that would add "color": "#FFFFFF" to every square in the production database. With a relational schema that would be as easy as writing ALTER TABLE square ADD color... for the forward migration and ALTER TABLE square DROP COLUMN color... for the rollback migration, but since square is not a separate table, it is an array-like JSONB, I need two UPDATE queries for the scene table.
(1) Adding color:
update scene set squares = (select array_to_json(array_agg(jsonb_insert(v.value, '{color}', '"#FFFFFF"')))
from jsonb_array_elements(squares) v);
select * from scene;
See demo.
(2) Removing color:
update scene set squares = (select array_to_json(array_agg(v.value - 'color'))
from jsonb_array_elements(squares) v);
select * from scene;
See demo.
Related
I have a data set that combines two temporal measurement series with one row per measurement
time: 1, measurement: a, value: 5
time: 2, measurement: b, value: false
time: 10, measurement: a, value: 2
time: 13, measurement: b, value: true
time: 20, measurement: a, value: 4
time: 24, measurement: b, value: true
time: 30, measurement: a, value: 6
time: 32, measurement: b, value: false
in a visualization using Vega lite, I'd like to combine the measurement series and encode measurement a and b in a single visualization without simply layering their representation on a temporal axis but representing their value in a single encoding spec.
either measurement a values need to be interpolated and added as a new value to rows of measurement b
eg:
time: 2, measurement: b, value: false, interpolatedMeasurementA: 4.6667
or the other way around, which leaves the question of how to interpolate a boolean. maybe closest value by time, or simpler: last value
eg:
time: 30, measurement: a, value: 6, lastValueMeasurementB: true
I suppose this could be done either query side in which case this question would be regarding indexDB Flux query language
or this could be done on the visualization side in which case this would be regarding vega-lite
There's not any true linear interpolation schemes built-in to Vega-Lite (though the loess transform comes close), but you can achieve roughly what you wish with a window transform.
Here is an example (view in editor):
{
"data": {
"values": [
{"time": 1, "measurement": "a", "value": 5},
{"time": 2, "measurement": "b", "value": false},
{"time": 10, "measurement": "a", "value": 2},
{"time": 13, "measurement": "b", "value": true},
{"time": 20, "measurement": "a", "value": 4},
{"time": 24, "measurement": "b", "value": true},
{"time": 30, "measurement": "a", "value": 6},
{"time": 32, "measurement": "b", "value": false}
]
},
"transform": [
{
"calculate": "datum.measurement == 'a' ? datum.value : null",
"as": "measurement_a"
},
{
"window": [
{"op": "mean", "field": "measurement_a", "as": "interpolated"}
],
"sort": [{"field": "time"}],
"frame": [1, 1]
},
{"filter": "datum.measurement == 'b'"}
],
"mark": "line",
"encoding": {
"x": {"field": "time"},
"y": {"field": "interpolated"},
"color": {"field": "value"}
}
}
This first uses a calculate transform to isolate the values to be interpolated, then a window transform that computes the mean over adjacent values (frame: [1, 1]), then a filter transform to isolate interpolated rows.
If you wanted to go the other route, you could do a similar sequence of transforms targeting the boolean value instead.
I currently have a data frame like this:
and I would like to explode the "listing" column into rows. I would like to use the key in the dictionary as column names, so ideally this is how I would like to data frame to look like this:
eventId listingId currentPrice
103337923 1307675567 ...
103337923 1307675567 ...
103337923 1307675567 ...
This is what I get with this: print(listing_df.head(3).to_dict())
Definitely there should be a better way to do this. But this works. :)
df1 = pd.DataFrame(
{"a": [1,2,3,4],
"b": [5,6,7,8],
"c": [[{"x": 17, "y": 18, "z": 19}, {"x": 27, "y": 28, "z": 29}],
[{"x": 37, "y": 38, "z": 39}, {"x": 47, "y": 48, "z": 49}],
[{"x": 57, "y": 58, "z": 59}, {"x": 27, "y": 68, "z": 69}],
[{"x": 77, "y": 78, "z": 79}, {"x": 27, "y": 88, "z": 89}]]})
Now you can create a new DataFrame from the above:
df2 = pd.DataFrame(columns=df1.columns)
df2_index = 0
for row in df1.iterrows():
one_row = row[1]
for list_value in row[1]["c"]:
one_row["c"] = list_value
df2.loc[df2_index] = one_row
df2_index += 1
Output is the way you need:
Now that we have expanded list into separate rows, you can further expand json into columns with:
df2[list(df2["c"].head(1).tolist()[0].keys())] = df2["c"].apply(
lambda x: pd.Series([x[key] for key in x.keys()]))
Hope it helps!
Similar questions asked here before:
Count items for a single key: jq count the number of items in json by a specific key
Calculate the sum of object values:
How do I sum the values in an array of maps in jq?
Question
How to emulate the COUNT aggregate function which should behave similarly to its SQL original? Let's extend this question even more to include other regular SQL functions:
COUNT
SUM / MAX/ MIN / AVG
ARRAY_AGG
The last one is not a standard SQL function - it's from PostgreSQL but is quite useful.
At input comes a stream of valid JSON objects. For demonstration let's pick a simple story of owners and their pets.
Model and data
Base relation: Owner
id name age
1 Adams 25
2 Baker 55
3 Clark 40
4 Davis 31
Base relation: Pet
id name litter owner_id
10 Bella 4 1
20 Lucy 2 1
30 Daisy 3 2
40 Molly 4 3
50 Lola 2 4
60 Sadie 4 4
70 Luna 3 4
Source
From above we get a derivative relation Owner_Pet (a result of SQL JOIN of the above relations) presented in JSON format for our jq queries (the source data):
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 10, "pet": "Bella", "litter": 4 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 20, "pet": "Lucy", "litter": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pet_id": 30, "pet": "Daisy", "litter": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pet_id": 40, "pet": "Molly", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 50, "pet": "Lola", "litter": 2 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 60, "pet": "Sadie", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 70, "pet": "Luna", "litter": 3 }
Requests
Here are sample requests and their expected output:
COUNT the number of pets per owner:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets_count": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets_count": 1 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets_count": 1 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets_count": 3 }
SUM up the number of whelps per owner and get their MAX (MIN/AVG):
{ "owner_id": 1, "owner": "Adams", "age": 25, "litter_total": 6, "litter_max": 4 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "litter_total": 3, "litter_max": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "litter_total": 4, "litter_max": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "litter_total": 9, "litter_max": 4 }
ARRAY_AGG pets per owner:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets": [ "Bella", "Lucy" ] }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets": [ "Daisy" ] }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets": [ "Molly" ] }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets": [ "Lola", "Sadie", "Luna" ] }
Here's an alternative, not using any custom functions with basic JQ. (I took the liberty to get rid of redundant parts of the question)
Count
In> jq -s 'group_by(.owner_id) | map({ owner_id: .[0].owner_id, count: map(.pet) | length})'
Out>[{"owner_id": "1","pets_count": 2}, ...]
Sum
In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, sum: map(.litter) | add})'
Out> [{"owner_id": "1","sum": 6}, ...]
Max
In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, max: map(.litter) | max})'
Out> [{"owner_id": "1","max": 4}, ...]
Aggregate
In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, agg: map(.pet) })'
Out> [{"owner_id": "1","agg": ["Bella","Lucy"]}, ...]
Sure, these might not be the most efficient implementations, but they show nicely how to implement custom functions oneself. All that changes between the different functions is inside the last map and the function after the pipe | (length, add, max)
The first map iterates over the different groups, taking the name from the first item, and using map again to iterate over the same-group items. Not as pretty as SQL, but not terribly more complicated.
I learned JQ today, and managed to do this already, so this should be encouraging for anyone getting started. JQ is neither like sed nor like SQL, but not terribly hard either.
Extended jq solution:
Custom count() function:
jq -sc 'def count($k): group_by(.[$k])[] | length as $l | .[0]
| .pets_count = $l
| del(.pet_id, .pet, .litter);
count("owner_id")' source.data
The output:
{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}
Custom sum() function:
jq -sc 'def sum($k): group_by(.[$k])[] | map(.litter) as $litters | .[0]
| . + {litter_total: $litters | add, litter_max: $litters | max}
| del(.pet_id, .pet, .litter);
sum("owner_id")' source.data
The output:
{"owner_id":1,"owner":"Adams","age":25,"litter_total":6,"litter_max":4}
{"owner_id":2,"owner":"Baker","age":55,"litter_total":3,"litter_max":3}
{"owner_id":3,"owner":"Clark","age":40,"litter_total":4,"litter_max":4}
{"owner_id":4,"owner":"Davis","age":31,"litter_total":9,"litter_max":4}
Custom array_agg() function:
jq -sc 'def array_agg($k): group_by(.[$k])[] | map(.pet) as $pets | .[0]
| .pets = $pets | del(.pet_id, .pet, .litter);
array_agg("owner_id")' source.data
The output:
{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}
This is a nice exercise, but SO is not a programming service, so I will focus here on some key concepts for generic solutions in jq that are efficient, even for very large collections.
GROUPS_BY
The key to efficiency here is avoiding the built-in group_by, as it requires sorting. Since jq is fundamentally stream-oriented, the following definition of GROUPS_BY is likewise stream-oriented. It takes advantage of the efficiency of key-based lookups, while avoiding calling tojson on strings:
# emit a stream of the groups defined by f
def GROUPS_BY(stream; f):
reduce stream as $x ({};
($x|f) as $s
| ($s|type) as $t
| (if $t == "string" then $s else ($s|tojson) end) as $y
| .[$t][$y] += [$x] )
| .[][] ;
distinct and count_distinct
# Emit an array of the distinct entities in `stream`, without sorting
def distinct(stream):
reduce stream as $x ({};
($x|type) as $t
| (if $t == "string" then $x else ($x|tojson) end) as $y
| if (.[$t] | has($y)) then . else .[$t][$y] += [$x] end )
| [.[][]] | add ;
# Emit the number of distinct items in the given stream
def count_distinct(stream):
def sum(s): reduce s as $x (0;.+$x);
reduce stream as $x ({};
($x|type) as $t
| (if $t == "string" then $x else ($x|tojson) end) as $y
| .[$t][$y] = 1 )
| sum( .[][] ) ;
Convenience function
def owner: {owner_id,owner,age};
Example: "COUNT the number of pets per owner"
GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets_count: count_distinct(.[]|.pet_id)}
Invocation: jq -nc -f program1.jq input.json
Output:
{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}
Example: "SUM up the number of whelps per owner and get their MAX"
GROUPS_BY(inputs; .owner_id)
| (.[0] | owner)
+ {litter_total: (map(.litter) | add)}
+ {litter_max: (map(.litter) | max)}
Invocation: jq -nc -f program2.jq input.json
Output: as given.
Example: "ARRAY_AGG pets per owner"
GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets: distinct(.[]|.pet)}
Invocation: jq -nc -f program3.jq input.json
Output:
{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}
Say I have the following product schema which has common properties like title etc, as well as variants in an array.
How would I go about ordering the products by price lowest to highest?
drop table if exists product;
create table product (
id int,
data jsonb
);
insert into product values (1, '
{
"product_id": 10000,
"title": "product 10000",
"variants": [
{
"variantId": 100,
"price": 9.95,
"sku": 100,
"weight": 388
},
{
"variantId": 101,
"price": 19.95,
"sku": 101,
"weight": 788
}
]
}');
insert into product values (2, '
{
"product_id": 10001,
"title": "product 10001",
"variants": [
{
"variantId": 200,
"price": 89.95,
"sku": 200,
"weight": 11
},
{
"variantId": 201,
"price": 99.95,
"sku": 201,
"weight": 22
}
]
}');
insert into product values (3, '
{
"product_id": 10002,
"title": "product 10002",
"variants": [
{
"variantId": 300,
"price": 1.00,
"sku": 300,
"weight": 36
}
]
}');
select * from product;
1;"{"title": "product 10000", "variants": [{"sku": 100, "price": 9.95, "weight": 388, "variantId": 100}, {"sku": 101, "price": 19.95, "weight": 788, "variantId": 101}], "product_id": 10000}"
2;"{"title": "product 10001", "variants": [{"sku": 200, "price": 89.95, "weight": 11, "variantId": 200}, {"sku": 201, "price": 99.95, "weight": 22, "variantId": 201}], "product_id": 10001}"
3;"{"title": "product 10002", "variants": [{"sku": 300, "price": 1.00, "weight": 36, "variantId": 300}], "product_id": 10002}"
Use jsonb_array_elements() to unnest variants, e.g.:
select
id, data->'product_id' product_id,
var->'sku' as sku, var->'price' as price
from
product, jsonb_array_elements(data->'variants') var
order by 4;
id | product_id | sku | price
----+------------+-----+-------
3 | 10002 | 300 | 1.00
1 | 10000 | 100 | 9.95
1 | 10000 | 101 | 19.95
2 | 10001 | 200 | 89.95
2 | 10001 | 201 | 99.95
(5 rows)
I have the following table:
CREATE TABLE mytable (
id serial PRIMARY KEY
, employee text UNIQUE NOT NULL
, data jsonb
);
With the following data:
INSERT INTO mytable (employee, data)
VALUES
('Jim', '{"sales_tv": [{"value": 10, "yr": "2010", "loc": "us"}, {"value": 5, "yr": "2011", "loc": "europe"}, {"value": 40, "yr": "2012", "loc": "asia"}], "sales_radio": [{"value": 11, "yr": "2010", "loc": "us"}, {"value": 8, "yr": "2011", "loc": "china"}, {"value": 76, "yr": "2012", "loc": "us"}], "another_key": "another value"}'),
('Rob', '{"sales_radio": [{"value": 7, "yr": "2014", "loc": "japan"}, {"value": 3, "yr": "2009", "loc": "us"}, {"value": 37, "yr": "2011", "loc": "us"}], "sales_tv": [{"value": 4, "yr": "2010", "loc": "us"}, {"value": 18, "yr": "2011", "loc": "europe"}, {"value": 28, "yr": "2012", "loc": "asia"}], "another_key": "another value"}')
Notice that there are other keys in there besides just "sales_tv" and "sales_radio". For the queries below I just need to focus on "sales_tv" and "sales_radio".
I need to find all sales for Jim for 2012. Anything that starts with "sales_" and then put that in an object (just need the what the product sold is and the value). e.g.:
employee | sales_
Jim | {"sales_tv": 40, "sales_radio": 76}
I've got:
SELECT * FROM mytable,
(SELECT l.key, l.value FROM mytable, lateral jsonb_each_text(data) AS l
WHERE key LIKE 'sales_%') AS a,
jsonb_to_recordset(a.value::jsonb) AS d(yr text, value float)
WHERE mytable.employee = 'Jim'
AND d.yr = '2012'
But I can't seem to even get just Jim's data. Instead I get:
employee | key | value
-------- |------ | -----
Jim | sales_tv | [{"yr": "2010", "loc": "us", "value": 4}, {"yr": "2011", "loc": "europe", "value": 18}, {"yr": "2012", "loc": "asia", "value": 28}]
Jim | sales_tv | [{"yr": "2010", "loc": "us", "value": 10}, {"yr": "2011", "loc": "europe", "value": 5}, {"yr": "2012", "loc": "asia", "value": 40}]
Jim | sales_radio | [{"yr": "2010", "loc": "us", "value": 11}, {"yr": "2011", "loc": "china", "value": 8}, {"yr": "2012", "loc": "us", "value": 76}]
You treat the result of the first join as JSON, not as text string, so use jsonb_each() instead of jsonb_each_text():
SELECT t.employee, json_object_agg(a.k, d.value) AS sales
FROM mytable t
JOIN LATERAL jsonb_each(t.data) a(k,v) ON a.k LIKE 'sales_%'
JOIN LATERAL jsonb_to_recordset(a.v) d(yr text, value float) ON d.yr = '2012'
WHERE t.employee = 'Jim' -- works because employee is unique
GROUP BY 1;
GROUP BY 1 is shorthand for GROUP BY t.employee.
Result:
employee | sales
---------+--------
Jim | '{ "sales_tv" : 40, "sales_radio" : 76 }'
I also untangled and simplified your query.
json_object_agg() is instrumental in aggregating name/value pairs as JSON object. Optionally cast to jsonb if you need that - or use jsonb_object_agg() in Postgres 9.5 or later.
Using explicit JOIN syntax to attach conditions in their most obvious place.
The same without explicit JOIN syntax:
SELECT t.employee, json_object_agg(a.k, d.value) AS sales
FROM mytable t
, jsonb_each(t.data) a(k,v)
, jsonb_to_recordset(a.v) d(yr text, value float)
WHERE t.employee = 'Jim'
AND a.k LIKE 'sales_%'
AND d.yr = '2012'
GROUP BY 1;
Your first query can be solved like this (shooting from the hip, no access to PG 9.4 here):
SELECT employee, json_object_agg(key, sales)::jsonb AS sales_
FROM (
SELECT t.employee, j.key, sum((e->>'value')::int) AS sales
FROM mytable t,
jsonb_each(t.data) j,
jsonb_array_elements(j.value) e
WHERE t.employee = 'Jim'
AND j.key like 'sales_%'
AND e->>'yr' = '2012'
GROUP BY t.employee, j.key) sub
GROUP BY employee;
The trick here is that you use LATERAL joins to "peel away" outer layers of the jsonb object to get at data deeper down. This query is assuming that Jim may have sales in multiple locations.
(Working on your query 2)