How to do calculations on json data in Postgres - sql

I'm storing AdWords report data in Postgres. Each report is stored in a table named Reports, which has a jsonb column named 'data'. Each report has json stored in its 'data' field that looks that looks like this:
[
{
match_type: "exact",
search_query: "gm hubcaps",
conversions: 2,
cost: 1.24
},
{
match_type: "broad",
search_query: "gm auto parts",
conversions: 34,
cost: 21.33
},
{
match_type: "phrase",
search_query: "silverdo headlights",
conversions: 63,
cost: 244.05
}
]
What I want to do is query off these data hashes and sum up the total number of conversions for a given report. I've looked though the Postgresql docs and it looks like you can only really do calculations on hashes, not arrays of hashes like this. Is what I'm trying to do possible in postgres? Do I need to make a temp table out of this array and do calculations off that? Or can I use a stored procedure?
I'm using Postgresql 9.4
EDIT
The reason I'm not just using a regular, normalized table is that this is just one example of how report data could be structured. In my project, reports have to allow arbitrary keys, because they are populated by users uploading CSV's with any columns they like. It's basically just a way to get around having arbitrarily many, user-created tables.

What I want to do is query off these data hashes and sum up the conversions
The fastest way should be with jsonb_populate_recordset(). But you need a registered row type for it.
CREATE TEMP TABLE report_data (
-- match_type text -- commented out, because we only need ..
-- , search_query text -- .. conversions for this query
conversions int
-- , cost numeric
);
A temp table is one way to register a row type ad-hoc. More explanation in this related answer:
jsonb query with nested objects in an array
Assuming a table report with report_id as PK for lack of inforamtion.
SELECT r.report_id, sum(d.conversions) AS sum_conversions
FROM report r
LEFT JOIN LATERAL jsonb_populate_recordset(null::report_data, r.data) d ON true
-- WHERE r.report_id = 12345 -- only for given report?
GROUP BY 1;
The LEFT JOIN ensures you get a result, even if data is NULL or empty or the JSON array is empty.
For a sum from a single row in the underlying table, this is faster:
SELECT d.sum_conversions
FROM report r
LEFT JOIN LATERAL (
SELECT sum(conversions) AS sum_conversions
FROM jsonb_populate_recordset(null::report_data, r.data)
) d ON true
WHERE r.report_id = 12345; -- enter report_id here
Alternative with jsonb_array_elements() (no need for a registered row type):
SELECT d.sum_conversions
FROM report r
LEFT JOIN LATERAL (
SELECT sum((value->>'conversions')::int) AS sum_conversions
FROM jsonb_array_elements(r.data)
) d ON true
WHERE r.report_id = 12345; -- enter report_id here
Normally you would implement this as plain, normalized table. I don't see the benefit of JSON here (except that your application seems to require it, like you added).

You could use unnest:
select sum(conv) from
(select d->'conversion' as conv from
(select unnest(data) as d from <your table>) all_data
) all_conv
Disclaimer: I don't have Pg 9.2 so I couldn't test it myself.
EDIT: this is assuming that the array you mentioned is a Postgresql array, i.e. that the data type of your data column is character varying[]. If you mean the data is a json array, you should be able to use json_array_elements instead of unnest.

Related

How to run an array UDF on a BigQuery column

I would like to process a column in a table as an array using a User Defined Function (UDF) written in Javascript. The function prototype of the UDF looks like this:
CREATE TEMPORARY FUNCTION Summary(items ARRAY<FLOAT64>, seed FLOAT64)
RETURNS ARRAY<FLOAT64>
I would like to turn the output array into an additional column.
If the name of the column of my table, which contains FLOAT64 elements, is computed_items then how would I create a new column, item_summaries containing one-value-per-row of the output of the function applied on computed_items.
Please always try to make the UDF run on each row and not of the array of all rows. If this is not possible, please check to total amount of rows to not exceed the array cluster size.
Your UDF has an input array items, which are the values of a column. And a seed value. Please add to your UDF an array of row id as input. The output of the UDF cannot be an array as rows, but an array of structure of id and item value.
Please see the JS of the UDF for details. The calculation of the new item value is here as an example old value plus seed value.
In case your table has an unique row_number, you can skip the first part and your table is already h1.
Your table is given as tbl and we generate some date. For testing we mix up the order of the table with order by rand(). The helper table adds a row_number as compute_id. In the next steps we need to query this table twice and need to obtain the same mapping of the row_number. This can be archived by making storing this table in a recursive table h1.
The table h2 is using your UDF Summary and as inputs the rows are build up as arrays using the array_agg function and the output array is named Summary_. Next the output Summary_ is unnested and the columns are renamed.
The final table h3 joins the array from the Summary to the table using row_numbers in column compute_id.
create temp function Summary(items ARRAY<FLOAT64>,id ARRAY<int64>, seed FLOAT64)
RETURNS ARRAY<struct<computed_items FLOAT64,id int64>>
language js as
"""
var calculated_items=[];
for(let i in items){
calculated_items[i]=items[i]+seed;
}
var out=[];
for(let i in calculated_items){
out[i]={computed_items:calculated_items[i],id:id[i]}
}
return out;
""";
with recursive
tbl as (select x, x+1000 as computed_items from unnest(generate_array(0,1000))x order by rand()
),
helper as (select row_number() over () as compute_id, x, computed_items from tbl),
h1 as (select * from helper union all select * from h1 where false),
h2 as (select t0.computed_items, t0.id as compute_id from (select Summary(array_agg(computed_items*1.0),array_agg(compute_id),10000.0) Summary_ from h1) as XX,unnest(XX.Summary_) as t0),
h3 as (select * from h1 left join h2 using (compute_id))
select * from h3
order by x

Is there a melt command in Snowflake?

Is there a Snowflake command that will transform a table like this:
a,b,c
1,10,0.1
2,11,0.12
3,12,0.13
to a table like this:
key,value
a,1
a,2
a,3
b,10
b,11
b,13
c,0.1
c,0.12
c,0.13
?
This operation is often called melt in other tabular systems, but the basic idea is to convert the table into a list of key value pairs.
There is an UNPIVOT in SnowSQL, but as I understand it UNPIVOT requires to manually specify every single column. This doesn't seem practical for a large number of columns.
Snowflake's SQL is powerful enough to perform such operation without help of third-party tools or other extensions.
Data prep:
CREATE OR REPLACE TABLE t(a INT, b INT, c DECIMAL(10,2))
AS
SELECT 1,10,0.1
UNION SELECT 2,11,0.12
UNION SELECT 3,12,0.13;
Query(aka "dynamic" UNPIVOT):
SELECT f.KEY, f.VALUE
FROM (SELECT OBJECT_CONSTRUCT_KEEP_NULL(*) AS j FROM t) AS s
,TABLE(FLATTEN(input => s.j)) f
ORDER BY f.KEY;
Output:
How does it work?
Transform row into JSON(row 1 becomes { "A": 1,"B": 10,"C": 0.1 })
Parse the JSON into key-value pairs using FLATTEN

How do I perform a join with a JSONB array structure in PostgreSQL

I'm struggling with the syntax for a join when I've got an array in stored in JSONB. I've searched for examples and I can't find the magic sauce that makes this work in PostgreSQL 9.6
I've got the following structure stored in a JSONB column in a table called disruption_history. The element is called data:
"message": {
"id": 352,
"preRecordedMessageList": {
"preRecordedMessageCodes": [804, 2110, 1864, 1599]
}
}
I then have another standard table called message_library
component_code | integer | not null
message_text | character varying(255) | not null
I'm trying to produce the text for each set of message codes. So something like
SELECT
ml.message_text
FROM
message_library ml, disruption_history dh
WHERE
jsonb_array_elements_text(dh.data->'message'->'preRecordedMessageList'
->'preRecordedMessageCodes')) = ml.component_code
I get
ERROR: operator does not exist: text = integer
even if I try to cast the numbers to integer I get argument of WHERE must not return a set.
Can someone help please?
select message_library.message_text
from disruption_history
join lateral jsonb_array_elements_text(data->'message'->'preRecordedMessageList'->'preRecordedMessageCodes') v
on true
join message_library
on v.value::int = message_library.component_code
You can use the following query:
SELECT
CAST(dh.data->'message'->>'id' AS INTEGER) AS message_id,
ml.message_text
FROM
disruption_history dh
JOIN message_library ml
ON ml.component_code IN
(SELECT
CAST(jsonb_array_elements_text(
dh.data->'message'->'preRecordedMessageList'->'preRecordedMessageCodes'
)
AS INTEGER)
) ;
Note that I have used an explicit join (avoid the implicit ones!).
The trick here is to convert your preRecordedMessageCodes into a set of texts, by using the jsonb_array_elements_text function, that are further CAST to integer, and then compared to the ml.component_code (by using an IN condition):
You can check the whole setup at dbfiddle here
Note also that this structure produces an awful execution plan, that requires whole sequential scans of both tables. I have not been able to find any kind of index that helps the queries.
Note that this won't work if you have arrays with NULLs in them, which I assume wouldn't make sense.
Keeping order:
If you want to keep the elements of the array in order, you need to use a WITH ORDINALITY predicate to obtain not only the array element, but also its relative position, and use it to ORDER BY
-- Keeping order
SELECT
CAST(dh.data->'message'->>'id' AS INTEGER) AS message_id,
ml.message_text
FROM
disruption_history dh
JOIN LATERAL
jsonb_array_elements_text(dh.data->'message'->'preRecordedMessageList'->'preRecordedMessageCodes')
WITH ORDINALITY AS x(mc, ord) /* We will want to use 'ord' to order by */
ON true
JOIN message_library ml ON ml.component_code = cast(mc AS INTEGER)
ORDER BY
message_id, ord ;
Watch this at dbfiddle here
Alternative:
If the structure of your json data is always the same, I would strongly recommend that you normalize your design (at least partially):
CREATE TABLE disruption_history_no_json
(
disruption_history_id SERIAL PRIMARY KEY,
message_id INTEGER,
pre_recorded_message_codes INTEGER[]
) ;
CREATE INDEX idx_disruption_history_no_json_pre_recorded_message_codes
ON disruption_history_no_json USING GIN (pre_recorded_message_codes) ;
Would allow for a much simpler and efficient and simpler query:
SELECT
message_id,
ml.message_text
FROM
disruption_history_no_json dh
JOIN message_library ml
ON ml.component_code = ANY(pre_recorded_message_codes) ;
Check everything together at dbfiddle here
JSON(B) allows you not to normalize, and not to have to think much about your table structures, but you pay a steep price in performance and maintainability.

Compare comma separated list with individual row in table

I have to compare comma separated values with a column in the table and find out which values are not in database. [kind of master data validation]. Please have a look at the sample data below:
table data in database:
id name
1 abc
2 def
3 ghi
SQL part :
Here i am getting comma separated list like ('abc','def','ghi','xyz').
now xyz is invalid value, so i want to take that value and return it as output saying "invalid value".
It is possible if i split those value, take it in temp table, loop through each value and compare one by one.
but is there any other optimal way to do this ??
I'm sure if I got the question right, however, I would personally be trying to get to something like this:
SELECT
D.id,
CASE
WHEN B.Name IS NULL THEN D.name
ELSE "invalid value"
END
FROM
data AS D
INNER JOIN badNames B ON b.Name = d.Name
--as SQL is case insensitive, equal sign should work
There is one table with bad names or invalid values if You prefer. This can a temporary table as well - depending on usage (a black-listed words should be a table, ad hoc invalid values provided by a service should be temp table, etc.).
NOTE: The select above can be nested in a view, so the data remain as they were, yet you gain the correctness information. Otherwise I would create a cursor inside a function that would go through the select like the one above and alter the original data, if that is the goal...
It sounds like you just need a NOT EXISTS / LEFT JOIN, as in:
SELECT tmp.InvalidValue
FROM dbo.HopeThisIsNotAWhileBasedSplit(#CSVlist) tmp
WHERE NOT EXISTS (
SELECT *
FROM dbo.Table tbl
WHERE tbl.Field = tmp.InvalidValue
);
Of course, depending on the size of the CSV list coming in, the number of rows in the table you are checking, and the style of splitter you are using, it might be better to dump the CSV to a temp table first (as you mentioned doing in the question).
Try following query:
SELECT SplitedValues.name,
CASE WHEN YourTable.Id IS NULL THEN 'invalid value' ELSE NULL END AS Result
FROM SplitedValues
LEFT JOIN yourTable ON SplitedValues.name = YourTable.name

Query table by indexes from integer array

After I get excellent results with converting data "to_timestamp" and "to_number" from VB.NET I am wondering if PostgreSQL have possibility to query table indexes by array of integers from .NET?
Say, I have array filled with (1, 3, 5, 6, 9).
Is here any possibility that PostgreSQL return rows with data from those indexes to "odbc.reader"?
That would be much faster than looping and querying 5 times like I do now.
Something like this:
SELECT myindexes, myname, myadress from mytable WHERE myindexes IS IN ARRAY
If this is possible how a simple query should look like?
That's possible.
ANY
SELECT myindex, myname, myadress
FROM mytable
WHERE myindex = ANY ($my_array)
Example with integer-array:
...
WHERE myindex = ANY ('{1,3,5,6,9}'::int[])
Details about ANY in the manual.
IN
There is also the SQL IN() expression for the same purpose.
PostgreSQl in its current implementation transforms that to = ANY (array) internally prior to execution, so it's conceivably a bit slower.
Examples for joining to a long list (as per comment):
JOIN to VALUES expression
WITH x(myindex) AS (
VALUES
(1),(3),(5),(6),(9)
)
SELECT myindex, myname, myadress
FROM mytable
JOIN x USING (myindex)
I am using a CTE in the example (which is optional, could be a sub-query as well). You need PostgreSQL 8.4 of later for that.
The manual about VALUES.
JOIN to unnested array
Or you could unnest() an array and JOIN to it:
SELECT myindex, myname, myadress
FROM mytable
JOIN (SELECT unnest('{1,3,5,6,9}'::int[]) AS myindex) x USING (myindex)
Each of these methods is far superior in performance to running a separate query per value.