Athena array aggregate and filter multiple columns on condition - sql

I have data as shown below.
uuid
movie
data
1
movie1
{title=rental, label=GA, price=50, feetype=rental, hidden=false}
1
movie1
{title=tax, label=GA, price=25, feetype=service-fees, hidden=true}
1
movie1
{title=rental, label=GA, price=50, feetype=rental, hidden=false}
1
movie1
{title=tax, label=GA, price=25, feetype=service-fees, hidden=true}
2
movie3
{title=rental, label=VIP, price=100, feetype=rental, hidden=false}
2
movie3
{title=tax, label=VIP, price=25, feetype=service-fees, hidden=true}
2
movie3
{title=promo, label=VIP, price=10, feetype=discount, hidden=false}
and, this is how i want the result to be like.
uuid
total_fee
total_discount
discount_type
1
150
0
NA
2
125
10
promo
I tried using
SELECT uuid
, sum("fee"."price") "total_fee"
, array_agg(distinct("fee"."feetype")) "fee_type"
, array_agg(distinct("fee"."title")) "fee_name"
This gives the result as shown below,
uuid
total_fee
fee_type
fee_name
1
100
[rental]
[rental]
1
50
[service-fees]
[tax]
2
100
[rental]
[rental]
2
25
[service-fees]
[tax]
2
10
[discount]
[promo]
Now how do I aggregate on total_fee and filter fee_name based on fee_type?
I tried using
, CASE WHEN regexp_like(array_join(fee_type, ','), 'discount') THEN sum("fee") ELSE 0 END "discount"
but that resulted in
SYNTAX_ERROR: line 207:6: '(CASE WHEN "regexp_like"("array_join"(fee_type, ','), 'discount') THEN "sum"("fee") ELSE 0 END)' must be an aggregate expression or appear in GROUP BY clause

You should be able to do something like this:
SELECT
uuid,
SUM(fee.price) AS total_fee,
SUM(fee.price) FILTER (WHERE fee.feetype = 'discount') AS total_discount,
ARBITRARY(fee.title) FILTER (WHERE fee.feetype = 'discount') AS discount_type
FROM …
GROUP BY uuid
(I'm assuming the data column in your example is the same as the fee column in your query).
Aggregate functions support a FILTER clause that selects the rows to include into the aggregation. This can also be achieved by e.g. SUM(IF(fee.feetype = 'discount', fee.price, 0)), which is more compact but not as elegant.
The ARBITRARY aggregate function picks an arbitrary value from the group. I don't know if that's appropriate in your case, but I assume that there will only be one discount row per group. If there are more than one you might want to use ARRAY_AGG with the DISTINCT clause (e.g. ARRAY_AGG(DISTINCT fee.title) to get the all).

Related

Misleading count of 1 on JOIN in Postgres 11.7

I've run into a subtlety around count(*) and join, and a hoping to get some confirmation that I've figured out what's going on correctly. For background, we commonly convert continuous timeline data into discrete bins, such as hours. And since we don't want gaps for bins with no content, we'll use generate_series to synthesize the buckets we want values for. If there's no entry for, say 10AM, fine, we stil get a result. However, I noticed that I'm sometimes getting 1 instead of 0. Here's what I'm trying to confirm:
The count is 1 if you count the "grid" series, and 0 if you count the data table.
This only has to do with count, and no other aggregate.
The code below sets up some sample data to show what I'm talking about:
DROP TABLE IF EXISTS analytics.measurement_table CASCADE;
CREATE TABLE IF NOT EXISTS analytics.measurement_table (
hour smallint NOT NULL DEFAULT NULL,
measurement smallint NOT NULL DEFAULT NULL
);
INSERT INTO measurement_table (hour, measurement)
VALUES ( 0, 1),
( 1, 1), ( 1, 1),
(10, 2), (10, 3), (10, 5);
Here are the goal results for the query. I'm using 12 hours to keep the example results shorter.
Hour Count sum
0 1 1
1 2 2
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 3 10
11 0 0
12 0 0
This works correctly:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(measurement_table.hour) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (measurement_table.hour = hour_series.hour)
GROUP BY 1
ORDER BY 1
This returns misleading 1's on the match:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(*) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (hour_series.hour = measurement_table.hour)
GROUP BY 1
ORDER BY 1
0 1 1
1 2 2
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 3 10
11 1 0
12 1 0
The only difference between these two examples is the count term:
count(*) -- A result of 1 on no match, and a correct count otherwise.
count(joined to table field) -- 0 on no match, correct count otherwise.
That seems to be it, you've got to make it explicit that you're counting the data table. Otherwise, you get a count of 1 since the series data is matching once. Is this a nuance of joinining, or a nuance of count in Postgres?
Does this impact any other aggrgate? It seems like it sholdn't.
P.S. generate_series is just about the best thing ever.
You figured out the problem correctly: count() behaves differently depending on the argument is is given.
count(*) counts how many rows belong to the group. This just cannot be 0 since there is always at least one row in a group (otherwise, there would be no group).
On the other hand, when given a column name or expression as argument, count() takes in account any non-null value, and ignores null values. For your query, this lets you distinguish groups that have no match in the left joined table from groups where there are matches.
Note that this behavior is not Postgres specific, but belongs to the standard
ANSI SQL specification (all databases that I know conform to it).
Bottom line:
in general cases, uses count(*); this is more efficient, since the database does not need to check for nulls (and makes it clear to the reader of the query that you just want to know how many rows belong to the group)
in specific cases such as yours, put the relevant expression in the count()

Order grouped table by id user sql

I want to order a grouped statement using as reference the number choosen by an specific user.
SELECT *
FROM likes
WHERE /**/
GROUP BY type
TABLE
id_user type
1420 1
1421 3
1422 3
1424 7
1425 4
1426 2
1427 1
expected result (at the end what user 1425 choosed)
1
2
3
7
4 //choosen by id_user 1425
I want to put the last row with the number choosed by the user. i just cant figure that out
You can aggregate and use a conditional max for ordering, like so:
select type
from likes
group by likes
order by max(case when id_user = 1425 then 1 else 0 end), type
If any row for the given type has an id_user that matches the chosen value, the conditional max returns 1, wich puts it last in the resultset. The second ordering criteria break the ties for groups that do not fulfill the condition.
If you are running MySQL, you can simplify the order by clause a little:
order by max(id_user = 1425), type

SQL Rows to Columns if column values are unknown

I have a table that has demographic information about a set of users which looks like this:
User_id Category IsMember
1 College 1
1 Married 0
1 Employed 1
1 Has_Kids 1
2 College 0
2 Married 1
2 Employed 1
3 College 0
3 Employed 0
The result set I want is a table that looks like this:
User_Id|College|Married|Employed|Has_Kids
1 1 0 1 1
2 0 1 1 0
3 0 0 0 0
In other words, the table indicates the presence or absence of a category for each user. Sometimes the user will have a category where the value if false, sometimes the user will have no row for a category, in which case IsMember is assumed to be false.
Also, from time to time additional categories will be added to the data set, and I'm wondering if its possible to do this query without knowing up front all the possible category names, in other words, I won't be able to specify all the column names I want to count in the result. (Note only user 1 has category "has_kids" and user 3 is missing a row for category "married"
(using Postgres)
Thanks.
You can use jsonb funcions.
with titles as (
select jsonb_object_agg(Category, Category) as titles,
jsonb_object_agg(Category, -1) as defaults
from demog
),
the_rows as (
select null::bigint as id, titles as data
from titles
union
select User_id, defaults || jsonb_object_agg(Category, IsMember)
from demog, titles
group by User_id, defaults
)
select id, string_agg(value, '|' order by key)
from (
select id, key, value
from the_rows, jsonb_each_text(data)
) x
group by id
order by id nulls first
You can see a running example in http://rextester.com/QEGT70842
You can replace -1 with 0 for the default value and '|' with ',' for the separator.
You can install tablefunc module and use the crosstab function.
https://www.postgresql.org/docs/9.1/static/tablefunc.html
I found a Postgres function script called colpivot here which does the trick. Ran the script to create the function, then created the table in one statement:
select colpivot ('_pivoted', 'select * from user_categories', array['user_id'],
array ['category'], '#.is_member', null);

Sybase SQL CASE with CAST

I have a Sybase table (which I can't alter) that I am trying to get into a specific table format. The table contains three columns all which are string values, with an id (which is not unique), a "position" which is a number that represents a field name, and a field column that is the value. The table looks like:
id position field
100 0 John
100 1 Jane
100 2 25
100 3 50
101 0 Dave
101 3 30
Position 0 means "SalesRep1", Position 1 means "SR1Commission", Position 2 means "SalesRep2", and Position 3 means "SR2Commission".
I am trying to get a table that looks like following, with the Commission columns being decimals instead of strings:
id SalesRep1 SR1Commission SalesRep2 SR2Commisson
100 John 25 Jane 50
101 Dave 30 NULL NULL
I've gotten close using CASE, but I end up with only one value per row and not sure there's a way to do what I want. I also have problems with trying to get CAST included to change the commission values from strings to decimals. Here's what I have so far:
SELECT id
CASE "position" WHEN '0' THEN field END AS SalesRep1,
CASE "position" WHEN '1' THEN field END AS SalesRep2,
CASE "position" WHEN '2' THEN field END AS SR1Commission,
CASE "position" WHEN '3' THEN field END AS SR2Commission
FROM v_custom_field WHERE id = ?
This gives me the following result when querying for id 100:
id SalesRep1 SR1Commission SalesRep2 SR2Commission
100 John NULL NULL NULL
100 NULL 25 NULL NULL
100 NULL NULL Jane NULL
100 NULL NULL NULL 50
This is close, but I want to 'collapse' the rows down into one row based off of the id as well as cast the commission values to numbers. I tried adding in a CAST(field AS DECIMAL) I'm not sure if this is even the right direction to go, and was looking into PIVOT, but Sybase doesn't seem to support that.
This is known as an entity-attribute-value table. They're a pain to work with because they're one step removed from being relational data, but they're very common for user-defined fields in applications.
If you can't use PIVOT, you'll need to do something like this:
SELECT DISTINCT s.id,
f0.field AS SalesRep1,
CAST(f1.field AS DECIMAL(20,5)) AS SR1Commission,
f2.field AS SalesRep2,
CAST(f3.field AS DECIMAL(20,5)) AS SR2Commission
FROM UnnamedSalesTable s
LEFT JOIN UnnamedSalesTable f0
ON f0.id = s.id AND f0.position = 0
LEFT JOIN UnnamedSalesTable f1
ON f1.id = s.id AND f1.position = 1
LEFT JOIN UnnamedSalesTable f2
ON f2.id = s.id AND f2.position = 2
LEFT JOIN UnnamedSalesTable f3
ON f3.id = s.id AND f3.position = 3
It's not very fast because it's a ton of self-joins followed by a DISTINCT, but it does work.

My aggregate is not affected by ROLLUP

I have a query similar to the following:
SELECT CASE WHEN (GROUPING(Name) = 1) THEN 'All' ELSE Name END AS Name,
CASE WHEN (GROUPING(Type) = 1) THEN 'All' ELSE Type END AS Type,
sum(quantity) AS [Quantity],
CAST(sum(quantity) * (SELECT QuantityMultiplier FROM QuantityMultipliers WHERE a = t.b) AS DECIMAL(18,2)) AS Multiplied Quantity
FROM #Table t
GROUP BY Name, Type WITH ROLLUP
I'm trying to return a list of Names, Types, a summed Quantity and a summed quantity multiplied by an arbitrary number. All fine so far. I also need to return a sub-total row per Name and per Type, such as the following
Name Type Quantity Multiplied Quantity
------- --------- ----------- -------------------
a 1 2 4
a 2 3 3
a ALL 5 7
b 1 6 12
b 2 1 1
b ALL 7 13
ALL ALL 24 40
The first 3 columns are fine. I'm getting null values in the rollup rows for the multiplied quantity though. The only reason I can think this is happening is because SQL doesn't recognize the last column as an aggregate now that I've multiplied it by something.
Can I somehow work around this without things getting too convoluted?
I will be falling back onto temporary tables if this can't be done.
In your sub-query to acquire the multiplier, you have WHERE a=b. Are either a or b from the tables in your main query?
If these values are static (nothing to do with the main query), it looks like it should be fine...
If the a or b values are the name or type field, they can be NULL for the rollup records. If so, you can change to something similiar to...
CAST(sum(quantity * (<multiplie_query>)) AS DECIMAL(18,2)).
If a or b are other field from your main query, you'd be getting multiple records back, not just a single multiplier. You could change to something like...
CAST(sum(quantity) * (SELECT MAX(multiplier) FROM ...)) AS DECIMAL(18,2))