I have the one column in the table of AWS Athena with structure as follows.
Order_id Item_Count Path_List order_date
1 1 A, B, C, A, W 2022-08-23
2 3 C, A, D, Z 2022-08-21
Path -> type array of Strings.
The first row indicates that order_id 1 had 1 item which was purchased after the item passed path A->B->C->A->W.
Similarly, 2nd row indicates that order_id 2 had 3 items which were purchased after the items passed path C->A->D->Z
Now I need to write a SQL query which gives me the list of all the paths and their total contribution in the orders for a given time-range.
So to print distinct path items, I have written the below query.
select path_item from table_name cross join unnest(path_list) as t(path_item)
where date(order_date) <= current_date and date(order_date) >= current_date - interval '6' day group by 1
So I get all the individual path stages for all the path_list. The output is as follows:
A
B
C
W
D
Z
Now I want to find how much stage A or C or D contributed to the overall products purchased.
To find contribution of A, it should be
(Product which followed A)/(Total path items)
= 5/17
total path items = 15 + 34 = 17
Items having A path = 2 + 3*1 = 5
Similarly, for B and W, it will be
1/17
For C, 4/17
for D, 3/17
For Z, 3/17
Can you suggest the SQL to print the below output
A 5/17
B 1/17
C 4/17
W 1/17
D 3/17
Z 3/17
You can use group by to process the unnested data (notice that I use succinct style allowing to skip cross join for unnest) and use either windows functions or subquery to count total number of elements for the divisor. With window function:
-- sample data
with dataset (Order_id, Item_Count, Path_List, order_date) as (
values (1, 1, array['A', 'B', 'C', 'A', 'W'], '2022-08-23'),
(2, 3, array['C', 'A', 'D', 'Z' ], '2022-08-21')
)
-- query
select edge,
cast(edge_cnt as varchar)
|| '/'
|| cast(sum(edge_cnt) over (range between unbounded preceding and UNBOUNDED FOLLOWING) as varchar)
from (select sum(Item_Count) edge_cnt, edge
from dataset,
unnest(Path_List) as t(edge)
group by edge)
order by edge;
With subquery:
-- query
select edge,
cast(edge_cnt as varchar)
|| '/'
|| cast((select sum(Item_Count * cardinality(Path_List)) from dataset) as varchar)
from (select sum(Item_Count) edge_cnt, edge
from dataset,
unnest(Path_List) as t(edge)
group by edge)
order by edge;
Output:
edge
_col1
A
5/17
B
1/17
C
4/17
D
3/17
W
1/17
Z
3/17
Related
Suppose I have the following table:
Day
ID
Value
2022-11-05
0
A
2022-11-06
1
B
2022-11-07
0
C
Now given a time window of 1 day, I want to create a time-series table that:
The Day column granular unit is 1 day
Each Day row displays every ID in the table (like cross-join)
Moreover, if for that day, the ID is not recorded, then it uses the Value from the previous day. If it does not exist before this day, we can ignore it.
Let's say I want to view this time series from 2022-11-05 to 2022-11-08, this is the desired output:
Day
ID
Value
2022-11-05
0
A
2022-11-06
0
A
2022-11-06
1
B
2022-11-07
0
C
2022-11-07
1
B
2022-11-08
0
C
2022-11-08
1
B
Explanation: ID=0 is not recorded on 11-06 so it uses the value from the previous day. ID=1 does not record new value on 11-07 so it uses the value from 11-06.
Note that the number of columns can be large, so if possible, I am looking for a solution that handles it too.
Way One:
first we start with some data
then we find the_days in the period we are interested in
then we find the data_start for each id
then we join those values together, and use LAG with the IGNORE NULLS OVER clause to find the "prior values" if the current values in not present via NVL
with data(Day, ID, Value) as (
select * from values
('2022-11-05'::date, 0, 'A'),
('2022-11-06'::date, 1, 'B'),
('2022-11-07'::date, 0, 'C')
), the_days as (
select
row_number() over (order by null)-1 as rn
,dateadd('day', rn, from_day) as day
from (
select
min(day) as from_day
,'2022-11-08' as to_day
,datediff('days', from_day, to_day) as days
from data
), table(generator(ROWCOUNT => 200))
qualify rn <= days
), data_starts as (
select
id,
min(day) as start_day
from data
group by 1
)
select
td.day,
ds.id,
nvl(d.value, lag(d.value) ignore nulls over (partition by ds.id order by td.day)) as value
from data_starts as ds
join the_days as td
on td.day >= ds.start_day
left join data as d
on ds.id = d.id and d.day = td.day
order by 1,2;
gives:
DAY
ID
VALUE
2022-11-05
0
A
2022-11-06
0
A
2022-11-06
1
B
2022-11-07
0
C
2022-11-07
1
B
2022-11-08
0
C
2022-11-08
1
B
Way Two:
with data(Day, ID, Value) as (
select * from values
('2022-11-05'::date, 0, 'A'),
('2022-11-06'::date, 1, 'B'),
('2022-11-07'::date, 0, 'C')
), the_days as (
select
dateadd('day', row_number() over (order by null)-1, '2022-11-05') as day
from table(generator(ROWCOUNT => 4))
)
select
td.day,
i.id,
nvl(d.value, lag(d.value) ignore nulls over (partition by i.id order by td.day)) as _value
from the_days as td
cross join (select distinct id from data) as i
left join data as d
on i.id = d.id and d.day = td.day
qualify _value is not null
order by 1,2;
this requires a unique name for the _values output so it can be referenced in the qualify without needing to duplicate the code.
I am not very fluent with SQL.. Im just facing a little issue in making the best and efficient sql query. I have a table with a composite key of column A and B as shown below
A
B
C
1
1
4
1
2
5
1
3
3
2
2
4
2
1
5
3
1
4
So what I need is to find rows where column C has both values of 4 and 5 (4 and 5 are just examples) for a particular value of column A. So 4 and 5 are present for two A values (1 and 2). For A value 3, 4 is present but 5 is not, hence we cannot take it.
My explanation is so confusing. I hope you get it.
After this, I need to find only those where B value for 4 (First Number) is less than B value for 5 (Second Number). In this case, for A=1, Row 1 (A-1, B-1,C-4) has B value lesser than Row 2 (A-1, B-2, C-5) So we take this row. For A = 2, Row 1(A-2,B-2,C-4) has B value greater than Row 2 (A-2,B-1,C-5) hence we cannot take it.
I Hope someone gets it and helps. Thanks.
Rows containing both c=4 and c=5 for a given a and ordered by b and by c the same way.
select a, b, c
from (
select tbl.*,
count(*) over(partition by a) cnt,
row_number() over (partition by a order by b) brn,
row_number() over (partition by a order by c) crn
from tbl
where c in (4, 5)
) t
where cnt = 2 and brn = crn;
EDIT
If an order if parameters matters, the position of the parameter must be set explicitly. Comparing b ordering to explicit parameter position
with params(val, pos) as (
select 4,2 union all
select 5,1
)
select a, b, c
from (
select tbl.*,
count(*) over(partition by a) cnt,
row_number() over (partition by a order by b) brn,
p.pos
from tbl
join params p on tbl.c = p.val
) t
where cnt = (select count(*) from params) and brn = pos;
I assume you want the values of a where this is true. If so, you can use aggregation:
select a
from t
where c in (4, 5)
group by a
having count(distinct c) = 2;
I'm trying to get exclusive max values from a query.
My first query (raw data) is something like that:
Material¦Fornecedor
X B
X B
X B
X C
X C
Y B
Y D
Y D
Firstly, I need to create the max values query for table above. For that, I need to count sequentially sames values of Materials AND Fornecedors. I mean, I need to count until SQL find a line that shows different material and fornecedors.
After that, I'll get an result as showed below (max_line is the number of times that it found a line with same material and fornecedor):
max_line¦Material¦Fornecedor
3 X B
2 X C
1 Y B
2 Y D
In the end, I need to get the highest rows lines for an exclusive Material. The result of the query that I need to contruct, based on table above, should be like that:
max_line¦Material¦Fornecedor
3 X B
2 Y D
My code, so far, is showed below:
select * from
(SELECT max(w2.line) as max_line, w2.Material, w2.[fornecedor] FROM
(SELECT w.Material, ROW_NUMBER() OVER(PARTITION BY w.Material, w.[fornecedor]
ORDER BY w.[fornecedor] DESC) as line, w.[fornecedor]
FROM [Database].[dbo].['Table1'] w) as w2
group by w2.Material, w2.[fornecedor]) as w1
inner join (SELECT w1.Material, MAX(w1.max_line) AS maximo FROM w1 GROUP BY w1.material) as w3
ON w1.Material = w3.Material AND w1.row = w3.maximo
I'm stuck on inner join, since I can't alias a query and use it on inner join.
Could you, please, help me?
Thank you,
Use a window function to find the max row number then filter by it.
SELECT MAXROW, w1.Material, w1.[fornecedor]
FROM (
SELECT w2.Material, w2.[fornecedor]
, max([ROW]) over (partition by Material) MAXROW
FROM (
SELECT w.Material, w.[fornecedor]
, ROW_NUMBER() OVER (PARTITION BY w.Material, w.[fornecedor] ORDER BY w.[fornecedor] DESC) as [ROW]
FROM [Database].[dbo].['Table1'] w
) AS w2
) AS w1
WHERE w1.[ROW] = w1.MAXROW;
I have a table in the following format:
A B C D
7 7 2 12
2 2 3 4
2 2 2 4
2 2 2 3
5 5 2 7
I would like to calculate correlations between each of the columns using the build-in correlation function (https://prestodb.io/docs/current/functions/aggregate.html corr(y, x) → double)
I could run over all the columns and perform the corr calculation each time with:
select corr(A,B) from table
but I would like to reduce the number of times I access presto and run it in one query if its possible.
Would it be possible to get as a result the column names that pass a certain threshold or at least the correlation scores between all possible combinations in one query?
Thanks.
I would like to calculate correlations between each of the columns
Correlation involves two series of data (in SQL, two columns). So I understand your question as: how to compute the correlation for each and every possible combination of columns in the table. That would look like:
select
corr(a, b) corr_a_b,
corr(a, c) corr_a_c,
corr(a, d) corr_a_d,
corr(b, c) corr_b_c,
corr(b, d) corr_c_d,
corr(c, d) corr_c_d
from mytable
You can use a lateral join to unpivot the table, then a self join and aggregation:
with v as (
select v.*, t.id
from (select t.*,
row_number() over (order by a) as id
from t
) t cross join lateral
(values ('a', a), ('b', b), ('c', c), ('d', d)
) v(col, val)
)
select v1.col, v2.col, corr(v1.val, v2.val)
from v v1 join
v v2
on v1.id = v2.id and v1.which < v2.which
group by v1.col, v2.col;
The row_number() is only to generate a unique id for each row, which is then used for the self-join. You may already have a column with this information, so that might not be necessary.
I have a problem where I have a large count of values on one side(a) and need to sum themup to a single value on the other(x) . There is no logical grouping to get to the total value(x)
On side (a) there are 10000+ items that need to be summed to a single value on the (z) side. Not all of the values on side (a) are needed to sum up to (z)
(a) (z)
123. 2
321. 19
234. 100
122
1
23
1
19
77
Expected output:
(a) 1, 1. = (z) 2
(a) 19. = (z) 19
(a) 23, 77. = (z) 100
Sum(a) to equal a value in (z)
My current code groups on date but now that will not work as I do not have a predefined date range.
Current code:
Select * From
(
Select sum(amount), date
From (a)
Group by date
) a
Inner join
(
Select amount,date
From (z)
) b on a.date = b.date
Where a.Amount - b.Amount = 0
This sounds like a self-join:
select z.z, a1.amount, a2.amount
from z z left join
(a a1 left join
a a2
on a1.amount < a2.amount
)
on z.amount = a1.amount + coalesce(a2.amount, 0);