Efficient query to Group by column name in SQL or hive - sql

Imagine I have a table with 2 columns m_1 and m_2:
m1 | m2
3 | 17
3 | 18
4 | 17
9 | 9
I would like to get a table with 3 columns:
m is the index of m (in my exemple 1 or 2)
d is the data contains in the table .
count is the number of occurence of each data, group by value and index.
In the example, the result is:
m | d | count
m_1 | 3 | 2
m_1 | 4 | 1
m_1 | 9 | 1
m_2 | 17| 2
m_2 | 18| 1
m_2 | 9 | 1
The first ligne mus be read as 'data 3 occurs 2 times in column m_1'?
A naive solution is to execute two times a parametric query like this:
for (i in 1 .. 2)
SELECT CONCAT('m_', i), m_i, count(*) FROM table GROUP BY m_i
But this algorithm scans my table two times. This is a problem since I have 255 columns m and bilion of rows.
Will the solution becomes easier if I use hive instead of a relational data base?

You can write this using union all and group by:
select colname, d, count(*)
from ((select 'm_1' as colname, m1 as d from t) union all
(select 'm_2' as colname, m2 as d from t)
) m12
group by colname, d;

posexplode(array(m1,m2))
select concat('m_',cast(pe.pos+1 as string)) as m
,pe.val as d
,count(*) as `count`
from mytable t
lateral view posexplode(array(m1,m2)) pe
group by pos
,val
;
+------+-----+--------+
| m | d | count |
+------+-----+--------+
| m_1 | 3 | 2 |
| m_1 | 4 | 1 |
| m_1 | 9 | 1 |
| m_2 | 9 | 1 |
| m_2 | 17 | 2 |
| m_2 | 18 | 1 |
+------+-----+--------+

Related

Postgres - Unique values for id column using CTE, Joins alongside GROUP BY

I have a table referrals:
id | user_id_owner | firstname | is_active | user_type | referred_at
----+---------------+-----------+-----------+-----------+-------------
3 | 2 | c | t | agent | 3
5 | 3 | e | f | customer | 5
4 | 1 | d | t | agent | 4
2 | 1 | b | f | agent | 2
1 | 1 | a | t | agent | 1
And another table activations
id | user_id_owner | referral_id | amount_earned | activated_at | app_id
----+---------------+-------------+---------------+--------------+--------
2 | 2 | 3 | 3.0 | 3 | a
4 | 1 | 1 | 6.0 | 5 | b
5 | 4 | 4 | 3.0 | 6 | c
1 | 1 | 2 | 2.0 | 2 | b
3 | 1 | 2 | 5.0 | 4 | b
6 | 1 | 2 | 7.0 | 8 | a
I am trying to generate another table from the two tables that has only unique values for referrals.id and returns as one of the columns the count for each apps as best_selling_app_count.
Here is the query I ran:
with agents
as
(select
referrals.id,
referral_id,
amount_earned,
referred_at,
activated_at,
activations.app_id
from referrals
left outer join activations
on (referrals.id = activations.referral_id)
where referrals.user_id_owner = 1),
distinct_referrals_by_id
as
(select
id,
count(referral_id) as activations_count,
sum(coalesce(amount_earned, 0)) as amount_earned,
referred_at,
max(activated_at) as last_activated_at
from
agents
group by id, referred_at),
distinct_referrals_by_app_id
as
(select id, app_id as best_selling_app,
count(app_id) as best_selling_app_count
from agents
group by id, app_id )
select *, dense_rank() over (order by best_selling_app_count desc) best_selling_app_rank
from distinct_referrals_by_id
inner join distinct_referrals_by_app_id
on (distinct_referrals_by_id.id = distinct_referrals_by_app_id.id);
Here is the result I got:
id | activations_count | amount_earned | referred_at | last_activated_at | id | best_selling_app | best_selling_app_count | best_selling_app_rank
----+-------------------+---------------+-------------+-------------------+----+------------------+------------------------+-----------------------
2 | 3 | 14.0 | 2 | 8 | 2 | b | 2 | 1
1 | 1 | 6.0 | 1 | 5 | 1 | b | 1 | 2
2 | 3 | 14.0 | 2 | 8 | 2 | a | 1 | 2
4 | 1 | 3.0 | 4 | 6 | 4 | c | 1 | 2
The problem with this result is that the table has a duplicate id of 2. I only need unique values for the id column.
I tried a workaround by harnessing distinct that gave desired result but I fear the query results may not be reliable and consistent.
Here is the workaround query:
with agents
as
(select
referrals.id,
referral_id,
amount_earned,
referred_at,
activated_at,
activations.app_id
from referrals
left outer join activations
on (referrals.id = activations.referral_id)
where referrals.user_id_owner = 1),
distinct_referrals_by_id
as
(select
id,
count(referral_id) as activations_count,
sum(coalesce(amount_earned, 0)) as amount_earned,
referred_at,
max(activated_at) as last_activated_at
from
agents
group by id, referred_at),
distinct_referrals_by_app_id
as
(select
distinct on(id), app_id as best_selling_app,
count(app_id) as best_selling_app_count
from agents
group by id, app_id
order by id, best_selling_app_count desc)
select *, dense_rank() over (order by best_selling_app_count desc) best_selling_app_rank
from distinct_referrals_by_id
inner join distinct_referrals_by_app_id
on (distinct_referrals_by_id.id = distinct_referrals_by_app_id.id);
I need a recommendation on how best to achieve this.
I am trying to generate another table from the two tables that has only unique values for referrals.id and returns as one of the columns the count for each apps as best_selling_app_count.
Your question is really complicated with a very complicated SQL query. However, the above is what looks like the actual question. If so, you can use:
select r.*,
a.app_id as most_common_app_id,
a.cnt as most_common_app_id_count
from referrals r left join
(select distinct on (a.referral_id) a.referral_id, a.app_id, count(*) as cnt
from activations a
group by a.referral_id, a.app_id
order by a.referral_id, count(*) desc
) a
on a.referral_id = r.id;
You have not explained the other columns that are in your result set.

Iterate over the rows of a second table to return resultset with cumulative sum

Yesterday, after the help of a SO user #
Iterate over the rows of a second table to return resultset
I was able to make a combination of rows with a selfjoin.
After some modifications, to adapt to my implementation, I faced a new challenge that I'm stuck: how to make an aggregate sum of a third column?
My issue is better explained in the image below:
Based on the code
SELECT
b1.table_a_id,
b1.label_x,
b2.label_y
FROM table_a a
INNER JOIN table_b b1
ON b1.table_a_id = a.table_a_id
INNER JOIN table_b b2
ON b2.table_a_id = b1.table_a_id AND
b2.label_y > b1.label_x
ORDER BY
b1.table_a_id,
b1.label_x,
b2.label_y;
I was able to acquire the combinations.
What should be the next step to get the cumulative sum based on a third column?
I couldn't think of a solution without using a second service, such as python with pandas, using a cumsum function.
To generate the expected resultset, you would need to join the table with itself with an inequality condition on the order column. Then, you can do a window sum:
select
t1.table_a_id,
t1.label_x,
t2.label_y,
sum(t2.value) over(
partition by t1.table_a_id, t1.label_x
order by t1."order", t2."order"
) agg_value
from
table_b t1
inner join table_b t2
on t1.table_a_id = t2.table_a_id
and t2."order" >= t1."order"
order by t1."order", t2."order"
Note: order is a reserved word, so it needs to be quoted; if you actual database column has a different name, you can remove the double quotes.
Demo on DB Fiddle:
TABLE_A_ID | LABEL_X | LABEL_Y | AGG_VALUE
---------: | :------ | :------ | --------:
1 | A | B | 1
1 | A | C | 3
1 | A | D | 6
1 | A | E | 10
1 | A | F | 15
1 | B | C | 2
1 | B | D | 5
1 | B | E | 9
1 | B | F | 14
1 | C | D | 3
1 | C | E | 7
1 | C | F | 12
1 | D | E | 4
1 | D | F | 9
1 | E | F | 5
You seem to want a cumulative sum:
SELECT b1.table_a_id, b1.label_x, b2.label_y,
SUM(b1.value) OVER (PARTITION BY b1.table_a_id, b1.label_x
ORDER BY b2.order
) as AGG_VALUE

Grouping over the subquery in SQL on unique id

I've a query which gets results from temp table. It has aggregate columns which are derived from the temp table:
SELECT
DISTINCT
SUM(a),
SUM(b),
c,
d,
id1
FROM
#tmpTable
.
.
.
join with many other tables
I want to now get the SUM of columns c & d returned from the query along with all other columns. It will be group by id1. It should look something like:
+--------------------------------------------
||Sum(A) |Sum(B)|C |D |id1 |
|-------------------------------------------+
| 12 |34 |1 | 3 | 1 |
|-------------------------------------------+
| 22 |37 | 2 | 4 | 2 |
|-------------------------------------------+
| 33 | 55 | 3 | 5 | 1 |
|-------------------------------------------+
| 44 | 25 | 5 | 6 | 2 |
+---------+------+------+---------+---------+
Final result should be this:
+--------------------------------------------
||Sum(A) |Sum(B)|Sum(C)|Sum(d) |id1 |
|-------------------------------------------+
| 12 |34 |4 | 8 | 1 |
|-------------------------------------------+
| 22 |37 | 7 | 10 | 2 |
|-------------------------------------------+
| 33 | 55 | 4 | 8 | 1 |
|-------------------------------------------+
| 44 | 25 | 7 | 10 | 2 |
+---------+------+------+---------+---------+
select
x.sum_a,
x.sum_b,
x.sum_c,
x.sum_d,
t.id1
from
tmpTable t
join
(
select
id1,
sum(A) as sum_a,
sum(B) as sum_b,
sum(C) as sum_c,
sum(D) as sum_d
from
tmpTable
group by
id1
) x on t.id1 = x.id1
Seeing as you have different grouping criteria for A and B, you can group them separately to C and D. The below (using common table expression) might start you on the right track:
; with SummaryValues AS
(
select id1, sum(C) as SumC, SUM(D) as SumD
from #SourceTable
group by id1
)
select SUM(st.A), SUM(st.b), sv.SumC, sv.SumD, st.id1
from #SourceTable st
inner join SummaryValues sv
on st.id1 = sv.id1
group by <whatever grouping you are using>
If your current real query is summing up a and b the way you want and generating that first sample output, maybe something like:
SELECT DISTINCT
SUM(a),
SUM(b),
SUM(c) OVER (PARTITION BY id1),
SUM(d) OVER (PARTITION BY id1),
id1
FROM
#tmpTable
.
.
.
join with many other tables
to get the second one.

Comparing rows vs array elements postgres

I have a table A with n rows (200+) and different numeric columns.
I have a table B with m rows (100K+) and a column called multipliers, which is of type array (REAL[]). For every row in B, this array's length is n, ie. a multiplier for every numeric variable in A. The array is sorted to match the alphabetical order of the id field in A
Table A
id | values_1 | values_2
---|----------|-------------
1 | 11.2 | 10.2
2 | 21.9 | 12.5
3 | 30.0 | 26.0
4 | 98.0 | 11.8
Table B
id | multipliers
--------|-------------
dafcec | {2,3,4,9}
nkjhbn | {0,0,1,5}
ccseff | {1,2,0,5}
ddeecc | {0,0,0,1}
I need a query that returns the SUM( multipliers * values_1 ).
Like this:
b.id | sum(b.multipliers*a.value_1)
--------|----------------------------------
dafcec | 2*11.2 + 3*21.9 + 4*30.0 + 9*98.0
nkjhbn | 0*11.2 + 0*21.9 + 1*30.0 + 5*98.0
ccseff | 1*11.2 + 2*21.9 + 0*30.0 + 5*98.0
ddeecc | 0*11.2 + 0*21.9 + 0*30.0 + 1*98.0
I have tried with different subquerys, LATERAL joins and UNNEST, but I can't get a working result.
Any hints? Thanks!
the easiest, but I believe expensive way to sum array would be:
t=# with b as (select id,unnest(multipliers) u from b)
select distinct id, sum(u)over (partition by id) from b;
id | sum
----------+-----
ccseff | 8
nkjhbn | 6
ddeecc | 1
dafcec | 18
(4 rows)
and no fast alternative comes to my mind here...
further - If I get it right you want a cartesian product - all against all, then:
t=# with b as (select id,unnest(multipliers) u from b)
, ag as (select distinct id, sum(u)over (partition by id) from b)
select ag.sum * v1, a.id aid, ag.id idb
from ag
join a on true;
?column? | aid | idb
----------+-----+----------
89.6 | 1 | ccseff
175.2 | 2 | ccseff
240 | 3 | ccseff
784 | 4 | ccseff
67.2 | 1 | nkjhbn
131.4 | 2 | nkjhbn
180 | 3 | nkjhbn
588 | 4 | nkjhbn
11.2 | 1 | ddeecc
21.9 | 2 | ddeecc
30 | 3 | ddeecc
98 | 4 | ddeecc
201.6 | 1 | dafcec
394.2 | 2 | dafcec
540 | 3 | dafcec
1764 | 4 | dafcec
(16 rows)
Solved it.
It just needs to pack the values into an array and unpack them so they are comparable. It worked for me. The ORDER BY makes sure the packing occurs in the desired order.
SELECT id, SUM (field * multiplier) result FROM
(
with c as (
SELECT array_agg( values_1 ORDER BY name ASC) val1
from A
)
, ag as (
select
distinct id,
multipliers
from B
)
SELECT
ag.id,
unnest(c.val1) field,
unnest(ag.multipliers) multiplier
FROM
c, ag
) s
GROUP BY id

PostgreSQL distinct rows joined with a count of distinct values in one column

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:
a | b | u | t
-----+---+----+----
foo | 1 | 1 | 10
foo | 1 | 2 | 11
foo | 1 | 2 | 11
foo | 2 | 4 | 1
foo | 3 | 5 | 2
bar | 1 | 6 | 2
bar | 2 | 7 | 2
bar | 2 | 8 | 3
bar | 3 | 9 | 4
bar | 4 | 10 | 5
bar | 5 | 11 | 6
baz | 1 | 12 | 1
baz | 1 | 13 | 2
baz | 1 | 13 | 2
baz | 1 | 13 | 3
There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.
I'm trying to build a query which will return the following results:
a | b | u | t | z
-----+---+----+----+---
foo | 1 | 1 | 10 | 3
foo | 1 | 2 | 11 | 3
foo | 2 | 4 | 1 | 3
foo | 3 | 5 | 2 | 3
bar | 1 | 6 | 2 | 5
bar | 2 | 7 | 2 | 5
bar | 2 | 8 | 3 | 5
bar | 3 | 9 | 4 | 5
bar | 4 | 10 | 5 | 5
bar | 5 | 11 | 6 | 5
In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.
I can get just the z filter working as follows:
SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
HAVING COUNT(b) > 2;
However, I'm stumped on combining this with the rest of the data in the table.
What's the most efficient way to do this?
Your first step can be simpler already:
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2;
Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.
Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental but - assuming b, u, and t are small columns - an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).
Your desired end result:
SELECT DISTINCT ON (md5(t.a), b, u, t)
t.a, t.b, t.u, t.t, a.z
FROM (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) a
JOIN t ON md5(t.a) = md5_a
ORDER BY 1, 2, 3, 4; -- optional
Or probably faster, yet:
SELECT a, b, u, t, z
FROM (
SELECT DISTINCT ON (1, 2, 3, 4)
md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
FROM t
) t
JOIN (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) z USING (md5_a)
ORDER BY 1, 2, 3, 4; -- optional
Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?