Comparing rows vs array elements postgres - sql

I have a table A with n rows (200+) and different numeric columns.
I have a table B with m rows (100K+) and a column called multipliers, which is of type array (REAL[]). For every row in B, this array's length is n, ie. a multiplier for every numeric variable in A. The array is sorted to match the alphabetical order of the id field in A
Table A
id | values_1 | values_2
---|----------|-------------
1 | 11.2 | 10.2
2 | 21.9 | 12.5
3 | 30.0 | 26.0
4 | 98.0 | 11.8
Table B
id | multipliers
--------|-------------
dafcec | {2,3,4,9}
nkjhbn | {0,0,1,5}
ccseff | {1,2,0,5}
ddeecc | {0,0,0,1}
I need a query that returns the SUM( multipliers * values_1 ).
Like this:
b.id | sum(b.multipliers*a.value_1)
--------|----------------------------------
dafcec | 2*11.2 + 3*21.9 + 4*30.0 + 9*98.0
nkjhbn | 0*11.2 + 0*21.9 + 1*30.0 + 5*98.0
ccseff | 1*11.2 + 2*21.9 + 0*30.0 + 5*98.0
ddeecc | 0*11.2 + 0*21.9 + 0*30.0 + 1*98.0
I have tried with different subquerys, LATERAL joins and UNNEST, but I can't get a working result.
Any hints? Thanks!

the easiest, but I believe expensive way to sum array would be:
t=# with b as (select id,unnest(multipliers) u from b)
select distinct id, sum(u)over (partition by id) from b;
id | sum
----------+-----
ccseff | 8
nkjhbn | 6
ddeecc | 1
dafcec | 18
(4 rows)
and no fast alternative comes to my mind here...
further - If I get it right you want a cartesian product - all against all, then:
t=# with b as (select id,unnest(multipliers) u from b)
, ag as (select distinct id, sum(u)over (partition by id) from b)
select ag.sum * v1, a.id aid, ag.id idb
from ag
join a on true;
?column? | aid | idb
----------+-----+----------
89.6 | 1 | ccseff
175.2 | 2 | ccseff
240 | 3 | ccseff
784 | 4 | ccseff
67.2 | 1 | nkjhbn
131.4 | 2 | nkjhbn
180 | 3 | nkjhbn
588 | 4 | nkjhbn
11.2 | 1 | ddeecc
21.9 | 2 | ddeecc
30 | 3 | ddeecc
98 | 4 | ddeecc
201.6 | 1 | dafcec
394.2 | 2 | dafcec
540 | 3 | dafcec
1764 | 4 | dafcec
(16 rows)

Solved it.
It just needs to pack the values into an array and unpack them so they are comparable. It worked for me. The ORDER BY makes sure the packing occurs in the desired order.
SELECT id, SUM (field * multiplier) result FROM
(
with c as (
SELECT array_agg( values_1 ORDER BY name ASC) val1
from A
)
, ag as (
select
distinct id,
multipliers
from B
)
SELECT
ag.id,
unnest(c.val1) field,
unnest(ag.multipliers) multiplier
FROM
c, ag
) s
GROUP BY id

Related

Postgres - Unique values for id column using CTE, Joins alongside GROUP BY

I have a table referrals:
id | user_id_owner | firstname | is_active | user_type | referred_at
----+---------------+-----------+-----------+-----------+-------------
3 | 2 | c | t | agent | 3
5 | 3 | e | f | customer | 5
4 | 1 | d | t | agent | 4
2 | 1 | b | f | agent | 2
1 | 1 | a | t | agent | 1
And another table activations
id | user_id_owner | referral_id | amount_earned | activated_at | app_id
----+---------------+-------------+---------------+--------------+--------
2 | 2 | 3 | 3.0 | 3 | a
4 | 1 | 1 | 6.0 | 5 | b
5 | 4 | 4 | 3.0 | 6 | c
1 | 1 | 2 | 2.0 | 2 | b
3 | 1 | 2 | 5.0 | 4 | b
6 | 1 | 2 | 7.0 | 8 | a
I am trying to generate another table from the two tables that has only unique values for referrals.id and returns as one of the columns the count for each apps as best_selling_app_count.
Here is the query I ran:
with agents
as
(select
referrals.id,
referral_id,
amount_earned,
referred_at,
activated_at,
activations.app_id
from referrals
left outer join activations
on (referrals.id = activations.referral_id)
where referrals.user_id_owner = 1),
distinct_referrals_by_id
as
(select
id,
count(referral_id) as activations_count,
sum(coalesce(amount_earned, 0)) as amount_earned,
referred_at,
max(activated_at) as last_activated_at
from
agents
group by id, referred_at),
distinct_referrals_by_app_id
as
(select id, app_id as best_selling_app,
count(app_id) as best_selling_app_count
from agents
group by id, app_id )
select *, dense_rank() over (order by best_selling_app_count desc) best_selling_app_rank
from distinct_referrals_by_id
inner join distinct_referrals_by_app_id
on (distinct_referrals_by_id.id = distinct_referrals_by_app_id.id);
Here is the result I got:
id | activations_count | amount_earned | referred_at | last_activated_at | id | best_selling_app | best_selling_app_count | best_selling_app_rank
----+-------------------+---------------+-------------+-------------------+----+------------------+------------------------+-----------------------
2 | 3 | 14.0 | 2 | 8 | 2 | b | 2 | 1
1 | 1 | 6.0 | 1 | 5 | 1 | b | 1 | 2
2 | 3 | 14.0 | 2 | 8 | 2 | a | 1 | 2
4 | 1 | 3.0 | 4 | 6 | 4 | c | 1 | 2
The problem with this result is that the table has a duplicate id of 2. I only need unique values for the id column.
I tried a workaround by harnessing distinct that gave desired result but I fear the query results may not be reliable and consistent.
Here is the workaround query:
with agents
as
(select
referrals.id,
referral_id,
amount_earned,
referred_at,
activated_at,
activations.app_id
from referrals
left outer join activations
on (referrals.id = activations.referral_id)
where referrals.user_id_owner = 1),
distinct_referrals_by_id
as
(select
id,
count(referral_id) as activations_count,
sum(coalesce(amount_earned, 0)) as amount_earned,
referred_at,
max(activated_at) as last_activated_at
from
agents
group by id, referred_at),
distinct_referrals_by_app_id
as
(select
distinct on(id), app_id as best_selling_app,
count(app_id) as best_selling_app_count
from agents
group by id, app_id
order by id, best_selling_app_count desc)
select *, dense_rank() over (order by best_selling_app_count desc) best_selling_app_rank
from distinct_referrals_by_id
inner join distinct_referrals_by_app_id
on (distinct_referrals_by_id.id = distinct_referrals_by_app_id.id);
I need a recommendation on how best to achieve this.
I am trying to generate another table from the two tables that has only unique values for referrals.id and returns as one of the columns the count for each apps as best_selling_app_count.
Your question is really complicated with a very complicated SQL query. However, the above is what looks like the actual question. If so, you can use:
select r.*,
a.app_id as most_common_app_id,
a.cnt as most_common_app_id_count
from referrals r left join
(select distinct on (a.referral_id) a.referral_id, a.app_id, count(*) as cnt
from activations a
group by a.referral_id, a.app_id
order by a.referral_id, count(*) desc
) a
on a.referral_id = r.id;
You have not explained the other columns that are in your result set.

Count the number of appearances of char given a ID

I have to perform a query where I can count the number of distinct codes per Id.
|Id | Code
------------
| 1 | C
| 1 | I
| 2 | I
| 2 | C
| 2 | D
| 2 | D
| 3 | C
| 3 | I
| 3 | D
| 4 | I
| 4 | C
| 4 | C
The output should be something like:
|Id | Count | #Code C | #Code I | #Code D
-------------------------------------------
| 1 | 2 | 1 | 1 | 0
| 2 | 3 | 1 | 0 | 2
| 3 | 3 | 1 | 1 | 1
| 4 | 2 | 2 | 1 | 0
Can you give me some advise on this?
This answers the original version of the question.
You are looking for count(distinct):
select id, count(distinct code)
from t
group by id;
If the codes are only to the provided ones, the following query can provide the desired result.
select
pvt.Id,
codes.total As [Count],
COALESCE(C, 0) AS [#Code C],
COALESCE(I, 0) AS [#Code I],
COALESCE(D, 0) AS [#Code D]
from
( select Id, Code, Count(code) cnt
from t
Group by Id, Code) s
PIVOT(MAX(cnt) FOR Code IN ([C], [I], [D])) pvt
join (select Id, count(distinct Code) total from t group by Id) codes on pvt.Id = codes.Id ;
Note: as I can see from sample input data, code 'I' is found in all of Ids. Its count is zero for Id = 3 in the expected output (in the question).
Here is the correct output:
DB Fiddle

Calculating consecutive range of dates with a value in Hive

I want to know if it is possible to calculate the consecutive ranges of a specific value for a group of Id's and return the calculated value(s) of each one.
Given the following data:
+----+----------+--------+
| ID | DATE_KEY | CREDIT |
+----+----------+--------+
| 1 | 8091 | 0.9 |
| 1 | 8092 | 20 |
| 1 | 8095 | 0.22 |
| 1 | 8096 | 0.23 |
| 1 | 8098 | 0.23 |
| 2 | 8095 | 12 |
| 2 | 8096 | 18 |
| 2 | 8097 | 3 |
| 2 | 8098 | 0.25 |
+----+----------+--------+
I want the following output:
+----+-------------------------------+
| ID | RANGE_DAYS_CREDIT_LESS_THAN_1 |
+----+-------------------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
+----+-------------------------------+
In this case, the ranges are the consecutive days with credit less than 1. If there is a gap between date_key column, then the range won't have to take the next value, like in ID 1 between 8096 and 8098 date key.
Is it possible to do this with windowing functions in Hive?
Thanks in advance!
You can do this with a running sum classifying rows into groups, incrementing by 1 every time a credit<1 row is found(in the date_key order). Thereafter it is just a group by.
select id,count(*) as range_days_credit_lt_1
from (select t.*
,sum(case when credit<1 then 0 else 1 end) over(partition by id order by date_key) as grp
from tbl t
) t
where credit<1
group by id
The key is to collapse all the consecutive sequence and compute their length, I struggled to achieve this in a relatively clumsy way:
with t_test as
(
select num,row_number()over(order by num) as rn
from
(
select explode(array(1,3,4,5,6,9,10,15)) as num
)
)
select length(sign)+1 from
(
select explode(continue_sign) as sign
from
(
select split(concat_ws('',collect_list(if(d>1,'v',d))), 'v') as continue_sign
from
(
select t0.num-t1.num as d from t_test t0
join t_test t1 on t0.rn=t1.rn+1
)
)
)
Get the previous number b in the seq for each original a;
Check if a-b == 1, which shows if there is a "gap", marked as 'v';
Merge all a-b to a string, and then split using 'v', and compute length.
To get the ID column out, another string which encode id should be considered.

Efficient query to Group by column name in SQL or hive

Imagine I have a table with 2 columns m_1 and m_2:
m1 | m2
3 | 17
3 | 18
4 | 17
9 | 9
I would like to get a table with 3 columns:
m is the index of m (in my exemple 1 or 2)
d is the data contains in the table .
count is the number of occurence of each data, group by value and index.
In the example, the result is:
m | d | count
m_1 | 3 | 2
m_1 | 4 | 1
m_1 | 9 | 1
m_2 | 17| 2
m_2 | 18| 1
m_2 | 9 | 1
The first ligne mus be read as 'data 3 occurs 2 times in column m_1'?
A naive solution is to execute two times a parametric query like this:
for (i in 1 .. 2)
SELECT CONCAT('m_', i), m_i, count(*) FROM table GROUP BY m_i
But this algorithm scans my table two times. This is a problem since I have 255 columns m and bilion of rows.
Will the solution becomes easier if I use hive instead of a relational data base?
You can write this using union all and group by:
select colname, d, count(*)
from ((select 'm_1' as colname, m1 as d from t) union all
(select 'm_2' as colname, m2 as d from t)
) m12
group by colname, d;
posexplode(array(m1,m2))
select concat('m_',cast(pe.pos+1 as string)) as m
,pe.val as d
,count(*) as `count`
from mytable t
lateral view posexplode(array(m1,m2)) pe
group by pos
,val
;
+------+-----+--------+
| m | d | count |
+------+-----+--------+
| m_1 | 3 | 2 |
| m_1 | 4 | 1 |
| m_1 | 9 | 1 |
| m_2 | 9 | 1 |
| m_2 | 17 | 2 |
| m_2 | 18 | 1 |
+------+-----+--------+

Display Rows side by side, Kind of pivot

I have a table like this
Column1 | Column2
-------------------
A | 1
A | 2
A | 3
B | 4
B | 5
B | 3
C | 2
C | 2
C | 2
D | 7
D | 8
D | 9
I want to output it as
A | B | C | D
--------------------
1 | 4 | 2 | 7
2 | 5 | 2 | 8
3 | 3 | 2 | 9
It will have fixed Rows/Columns like A,B,C,D.
Could you suggest a query in SQL Server 2005/2008?
it's better to know your clustered key in the table, since the order might be different after the result. Martin is right, try this out, it will get you started:
SELECT pvt.A,
pvt.B,
pvt.C,
pvt.D
FROM (SELECT *,
row=ROW_NUMBER() OVER(PARTITION BY Column1 ORDER BY (SELECT 1))
FROM yourtable) AS A
PIVOT (MIN(Column2) FOR Column1 IN ([A], [B], [C], [D]))
AS pvt