Get MAX from row with column name (SQL) - google-bigquery

Sorry if my questin is simple, but I spent one day for googling and still can't figure out how to solve this:
I have table like:
userId A B C D E
1 5 0 2 3 2
2 3 2 0 7 3
And I need each MAX per row with column name:
userId MAX
1 A
2 D
All ideas will be much appreciated! Thanks!
I use Google Big Query so my possibilities are different form MySQL as I understand, but I will try to translate if you have ideas in MySQL way.

You can use GREATEST:
SELECT userid, CASE GREATEST(A,B,C,D,E)
WHEN A THEN 'A'
WHEN B THEN 'B'
WHEN C THEN 'C'
WHEN D THEN 'D'
WHEN E THEN 'E'
END AS MAX
FROM TableName
Result:
userId MAX
1 A
2 D

Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
( SELECT key
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'".*?":[^,}]*')) kv,
UNNEST([STRUCT(TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') AS key, SAFE_CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) AS value)])
WHERE key != 'userId'
ORDER BY value DESC
LIMIT 1
) max_column
FROM `project.dataset.table` t
if to apply to sample data from question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 userId, 5 A, 0 B, 2 C, 3 D, 2 E UNION ALL
SELECT 2, 3, 2, 0, 7, 3 UNION ALL
SELECT 3, 1, 2, NULL, 4, 5
)
SELECT *,
( SELECT key
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'".*?":[^,}]*')) kv,
UNNEST([STRUCT(TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') AS key, SAFE_CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) AS value)])
WHERE key != 'userId'
ORDER BY value DESC
LIMIT 1
) max_column
FROM `project.dataset.table` t
output is
Row userId A B C D E max_column
1 1 5 0 2 3 2 A
2 2 3 2 0 7 3 D
3 3 1 2 null 4 5 E

Related

Find records on group level which are connected to all other record within the group

I have a scenario where I have to find IDs within each group which are connected to all other IDs in the same group. So basically we have to treat each group separately.
In the table below, the group A has 3 IDs 1, 2 and 3. 1 is connected to both 2 and 3, 2 is connected to both 1 and 3, but 3 is not connected to 1 and 2. So 1 and 2 should be output from group A. Similarly in group B only 5 is connected to all other IDs namely 4 and 6 within group B, so 5 should be output. Similarly from group C, that should be 8, and from group D no records should be output.
So the output of the select statement should be 1, 2, 5, 8.
GRP
ID
CONNECTED_TO
A
1
2
A
1
3
A
2
3
A
2
1
A
3
5
B
4
5
B
5
4
B
5
6
B
6
4
C
7
21
C
7
25
C
8
7
D
9
31
D
10
35
D
11
37
I was able to do this if group level was not required, by below SQL:
SELECT ID FROM <table>
where CONNECTED_TO in (select ID from <table>)
group by ID
having count(*) = <number of records - 1>
But not able to find correct SQL for my scenario. Any help is appreciated.
You may use count and count(distinct) functions as the following:
select id
from tbl T
where connected_to in
(
select id from tbl T2
where T2.grp = T.grp
)
group by grp, id
having count(connected_to) =
(
select count(distinct D.id) - 1
from tbl D
where T.grp = D.grp
)
When count(connected_to) group by grp, id equals to the count(distinct id) - 1 with the same grp, this means that the ID is connected to all other IDs.

SQL how to select n row from each interval of one column

For example, the table looks like
a
b
c
1
1
1
2
1
1
3
1
1
4
1
1
5
1
1
6
1
1
7
1
1
8
1
1
9
1
1
10
1
1
11
1
1
I want to randomly pick 2 rows from every interval based on column a, where a ~ [0, 2], a ~ [4, 6], a ~ [9-20].
Another more complicated case would be select n rows from every interval based on multiple columns, for example in this case the interval will be a ~ [0, 2], a ~ [4, 6], b ~ [7, 9], ...
Is there a way to do so with just SQL?
Find out to which interval each row belongs, order by random partitioned by an interval id, get the top n rows for each interval:
create transient table mytable as
select seq8() id, random() data
from table(generator(rowcount => 100)) v;
create transient table intervals as
select 0 i_start, 6 i_end, 2 random_rows
union all select 7, 20, 1
union all select 21, 30, 3
union all select 31, 50, 1;
select *
from (
select *
, row_number() over(partition by i_start order by random()) rn
from mytable a
join intervals b
on a.id between b.i_start and b.i_end
)
where rn<=random_rows
Edit: Shorter and cleaner.
select a.*
from mytable a
join intervals b
on a.id between b.i_start and b.i_end
qualify row_number() over(partition by i_start order by random()) <= random_rows
To get two rows per group, you want to use row_number(). To define the groups, you can use a lateral join to define the groupings:
select t.*
from (select t.*,
row_number() over (partition by v.grp order by random()) as seqnum
from t cross join lateral
(values (case when a between 0 and 2 then 1
when a between 4 and 6 then 2
when a between 7 and 9 then d
end)
) v(grp)
where grp is not null
) t
where seqnum <= 2;
You can adjust the case expression to define whatever groups you like.

Is there a way to find active users in SQL?

I'm trying to find the total count of active users in a database. "Active" users here as defined as those who have registered an event on the selected day or later than the selected day. So if a user registered an event on days 1, 2 and 5, they are counted as "active" throughout days 1, 2, 3, 4 and 5.
My original dataset looks like this (note that this is a sample - the real dataset will run to up to 365 days, and has around 1000 users).
Day ID
0 1
0 2
0 3
0 4
0 5
1 1
1 2
2 1
3 1
4 1
4 2
As you can see, all 5 IDs are active on Day 0, and 2 IDs (1 and 2) are active until Day 4, so I'd like the finished table to look like this:
Day Count
0 5
1 2
2 2
3 2
4 2
I've tried using the following query:
select Day as days, sum(case when Day <= days then 1 else 0 end)
from df
But it gives incorrect output (only counts users who were active on each specific days).
I'm at a loss as to what I could try next. Does anyone have any ideas? Many thanks in advance!
I think I would just use generate_series():
select gs.d, count(*)
from (select id, min(day) as min_day, max(day) as max_day
from t
group by id
) t cross join lateral
generate_series(t.min_day, .max_day, 1) gs(d)
group by gs.d
order by gs.d;
If you want to count everyone as active from day 1 -- but not all have a value on day 1 -- then use 1 instead of min_day.
Here is a db<>fiddle.
A bit verbose, but this should do:
with dt as (
select 0 d, 1 id
union all
select 0 d, 2 id
union all
select 0 d, 3 id
union all
select 0 d, 4 id
union all
select 0 d, 5 id
union all
select 1 d, 1 id
union all
select 1 d, 2 id
union all
select 2 d, 1 id
union all
select 3 d, 1 id
union all
select 4 d, 1 id
union all
select 4 d, 2 id
)
, active_periods as (
select id
, min(d) min_d
, max(d) max_d
from dt
group by id
)
, days as (
select distinct d
from dt
)
select d.d
, count(ap.id)
from days d
join active_periods ap on d.d between ap.min_d and ap.max_d
group by 1
order by 1 asc
You need count by day.
select
id,
count(*)
from df
GROUP BY
id

Shuffle column in Google's BigQuery based on groupby

I want to randomly shuffle the values for one single column of a table based on a groupby. E.g., I have two columns A and B. Now, I want to randomly shuffle column B based on a groupby on A.
For an example, suppose that there are three distinct values in A. Now for each distinct value of A, I want to shuffle the values in B, but just with values having the same A.
Example input:
A B C
-------------------
1 1 x
1 3 a
2 4 c
3 6 d
1 2 a
3 5 v
Example output:
A B C
------------------
1 3 x
1 2 a
2 4 c
3 6 d
1 1 a
3 5 v
In this case, for A=1 the values for B got shuffled. The same happened for A=2, but as there is only one row it stayed like it was. For A=3 by chance the values for B also stayed like they were. The values for column C stay as they are.
Maybe this can be solved by using window functions, but I am unsure how exactly.
As a side note: This should be achieved in Google's BigQuery.
Is this what you're after ? (you tagged with both Mysql and Oracle .. so I answer here using Oracle)
[edit] corrected based on confirmed logic [/edit]
with w_data as (
select 1 a, 1 b from dual union all
select 1 a, 3 b from dual union all
select 2 a, 4 b from dual union all
select 3 a, 6 b from dual union all
select 1 a, 2 b from dual union all
select 3 a, 5 b from dual
),
w_suba as (
select a, row_number() over (partition by a order by dbms_random.value) aid
from w_data
),
w_subb as (
select a, b, row_number() over (partition by a order by dbms_random.value) bid
from w_data
)
select sa.a, sb.b
from w_suba sa,
w_subb sb
where sa.aid = sb.bid
and sa.a = sb.a
/
A B
---------- ----------
1 3
1 1
1 2
2 4
3 6
3 5
6 rows selected.
SQL> /
A B
---------- ----------
1 3
1 1
1 2
2 4
3 5
3 6
6 rows selected.
SQL>
Logic breakdown:
1) w_data is just your sample data set ...
2) randomize column a (not really needed, you could just rownum this, and let b randomize ... but I do so love (over)using dbms_random :) heh )
3) randomize column b - (using partition by analytic creates "groups" .. order by random radomizes the items within each group)
4) join them ... using both the group (a) and the randomized id to find a random item within each group.
by doing the randomize this way you can ensure that you get the same # .. ie you start with one "3" .. you end with one "3" .. etc.
I feel below should work in BigQuery
SELECT
x.A as A, x.B as Old_B, x.c as C, y.B as New_B
FROM (
SELECT A, B, C,
ROW_NUMBER() OVER(PARTITION BY A ORDER BY B, C) as pos
FROM [your_table]
) as x
JOIN (
SELECT
A, B, ROW_NUMBER() OVER(PARTITION BY A ORDER BY rnd) as pos
FROM (
SELECT A, B, RAND() as rnd
FROM [your_table]
)
) as y
ON x.A = y.A AND x.pos = y.pos

SQL query to group based on sum

I have a simple table with values that I want to chunk/partition into distinct groups based on the sum of those values (up to a certain limit group sum total).
e.g.,. imagine a table like the following:
Key Value
-----------
A 1
B 4
C 2
D 2
E 5
F 1
And I would like to group into sets such that no one grouping's sum will exceed some given value (say, 5).
The result would be something like:
Group Key Value
-------------------
1 A 1
B 4
--------
Total: 5
2 C 2
D 2
--------
Total: 4
3 E 5
--------
Total: 5
4 F 1
--------
Total: 1
Is such a query possible?
While I am inclined to agree with the comments that this is best done outside of SQL, here is some SQL which would seem to do roughly what you're asking:
with mytable AS (
select 'A' AS [Key], 1 AS [Value] UNION ALL
select 'B', 4 UNION ALL
select 'C', 2 UNION ALL
select 'D', 2 UNION ALL
select 'E', 5 UNION ALL
select 'F', 1
)
, Sums AS (
select T1.[Key] AS T1K
, T2.[Key] AS T2K
, (SELECT SUM([Value])
FROM mytable T3
WHERE T3.[Key] <= T2.[Key]
AND T3.[Key] >= T1.[Key]) AS TheSum
from mytable T1
inner join mytable T2
on T2.[Key] >= T1.[Key]
)
select S1.T1K AS StartKey
, S1.T2K AS EndKey
, S1.TheSum
from Sums S1
left join Sums S2
on (S1.T1K >= S2.T1K and S1.T2K <= S2.T2K)
and S2.TheSum > S1.TheSum
and S2.TheSum <= 5
where S1.TheSum <= 5
AND S2.T1K IS NULL
When I ran this code on SQL Server 2008 I got the following results:
StartKey EndKey Sum
A B 5
C D 4
E E 5
F F 1
It should be straightforward to construct the required groups from these results.
If you want to have only two members or less in each set, you can use the following query:
Select
A.[Key] as K1 ,
B.[Key] as K2 ,
isnull(A.value,0) as V1 ,
isnull(B.value,0) as V2 ,
(A.value+B.value)as Total
from Table_1 as A left join Table_1 as B
on A.value+B.value<=5 and A.[Key]<>B.[Key]
For finding sets having more members, you can continue to use joins.