Sorting columns within rows in BigQuery - google-bigquery

I have table like below
id a b c
1 2 1 3
2 3 2 1
3 16 14 15
4 10 12 13
5 15 16 14
6 10 12 8
I need to "normalize" this table by sorting values in columns a, b, c - row by row and deduping them counting dups
Expected result
a b c dups
1 2 3 2
14 15 16 2
10 12 13 1
8 10 12 1
I do have solution but I don't see how to "scale" it easily to case when I have more than 3 columns to normalize. The first and last column as you can see below is not an issue. Stuff gets messy for columns in the middle when number of columns > 3
select a, b, c, count(1) as dups from (
select a1 as a, if(a != a1 and a != c1, a, if(b != a1 and b != c1, b, c)) as b, c1 as c
from (select a, b, c, least(a, b, c) as a1, greatest(a, b, c) as c1 from table)
) group by a, b, c
Can anyone suggest another approach?

Below example works for 4 columns and can be adjusted to any number of columns by adding extra STRING(x) to CONCAT() and extra line for REGEXP_EXRACT per each extra column.
SELECT a, b, c, d, COUNT(1) AS dups
FROM (
SELECT id,
REGEXP_EXTRACT(s + ',', r'(?U)^(?:.*,){0}(.*),') AS a,
REGEXP_EXTRACT(s + ',', r'(?U)^(?:.*,){1}(.*),') AS b,
REGEXP_EXTRACT(s + ',', r'(?U)^(?:.*,){2}(.*),') AS c,
REGEXP_EXTRACT(s + ',', r'(?U)^(?:.*,){3}(.*),') AS d
FROM (
SELECT id, GROUP_CONCAT(s) AS s FROM (
SELECT id, s,
INTEGER(s) AS e,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY e) pos
FROM (
SELECT id,
SPLIT(CONCAT(STRING(a),',',STRING(b),',',STRING(c),',',STRING(d))) AS s
FROM table
) ORDER BY id, pos
) GROUP BY id
)
) GROUP BY a, b, c, d

Related

Switch column values to the left

I have a table like below:
Column A Column B Column C
---------- --------- -----------
1 1 2
4 3 4
3 null 2
12 12 12
15 7 7
8 9 6
null 2 2
null null 3
I need to move each column value from right to the left until each column have distinct values except the null values. Output must be like this:
Column A Column B Column C
---------- --------- -----------
1 2 null
4 3 null
3 2 null
12 null null
15 7 null
8 9 6
2 null null
3 null null
How can I do this with the simplest way?
Thanks,
You can use the regexp_substr as follows:
Table data:
SQL> select a,b,c from t;
A B C
---- ---- ----
1 1 2
4 3 4
3 2
12 12 12
15 7 7
8 9 6
2 2
3
8 rows selected.
Your query:
SQL> select regexp_substr(vals,'[^,]+',1,1) as a,
2 regexp_substr(vals,'[^,]+',1,2) as b,
3 regexp_substr(vals,'[^,]+',1,3) as c
4 from (select rtrim(case when a is not null then a || ',' end ||
5 case when b = a or b is null then null else b || ',' end ||
6 case when b = c or a = c or c is null then null else c end, ',') as vals
7 from t
8 );
A B C
--- --- ---
1 2
4 3
3 2
12
15 7
8 9 6
2
3
8 rows selected.
SQL>
This combination of unpivot, pivot and conditional ordering returns desired output:
with
nn as (
select id, col, case rn when 1 then val end val
from (
select id, col, val, row_number() over (partition by id, val order by col) rn
from (select rownum id, a, b, c from t)
unpivot (val for col in (a, b, c) ) ) ),
mov as (
select id, val,
row_number() over (partition by id
order by case when val is not null then col end ) rn
from nn )
select * from mov pivot (max(val) for rn in (1 a, 2 b, 3 c))
dbfiddle
Subquery nn removes duplicated values, subquery mov based on conditional ordering moves them up. Then pivot transposes rows into columns, because they were unpivoted in first step.
If the input (and output) is three columns, as in your example, then a bit of brute force can give you an efficient query:
select coalesce(a, b, c) as a
, case when b != a then b
when c != coalesce(a, b) then c end as b
, case when a != b and b != c and a != c then c end as c
from t
;
It takes just a moment's thought to understand why this is correct (or, alternatively, you can throw a lot of test cases at it and be satisfied that since it gives the correct answer in all cases, it must be correct even if you don't understand why it is).
This is not easy to generalize though; if you had, say, eight columns in the input (and in the output), you would do better with a solution like Ponder Stibbons proposed. Note, though, that the number of columns (whether it is three or eight or 250) has to be known in advance for a standard SQL query; otherwise you will need to write a dynamic query, which is generally not seen as a good practice.
EDIT:
Here is a solution that generalizes easily to any number of columns (same in the input and in the output); the number of columns, though, must be known in advance (as well as the names of the columns and their order).
This solution is similar to the one posted by Ponder Stibbons. There are two main differences. First, I use the lateral clause, which is available in Oracle 12.1 and higher; this allows computations to be done separately in each row (instead of mixing all values from all rows together after unpivot, only to group them back into the initial rows via pivot). Second, the code is a bit more complicated, to handle the case when all values in a row are null.
select l.a, l.b, l.c
from t,
lateral (
select a, b, c
from ( select 1, row_number() over (order by nvl2(v, min(o), null)) o, v
from ( select t.a, t.b, t.c from dual )
unpivot include nulls (v for o in (a as 1, b as 2, c as 3))
group by v
)
pivot (max(v) for o in (1 as a, 2 as b, 3 as c))
) l
;
EDIT 2
In comments after this answer, the OP states that his real-life data has five columns, rather than three. The first solution is hard to extend to five columns, but the second solution is easy to extend. I show how on dbfiddle.
Hmmm . . . one method would be to construct a string and then break it apart. For example:
select regexp_substr(str, '[^|]+', 1, 1) as a,
regexp_substr(str, '[^|]+', 1, 2) as b,
regexp_substr(str, '[^|]+', 1, 3) as c
from (select t.*,
trim(leading '|' from
(case when a is not null then '|' || a end) ||
(case when b is not null then '|' || b end) ||
(case when c is not null then '|' || c end)
) str
from t
) t
Here is a db<>fiddle.

Shuffle column in Google's BigQuery based on groupby

I want to randomly shuffle the values for one single column of a table based on a groupby. E.g., I have two columns A and B. Now, I want to randomly shuffle column B based on a groupby on A.
For an example, suppose that there are three distinct values in A. Now for each distinct value of A, I want to shuffle the values in B, but just with values having the same A.
Example input:
A B C
-------------------
1 1 x
1 3 a
2 4 c
3 6 d
1 2 a
3 5 v
Example output:
A B C
------------------
1 3 x
1 2 a
2 4 c
3 6 d
1 1 a
3 5 v
In this case, for A=1 the values for B got shuffled. The same happened for A=2, but as there is only one row it stayed like it was. For A=3 by chance the values for B also stayed like they were. The values for column C stay as they are.
Maybe this can be solved by using window functions, but I am unsure how exactly.
As a side note: This should be achieved in Google's BigQuery.
Is this what you're after ? (you tagged with both Mysql and Oracle .. so I answer here using Oracle)
[edit] corrected based on confirmed logic [/edit]
with w_data as (
select 1 a, 1 b from dual union all
select 1 a, 3 b from dual union all
select 2 a, 4 b from dual union all
select 3 a, 6 b from dual union all
select 1 a, 2 b from dual union all
select 3 a, 5 b from dual
),
w_suba as (
select a, row_number() over (partition by a order by dbms_random.value) aid
from w_data
),
w_subb as (
select a, b, row_number() over (partition by a order by dbms_random.value) bid
from w_data
)
select sa.a, sb.b
from w_suba sa,
w_subb sb
where sa.aid = sb.bid
and sa.a = sb.a
/
A B
---------- ----------
1 3
1 1
1 2
2 4
3 6
3 5
6 rows selected.
SQL> /
A B
---------- ----------
1 3
1 1
1 2
2 4
3 5
3 6
6 rows selected.
SQL>
Logic breakdown:
1) w_data is just your sample data set ...
2) randomize column a (not really needed, you could just rownum this, and let b randomize ... but I do so love (over)using dbms_random :) heh )
3) randomize column b - (using partition by analytic creates "groups" .. order by random radomizes the items within each group)
4) join them ... using both the group (a) and the randomized id to find a random item within each group.
by doing the randomize this way you can ensure that you get the same # .. ie you start with one "3" .. you end with one "3" .. etc.
I feel below should work in BigQuery
SELECT
x.A as A, x.B as Old_B, x.c as C, y.B as New_B
FROM (
SELECT A, B, C,
ROW_NUMBER() OVER(PARTITION BY A ORDER BY B, C) as pos
FROM [your_table]
) as x
JOIN (
SELECT
A, B, ROW_NUMBER() OVER(PARTITION BY A ORDER BY rnd) as pos
FROM (
SELECT A, B, RAND() as rnd
FROM [your_table]
)
) as y
ON x.A = y.A AND x.pos = y.pos

SQLSERVER group by (aggregate column based on other column)

I have a table which has 3 columns A, B, C
I want to do a query like this:
select A, Max(B), ( C in the row having max B ) from Table group by A.
is there a way to do such a query?
Test Data:
A B C
2 5 3
2 6 1
4 5 1
4 7 9
6 5 0
the expected result would be:
2 6 1
4 7 9
6 5 0
;WITH CTE AS
(
SELECT A,
B,
C,
RN = ROW_NUMBER() OVER(PARTITION BY A ORDER BY B DESC)
FROM YourTable
)
SELECT A, B, C
FROM CTE
WHERE RN = 1
Try this
select t.*
from table t
join (Select A,max(b) B from table group by A) c
on c.a=t.a
and c.b=a.b

Categorize by columns

Have no idea what I'm to name this request so therefore I have not found any answers to it.
Basically a common statement like:
SELECT A,B,C,D,E FROM TABLE
Example result:
A B C D E
1 1 2 3 4
1 2 3 4 5
1 2 7 8 9
2 1 4 5 6
How do I 'categorize' by certain columns (in example A and B) so column values are omitted?
Preferred result:
A B C D E
1 1 2 3 4
1 2 3 4 5
7 8 9
2 1 4 5 6
In your case I guess it would make sense to have the result ordered by A,B?
In that case you could use:
SELECT DECODE(RN,1,A,NULL) AS A,
DECODE(RN,1,B,NULL) AS B,
C,
D,
E
FROM
(SELECT A,
B,
row_number() over (partition BY A,B order by A,B) AS RN,
C,
D,
E
FROM
(SELECT * FROM TEST_TABLE ORDER BY A,B
)
);
You can use LAG analytic function to access previous row values. See below example:
SELECT case
when LAG(a, 1, NULL)
OVER(ORDER BY a, b, c, d, e) = a and LAG(b, 1, NULL)
OVER(ORDER BY a, b, c, d, e) = b then
null
else
a
end new_a,
case
when LAG(a, 1, NULL)
OVER(ORDER BY a, b, c, d, e) = a and LAG(b, 1, NULL)
OVER(ORDER BY a, b, c, d, e) = b then
null
else
b
end new_b,
c,
d,
e
FROM t_table t
ORDER BY a, b, c, d, e;
SQLFiddle.

SQL grouping

I have a table with the following columns:
A B C
---------
1 10 X
1 11 X
2 15 X
3 20 Y
4 15 Y
4 20 Y
I want to group the data based on the B and C columns and count the distinct values of the A column. But if there are two ore more rows where the value on the A column is the same I want to get the maximum value from the B column.
If I do a simple group by the result would be:
B C Count
--------------
10 X 1
11 X 1
15 X 1
20 Y 2
15 Y 1
What I want is this result:
B C Count
--------------
11 X 1
15 X 1
20 Y 2
Is there any query that can return this result. Server is SQL Server 2005.
I like to work in steps: first get rid of duplicate A records, then group. Not the most efficient, but it works on your example.
with t1 as (
select A, max(B) as B, C
from YourTable
group by A, C
)
select count(A) as CountA, B, C
from t1
group by B, C
I have actually tested this:
SELECT
MAX( B ) AS B,
C,
Count
FROM
(
SELECT
B, C, COUNT(DISTINCT A) AS Count
FROM
t
GROUP BY
B, C
) X
GROUP BY C, Count
and it gives me:
B C Count
---- ---- --------
15 X 1
15 y 1
20 y 2
WITH cteA AS
(
SELECT
A, C,
MAX(B) OVER(PARTITION BY A, C) [Max]
FROM T1
)
SELECT
[Max] AS B, C,
COUNT(DISTINCT A) AS [Count]
FROM cteA
GROUP BY C, [Max];
Check this out. This should work in Oracle, although I haven't tested it;
select count(a), BB, CC from
(
select a, max(B) BB, Max(C) CC
from yourtable
group by a
)
group by BB,CC