BigQuery, FIRST_VALUE, and null - google-bigquery

In the following example, I would have expected the results
Row a b f0_
1 1 1 3
2 1 2 3
3 1 3 5
4 1 4 5
5 1 5 null
because, in general, aggregates tend to ignore nulls. If FIRST_VALUE doesn't ignore nulls, what value does it have over using LEAD
Example:
select a, b, first_value(c) over (partition by a order by b asc rows BETWEEN 1 following AND UNBOUNDED FOLLOWING)
from
(select 1 as a, 1 as b, 1 as c),
(select 1 as a, 2 as b, null as c),
(select 1 as a, 3 as b, 3 as c),
(select 1 as a, 4 as b, null as c),
(select 1 as a, 5 as b, 5 as c),
gives
Row a b f0_
1 1 1 null
2 1 2 3
3 1 3 null
4 1 4 5
5 1 5 5

I would have expected the results
Below trick gives expected (in your question) result
SELECT
a, b,
MAX(c) OVER (PARTITION BY a ORDER BY grp ASC RANGE BETWEEN 1 FOLLOWING AND 1 FOLLOWING)
FROM (
SELECT
a, b, c,
COUNT(c) OVER (PARTITION BY a ORDER BY b ASC rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) grp
FROM
(SELECT 1 AS a, 1 AS b, 1 AS c),
(SELECT 1 AS a, 2 AS b, NULL AS c),
(SELECT 1 AS a, 3 AS b, 3 AS c),
(SELECT 1 AS a, 4 AS b, NULL AS c),
(SELECT 1 AS a, 5 AS b, 5 AS c)
)
what value does it have over using LEAD
LEAD has more reach signature - LEAD(<expr>[, <offset>[, <default_value>]]) - so if you just need first value you can short cut it to FIRST_VALUE(<field_name>) - I think this is the major practical difference

Related

Sorting columns within rows in BigQuery

I have table like below
id a b c
1 2 1 3
2 3 2 1
3 16 14 15
4 10 12 13
5 15 16 14
6 10 12 8
I need to "normalize" this table by sorting values in columns a, b, c - row by row and deduping them counting dups
Expected result
a b c dups
1 2 3 2
14 15 16 2
10 12 13 1
8 10 12 1
I do have solution but I don't see how to "scale" it easily to case when I have more than 3 columns to normalize. The first and last column as you can see below is not an issue. Stuff gets messy for columns in the middle when number of columns > 3
select a, b, c, count(1) as dups from (
select a1 as a, if(a != a1 and a != c1, a, if(b != a1 and b != c1, b, c)) as b, c1 as c
from (select a, b, c, least(a, b, c) as a1, greatest(a, b, c) as c1 from table)
) group by a, b, c
Can anyone suggest another approach?
Below example works for 4 columns and can be adjusted to any number of columns by adding extra STRING(x) to CONCAT() and extra line for REGEXP_EXRACT per each extra column.
SELECT a, b, c, d, COUNT(1) AS dups
FROM (
SELECT id,
REGEXP_EXTRACT(s + ',', r'(?U)^(?:.*,){0}(.*),') AS a,
REGEXP_EXTRACT(s + ',', r'(?U)^(?:.*,){1}(.*),') AS b,
REGEXP_EXTRACT(s + ',', r'(?U)^(?:.*,){2}(.*),') AS c,
REGEXP_EXTRACT(s + ',', r'(?U)^(?:.*,){3}(.*),') AS d
FROM (
SELECT id, GROUP_CONCAT(s) AS s FROM (
SELECT id, s,
INTEGER(s) AS e,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY e) pos
FROM (
SELECT id,
SPLIT(CONCAT(STRING(a),',',STRING(b),',',STRING(c),',',STRING(d))) AS s
FROM table
) ORDER BY id, pos
) GROUP BY id
)
) GROUP BY a, b, c, d

Shuffle column in Google's BigQuery based on groupby

I want to randomly shuffle the values for one single column of a table based on a groupby. E.g., I have two columns A and B. Now, I want to randomly shuffle column B based on a groupby on A.
For an example, suppose that there are three distinct values in A. Now for each distinct value of A, I want to shuffle the values in B, but just with values having the same A.
Example input:
A B C
-------------------
1 1 x
1 3 a
2 4 c
3 6 d
1 2 a
3 5 v
Example output:
A B C
------------------
1 3 x
1 2 a
2 4 c
3 6 d
1 1 a
3 5 v
In this case, for A=1 the values for B got shuffled. The same happened for A=2, but as there is only one row it stayed like it was. For A=3 by chance the values for B also stayed like they were. The values for column C stay as they are.
Maybe this can be solved by using window functions, but I am unsure how exactly.
As a side note: This should be achieved in Google's BigQuery.
Is this what you're after ? (you tagged with both Mysql and Oracle .. so I answer here using Oracle)
[edit] corrected based on confirmed logic [/edit]
with w_data as (
select 1 a, 1 b from dual union all
select 1 a, 3 b from dual union all
select 2 a, 4 b from dual union all
select 3 a, 6 b from dual union all
select 1 a, 2 b from dual union all
select 3 a, 5 b from dual
),
w_suba as (
select a, row_number() over (partition by a order by dbms_random.value) aid
from w_data
),
w_subb as (
select a, b, row_number() over (partition by a order by dbms_random.value) bid
from w_data
)
select sa.a, sb.b
from w_suba sa,
w_subb sb
where sa.aid = sb.bid
and sa.a = sb.a
/
A B
---------- ----------
1 3
1 1
1 2
2 4
3 6
3 5
6 rows selected.
SQL> /
A B
---------- ----------
1 3
1 1
1 2
2 4
3 5
3 6
6 rows selected.
SQL>
Logic breakdown:
1) w_data is just your sample data set ...
2) randomize column a (not really needed, you could just rownum this, and let b randomize ... but I do so love (over)using dbms_random :) heh )
3) randomize column b - (using partition by analytic creates "groups" .. order by random radomizes the items within each group)
4) join them ... using both the group (a) and the randomized id to find a random item within each group.
by doing the randomize this way you can ensure that you get the same # .. ie you start with one "3" .. you end with one "3" .. etc.
I feel below should work in BigQuery
SELECT
x.A as A, x.B as Old_B, x.c as C, y.B as New_B
FROM (
SELECT A, B, C,
ROW_NUMBER() OVER(PARTITION BY A ORDER BY B, C) as pos
FROM [your_table]
) as x
JOIN (
SELECT
A, B, ROW_NUMBER() OVER(PARTITION BY A ORDER BY rnd) as pos
FROM (
SELECT A, B, RAND() as rnd
FROM [your_table]
)
) as y
ON x.A = y.A AND x.pos = y.pos

SQLSERVER group by (aggregate column based on other column)

I have a table which has 3 columns A, B, C
I want to do a query like this:
select A, Max(B), ( C in the row having max B ) from Table group by A.
is there a way to do such a query?
Test Data:
A B C
2 5 3
2 6 1
4 5 1
4 7 9
6 5 0
the expected result would be:
2 6 1
4 7 9
6 5 0
;WITH CTE AS
(
SELECT A,
B,
C,
RN = ROW_NUMBER() OVER(PARTITION BY A ORDER BY B DESC)
FROM YourTable
)
SELECT A, B, C
FROM CTE
WHERE RN = 1
Try this
select t.*
from table t
join (Select A,max(b) B from table group by A) c
on c.a=t.a
and c.b=a.b

Categorize by columns

Have no idea what I'm to name this request so therefore I have not found any answers to it.
Basically a common statement like:
SELECT A,B,C,D,E FROM TABLE
Example result:
A B C D E
1 1 2 3 4
1 2 3 4 5
1 2 7 8 9
2 1 4 5 6
How do I 'categorize' by certain columns (in example A and B) so column values are omitted?
Preferred result:
A B C D E
1 1 2 3 4
1 2 3 4 5
7 8 9
2 1 4 5 6
In your case I guess it would make sense to have the result ordered by A,B?
In that case you could use:
SELECT DECODE(RN,1,A,NULL) AS A,
DECODE(RN,1,B,NULL) AS B,
C,
D,
E
FROM
(SELECT A,
B,
row_number() over (partition BY A,B order by A,B) AS RN,
C,
D,
E
FROM
(SELECT * FROM TEST_TABLE ORDER BY A,B
)
);
You can use LAG analytic function to access previous row values. See below example:
SELECT case
when LAG(a, 1, NULL)
OVER(ORDER BY a, b, c, d, e) = a and LAG(b, 1, NULL)
OVER(ORDER BY a, b, c, d, e) = b then
null
else
a
end new_a,
case
when LAG(a, 1, NULL)
OVER(ORDER BY a, b, c, d, e) = a and LAG(b, 1, NULL)
OVER(ORDER BY a, b, c, d, e) = b then
null
else
b
end new_b,
c,
d,
e
FROM t_table t
ORDER BY a, b, c, d, e;
SQLFiddle.

SQL grouping

I have a table with the following columns:
A B C
---------
1 10 X
1 11 X
2 15 X
3 20 Y
4 15 Y
4 20 Y
I want to group the data based on the B and C columns and count the distinct values of the A column. But if there are two ore more rows where the value on the A column is the same I want to get the maximum value from the B column.
If I do a simple group by the result would be:
B C Count
--------------
10 X 1
11 X 1
15 X 1
20 Y 2
15 Y 1
What I want is this result:
B C Count
--------------
11 X 1
15 X 1
20 Y 2
Is there any query that can return this result. Server is SQL Server 2005.
I like to work in steps: first get rid of duplicate A records, then group. Not the most efficient, but it works on your example.
with t1 as (
select A, max(B) as B, C
from YourTable
group by A, C
)
select count(A) as CountA, B, C
from t1
group by B, C
I have actually tested this:
SELECT
MAX( B ) AS B,
C,
Count
FROM
(
SELECT
B, C, COUNT(DISTINCT A) AS Count
FROM
t
GROUP BY
B, C
) X
GROUP BY C, Count
and it gives me:
B C Count
---- ---- --------
15 X 1
15 y 1
20 y 2
WITH cteA AS
(
SELECT
A, C,
MAX(B) OVER(PARTITION BY A, C) [Max]
FROM T1
)
SELECT
[Max] AS B, C,
COUNT(DISTINCT A) AS [Count]
FROM cteA
GROUP BY C, [Max];
Check this out. This should work in Oracle, although I haven't tested it;
select count(a), BB, CC from
(
select a, max(B) BB, Max(C) CC
from yourtable
group by a
)
group by BB,CC