Group by and aggregation on Bigquery - google-bigquery

I have a table which has the following format (Google Big query) :
user
url
val1
val2
val3
...
val300
A
a
0.5
0
-3
...
1
A
b
1
2
3
...
2
B
c
5
4
-10
...
2
I would like to obtain a new table where I obtain the number of urls by user, and vals are aggregated by average. (The number of different vals can be variable so I would like to have something rather flexible)
user
nb_url
val1
val2
val3
...
val300
A
2
0.75
1
0
...
1.5
B
1
...
What is the good syntax?
Thank you in advance

Aggregate by user, select the count of URLs, and the average of the other columns.
SELECT
user,
COUNT(*) AS nb_url,
AVG(val1) AS val1,
AVG(val2) AS val2,
AVG(val3) AS val3,
...
AVG(val300) AS val300
FROM yourTable
GROUP BY user
ORDER BY user;

Generating pivot for 300 columns can be quite expensive even for BigQuery - instead I would recommend below [unpivoted] solution
select user, count(url) nb_url,
offset + 1 col, avg(cast(val as float64)) as val
from your_table t,
unnest(split(translate(format('%t', (select as struct * except(user, url) from unnest([t]))), '() ', ''))) val with offset
group by user, col
if applied to sample data as in your question - output is

Related

Display query results based on condition

More of a conceptual question. I have a query that calculates a sum of some values and checks it against a template value X, something like:
SELECT
SUM(...),
X,
SUM(...) - X AS delta
FROM
...
Now the query by itself works fine. The problem is that I need it to only display some results if the delta variable is non-zero, meaning there is a difference between the calculated sum and template X. If delta is zero, I need it to display nothing.
Is this possible to achieve in SQL? If yes, how would I go about it?
You have an aggregation query. Many, if not most, databases support column aliases in the having clause:
select . . .
from . . .
group by . . .
having delta <> 0;
For those that don't, it is probably simplest to repeat the expressions:
having sum( . . . ) <> X
You can also put the query into a CTE or subquery, and then use where on the subquery.
Oracle does not support column alias in the HAVING clause. so in oracle, You must have to repeat the aggregation in the HAVING clause.
See this:
SQL> SELECT MAX(COL1) AS RES,
2 COL2
3 FROM (select 1 as col1, 1 as col2 from dual
4 union all
5 select 10 as col1, 2 as col2 from dual)
6 GROUP BY COL2
7 HAVING RES > 5;
HAVING RES > 5
*
ERROR at line 7:
ORA-00904: "RES": invalid identifier
SQL> SELECT MAX(COL1) AS RES,
2 COL2
3 FROM (select 1 as col1, 1 as col2 from dual
4 union all
5 select 10 as col1, 2 as col2 from dual)
6 GROUP BY COL2
7 HAVING MAX(COL1) > 5;
RES COL2
---------- ----------
10 2
SQL>

Find min max over all columns without listing down each column name in SQL

I have a SQL table (actually a BigQuery table) that has a huge number of columns (over a thousand). I want to quickly find the min and max value of each column. Is there a way to do that?
It is impossible for me to list all the columns. Looking for ways to do something like
SELECT MAX(*) FROM mytable;
and then running
SELECT MIN(*) FROM mytable;
I have been unable to Google a way of doing that. Not sure that's even possible.
For example, if my table has the following schema:
col1 col2 col3 .... col1000
the (say, max) query should return
Row col1 col2 col3 ... col1000
1 3 18 0.6 ... 45
and the min query should return (say)
Row col1 col2 col3 ... col1000
1 -5 4 0.1 ... -5
The numbers are just for illustration. The column names could be different strings and not easily scriptable.
See below example for BigQuery Standard SQL - it works for any number of columns and does not require explicit calling/use of columns names
#standardSQL
WITH `project.dataset.mytable` AS (
SELECT 1 AS col1, 2 AS col2, 3 AS col3, 4 AS col4 UNION ALL
SELECT 7,6,5,4 UNION ALL
SELECT -1, 11, 5, 8
)
SELECT
MIN(CAST(value AS INT64)) AS min_value,
MAX(CAST(value AS INT64)) AS max_value
FROM `project.dataset.mytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'":(.*?)(?:,"|})')) value
with result
Row min_value max_value
1 -1 11
Note: if your columns are of STRING data type - you should remove CAST ... AS INT64
Or if they are of FLOAT64 - replace INT64 with FLOAT64 in the CAST function
Update
Below is option to get MIN/Max for each column and present result as array of respective values as list of respective values in the order of the columns
#standardSQL
WITH `project.dataset.mytable` AS (
SELECT 1 AS col1, 2 AS col2, 3 AS col3, 14 AS col4 UNION ALL
SELECT 7,6,5,4 UNION ALL
SELECT -1, 11, 5, 8
), temp AS (
SELECT pos, MIN(CAST(value AS INT64)) min_value, MAX(CAST(value AS INT64)) max_value
FROM `project.dataset.mytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'":(.*?)(?:,"|})')) value WITH OFFSET pos
GROUP BY pos
)
SELECT 'min_values' stats, TO_JSON_STRING(ARRAY_AGG(min_value ORDER BY pos)) vals FROM temp UNION ALL
SELECT 'max_values', TO_JSON_STRING(ARRAY_AGG(max_value ORDER BY pos)) FROM temp
with result as
Row stats vals
1 min_values [-1,2,3,4]
2 max_values [7,11,5,14]
Hope this is something you can still apply to whatever your final goal

How to update rows based on a shared ID within a single table

Currently I have a table that looks like below:
ID|Date |Val1|Val2|
1 |1/1/2016|1000|0
2 |1/1/2016|Null|0
3 |1/1/2016|Null|0
1 |2/1/2016|1000|0
2 |2/1/2016|Null|0
3 |2/1/2016|1000|0
1 |3/1/2016|1000|0
2 |3/1/2016|1000|0
3 |3/1/2016|1000|0
I want val2 to become 1 if Val1 is populated in the previous month, so the output would look like:
ID|Date |Val1|Val2|
1 |1/1/2016|1000|0
2 |1/1/2016|Null|0
3 |1/1/2016|Null|0
1 |2/1/2016|1000|1
2 |2/1/2016|Null|0
3 |2/1/2016|1000|0
1 |3/1/2016|1000|1
2 |3/1/2016|1000|0
3 |3/1/2016|1000|1
I've tried a few code combinations, but the conditional of updating the value by the previous date where Val1 first appears is tripping me up. I'd appreciate any help!
You can do this with a windowed LAG() to find the previous value, and update Val2 if it's NOT NULL.
;With Cte As
(
Select Id, [Date-----], Val1, Val2,
Lag(Val1) Over (Partition By Id Order By [Date-----] Asc) As Prev
From LikeBelow
)
Update Cte
Set Val2 = 1
Where Prev Is Not Null;
If you are actually storing your dates as a VARCHAR and not a DATE, you'll need to convert it:
;With Cte As
(
Select Id, [Date-----], Val1, Val2,
Lag(Val1) Over (Partition By Id
Order By Convert(Date, [Date-----]) Asc) As Prev
From LikeBelow
)
Update Cte
Set Val2 = 1
Where Prev Is Not Null;

SQL - Group by numbers according to their difference

I have a table and I want to group rows that have at most x difference at col2.
For example,
col1 col2
abg 3
abw 4
abc 5
abd 6
abe 20
abf 21
After query I want to get groups such that
group 1: abg 3
abw 4
abc 5
abd 6
group 2: abe 20
abf 21
In this example difference is 1.
How can write such a query?
For Oracle (or anything that supports window functions) this will work:
select col1, col2, sum(group_gen) over (order by col2) as grp
from (
select col1, col2,
case when col2 - lag(col2) over (order by col2) > 1 then 1 else 0 end as group_gen
from some_table
)
Check it on SQLFiddle.
This should get what you need, and changing the gap to that of 5, or any other number is a single change at the #lastVal +1 (vs whatever other difference). The prequery "PreSorted" is required to make sure the data is being processed sequentially so you don't get out-of-order entries.
As each current row is processed, it's column 2 value is stored in the #lastVal for test comparison of the next row, but remains as a valid column "Col2". There is no "group by" as you are just wanting a column to identify where each group is associated vs any aggregation.
select
#grp := if( PreSorted.col2 > #lastVal +1, #grp +1, #grp ) as GapGroup,
PreSorted.col1,
#lastVal := PreSorted.col2 as Col2
from
( select
YT.col1,
YT.col2
from
YourTable YT
order by
YT.col2 ) PreSorted,
( select #grp := 1,
#lastVal := -1 ) sqlvars
try this query, you can use 1 and 2 as input and get you groups:
var grp number(5)
exec :grp :=1
select * from YourTABLE
where (:grp = 1 and col2 < 20) or (:grp = 2 and col2 > 6);

Oracle SQL -- select from two columns and combine into one

I have this table:
Vals
Val1 Val2 Score
A B 1
C 2
D 3
I would like the output to be a single column that is the "superset" of the Vals1 and Val2 variable. It also keeps the "score" variable associated with that value.
The output should be:
Val Score
A 1
B 1
C 2
D 3
Selecting from this table twice and then unioning is absolutely not a possibility because producing it is very expensive. In addition I cannot use a with clause because this query uses one in a sub-query and for some reason Oracle doesn't support two with clauses.
I don't really care about how repeat values are dealt with, whatever is easiest/fastest.
How can I generate my appropriate output?
Here is solution without using unpivot.
with columns as (
select level as colNum from dual connect by level <= 2
),
results as (
select case colNum
when 1 then Val1
when 2 then Val2
end Val,
score
from vals,
columns
)
select * from results where val is not null
Here is essentially the same query without the WITH clause:
select case colNum
when 1 then Val1
when 2 then Val2
end Val,
score
from vals,
(select level as colNum from dual connect by level <= 2) columns
where case colNum
when 1 then Val1
when 2 then Val2
end is not null
Or a bit more concisely
select *
from ( select case colNum
when 1 then Val1
when 2 then Val2
end Val,
score
from vals,
(select level as colNum from dual connect by level <= 2) columns
) results
where val is not null
try this, looks like you want to convert column values into rows
select val1, score from vals where val1 is not null
union
select val2,score from vals where val2 is not null
If you're on Oracle 11, unPivot will help:
SELECT *
FROM vals
UNPIVOT ( val FOR origin IN (val1, val2) )
you can choose any names instead of 'val' and 'origin'.
See Oracle article on pivot / unPivot.