BigQuery - replicate rows with modified values - sql

The title of the post might not accurately represent what I want to do. I have a BigQuery table with a userId column and a bunch of feature columns. Let's say the table is like this.
_____________________________
|userId| col1 | col2 | col3 |
-------|------|------|-------
|u1 | 0.3 | 0.0 | 0.0 |
|u2 | 0.0 | 0.1 | 0.6 |
-----------------------------
Each row has a userId (userIds may or may not be distinct across rows), followed by some feature values. Most of those are 0 except a few.
Now, for each of the rows, I want to create additional rows where only one non-zero feature is substituted with 0. With the example above, the resulting table would look like this.
_____________________________
|userId| col1 | col2 | col3 |
-------|------|------|-------
|u1 | 0.3 | 0.0 | 0.0 |
|u1 | 0.0* | 0.0 | 0.0 |
|u2 | 0.0 | 0.1 | 0.6 |
|u2 | 0.0 | 0.0* | 0.6 |
|u2 | 0.0 | 0.1 | 0.0* |
-----------------------------
Values with asterisk represent the columns for which the non-zero value was set to 0. Since u1 had 1 nonzero feature, only one additional row was added to it with col1 value set to 0. u2 had 2 non-zero columns (col2 and col3). As such, two additional rows were added, one with col2 set to 0 and the other with col3 set to 0.
The table has around 2000 columns and more than 20 million rows.
Normally, I post the crude attempts I could come up with. However, in this case, I don't even know where to start from. I did have one bizarre idea of joining this table with an unpivoted version of it. But, I don't know how to unpivot a BQ table.

Below is for BigQuery Standard SQL
It is generic enough - you don't need to specify column names or repeat same chunk of code 2000 times!
Assuming that your initial data is in project.dataset.table table
#standardSQL
create temp table flatten as
with temp as (
select userid, offset,
split(col_kv, ':')[offset(0)] as col,
cast(split(col_kv, ':')[offset(1)] as float64) as val
from `project.dataset.table` t,
unnest(split(translate(to_json_string(t), '{}"', ''))) col_kv with offset
where split(col_kv, ':')[offset(0)] != 'userid'
), numbers as (
select * from unnest((
select generate_array(1, max(offset))
from temp)) as grp
), targets as (
select userid, grp from temp, numbers
where grp = offset and val != 0
), flatten_result as (
select *, 0 as grp from temp union all
select userid, offset, col, if(offset = grp, 0, val) as val, grp
from temp left join targets using(userid)
)
select * from flatten_result;
execute immediate '''create temp table pivot as
select userid, ''' || (
select string_agg(distinct "max(if(col = '" || col || "', val, null)) as " || col)
from flatten
) || ''' from flatten group by userid, grp''';
select * from pivot order by userid;
your final output is in temp table pivot
If to apply above to sample data from your question output of script is
and output of pivot table is under last VIW RESULT link

One method is brute force:
select userid, col1, col2, col3
from t
union all
select userid, 0 as col1, col2, col3
from t
where col1 = 0
union all
select userid, col1, 0 as col2, col3
from t
where col2 = 0
union all
select userid, col1, col2, 0 as col3
from t
where col3 = 0;
This is verbose -- and convoluted with hundreds of columns. I can't readily think of a simpler method.

Related

How to select only rows which having the values excluding the column which have the value 0 in BigQuery

I have data set with having columns user_id, photo_taken,phot_uploaded, and photo_upload_error. For every user there is count for photo_taken,phot_uploaded, and photo_upload_error. like Picture of data
--------------------------------------------------------------
user_id| photo_taken|phot_uploaded|photo_upload_erro|
-------------------------------------------------------------
34645654645| 6 | 7 | 9 |
65543545435| 0 | 0 | 0 |
65455545435| 0 | 0 | 0 |
44553535435| 1 | 1 | 1 |
--------------------------------------------------------------
I want to take columns that having the values and I want to exclude the columns which have the value 0.
user_id| photo_taken|phot_uploaded|photo_upload_erro|
-------------------------------------------------------------
34645654645| 6 | 7 | 9 |
44553535435| 1 | 1 | 1 |
--------------------------------------------------------------
You seem to want to exclude (rows where all the values are 0. If that is the case:
select t.*
from t
where photo_taken > 0 or phot_uploaded > 0 or photo_upload_erro > 0;
Actually, it is not clear if you want to filter out rows where all values are 0 or any values. The above does any value. For all values, change the or to and.
How to select only rows which having the values ...
Consider below option - it DOES NOT require explicit mentioning of all the columns to be checked which can be quite handy if you have more than just few
select *
from `project.dataset.table` t
where translate(format('%t',(select as struct * except(user_id) from unnest([t]))),'0, ','')!='()'
If applied to sample data in your question - output is
I want to exclude the columns which have the value 0 [ ... in all rows ...]
Consider below
execute immediate (
select 'select ' || string_agg(col, ', ' order by offset) || ' from `project.dataset.table`'
from (
select distinct offset, col,
logical_and(val = '0') over(partition by col) all_zeroes
from `project.dataset.table` t,
unnest(split(translate(to_json_string(t), '{}"', ''))) kv with offset,
unnest([struct(split(kv, ':')[offset(0)] as col, split(kv, ':')[offset(1)] as val)])
)
where not all_zeroes
)
if to apply to below data
the output will be
as you can see - the column photo_taken is excluded because it has all zeroes

SQL - Create a formatted ouput with placeholder rows

For reasons of our IT department, I am stuck doing this entirely within an SQL query.
Simplified, I have this as an input table:
And I need to create this:
And I am just not sure where to start with this. In my normal C# way of thinking its easy. Column1 is ordered, if the value in Col1 is new, then add a new row to the output and put the contents in column1 in the output. Then, whilst the contents of the input Column1 is unchanged, keep adding the contents of column2 to new rows.
In SQL... nope, I just cannot see the right way to start!
This is a presentation issue that can be easily done in the application or presentation layer. In SQL this can be clunky. The goal of a database is not to render a UI but to store and retrieve data fast and also efficiently, in order to serve as many clients as possible with the same hardware and software resources constraints.
The query that could do this can look like:
with
y as (
select col1, row_number() over(order by col1) as r1
from (select distinct col1 as col1 from t) x
),
z as (
select
t.col1, y.r1, t.col2,
row_number() over(partition by t.col1 order by t.col2) as r2
from t
join y on y.col1 = x.col1
)
select col1, col2
from (
select col1, null as col2, r1, 0 from y
union all
select null, col2, r1, r2 from z
) w
order by r1, r2
As you see, it looks clunky and bloated.
You need a header row for each group which will consist of col1 and null and all the rows of the table with null as col1.
You can do it with UNION ALL and conditional sorting:
select
case when t.col2 is null then t.col1 end col1,
t.col2
from (
select col1, col2 from tablename
union all
select distinct col1, null from tablename
) t
order by
t.col1,
case when t.col2 is null then 1 else 2 end,
t.col2
See the demo (for MySql but it is standard SQL).
Results:
| col1 | col2 |
| ---- | ----- |
| SetA | |
| | BH101 |
| | BH102 |
| | BH103 |
| SetB | |
| | BH201 |
| | BH202 |
| | BH203 |
I agree, formatting should be done outside of SQL, but if you have no choice, here is some SQL Server code that will generate your output
select *
from (
select top 100
case
when col2 is null then ' '+col1
else '' end as firstCol,
IsNull(col2,'') as Col2
from dbo.test t1
group by col1,col2 with rollup
order by col1,col2
) x
where x.firstcol is not null

How can I array aggregate per column where distinct values are less than a given number in Google BigQuery?

I have a dataset table like this in Google Big Query:
| col1 | col2 | col3 | col4 | col5 | col6 |
-------------------------------------------
| a1 | b1 | c1 | d1 | e2 | f1 |
| a2 | b2 | c2 | d1 | e2 | f2 |
| a1 | b3 | c3 | d1 | e3 | f2 |
| a2 | b1 | c4 | d1 | e4 | f2 |
| a1 | b2 | c5 | d1 | e5 | f2 |
Let's say the given threshold number is 4, in that case, I want to transform this into one of the tables given below:
| col1 | col2 | col4 | col5 | col6 |
---------------------------------------------------------------------
| [a1,a2] | [b1,b2,b] | [d1] |[e2,e3,e4,e5]| [f1,f2] |
Or like this:
| col | values |
------------------------
| col1 | [a1,a2] |
| col2 | [b1,b2,b] |
| col4 | [d1] |
| col5 | [e2,e3,e4,e5] |
| col6 | [f1,f2] |
Please note col3 was removed because it contained more than 4 (threshold) distinct values. I explored lot of documents here but was not able to figure out the required query. Can somebody help or point in the right direction ?
Edit: I have one solution in mind, where I do something like this:
select * from (select 'col1', array_aggregate(distinct col1) as values union all
select 'col2', array_aggregate(distinct col2) as values union all
select 'col3', array_aggregate(distinct col3) as values union all
select 'col4', array_aggregate(distinct col4) as values union all
select 'col5', array_aggregate(distinct col5) as values) X where array_length(values) > 4;
This will give me the second result but requires complex query construction assuming I don't know the number and names of the columns up front. Also, this might cross 100MB per row limit for BigQuery table as I will be having more than a billion rows in the table. Please also suggest if there is a better way to do this.
How about:
WITH arrays AS (
SELECT * FROM UNNEST((
SELECT [
STRUCT("col_repo_name" AS col, ARRAY_AGG(DISTINCT repo.name IGNORE NULLS LIMIT 1001) AS values)
, ('col_actor_login', ARRAY_AGG(DISTINCT actor.login IGNORE NULLS LIMIT 1001))
, ('col_type', ARRAY_AGG(DISTINCT type IGNORE NULLS LIMIT 1001))
, ('col_org_login', ARRAY_AGG(DISTINCT org.login IGNORE NULLS LIMIT 1001))
]
FROM `githubarchive.year.2017`
))
)
SELECT *
FROM arrays
WHERE ARRAY_LENGTH(values)<=1000
This query processed 20.6GB in 11.9s (half billion rows). It only returned one row, because every other row had more than 1000 unique values (my threshold).
That's traditional SQL -- but see here an even simpler query, that produces similar results:
SELECT col, ARRAY_AGG(DISTINCT value IGNORE NULLS LIMIT 1001) values
FROM (
SELECT REGEXP_EXTRACT(x, r'"([^\"]*)"') col , REGEXP_EXTRACT(x, r'":"([^\"]*)"') value
FROM (
SELECT SPLIT(TO_JSON_STRING(STRUCT(repo.name, actor.login, type, org.login)), ',') x
FROM `githubarchive.year.2017`
), UNNEST(x) x
)
GROUP BY col
HAVING ARRAY_LENGTH(values)<=1000
# 17.0 sec elapsed, 20.6 GB processed
Caveat: This will only run if there are no special values in the columns, like quotes or commas. If you have those, it won't be as straightforward (but still possible).
Below is for BigQuery Standard SQL
#standardSQL
SELECT col, STRING_AGG(DISTINCT value) `values`
FROM (
SELECT
TRIM(z[OFFSET(0)], '"') col,
TRIM(z[OFFSET(1)], '"') value
FROM `project.dataset.table` t,
UNNEST(SPLIT(TRIM(to_JSON_STRING(t), '{}'))) kv,
UNNEST([STRUCT(SPLIT(kv, ':') AS z)])
)
GROUP BY col
HAVING COUNT(DISTINCT value) < 5
You can test, play with above using sample data from your question - result will be
Row col values
1 col1 a1,a2
2 col2 b1,b2,b3
3 col4 d1
4 col5 e2,e3,e4,e5
5 col6 f1,f2
#FelipeHoffa I was able to use your idea with a little modification in the query for my use-case.
SELECT * FROM UNNEST((
SELECT [
STRUCT("col_repo_name" AS col, ARRAY_AGG(DISTINCT repo.name IGNORE NULLS LIMIT 1001) AS values)
, ('col_actor_login', ARRAY_AGG(DISTINCT actor.login IGNORE NULLS LIMIT 1001))
, ('col_type', ARRAY_AGG(DISTINCT type IGNORE NULLS LIMIT 1001))
, ('col_org_login', ARRAY_AGG(DISTINCT org.login IGNORE NULLS LIMIT 1001))
]
FROM `githubarchive.year.2017`
))
This UNNEST on an array of structs will not work as it is because the underlying columns will have different data-types and BigQuery will not be able to put the arrays under a single column (with error like this: Array elements of types {STRUCT>, STRUCT>} do not have a common supertype). I modified it to something like this to serve my use-case.
SELECT * FROM UNNEST((
SELECT [
STRUCT("col_repo_name" AS col, to_json_string(ARRAY_AGG(DISTINCT repo.name IGNORE NULLS LIMIT 1001)) AS values)
, ('col_actor_login', to_json_string(ARRAY_AGG(DISTINCT actor.login IGNORE NULLS LIMIT 1001)))
, ('col_type', to_json_string(ARRAY_AGG(DISTINCT type IGNORE NULLS LIMIT 1001)))
, ('col_org_login', to_json_string(ARRAY_AGG(DISTINCT org.login IGNORE NULLS LIMIT 1001)))
]
FROM `githubarchive.year.2017`
))
And this worked well !

SQL rank/dense_rank and how to query/calculate with the result

So I have a table where it dense_ranks my rows.
Here is the table:
COL1 | COL2 | COL3 | DENSE_RANK |
a | b | c | 1 |
a | s | r | 1 |
a | w | f | 1 |
b | b | c | 2 |
c | f | r | 3 |
c | q | d | 3 |
So now I want to select any rows where the rank was only represented once, so the 2 is all alone, but not the 1 or 3. I want to select all the rows where this occurs, but how do I do that?
Some ideas:
-COUNT DISTINCT (RANK())
-COUNT RANK()
but neither of those are working, any ideas? please and thank you!
happy hacking
actual code:
SELECT events.event_type AS "event",
DENSE_RANK() OVER (ORDER BY bw_user_event.pad_id) as rank
FROM user_event
WHERE (software_events.software_id = '8' OR software_events.software_id = '14')
AND (software_events.event_type = 'install')
WITH Dense_ranked_table as (
-- Your select query that generates the table with dense ranks
)
SELECT DENSE_RANK
FROM Dense_ranked_table
GROUP BY DENSE_RANK
HAVING COUNT(DENSE_RANK) = 1;
I don't have SQL Server to test this. So please let me know whether this works or not.
I would think you can add a COUNT(*) OVER (PARTITION BY XXXXX) where XXXXX is what you include in your dense rank.
Then wrap this in a Common Table Expression and select where your new Count is = 1.
Something like this fiddler:
http://sqlfiddle.com/#!6/ae774/1
Code included here as well:
CREATE TABLE T
(
COL1 CHAR,
COL2 CHAR,
COL3 CHAR
);
INSERT INTO T
VALUES
('a','b','c'),
('a','s','r'),
('a','w','f'),
('b','b','c'),
('c','f','r'),
('c','q','d');
WITH CTE AS (
SELECT COL1 ,
COL2 ,
COL3,
DENSE_RANK() OVER (ORDER BY COL1) AS DR,
COUNT(*) OVER (PARTITION BY COL1) AS C
FROM dbo.T AS t
)
SELECT COL1, COL2, COL3, DR
FROM CTE
WHERE C = 1
Would return just the
b, b, c, 2
row from your test data.

Oracle SQL Transpose

Before I begin, I know there is a whole bunch of questions on Stackoverflow on this topic but I could not find any of them relevant to my case because they involve something much more complicated than what I need.
What I want is a simple dumb transpose with no logic involved.
Here is the original table that my select query returns:
Name Age Sex DOB Col1 Col2 Col3 ....
A 12 M 8/7 aa bb cc
Typically, this is going to contain only 1 record i.e. for one person
Now what I want is
Field Value
Name A
Age 12
Sex M
DOB 8/7
Col1 aa
Col2 bb
Col3 cc
.
.
So there is no counting, summing or any complicated logic involved like most of the similar question on Stackoverflow.
How do I do it?
I read through the PIVOT and UNPIVOT help and it was not that helpful at all.
PS: By chance, if it contains more than one records, is it possible to return each record as a field somewhat like
Field Value1 Value2 Value3 ...
Name A B C ...
Age .. .. .. ...
.
.
I want to know how to to do this for Oracle 10g and 11g
PS:Feel free to tag as duplicate if you find a question that is truly similar to mine.
I would suggest applying the UNPIVOT function first to your multiple columns, then using row_number() to create your new column names that will be used in the PIVOT.
The basic syntax for the unpivot will be
select field,
value,
'value'||
to_char(row_number() over(partition by field
order by value)) seq
from yourtable
unpivot
(
value
for field in (Name, Age, Sex, DOB, col1, col2, col3)
) u;
See SQL Fiddle with Demo. This is going to convert your multiple columns of data into multiple rows. I used row_number() to create a unique value for your new column names, the data from this query looks like:
| FIELD | VALUE | SEQ |
|-------|-------------------------|--------|
| AGE | 12 | value1 |
| AGE | 15 | value2 |
| COL1 | aa | value1 |
| COL1 | xx | value2 |
Then you can apply the PIVOT function to this result:
select field, value1, value2
from
(
select field,
value,
'value'||
to_char(row_number() over(partition by field
order by value)) seq
from yourtable
unpivot
(
value
for field in (Name, Age, Sex, DOB, col1, col2, col3)
) u
) d
pivot
(
max(value)
for seq in ('value1' as value1, 'value2' as value2)
) piv
See SQL Fiddle with Demo. This gives a final result:
| FIELD | VALUE1 | VALUE2 |
|-------|-------------------------|-------------------------|
| AGE | 12 | 15 |
| COL1 | aa | xx |
| COL2 | bb | yy |
| COL3 | cc | zz |
| DOB | 07-Aug-2001 12:00:00 AM | 26-Aug-2001 12:00:00 AM |
| NAME | A | B |
| SEX | F | M |
Note, when you are applying the unpivot function the datatype of all of the columns must be the same so you might have to convert your data in a subquery before you can unpivot it.
The UNPIVOT/PIVOT function were introduced in Oracle 11g, if you are using Oracle 10g, then you can edit the query to use:
with cte as
(
select 'name' field, name value
from yourtable
union all
select 'Age' field, Age value
from yourtable
union all
select 'Sex' field, Sex value
from yourtable
union all
select 'DOB' field, DOB value
from yourtable
union all
select 'col1' field, col1 value
from yourtable
union all
select 'col2' field, col2 value
from yourtable
union all
select 'col3' field, col3 value
from yourtable
)
select
field,
max(case when seq = 'value1' then value end) value1,
max(case when seq = 'value2' then value end) value2
from
(
select field, value,
'value'||
to_char(row_number() over(partition by field
order by value)) seq
from cte
) d
group by field;
See SQL Fiddle with Demo