one hot encode list in bigquery - google-bigquery

I would like to use BigQuery instead of Pandas to create dummy variables (one-hot-encoding using multilabelbinarizer) for my categories. I have large number of columns, therefore I can't do it manually and hard code it
Test dataset (the actual one has many more variables than this one)
WITH table AS (
SELECT 1001 as ID, ['blue','green'] As Color, ['big'] AS size UNION ALL
SELECT 1002 as ID, ['green','yellow'] As Color, ['medium','large'] AS size UNION ALL
SELECT 1003 as ID, ['red'] As Color, ['big'] AS size UNION ALL
SELECT 1004 as ID, ['blue'] As Color, ['big'] AS size)
SELECT *
FROM table
EXPECTED output
I wish to store it as a table/dataframe as shown in the image. I have more columns like color,size, products, etc.
Related answer(not a list): one-hot-encoding (dummy variables) with BigQuery

Below query will return your expected output.
WITH table AS (
SELECT 1001 as ID, ['blue','green'] As Color, ['big'] AS size UNION ALL
SELECT 1002 as ID, ['green','yellow'] As Color, ['medium','large'] AS size UNION ALL
SELECT 1003 as ID, ['red'] As Color, ['big'] AS size UNION ALL
SELECT 1004 as ID, ['blue'] As Color, ['big'] AS size
)
SELECT * FROM (
SELECT ID, type, value FROM table UNPIVOT (values FOR type IN (Color, size)), UNNEST(values) value
) PIVOT (COUNT(1) FOR type || '_' || value IN (
'Color_blue', 'Color_green', 'Color_yellow', 'Color_red', 'size_big', 'size_medium', 'size_large'
));
Query results
Based on #Mikhail's answer using a dynamic sql, you can partially generalize the query. (column names are still hard-coded.)
DECLARE Colors, Sizes ARRAY<STRING>;
CREATE TEMP TABLE sample_table AS (
SELECT 1001 as ID, ['blue','green'] As Color, ['big'] AS size UNION ALL
SELECT 1002 as ID, ['green','yellow'] As Color, ['medium','large'] AS size UNION ALL
SELECT 1003 as ID, ['red'] As Color, ['big'] AS size UNION ALL
SELECT 1004 as ID, ['blue'] As Color, ['big'] AS size
);
SET (Colors, Sizes) = (
SELECT AS STRUCT ARRAY_AGG(DISTINCT IF(type = 'Color', value, NULL) IGNORE NULLS),
ARRAY_AGG(DISTINCT IF(type = 'size', value, NULL) IGNORE NULLS),
FROM `your-project.your-dataset.input_table` UNPIVOT (values FOR type IN (Color, size)), UNNEST(values) value
);
EXECUTE IMMEDIATE FORMAT("""
CREATE OR REPLACE TABLE `your-project.your-dataset.output_table` AS
SELECT * FROM (
SELECT ID, type, value FROM `your-project.your-dataset.input_table` UNPIVOT (values FOR type IN (Color, size)), UNNEST(values) value
) PIVOT (COUNT(1) FOR type || '_' || value IN (%s,%s)) ORDER BY ID;
""", (SELECT STRING_AGG(FORMAT("'Color_%s'", color)) FROM UNNEST(Colors) color),
(SELECT STRING_AGG(FORMAT("'size_%s'", size)) FROM UNNEST(Sizes) size)
);

Consider below approach - most generic I can think of and does not require any knowledge about columns number and names
create temp function extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));""";
create temp function extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));""";
create temp table flatten_list as (
select id, format('%s_%s', key, val) col from your_table t,
unnest([to_json_string((select as struct * except(id) from unnest([t])))]) json,
unnest(extract_keys(json)) key with offset
join unnest(extract_values(json)) vals with offset
using (offset), unnest(split(vals)) val
);
execute immediate format(
'create temp table pivot_table as select * from flatten_list pivot (count(*) for col in (%s)) order by id',
(select string_agg("'" || col || "'", "," order by col)
from (select distinct col from flatten_list))
);
select * from pivot_table;
if applied to sample data in your question - output is

Related

Snowflake SQL - OBJECT_CONSTRUCT from COUNT and GROUP BY

I'm trying to summarize data in a table:
counting total rows
counting values on specific fields
getting the distinct values on specific fields
and, more importantly, I'm struggling with:
getting the count for each field nested in an object
given this data
COL1
COL2
A
0
null
1
B
null
B
null
the expected result from this query would be:
with dummy as (
select 'A' as col1, 0 as col2
union all
select null, 1
union all
select 'B', null
union all
select 'B', null
)
select
count(1) as total
,count(col1) as col1
,array_agg(distinct col1) as dist_col1
--,object_construct(???) as col1_object_count
,count(col2) as col2
,array_agg(distinct col2) as dist_col2
--,object_construct(???) as col2_object_count
from
dummy
TOTAL
COL1
DIST_COL1
COL1_OBJECT_COUNT
COL2
DIST_COL2
COL2_OBJECT_COUNT
4
3
["A", "B"]
{"A": 1, "B", 2, null: 1}
2
[0, 1]
{0: 1, 1: 1, null: 2}
I've tried several functions inside OBJECT_CONSTRUCT mixed with ARRAY_AGG, but all failed
OBJECT_CONSTRUCT can work with several columns but only given all (*), if you try a select statement inside, it will fail
another issue is that analytical functions are not easily taken by the object or array functions in Snowflake.
You could use Snowflake Scripting or Snowpark for this but here's a solution that is somewhat flexible so you can apply it to different tables and column sets.
Create test table/view:
Create or Replace View dummy as (
select 'A' as col1, 0 as col2
union all
select null, 1
union all
select 'B', null
union all
select 'B', null
);
Set session variables for table and colnames.
set tbname = 'DUMMY';
set colnames = '["COL1", "COL2"]';
Create view that generates the required table_column_summary data:
Create or replace View table_column_summary as
with
-- Create table of required column names
cn as (
select VALUE::VARCHAR CNAME
from table(flatten(input => parse_json($colnames)))
)
-- Convert rows into objects
,ro as (
select
object_construct_keep_null(*) row_object
-- using identifier on session variable to dynamically supply table/view name
from identifier($tbname) )
-- Flatten row objects into key/values
,rof as (
select
key col_name,
ifnull(value,'null')::VARCHAR col_value
from ro, lateral flatten(input => row_object), cn
-- You will only need this filter if you need a subset
-- of columns from the source table/query summarised
where col_name = cn.cname)
-- Get the column value distinct value counts
,cdv as (
select col_name,
col_value,
sum(1) col_value_count
from rof
group by 1,2
)
-- and derive required column level stats and combine with cdv
,cv as (
select
(select count(1) from dummy) total,
col_name,
object_construct('COL_COUNT', count(col_value) ,
'COL_DIST', array_agg(distinct col_value),
'COL_OBJECT_COUNT', object_agg(col_value,col_value_count)) col_values
from cdv
group by 1,2)
-- Return result
Select * from cv;
Use this final query if you want a solution that works flexibility with any table/columns provided as input...
Select total, object_agg(col_name, col_values) col_values_obj
From table_column_summary
Group by 1;
Or use this final query if you want the fixed columns output as described in your question...
Select total,
COL1[0]:COL_COUNT COL1,
COL1[0]:COL_DIST DIST_COL1,
COL1[0]:COL_OBJECT_COUNT COL1_OBJECT_COUNT,
COL2[0]:COL_COUNT COL2,
COL2[0]:COL_DIST DIST_COL2,
COL2[0]:COL_OBJECT_COUNT COL2_OBJECT_COUNT
from table_column_summary
PIVOT ( ARRAY_AGG ( col_values )
FOR col_name IN ( 'COL1', 'COL2' ) ) as pt (total, col1, col2);

Generate CASE WHEN statement using another table

I would like to create a query that does the following:
Using a regex_mapping table, find all rows in the sample data that REGEXP_MATCH on x
WITH sample_data AS (
SELECT x, y
FROM (SELECT "asd rmkt asdf" AS x, true AS y UNION ALL -- should map to remekartier
SELECT "as asdf", true UNION ALL -- should map to ali sneider
SELECT "asdafsd", false) -- should map to NULL
),
regex_mapping AS (
SELECT regex, map
FROM (SELECT "as" AS regex, "ali sneider" AS map UNION ALL
SELECT "rmkt" AS regex, "remekartier" AS map )
)
SELECT sample_data.*, mapped_item
FROM sample_data
-- but here, use multiple REGEXP_MATCH with CASE WHEN looping over the regex_mappings.
-- e.g. CASE WHEN REGEXP_MATCH(x, "as") THEN "ali sneider"
WHEN REGEXP_MATCH(x, "rmkt") THEN "remakrtier" END AS mapped_item)
Try this -
WITH sample_data AS (
SELECT x, y
FROM (SELECT "asd rmkt asdf" AS x, true AS y UNION ALL -- should map to remekartier
SELECT "as asdf", true UNION ALL -- should map to ali sneider
SELECT "asdafsd", false)
),
regex_mapping AS (
SELECT regex, map
FROM (SELECT "as" AS regex, "ali sneider" AS map UNION ALL
SELECT "rmkt" AS regex, "remekartier" AS map )
)
SELECT s.*, r.map
FROM sample_data s, regex_mapping r
WHERE regexp_contains(s.x,concat('\\b',r.regex,'\\b'))
The results ->
Second way: Instead of cross-join, use a scalar subquery. I have used limit so that the subquery doesn't return more than 1 row and if multiple regexp matches, then it will return only one of them
--- same WITH clause as above query ---
SELECT s.*, (SELECT r.map
FROM regex_mapping r
WHERE regexp_contains(s.x,concat('\\b',r.regex,'\\b'))
LIMIT 1) as map
FROM sample_data s
The results ->
Third way: Deduplicated Data
WITH sample_data AS (
SELECT campaign_name, placement_name
FROM (SELECT "as_rmkt_asdf" AS campaign_name, "xdd" AS placement_name UNION ALL -- should map to remekartier
SELECT "as_asdf", "sdfsdf" UNION ALL -- should map to ali sneider
SELECT "as_rmkt_dafsd", "sdfg" UNION ALL -- should map to rmkt
SELECT "asf_adsdf", "gdf" -- should map to NULL (because higher priority)
)
),
regex_mapping AS (
SELECT regex, map, priority
FROM (SELECT "rmkt" AS regex, "remekartier" AS map, 1 AS priority UNION ALL
SELECT "as" AS regex, "ali sneider" AS map, 2 AS priority)
),
X AS (
SELECT s.*,
CASE WHEN regexp_contains(s.campaign_name, concat('(^|_)',r.regex,'($|_)')) THEN r.map ELSE NULL END AS map,
ROW_NUMBER() OVER (PARTITION BY s.campaign_name ORDER BY regexp_contains(s.campaign_name, concat('(^|_)',r.regex,'($|_)')) DESC, r.priority) AS rn
FROM sample_data s
CROSS JOIN regex_mapping r
)
SELECT * EXCEPT (rn)
FROM X
WHERE rn = 1

pivot multi-level nested fields in bigquery

My bq table schema:
Continuing this post: bigquery pivoting with nested field
I'm trying to flatten this table. I would like to unnest the timeseries.data fields, i.e. the final number of rows should be equal to the total length of timeseries.data arrays. I would also like to add annotation.properties.key with certain value as additional columns, and annotation.properties.value as its value. So in this case, it would be the "margin" column. However the following query gives me error: "Unrecognized name: data". But after the last FROM, I did already: unnest(timeseries.data) as data.
flow_timestamp, channel_name, number_of_digits, timestamp, value, margin
2019-10-31 15:31:15.079674 UTC, channel_1, 4, 2018-02-28T02:00:00, 50, 0.01
query:
SELECT
flow_timestamp, timeseries.channel_name,
( SELECT MAX(IF(channel_properties.key = 'number_of_digits', channel_properties.value, NULL))
FROM UNNEST(timeseries.channel_properties) AS channel_properties
),
data.timestamp ,data.value
,(with subq as (select * from unnest(data.annotation))
select max(if (properties.key = 'margin', properties.value, null))
from (
select * from unnest(subq.properties)
) as properties
) as margin
FROM my_table
left join unnest(timeseries.data) as data
WHERE DATE(flow_timestamp) between "2019-10-28" and "2019-11-02"
order by flow_timestamp
Try below
#standardSQL
SELECT
flow_timestamp,
timeseries.channel_name,
( SELECT MAX(IF(channel_properties.key = 'number_of_digits', channel_properties.value, NULL))
FROM UNNEST(timeseries.channel_properties) AS channel_properties
) AS number_of_digits,
item.timestamp,
item.value,
( SELECT MAX(IF(prop.key = 'margin', prop.value, NULL))
FROM UNNEST(item.annotation) AS annot, UNNEST(annot.properties) prop
) AS margin
FROM my_table
LEFT JOIN UNNEST(timeseries.data) item
WHERE DATE(flow_timestamp) BETWEEN '2019-10-28' AND '2019-11-02'
ORDER BY flow_timestamp
Below is a little less verbose version of the same solution, but I usually prefer above as it simpler to maintain
#standardSQL
SELECT
flow_timestamp,
timeseries.channel_name,
( SELECT MAX(IF(key = 'number_of_digits', value, NULL))
FROM UNNEST(timeseries.channel_properties) AS channel_properties
) AS number_of_digits,
timestamp,
value,
( SELECT MAX(IF(key = 'margin', value, NULL))
FROM UNNEST(annotation), UNNEST(properties)
) AS margin
FROM my_table
LEFT JOIN UNNEST(timeseries.data)
WHERE DATE(flow_timestamp) BETWEEN '2019-10-28' AND '2019-11-02'
ORDER BY flow_timestamp

Find way for gathering data and replace with values from another table

I am looking for an Oracle SQL query to find a specific pattern and replace them with values from another table.
Scenario:
Table 1:
No column1
-----------------------------------------
12345 user:12345;group:56789;group:6785;...
Note: field 1 may be has one or more pattern
Table2 :
Id name type
----------------------
12345 admin user
56789 testgroup group
Result must be the same
No column1
-----------------------------------
12345 user: admin;group:testgroup
Logic:
First split the concatenated string to individual rows using connect
by clause and regex.
Join the newly created table(split_tab) with Table2(tab2).
Use listagg function to concatenate data in the columns.
Query:
WITH tab1 AS
( SELECT '12345' NO
,'user:12345;group:56789;group:6785;' column1
FROM DUAL )
,tab2 AS
( SELECT 12345 id
,'admin' name
,'user' TYPE
FROM DUAL
UNION
SELECT 56789 id
,'testgroup' name
,'group' TYPE
FROM DUAL )
SELECT no
,listagg(category||':'||name,';') WITHIN GROUP (ORDER BY tab2.id) column1
FROM ( SELECT NO
,REGEXP_SUBSTR( column1, '(\d+)', 1, LEVEL ) id
,REGEXP_SUBSTR( column1, '([a-z]+)', 1, LEVEL ) CATEGORY
FROM tab1
CONNECT BY LEVEL <= regexp_count( column1, '\d+' ) ) split_tab
,tab2
WHERE split_tab.id = tab2.id
GROUP BY no
Output:
No Column1
12345 user:admin;group:testgroup
with t1 (no, col) as
(
-- start of test data
select 1, 'user:12345;group:56789;group:6785;' from dual union all
select 2, 'user:12345;group:56789;group:6785;' from dual
-- end of test data
)
-- the lookup table which has the substitute strings
-- nid : concatenation of name and id as in table t1 which requires the lookup
-- tname : required substitute for each nid
, t2 (id, name, type, nid, tname) as
(
select t.*, type || ':' || id, type || ':' || name from
(
select 12345 id, 'admin' name, 'user' type from dual union all
select 56789, 'testgroup', 'group' from dual
) t
)
--select * from t2;
-- cte table calculates the indexes for the substrings (eg, user:12345)
-- no : sequence no in t1
-- col : the input string in t1
-- si : starting index of each substring in the 'col' input string that needs attention later
-- ei : ending index of each substring in the 'col' input string
-- idx : the order of substring to put them together later
,cte (no, col, si, ei, idx) as
(
select no, col, 1, case when instr(col,';') = 0 then length(col)+1 else instr(col,';') end, 1 from t1 union all
select no, col, ei+1, case when instr(col,';', ei+1) = 0 then length(col)+1 else instr(col,';', ei+1) end, idx+1 from cte where ei + 1 <= length(col)
)
,coll(no, col, sstr, idx, newstr) as
(
select
a.no, a.col, a.sstr, a.idx,
-- when a substitute is not found in t2, use the same input substring (eg. group:6785)
case when t2.tname is null then a.sstr else t2.tname end
from
(select cte.*, substr(col, si, ei-si) as sstr from cte) a
-- we don't want to miss if there is no substitute available in t2 for a substring
left outer join
t2
on (a.sstr = t2.nid)
)
select no, col, listagg(newstr, ';') within group (order by no, col, idx) from coll
group by no, col;

How do I combine 2 records with a single field into 1 row with 2 fields (Oracle 11g)?

Here's a sample data
record1: field1 = test2
record2: field1 = test3
The actual output I want is
record1: field1 = test2 | field2 = test3
I've looked around the net but can't find what I'm looking for. I can use a custom function to get it in this format but I'm trying to see if there's a way to make it work without resorting to that.
thanks a lot
You need to use pivot:
with t(id, d) as (
select 1, 'field1 = test2' from dual union all
select 2, 'field1 = test3' from dual
)
select *
from t
pivot (max (d) for id in (1, 2))
If you don't have the id field you can generate it, but you will have XML type:
with t(d) as (
select 'field1 = test2' from dual union all
select 'field1 = test3' from dual
), t1(id, d) as (
select ROW_NUMBER() OVER(ORDER BY d), d from t
)
select *
from t1
pivot xml (max (d) for id in (select id from t1))
There are several ways to approach this - google pivot rows to columns. Here is one set of answers: http://www.dba-oracle.com/t_converting_rows_columns.htm