Using UNNEST Function in BigQuery - google-bigquery

I need help on how to use BigQuery UNNEST function. My query:
I have table as shown in the image and I want to unnest the field "domains" (string type) currently separated by comma, so that I get each comma separated domain into a different row for each "acname". The output needed is also enclosed in the image:
enter image description here
I tried this logic but did not work:
select acc.acname,acc.amount,acc.domains as accdomains from project.dataset.dummy_account as acc
CROSS JOIN UNNEST(acc.domains)
But this gave error "Values referenced in UNNEST must be arrays. UNNEST contains expression of type STRING". The error makes sense completely but did not understand, how to convert string to an array.
Can someone please help with solution and also explain a bit, how actually it works. Thank you.

Below is for BigQuery Standard SQL
#standardSQL
SELECT acname, amount, domain
FROM `project.dataset.dummy`,
UNNEST(SPLIT(domains)) domain
You can test, play with above using dummy data from your question as in example below
#standardSQL
WITH `project.dataset.dummy` AS (
SELECT 'abc' acname, 100 amount, 'a,b,c' domains UNION ALL
SELECT 'pqr', 300, 'p,q,r' UNION ALL
SELECT 'lmn', 500, 'l,m,n'
)
SELECT acname, amount, domain
FROM `project.dataset.dummy`,
UNNEST(SPLIT(domains)) domain
with output
Row acname amount domain
1 abc 100 a
2 abc 100 b
3 abc 100 c
4 pqr 300 p
5 pqr 300 q
6 pqr 300 r
7 lmn 500 l
8 lmn 500 m
9 lmn 500 n
The source table project.dataset.dummy which had field "domains" has comma separated values but after the comma there is a space (e.g. 'a'commaspace'b'commaspacec a, b, c). This results in space before the values b c q r m n; in the field "domains" in "Output After Unnest" table. Now I'm joining this table with "salesdomain" as a key. But because of space before b c q r m n, the output received is not correct
To address this - you can just simply use TRIM function to removes all leading and trailing spaces, like in example below
#standardSQL
WITH `project.dataset.dummy` AS (
SELECT 'abc' acname, 100 amount, 'a, b, c' domains UNION ALL
SELECT 'pqr', 300, 'p, q, r' UNION ALL
SELECT 'lmn', 500, 'l, m, n'
)
SELECT acname, amount, TRIM(domain, ' ') domain
FROM `project.dataset.dummy`,
UNNEST(SPLIT(domains)) domain

Related

SQL query multiple values in just one cell

Hello I am kinda new to sql. Just wanna know if this is possible via sql:
Table: (Multiple values are in just 1 cell.)
COLUMN 1
COLUMN 2
"2023-01-01", "2023-01-02", "2023-01-03"
"User A, User B, User C"
Needed Output:
COLUMN 1
COLUMN 2
2023-01-01
User A
2023-01-02
User A
2023-01-03
User A
2023-01-01
User B
2023-01-02
User B
2023-01-03
User B
2023-01-01
User C
2023-01-02
User C
2023-01-03
User C
Basically, each date from the row is assigned to all users in that same row. Any help or tip will be appreciated.
Thank you!
Screenshot of data/required table
I have no idea yet on how to go around this
You can use the string_to_array function to get all parts of a string as elements of an array, then use the unnest function on that array to get the desired result, check the following:
select col1,
unnest(string_to_array(replace(replace(COLUMN2,'"',''),', ',','), ',')) as col2
from
(
select unnest(string_to_array(replace(replace(COLUMN1,'"',''),', ',','), ',')) as col1
, COLUMN2
from table_name
) T
order by col1, col2
See demo
We can use a combination of STRING_TO_ARRAY with UNNEST and LATERAL JOIN here:
SELECT col1.column1, col2.column2
FROM
(SELECT UNNEST(
STRING_TO_ARRAY(column1,',')
) AS column1 FROM test) col1
LEFT JOIN LATERAL
(SELECT UNNEST(
STRING_TO_ARRAY(column2,',')
) AS column2 FROM test) col2
ON true
ORDER BY col2.column2, col1.column1;
Try out: db<>fiddle
STRING_TO_ARRAY will split the different dates and the different users into separate items.
UNNEST will write those items in separate rows.
LATERAL JOIN will put the three dates together with the three users (or of course less/more, depending on your data) and so creates the nine rows shown in your question. It works similar to the CROSS APPLY approach which will do on a SQL Server DB.
The ORDER BY clause just creates the same order as shown in your question, we can remove it if not required. The question doesn't really tell us if it's needed.
Because implementation details van change on different DBMS's, here is an example of how to do it in MySQL (8.0+):
WITH column1 as (
SELECT TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(column1,',',x),',',-1)) as Value
FROM test
CROSS JOIN (select 1 as x union select 2 union select 3 union select 4) x
WHERE x <= LENGTH(Column1)-LENGTH(REPLACE(Column1,',',''))+1
),
column2 as (
SELECT TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(column2,',',x),',',-1)) as Value
FROM test
CROSS JOIN (select 1 as x union select 2 union select 3 union select 4) x
WHERE x <= LENGTH(Column2)-LENGTH(REPLACE(Column2,',',''))+1
)
SELECT *
FROM column1, column2;
see: DBFIDDLE
NOTE:
The CROSS JOIN, with only 4 values should be expanded when more than 4 items exist.
There is not data type connected to the values that are fetched. This implementation does not know that "2023-01-08" is, sorry CAN BE, a date. It just sticks to strings.
In sql server this can be done using string_split
select x.value as date_val,y.value as user_val
from test a
CROSS APPLY string_split(Column1,',')x
CROSS APPLY string_split(Column2,',')y
order by y.value,x.value
date_val user_val
2023-01-01 User A
2023-01-02 User A
2023-01-03 User A
2023-01-03 User B
2023-01-02 User B
2023-01-01 User B
2023-01-01 User C
2023-01-02 User C
2023-01-03 User C
db fiddle link
https://dbfiddle.uk/YNJWDPBq
In mysql you can do it as follows :
WITH dates as (
select TRIM(SUBSTRING_INDEX(_date, ',', 1)) AS 'dates'
from _table
union
select TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(_date, ',', 2), ',', -1)) AS 'dates'
from _table
union
select TRIM(SUBSTRING_INDEX(_date, ',', -1)) AS 'dates'
from _table
),
users as
( select TRIM(SUBSTRING_INDEX(user, ',', 1)) AS 'users'
from _table
union
select TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(user, ',', 2), ',', -1)) AS 'users'
from _table
union
select TRIM(SUBSTRING_INDEX(user, ',', -1)) AS 'users'
from _table
)
select *
from dates, users
order by dates, users;
check it here : https://dbfiddle.uk/_oGix9PD

How to split comma delimited data from one column into multiple rows

I'm trying to write a query that will have a column show a specific value depending on another comma delimited column. The codes are meant to denote Regular time/overtime/doubletime/ etc. and they come from the previously mentioned comma delimited column. In the original view, there are columns for each of the different hours accrued separately. For the purposes of this, we can say A = regular time, B = doubletime, C = overtime. However, we have many codes that can represent the same type of time.
What my original view looks like:
Employee_FullName
EmpID
Code
Regular Time
Double Time
Overtime
John Doe
123
A,B
7
2
0
Jane Doe
234
B
4
0
1
What my query outputs:
Employee_FullName
EmpID
Code
Hours
John Doe
123
A, B
10
John Doe
123
A, B
5
Jane Doe
234
B
5
What I want the output to look like:
Employee_FullName
EmpID
Code
Hours
John Doe
123
A
10
John Doe
123
B
5
Jane Doe
234
B
5
It looks the way it does in the first table because currently it's only pulling from the regular time column. I've tried using a case switch to have it look for a specific code and then pull the number, but I get a variety of errors no matter how I write it. Here's what my query looks like:
SELECT [Employee_FullName],
SUBSTRING(col, 1, CHARINDEX(' ', col + ' ' ) -1)'Code',
hrsValue
FROM
(
SELECT [Employee_FullName], col, hrsValue
FROM myTable
CROSS APPLY
(
VALUES ([Code],[RegularHours])
) C (COL, hrsValue)
) SRC
Any advice on how to fix it or perspective on what to use is appreciated!
Edit: I cannot change the comma delimited data, it is provided that way. I think a case within a cross apply will solve it but I honestly don't know.
Edit 2: I will be using a unique EmployeeID to identify them. In this case yes A is regular time, B is double time, C is overtime. The complication is that there are a variety of different codes and multiple refer to each type of time. There is never a case where A would refer to regular time for one employee and double time for another, etc. I am on SQL Server 2017. Thank you all for your time!
If you are on SQL Server 2016 or better, you can use OPENJSON() to split up the code values instead of cumbersome string operations:
SELECT t.Employee_FullName,
Code = LTRIM(j.value),
Hours = MAX(CASE j.[key]
WHEN 0 THEN RegularTime
WHEN 1 THEN DoubleTime
WHEN 2 THEN Overtime END)
FROM dbo.MyTable AS t
CROSS APPLY OPENJSON('["' + REPLACE(t.Code,',','","') + '"]') AS j
GROUP BY t.Employee_FullName, LTRIM(j.value);
Example db<>fiddle
You can use the following code to split up the values
Note how NULLIF nulls out the CHARINDEX if it returns 0
The second half of the second APPLY is conditional on that null
SELECT
t.[Employee_FullName],
Code = TRIM(v2.Code),
v2.Hours
FROM myTable t
CROSS APPLY (VALUES( NULLIF(CHARINDEX(',', t.Code), 0) )) v1(comma)
CROSS APPLY (
SELECT Code = ISNULL(LEFT(t.Code, v1.comma - 1), t.Code), Hours = t.RegularTime
UNION ALL
SELECT SUBSTRING(t.Code, v1.comma + 1, LEN(t.Code)), t.DoubleTime
WHERE v1.comma IS NOT NULL
) v2;
db<>fiddle
You can go for CROSS APPLY based approach as given below.
Thanks to #Chalieface for the insert script.
CREATE TABLE mytable (
"Employee_FullName" VARCHAR(8),
"Code" VARCHAR(3),
"RegularTime" INTEGER,
"DoubleTime" INTEGER,
"Overtime" INTEGER
);
INSERT INTO mytable
("Employee_FullName", "Code", "RegularTime", "DoubleTime", "Overtime")
VALUES
('John Doe', 'A,B', '10', '5', '0'),
('Jane Doe', 'B', '5', '0', '0');
SELECT
t.[Employee_FullName],
c.Code,
CASE WHEN c.code = 'A' THEN t.RegularTime
WHEN c.code = 'B' THEN t.DoubleTime
WHEN c.code = 'C' THEN t.Overtime
END AS Hours
FROM myTable t
CROSS APPLY (select value from string_split(t.code,',')
) c(code)
Employee_FullName
Code
Hours
John Doe
A
10
John Doe
B
5
Jane Doe
B
0

Merge similar rows based on name and count in Oracle SQL or PL/SQL

We have data as below
CompanyID CompanyName
1000 Decisive Data
1001 Decisive Data, Inc.
1002 Decisive Data Inc.
1003 Thomson ABC Data
1004 Thomson ABC Data Pvt Ltd
1005 Susheel Solutions R K
1006 Susheel R K Sol
1007 R K Susheel Data Solutions
1008 GMR Infra
1009 GMR Infra Projects
1010 GMR Infrastructure Projects Ltd
Expected Query Result:
CompanyName Count
Decisive Data, Inc. 3
Thomson ABC Data Pvt Ltd 2
R K Susheel Data Solutions 3
GMR Infrastructure Projects Ltd 3
Is it possible using some match & merge logic and show the expected result.
The following is rather expensive, but this should work on your particular data. To get the "parent name":
select t.companyName, min(tp.companyname) as parent_companyname
from t join
t tp
on t.companyname like tp.companyname || '%';
Then aggregate:
select parent_companyname, count(*)
from (select t.companyName, min(tp.companyname) as parent_companyname
from t join
t tp
on t.companyname like tp.companyname || '%'
) t
group by parent_companyname;
Notes:
This will NOT scale very well.
This is highly data dependent but it should work on your example.
Fixing names is a HARD problem. My advice is really to put the names in a spreadsheet and manually add the canonical name.
This is hard to do in sql. But i will suggest an approach.
Split up the names into tokens based on the spaces and then see how many tokens match between the company_names.
Once you define a "threshold" of the number you would need some manual intervention to decide which of them are good matches.
After that you would get an idea of how many of them are likely matches. This should help in aggregation logic.
Eg: in the last query the fields (cnt_token) and (cnt_matching_tokens) tell you that "Decisive Data" has 2 out of 2 matching with "Decisive Data Inc."
X Y B_Y TOKEN_VAL CNT_TOKENS CNT_OF_MATCHING_TOKENS
1000 Decisive Data Decisive Data Inc. Decisive 2 2
1000 Decisive Data Decisive Data Inc. Data 2 2
create table t(x int, y varchar2(500));
insert
into t
select 1000 ,'Decisive Data' from dual union all
select 1001 ,'Decisive Data, Inc.' from dual union all
select 1002 ,'Decisive Data Inc.' from dual union all
select 1003 ,'Thomson ABC Data ' from dual union all
select 1004 ,'Thomson ABC Data Pvt Ltd' from dual union all
select 1005 ,'Susheel Solutions R K' from dual union all
select 1006 ,'Susheel R K Sol' from dual union all
select 1007 ,'R K Susheel Data Solutions' from dual union all
select 1008 ,'GMR Infra' from dual union all
select 1009 ,'GMR Infra Projects' from dual union all
select 1010 ,'GMR Infrastructure Projects Ltd' from dual;
commit;
--Example using jaro_winkler_similarity of string.
select * from(
select a.x,a.y as a_y,b.x as b_x,b.y,round(utl_match.jaro_winkler_similarity(a.y,b.y),2) as similar_dist
from t a
join t b
on a.x <> b.x
)m
where m.similar_dist>=80
--comparision based on tokens of the name
with data /*This would split the name into rows based on <space>*/
as (select distinct x,y, replace(trim(regexp_substr(y,'[^ ]+', 1, level) ),',','') as token_val, level
from t
connect by regexp_substr(y, '[^ ]+', 1, level) is not null
)
,data2
as(
select x,count(token_val) as cnt_tokens
from data
group by x
)
select * from (
select a.x,a.y,b.y as b_y,a.token_val
,a1.cnt_tokens
,count(*) over(partition by a.y,b.y) as cnt_of_matching_tokens
from data a
join data2 a1
on a.x=a1.x
left join data b
on a.token_val=b.token_val
and a.x <> b.x
)y
In my opinion is hard to do it on Oracle
You can go to other lenguages like Java or in my case Python.
The result not grantee all your case, but is a good aproach.
Let me give you my opinion and if you are interested can integer on your work:
First of all install package difflibhelper
pip3 install difflibhelper
after you have to get your samples:
open python script or prompt...
Here is my code to create some ratio:
from difflib import SequenceMatcher
def s_ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
lista_1 = [
'Decisive Data',
...
'GMR Infrastructure Projects Ltd'
]
lista_2 = [data.split() for data in lista_1]
for data in lista_2:
data.sort()
lista_3 = []
[lista_3.append(' '.join(data)) for data in lista_2]
print(s_ratio(lista_3[0], lista[1])) -> Result **0.8125** # it means data is compatible
When you Join all the data, you need to know some things, the first is if your statements is ordered, and when you compare you continue or compare 1 x 1.
Also you need to define your ratio to find the parency.
And finally you have to write the data into a file(is very eazy) to parse data on SQL.

Convert a categorical column to binary representation in SQL

Consider there is a column of array of strings in a table containing categorical data. Is there an easy way to convert this schema so there is number of categories boolean columns representing binary encoding of that categorical column?
Example:
id type
-------------
1 [A, C]
2 [B, C]
being converted to :
id is_A is_B is_C
1 1 0 1
2 0 1 1
I know I can do this 'by hand', i.e. using:
WITH flat AS (SELECT * FROM t, unnest(type) type),
mid AS (SELECT id, (type='A') as is_A, (type='B') AS is_B, (type='C') as is_C)
SELECT id, SUM(is_A), SUM(is_B), SUM(is_C) FROM mid GROUP BY id
But I am looking for a solution that works when the number of categories is around 1-10K
By the way I am using BigQuery SQL.
looking for a solution that works when the number of categories is around 1-10K
Below is for BigQuery SQL
Step 1 - produce dynamically query (similar to one used in your question - but now it is built dynamically base on you table - yourTable)
#standardSQL
WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat)
SELECT CONCAT(
"WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat), ",
"ids AS (SELECT DISTINCT id FROM yourTable), ",
"pairs AS (SELECT id, cat FROM ids CROSS JOIN categories), ",
"flat AS (SELECT id, cat FROM yourTable, UNNEST(type) cat), ",
"combinations AS ( ",
" SELECT p.id, p.cat AS col, IF(f.cat IS NULL, 0, 1) AS flag ",
" FROM pairs AS p LEFT JOIN flat AS f ",
" ON p.cat = f.cat AND p.id=f.id ",
") ",
"SELECT id, ",
STRING_AGG(CONCAT("SUM(IF(col = '", cat, "', flag, 0)) as is_", cat) ORDER BY cat),
" FROM combinations ",
"GROUP BY id ",
"ORDER BY id"
) as query
FROM categories
Step 2 - copy result of above query, paste it back to Web UI and run Query
I think you've got an idea. Yo can implement it as above purely in SQL or you can generate final query in any client of your choice
I had tried this approach of generating the query (but in Python) the problem is that query can easily reach the 256KB limit of query size in BigQuery
First, let’s see how “easily” it is to reach 256KB limit
Assuming you have 10 chars as average length of category – in this case you can cover about 4750 categories with this approach.
With 20 as average - coverage is about 3480 and for 30 – 2750
If you will "compress" sql a little by removing spaces and AS , etc. you can make it respectively:
5400, 3800, 2970 for respectively 10, 20, 30 chars
So, I would say – Yes/Agree – it most likely reach limit before 5K in real case
So, secondly, let’s see if this is actually a big of a problem!
Just as an example, assume you need 6K categories. Let’s see how you can split this to two batches (assuming that 3K scenario does work as per initial solution)
What we need to do is to split categories to two groups – just based on category names
So first group will be - BETWEEN ‘cat1’ AND ‘cat3000’
And second group will be – BETWEEN ‘cat3001’ AND ‘cat6000’
So, now run both groups with Step1 and Step2 with temp1 and temp2 tables as destination
In Step 1 – add (to the very bottom of query - after FROM categories
WHERE cat BETWEEN ‘cat1’ AND ‘cat3000’
for first batch, and
WHERE cat BETWEEN ‘cat3001’ AND ‘cat6000’
for second batch
Now, proceed to Step 3
Step 3 – Combining partial results
#standardSQL
SELECT * EXCEPT(id2)
FROM temp1 FULL JOIN (
SELECT id AS id2, * EXCEPT(id) FROM temp2
) ON id = id2
-- ORDER BY id
You can test last logic with below simple/dummy data
WITH temp1 AS (
SELECT 1 AS id, 1 AS is_A, 0 AS is_B UNION ALL
SELECT 2 AS id, 0 AS is_A, 1 AS is_B UNION ALL
SELECT 3 AS id, 1 AS is_A, 0 AS is_B
),
temp2 AS (
SELECT 1 AS id, 1 AS is_C, 0 AS is_D UNION ALL
SELECT 2 AS id, 1 AS is_C, 0 AS is_D UNION ALL
SELECT 3 AS id, 0 AS is_C, 1 AS is_D
)
Above can easily be extended to more than just two batches
Hope this helped

How to split a cell and create a new row in sql

I have a column which stores multiple comma separated values. I need to split it in a way so that it gets split into as many rows as values in that column along with remaining values in that row.
eg:
John 111 2Jan
Sam 222,333 3Jan
Jame 444,555,666 2Jan
Jen 777 4Jan
Output:
John 111 2Jan
Sam 222 3Jan
Sam 333 3Jan
Jame 444 2Jan
Jame 555 2Jan
Jame 666 2Jan
Jen 777 4Jan
P.S : I have seen multiple questions similar to this, but could not find a way to split in such a way.
This solution is built on Vertica, but it works for every database that offers a function corresponding to SPLIT_PART().
Part of it corresponds to the un-pivoting technique that works with every ANSI compliant database platform that I explain here (just the un-pivoting part of the script):
Pivot sql convert rows to columns
So I would do it like here below. I'm assuming that the minimalistic date representation is part of the second column of a two-column input table. So I'm first splitting that short date literal away, in a first Common Table Expression (and, in a comment, I list that CTE's output), before splitting the comma separated list into tokens.
Here goes:
WITH
-- input
input(name,the_string) AS (
SELECT 'John', '111 2Jan'
UNION ALL SELECT 'Sam' , '222,333 3Jan'
UNION ALL SELECT 'Jame', '444,555,666 2Jan'
UNION ALL SELECT 'Jen' , '777 4Jan'
)
,
-- put the strange date literal into a separate column
the_list_and_the_date(name,list,datestub) AS (
SELECT
name
, SPLIT_PART(the_string,' ',1)
, SPLIT_PART(the_string,' ',2)
FROM input
)
-- debug
-- SELECT * FROM the_list_and_the_date;
-- name|list |datestub
-- John|111 |2Jan
-- Sam |222,333 |3Jan
-- Jame|444,555,666|2Jan
-- Jen |777 |4Jan
,
-- ten integers (too many for this example) to use as pivoting value and as "index"
ten_ints(idx) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
UNION ALL SELECT 7
UNION ALL SELECT 8
UNION ALL SELECT 9
UNION ALL SELECT 10
)
-- the final query - pivoting prepared input using a CROSS JOIN with ten_ints
-- and filter out where the SPLIT_PART() expression evaluates to the empty string
SELECT
name
, SPLIT_PART(list,',',idx) AS token
, datestub
FROM the_list_and_the_date
CROSS JOIN ten_ints
WHERE SPLIT_PART(list,',',idx) <> ''
;
name|token|datestub
John|111 |2Jan
Jame|444 |2Jan
Jame|555 |2Jan
Jame|666 |2Jan
Sam |222 |3Jan
Sam |333 |3Jan
Jen |777 |4Jan
Happy playing ...
Marco the Sane