How to pivot array build from REGEXP_EXTRACT_ALL - sql

I'm collecting url with query parameters in a BigQuery table. I want to parse these urls and then pivot the table. Input data and expected Output at the end.
I found two queries that I want to merge.
This one to pivot my parsed url:
select id,
max(case when test.name='a' then test.score end) as a,
max(case when test.name='b' then test.score end) as b,
max(case when test.name='c' then test.score end) as c
from
(
select a.id, t
from `table` as a,
unnest(test) as t
)A group by id
then I have this query to parse the url:
WITH examples AS (
SELECT 1 AS id,
'?foo=bar' AS query,
'simple' AS description
UNION ALL SELECT 2, '?foo=bar&bar=baz', 'multiple params'
UNION ALL SELECT 3, '?foo[]=bar&foo[]=baz', 'arrays'
UNION ALL SELECT 4, '', 'no query'
)
SELECT
id,
query,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values,
description
FROM examples
I'm not sure to explain my issues. But I think that is because when I'm splitting my query parameters as separate columns It doesn't match with the format of the first query where I need to merge the key and values under the same column so I can unnest them correctly.
Input data:
| id | url |
|---- |-------------------- |
| 1 | url/?foo=aaa&bar=ccc |
| 2 | url/?foo=bbb&bar=ccc |
expected output:
| id | foo | bar |
|---- |---- |---- |
| 1 | aaa | ccc |
| 2 | bbb | ccc |
I have exactly the same number of parameters

Use below
select id,
max(if(split(kv, '=')[offset(0)] = 'foo', split(kv, '=')[offset(1)], null)) as foo,
max(if(split(kv, '=')[offset(0)] = 'bar', split(kv, '=')[offset(1)], null)) as bar
from `project.dataset.table` t,
unnest(regexp_extract_all(url, r'[?&](\w+=\w+)')) kv
group by id
if applied to sample data in your question - output is

Related

Big query query is too complex after pivot

Assume I have the following table table and a list of interests (cat, dog, music, soccer, coding)
| userId | user_interest | label |
| -------- | -------------- |----------|
| 12345 | cat | 1 |
| 12345 | dog | 1 |
| 6789 | music | 1 |
| 6789 | soccer | 1 |
I want to transform the user interest into a binary array (i.e. binarization), and the resulting table will be something like
| userId | labels |
| -------- | -------------- |
| 12345 | [1,1,0,0,0] |
| 6789 | [0,0,1,1,0] |
I am able to do it with PIVOT and ARRAY, e.g.
WITH user_interest_pivot AS (
SELECT
*
FROM (
SELECT userId, user_interest, label FROM table
) AS T
PIVOT
(
MAX(label) FOR user_interestc IN ('cat', 'dog', 'music', 'soccer', 'coding')
) AS P
)
SELECT
userId,
ARRAY[IFNULL(cat,0), IFNULL(dog,0), IFNULL(music,0), IFNULL(soccer,0), IFNULL(coding,0)] AS labels,
FROM user_interea_pivot
HOWEVER, in reality I have a very long list of interests, and the above method in bigquery seems to not work due to
Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too comple
Please help to let me know if there is anything I can do to deal with this situation. Thanks!
Still it's likely to face resource problem depending on your real data, but it is worth trying the following approach without PIVOT.
Create interests table with additional index column first
+----------+-----+-----------------+
| interest | idx | total_interests |
+----------+-----+-----------------+
| cat | 0 | 5 |
| dog | 1 | 5 |
| music | 2 | 5 |
| soccer | 3 | 5 |
| coding | 4 | 5 |
+----------+-----+-----------------+
find idx of each user interest and aggreage them like below. (assuming that user intererest is sparse over overall interests)
SELECT userId, ARRAY_AGG(idx) user_interests
FROM sample_table t JOIN interests i ON t.user_interest = i.interest
GROUP BY 1
Lastly, create labels vector using a sparse user interest array and dimension of interest space (i.e. total_interests) like below
ARRAY(SELECT IF(ui IS NULL, 0, 1)
FROM UNNEST(GENERATE_ARRAY(0, total_interests - 1)) i
LEFT JOIN t.user_interests ui ON i = ui
ORDER BY i
) AS labels
Query
CREATE TEMP TABLE sample_table AS
SELECT '12345' AS userId, 'cat' AS user_interest, 1 AS label UNION ALL
SELECT '12345' AS userId, 'dog' AS user_interest, 1 AS label UNION ALL
SELECT '6789' AS userId, 'music' AS user_interest, 1 AS label UNION ALL
SELECT '6789' AS userId, 'soccer' AS user_interest, 1 AS label;
CREATE TEMP TABLE interests AS
SELECT *, COUNT(1) OVER () AS total_interests
FROM UNNEST(['cat', 'dog', 'music', 'soccer', 'coding']) interest
WITH OFFSET idx
;
SELECT userId,
ARRAY(SELECT IF(ui IS NULL, 0, 1)
FROM UNNEST(GENERATE_ARRAY(0, total_interests - 1)) i
LEFT JOIN t.user_interests ui ON i = ui
ORDER BY i
) AS labels
FROM (
SELECT userId, total_interests, ARRAY_AGG(idx) user_interests
FROM sample_table t JOIN interests i ON t.user_interest = i.interest
GROUP BY 1, 2
) t;
Query results
I think below approach will "survive" any [reasonable] data
create temp function base10to2(x float64) returns string
language js as r'return x.toString(2);';
with your_table as (
select '12345' as userid, 'cat' as user_interest, 1 as label union all
select '12345' as userid, 'dog' as user_interest, 1 as label union all
select '6789' as userid, 'music' as user_interest, 1 as label union all
select '6789' as userid, 'soccer' as user_interest, 1 as label
), interests as (
select *, pow(2, offset) weight, max(offset + 1) over() as len
from unnest(['cat', 'dog', 'music', 'soccer', 'coding']) user_interest
with offset
)
select userid,
split(rpad(reverse(base10to2(sum(weight))), any_value(len), '0'), '') labels,
from your_table
join interests
using(user_interest)
group by userid
with output

Transforming columns to row values

Have a table in Google Bigquery like this with 1 id column (customers) and 3 store-name columns:
id |PA|PB|Mall|
----|--|--|----|
3699|1 |1 | 1 |
1017| |1 | 1 |
9991|1 | | |
My objective is to have the option to select customers (id's) who visited for example:
ONLY PA
PA and PB
PA and Mall
PA, PB and Mall
One alternative output could be:
id |Store |
----|--------- |
3699|PA+PB+Mall|
1017|PB+Mall |
9991|PA |
However this would not give me counts of all stopping by PA regardless of other stores visited. In the example above that count would have been 2 (3699 and 9991).
A second alternative output could be:
id |Store|
----|-----|
3699|PA |
3699|PB |
3699|Mall |
1017|PB |
1017|Mall |
9991|PA |
However, this would not allow me (i think) to select/filter those who has visited for example BOTH PA and Mall (only 3699)
A third alternative output could be a combo:
id |Store| Multiple store|
----|-----|---------------|
3699|PA | PA+PB+Mall |
3699|PB | PA+PB+Mall |
3699|Mall | PA+PB+Mall |
1017|PB | PB+Mall |
1017|Mall | PB+Mall |
9991|PA | |
What option is the best and is there any other alternatives to achieve my objective? I believe alternative 3 could be best, but not sure how to achieve it.
It depends what you want. For instance, the third would simply be:
select t.*,
string_agg(store, '+') over (partition by id)
from t;
The second would be:
select id, string_agg(store, '+')
from t
group by id;
For the third option, you may try unpivoting your current table, then applying STRING_AGG to get the computed column containing all stores for each id:
WITH cte AS (
SELECT id, CASE WHEN PA = 1 THEN 'PA' END AS Store
FROM yourTable
UNION ALL
SELECT id, CASE WHEN PB = 1 THEN 'PB' END
FROM yourTable
UNION ALL
SELECT id, CASE WHEN Mall = 1 THEN 'Mall' END
FROM yourTable
)
SELECT id, Store,
STRING_AGG(store, '+') OVER (PARTITION BY id) All_Stores
FROM cte
WHERE Store IS NOT NULL
ORDER BY id, Store;
Consider below approaches to all three options
Assuming input data is filled with nulls when it is empty in question's sample
with `project.dataset.table` as (
select 3699 id, 1 PA, 1 PB, 1 Mall union all
select 1017, null, 1, 1 union all
select 9991, 1, null, null
)
Option #1
select id, string_agg(key, '+') as Store
from `project.dataset.table` t,
unnest(split(translate(to_json_string(t), '{}"', ''))) kv,
unnest([struct(split(kv,':')[offset(0)] as key, split(kv,':')[offset(1)] as value)])
where key !='id'
and value != 'null'
group by id
with output
Option #2
select id, key as Store
from `project.dataset.table` t,
unnest(split(translate(to_json_string(t), '{}"', ''))) kv,
unnest([struct(split(kv,':')[offset(0)] as key, split(kv,':')[offset(1)] as value)])
where key !='id'
and value != 'null'
with output
Option #3
select id, key as Store,
string_agg(key, '+') over(partition by id) as Multiple_Store
from `project.dataset.table` t,
unnest(split(translate(to_json_string(t), '{}"', ''))) kv,
unnest([struct(split(kv,':')[offset(0)] as key, split(kv,':')[offset(1)] as value)])
where key !='id'
and value != 'null'
with output

Oracle - how to select multiple rows as 1 row with columns

I am using Oracle 12c and have data coming back with multiple rows that I would like to switch to a single row select statement that has headers describing the data. The twist here is that the data column is a CLOB.
Here is an example (in reality, this would be a dozen rows):
select ID, description, data from dual
|---------------------|------------------|------------------|
| ID | Description | Data |
|---------------------|------------------|------------------|
| 1 | DescriptionA | TestA |
|---------------------|------------------|------------------|
| 2 | DescriptionB | TestB |
|---------------------|------------------|------------------|
I would like it to look like this instead:
|---------------------|------------------|
| DescriptionA | DescriptionB |
|---------------------|------------------|
| TestA | TestB |
|---------------------|------------------|
Any ideas are greatly appreciated!
Thank you!
You can use case when
with t(ID, Description,Data) as
(
select 1, 'DescriptionA','TestA' from dual
union all
select 2, 'DescriptionB','TestB' from dual
)
select max( case when Data='TestA' then Data end) as DescriptionA,
max(case when Data='TestB' then Data end) as DescriptionB from t
DESCRIPTIONA DESCRIPTIONB
TestA TestB
Here is also an option, If you want dynamic values instead of hard-coded for ID column then use dynamic query.
SELECT MAX(DECODE(T.ID, 1, T.TE)) AS DES1,
MAX(DECODE(T.ID, 2, T.TE)) AS DES2
FROM (SELECT 1 as id, 'DescriptionA' AS DES, 'TestA' AS TE FROM DUAL
UNION ALL
SELECT 2 as id, 'DescriptionB' AS DES, 'TestB' AS TE FROM DUAL)T

SQL: List/aggregate all items for one corresponding transaction id

I have the following table in a vertica db:
+-----+------+
| Tid | Item |
+-----+------+
| 1 | A |
| 1 | B |
| 1 | C |
| 2 | B |
| 2 | D |
+-----+------+
And I want to get this table:
+-----+-------+-------+-------+
| Tid | Item1 | Item2 | Item3 |
+-----+-------+-------+-------+
| 1 | A | B | C |
| 2 | B | D | |
+-----+-------+-------+-------+
Keep in mind that I don't know the maximum item number a transaction_id (Tid) can have, and the amount of items per Tid is not constant. I tried using join and where but could not get it to work properly. Thanks for the help.
There is no PIVOT ability in Vertica. Columns can not be defined on the fly as part of the query. You have to specify.
There are perhaps other options, such as concatenating them in an aggregate using a UDX, such as what you will find in this Stack Overflow answer. But this will put them into a single field.
The only other alternative would be to build the pivot on the client side using something like Python. Else you have to have a way to generate the column lists for your query.
For my example, I am assuming you are dealing with a unique (Tid, Item) set. You may need to modify to suite your needs.
First you would need to determine the max number if items you need to support:
with Tid_count as (
select Tid, count(*) cnt
from mytable
group by 1
)
select max(cnt)
from Tid_count;
And let's say the most Items you had to support was 4, you would then generate a sql to pivot:
with numbered_mytable as (
select Tid,
Item,
row_number() over (partition by Tid order by Item) rn
from mytable
)
select Tid,
MAX(decode(rn,1,Item)) Item1,
MAX(decode(rn,2,Item)) Item2,
MAX(decode(rn,3,Item)) Item3,
MAX(decode(rn,4,Item)) Item4
from numbered_mytable
group by 1
order by 1;
Or if you don't want to generate SQL, but know you'll never have more than X items, you can just create a static form that goes to X.
You can try this:
Create table #table(id int,Value varchar(1))
insert into #table
select 1,'A'
union
select 1,'B'
union
select 1,'C'
union
select 2,'B'
union
select 2,'D'
select id,[1] Item1,[2] Item2,[3] Item3 from
(
select id,Dense_rank()over(partition by id order by value)Rnak,Value from #table
)d
Pivot
(Min(value) for Rnak in ([1],[2],[3]))p
drop table #table

SQL Select a group when attributes match at least a list of values

Given a table with a (non-distinct) identifier and a value:
| ID | Value |
|----|-------|
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
| 2 | A |
| 2 | B |
| 2 | C |
| 3 | A |
| 3 | B |
How can you select the grouped identifiers, which have values for a given list? (e.g. ('B', 'C'))
This list might also be the result of another query (like SELECT Value from Table1 WHERE ID = '2' to find all IDs which have a superset of values, compared to ID=2 (only ID=1 in this example))
Result
| ID |
|----|
| 1 |
| 2 |
1 and 2 are part of the result, as they have both A and B in their Value-column. 3 is not included, as it is missing C
Thanks to the answer from this question: SQL Select only rows where exact multiple relationships exist I created a query which works for a fixed list. However I need to be able to use the results of another query without changing the query. (And also requires the Access-specific IFF function):
SELECT ID FROM Table1
GROUP BY ID
HAVING SUM(Value NOT IN ('A', 'B')) = 0
AND SUM(IIF(Value='A', 1, 0)) = 1
AND SUM(IIF(Value='B', 1, 0)) = 1
In case it matters: The SQL is run on a Excel-table via VBA and ADODB.
In the where criteria filter on the list of values you would like to see, group by id and in the having clause filter on those ids which have 3 matching rows.
select id from table1
where value in ('A', 'B', 'C') --you can use a result of another query here
group by id
having count(*)=3
If you can have the same id - value pair more than once, then you need to slightly alter the having clause: having count(distinct value)=3
If you want to make it completely dynamic based on a subquery, then:
select id, min(valcount) as minvalcount from table1
cross join (select count(*) as valcount from table1 where id=2) as t1
where value in (select value from table1 where id=2) --you can use a result of another query here
group by id
having count(*)=minvalcount