I have a SQL table of the following format:
ID Cat
1 A
1 B
1 D
1 F
2 B
2 C
2 D
3 A
3 F
Now, I want to create a table with one ID per row, and multiple Cat's in a row. My desired output looks as follows:
ID A B C D E F
1 1 1 0 1 0 1
2 0 1 1 1 0 0
3 1 0 0 0 0 1
I have found:
Transform table to one-hot-encoding of single column value
However, I have more than 1000 Cat's, so I am looking for code to write this automatically, rather than manually. Who can help me with this?
First let me transform the data you pasted into an actual table:
WITH data AS (
SELECT REGEXP_EXTRACT(data2, '[0-9]') id, REGEXP_EXTRACT(data2, '[A-Z]') cat
FROM (
SELECT SPLIT("""1 A
1 B
1 D
1 F
2 B
2 C
2 D
3 A
3 F""", '\n') AS data1
), UNNEST(data1) data2
)
SELECT * FROM data
(try sharing a table next time)
Now we can do some manual 1-hot encoding:
SELECT id
, MAX(IF(cat='A',1,0)) cat_A
, MAX(IF(cat='B',1,0)) cat_B
, MAX(IF(cat='C',1,0)) cat_C
FROM data
GROUP BY id
Now we want to write a script that will automatically create the columns we want:
SELECT STRING_AGG(FORMAT("MAX(IF(cat='%s',1,0))cat_%s", cat, cat), ', ')
FROM (
SELECT DISTINCT cat
FROM data
ORDER BY 1
)
That generates a string that you can copy paste into a query, that 1-hot encodes your arrays/rows:
SELECT id
,
MAX(IF(cat='A',1,0))cat_A, MAX(IF(cat='B',1,0))cat_B, MAX(IF(cat='C',1,0))cat_C, MAX(IF(cat='D',1,0))cat_D, MAX(IF(cat='F',1,0))cat_F
FROM data
GROUP BY id
And that's exactly what the question was asking for. You can generate SQL with SQL, but you'll need to write a new query using that result.
BigQuery has no dynamic column with standardSQL, but depending on what you want to do on the next step, there might be a way to make it easier.
Following code sample groups Cat by ID and uses a JavaScript function to do one-hot encoding and return JSON string.
CREATE TEMP FUNCTION trans(cats ARRAY<STRING>)
RETURNS STRING
LANGUAGE js
AS
"""
// TODO: Doing one hot encoding for one cat and return as JSON string
return "{a:1}";
"""
;
WITH id_cat AS (
SELECT 1 as ID, 'A' As Cat UNION ALL
SELECT 1 as ID, 'B' As Cat UNION ALL
SELECT 1 as ID, 'C' As Cat UNION ALL
SELECT 2 as ID, 'A' As Cat UNION ALL
SELECT 3 as ID, 'C' As Cat)
SELECT ID, trans(ARRAY_AGG(Cat))
FROM id_cat
GROUP BY ID;
Related
I am giving up the SQL solution, and now switching to Pandas.
My goal is to merge the integer data as below:
Data input:
ACCT
SOURCES
A
1
A
2
B
1
C
4
expected output:
ACCT
SOURCES
A
1,2
B
1
C
4
Given:
ACCT SOURCES
0 A 1
1 A 2
2 B 1
3 C 4
Doing:
df.SOURCES = df.SOURCES.astype(str)
df = df.groupby('ACCT', as_index=False)['SOURCES'].agg(','.join)
print(df)
Output:
ACCT SOURCES
0 A 1,2
1 B 1
2 C 4
You can use XMLAGG to concatenate them together. It puts spaces between the values, you can replace those with a comma.
The innermost cast is if sources is actually defined as integer, not char/varchar.
select
acct,
oreplace(cast(xmlagg(cast(sources as varchar(5))) as varchar(10000)),' ',',')
from
<your table>
group by
acct
I just can’t figure this one out. I've been trying for hours.
I have a table like this…
ID
sample
1
A
1
B
1
C
1
D
2
A
2
B
3
A
4
A
4
B
4
C
5
B
I'm interested in getting all the samples that match 'A', 'B' and 'C' for a given ID. The ID must contain all 3 sample types. There are a lot more sample types in the table but I'm interested in just A, B and C.
Here's my desired output...
ID
sample
1
A
1
B
1
C
4
A
4
B
4
C
If I use this:
WHERE sample in ('A', 'B', 'C')
I get this result:
ID
sample
1
A
1
B
1
C
1
D
2
A
2
B
3
A
4
A
4
B
4
C
5
B
Any ideas on how I can get my desired output?
One ANSI compliant way would be to aggregate using a distinct count
select id, sample
from t
where sample in ('A','B','C')
and id in (
select id from t
where sample in ('A','B','C')
group by id
having Count(distinct sample)=3
);
WHERE sample in (‘A’, ‘B’, ‘C’)
Should eliminate any other samples such as 'D'.
You could also try the following:
WHERE sample = ('A' OR 'B' OR 'C')
Not sure what flavor of SQL is being used, but here's an example to work off of:
Postgre - db-fiddle
SELECT id
FROM t
GROUP BY id
HAVING array_agg(sample) #> array['A', 'B', 'C']::varchar[];
-- HAVING 'A' = ANY (array_agg(sample))
-- AND 'B' = ANY (array_agg(sample))
-- AND 'C' = ANY (array_agg(sample))
Presto
SELECT id
FROM t
GROUP BY id
HAVING contains(array_agg(sample), 'A')
AND contains(array_agg(sample), 'B')
AND contains(array_agg(sample), 'C')
Currently, I've got a single set of data in which I want to exclude based on if a condition is meet. The group has a common column reference.
Name Sequence Value
-----------------------------------
Text 1 1
Don 1 30
Text 2 0
Sid 2 240
Florence 2 300
Text 3 200
Casper 3 20
Cat 3 10
Text 4 0
Dem 4 50
Basically any row in which Text is not equal to 0 needs to be excluded be excluded. In addition the rows in which share the same sequence. Expected outcome is to only have data from sequence 2 and 4.
You can try with NOT EXISTS as below-
SELECT Name,
Position,
Value
FROM your_table
WHERE NOT EXISTS (
SELECT Name,Position,Value
FROM your_table
WHERE (Name = 'Text' AND Value = 1)
OR (Position = Value)
)
As you are looking for options other than NOT EXISTS, you can try this below-
SELECT *
FROM your_table
WHERE [Sequence] NOT IN (
SELECT DISTINCT [Sequence]
FROM your_table
WHERE [Name] = 'Text'
AND [Value] <> 0
)
I have data like this:
Each column will have the same number of elements across a row, where the first element in the first column corresponds to the first element in the second column etc.
How can I flatten this to get the below?
With a single column I am able to do this by combining a CROSS JOIN with an UNNEST but I cannot get this to work with multiple columns since the join ends up creating multiple variations and UNNEST loses the order of the array so I can't match them.
If I were building the arrays from scratch, I would use some kind of STRUCT element in there, but I can't find a way of doing this when the arrays are created from a SPLIT()?
WITH_OFFSET is your friend here:
WITH strings AS (
SELECT "a,b,c" a, "aa,bb,cc" b
UNION ALL
SELECT "a1,b1,c1" a, "aa1,bb1,cc1" b
)
SELECT x_a, x_b
FROM strings
, UNNEST(SPLIT(a)) x_a WITH OFFSET o_a
JOIN UNNEST(SPLIT(b)) x_b WITH OFFSET o_b
ON o_a=o_b
Another approach for BigQuery Standard SQL is shown below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 'a|b|c' col1, 'n|o|p' col2 UNION ALL
SELECT 2, 'd|e', 'q|r' UNION ALL
SELECT 3, 'f|g|h|i', 's|t|u|v' UNION ALL
SELECT 4, 'j', 'w' UNION ALL
SELECT 5, 'k|l|m', 'x|y|z'
)
SELECT
id,
SPLIT(col1, '|')[SAFE_ORDINAL(pos)] value1,
SPLIT(col2, '|')[SAFE_ORDINAL(pos)] value2
FROM `project.dataset.table`,
UNNEST(GENERATE_ARRAY(1, ARRAY_LENGTH(SPLIT(col1, '|')))) pos
with expected result
Row id value1 value2
1 1 a n
2 1 b o
3 1 c p
4 2 d q
5 2 e r
6 3 f s
7 3 g t
8 3 h u
9 3 i v
10 4 j w
11 5 k x
12 5 l y
13 5 m z
I have table like this
create table aaa (id int not null, data varchar(50), numb int);
with data like this
begin
for i in 1..30 loop
insert into aaa
values (i, dbms_random.string('L',1),dbms_random.value(0,10));
end loop;
end;
now im making this
select a.id, a.data, a.numb,
count(*) over (partition by a.numb order by a.data) count,
b.id, b.data,b.numb
from aaa a, aaa b
where a.numb=b.numb
and a.data!=b.data
order by a.data;
and i want to update every row where those numbers are the same but with different letters, and in result i want to have new data with more than one letter (for example in data column- "a c d e"), i just want to create concatenation within. How can i make that? the point is to make something like group by for number but for that grouped column i would like to put additional value.
that is how it looks like in begining
id | data |numb
1 q 1
2 z 8
3 i 7
4 a 2
5 q 4
6 h 1
7 b 9
8 u 9
9 s 4
That i would like to get at end
id | data |numb
1 q h 1
2 z 8
3 i 7
4 a 2
5 q s 4
7 b u 9
Try this
SELECT MIN(id),
LISTAGG(data,' ') WITHIN GROUP(
ORDER BY data
) data,
numb
FROM aaa GROUP BY numb
ORDER BY 1
Demo
This selects 10 random strings 1 to 4 letters long, letters in words may repeat:
select level, dbms_random.string('l', dbms_random.value(1, 4))
from dual connect by level <= 10
This selects 1 to 10 random strings 1 to 26 letters long, letters do not repeat and are sorted:
with aaa(id, data, numb) as (
select level, dbms_random.string('L', 1),
round(dbms_random.value(0, 10))
from dual connect by level <= 30)
select numb, listagg(data) within group (order by data) list
from (select distinct data, numb from aaa)
group by numb