Transform table to one-hot encoding for many rows

Transform table to one-hot encoding for many rows - sql

I have a SQL table of the following format:
ID Cat
1 A
1 B
1 D
1 F
2 B
2 C
2 D
3 A
3 F
Now, I want to create a table with one ID per row, and multiple Cat's in a row. My desired output looks as follows:
ID A B C D E F
1 1 1 0 1 0 1
2 0 1 1 1 0 0
3 1 0 0 0 0 1
I have found:
Transform table to one-hot-encoding of single column value
However, I have more than 1000 Cat's, so I am looking for code to write this automatically, rather than manually. Who can help me with this?

First let me transform the data you pasted into an actual table:
WITH data AS (
SELECT REGEXP_EXTRACT(data2, '[0-9]') id, REGEXP_EXTRACT(data2, '[A-Z]') cat
FROM (
SELECT SPLIT("""1 A
1 B
1 D
1 F
2 B
2 C
2 D
3 A
3 F""", '\n') AS data1
), UNNEST(data1) data2
)
SELECT * FROM data
(try sharing a table next time)
Now we can do some manual 1-hot encoding:
SELECT id
, MAX(IF(cat='A',1,0)) cat_A
, MAX(IF(cat='B',1,0)) cat_B
, MAX(IF(cat='C',1,0)) cat_C
FROM data
GROUP BY id
Now we want to write a script that will automatically create the columns we want:
SELECT STRING_AGG(FORMAT("MAX(IF(cat='%s',1,0))cat_%s", cat, cat), ', ')
FROM (
SELECT DISTINCT cat
FROM data
ORDER BY 1
)
That generates a string that you can copy paste into a query, that 1-hot encodes your arrays/rows:
SELECT id
,
MAX(IF(cat='A',1,0))cat_A, MAX(IF(cat='B',1,0))cat_B, MAX(IF(cat='C',1,0))cat_C, MAX(IF(cat='D',1,0))cat_D, MAX(IF(cat='F',1,0))cat_F
FROM data
GROUP BY id
And that's exactly what the question was asking for. You can generate SQL with SQL, but you'll need to write a new query using that result.

BigQuery has no dynamic column with standardSQL, but depending on what you want to do on the next step, there might be a way to make it easier.
Following code sample groups Cat by ID and uses a JavaScript function to do one-hot encoding and return JSON string.
CREATE TEMP FUNCTION trans(cats ARRAY<STRING>)
RETURNS STRING
LANGUAGE js
AS
"""
// TODO: Doing one hot encoding for one cat and return as JSON string
return "{a:1}";
"""
;
WITH id_cat AS (
SELECT 1 as ID, 'A' As Cat UNION ALL
SELECT 1 as ID, 'B' As Cat UNION ALL
SELECT 1 as ID, 'C' As Cat UNION ALL
SELECT 2 as ID, 'A' As Cat UNION ALL
SELECT 3 as ID, 'C' As Cat)
SELECT ID, trans(ARRAY_AGG(Cat))
FROM id_cat
GROUP BY ID;

Related

How to merge integers from multiple cells to one in pandas?

I am giving up the SQL solution, and now switching to Pandas.
My goal is to merge the integer data as below:
Data input:
ACCT
SOURCES
A
1
A
2
B
1
C
4
expected output:
ACCT
SOURCES
A
1,2
B
1
C
4

Given:
ACCT SOURCES
0 A 1
1 A 2
2 B 1
3 C 4
Doing:
df.SOURCES = df.SOURCES.astype(str)
df = df.groupby('ACCT', as_index=False)['SOURCES'].agg(','.join)
print(df)
Output:
ACCT SOURCES
0 A 1,2
1 B 1
2 C 4

You can use XMLAGG to concatenate them together. It puts spaces between the values, you can replace those with a comma.
The innermost cast is if sources is actually defined as integer, not char/varchar.
select
acct,
oreplace(cast(xmlagg(cast(sources as varchar(5))) as varchar(10000)),' ',',')
from
<your table>
group by
acct

Unsure how to use where clause with two columns

I just can’t figure this one out. I've been trying for hours.
I have a table like this…
ID
sample
1
A
1
B
1
C
1
D
2
A
2
B
3
A
4
A
4
B
4
C
5
B
I'm interested in getting all the samples that match 'A', 'B' and 'C' for a given ID. The ID must contain all 3 sample types. There are a lot more sample types in the table but I'm interested in just A, B and C.
Here's my desired output...
ID
sample
1
A
1
B
1
C
4
A
4
B
4
C
If I use this:
WHERE sample in ('A', 'B', 'C')
I get this result:
ID
sample
1
A
1
B
1
C
1
D
2
A
2
B
3
A
4
A
4
B
4
C
5
B
Any ideas on how I can get my desired output?

One ANSI compliant way would be to aggregate using a distinct count
select id, sample
from t
where sample in ('A','B','C')
and id in (
select id from t
where sample in ('A','B','C')
group by id
having Count(distinct sample)=3
);

WHERE sample in (‘A’, ‘B’, ‘C’)
Should eliminate any other samples such as 'D'.
You could also try the following:
WHERE sample = ('A' OR 'B' OR 'C')

Not sure what flavor of SQL is being used, but here's an example to work off of:
Postgre - db-fiddle
SELECT id
FROM t
GROUP BY id
HAVING array_agg(sample) #> array['A', 'B', 'C']::varchar[];
-- HAVING 'A' = ANY (array_agg(sample))
-- AND 'B' = ANY (array_agg(sample))
-- AND 'C' = ANY (array_agg(sample))
Presto
SELECT id
FROM t
GROUP BY id
HAVING contains(array_agg(sample), 'A')
AND contains(array_agg(sample), 'B')
AND contains(array_agg(sample), 'C')

Excluding Data Values which common column

Currently, I've got a single set of data in which I want to exclude based on if a condition is meet. The group has a common column reference.
Name Sequence Value
-----------------------------------
Text 1 1
Don 1 30
Text 2 0
Sid 2 240
Florence 2 300
Text 3 200
Casper 3 20
Cat 3 10
Text 4 0
Dem 4 50
Basically any row in which Text is not equal to 0 needs to be excluded be excluded. In addition the rows in which share the same sequence. Expected outcome is to only have data from sequence 2 and 4.

You can try with NOT EXISTS as below-
SELECT Name,
Position,
Value
FROM your_table
WHERE NOT EXISTS (
SELECT Name,Position,Value
FROM your_table
WHERE (Name = 'Text' AND Value = 1)
OR (Position = Value)
)
As you are looking for options other than NOT EXISTS, you can try this below-
SELECT *
FROM your_table
WHERE [Sequence] NOT IN (
SELECT DISTINCT [Sequence]
FROM your_table
WHERE [Name] = 'Text'
AND [Value] <> 0
)

Split function across multiple fields in BigQuery SQL

I have data like this:
Each column will have the same number of elements across a row, where the first element in the first column corresponds to the first element in the second column etc.
How can I flatten this to get the below?
With a single column I am able to do this by combining a CROSS JOIN with an UNNEST but I cannot get this to work with multiple columns since the join ends up creating multiple variations and UNNEST loses the order of the array so I can't match them.
If I were building the arrays from scratch, I would use some kind of STRUCT element in there, but I can't find a way of doing this when the arrays are created from a SPLIT()?

WITH_OFFSET is your friend here:
WITH strings AS (
SELECT "a,b,c" a, "aa,bb,cc" b
UNION ALL
SELECT "a1,b1,c1" a, "aa1,bb1,cc1" b
)
SELECT x_a, x_b
FROM strings
, UNNEST(SPLIT(a)) x_a WITH OFFSET o_a
JOIN UNNEST(SPLIT(b)) x_b WITH OFFSET o_b
ON o_a=o_b

Another approach for BigQuery Standard SQL is shown below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 'a|b|c' col1, 'n|o|p' col2 UNION ALL
SELECT 2, 'd|e', 'q|r' UNION ALL
SELECT 3, 'f|g|h|i', 's|t|u|v' UNION ALL
SELECT 4, 'j', 'w' UNION ALL
SELECT 5, 'k|l|m', 'x|y|z'
)
SELECT
id,
SPLIT(col1, '|')[SAFE_ORDINAL(pos)] value1,
SPLIT(col2, '|')[SAFE_ORDINAL(pos)] value2
FROM `project.dataset.table`,
UNNEST(GENERATE_ARRAY(1, ARRAY_LENGTH(SPLIT(col1, '|')))) pos
with expected result
Row id value1 value2
1 1 a n
2 1 b o
3 1 c p
4 2 d q
5 2 e r
6 3 f s
7 3 g t
8 3 h u
9 3 i v
10 4 j w
11 5 k x
12 5 l y
13 5 m z

How to update table with concatenation

I have table like this
create table aaa (id int not null, data varchar(50), numb int);
with data like this
begin
for i in 1..30 loop
insert into aaa
values (i, dbms_random.string('L',1),dbms_random.value(0,10));
end loop;
end;
now im making this
select a.id, a.data, a.numb,
count(*) over (partition by a.numb order by a.data) count,
b.id, b.data,b.numb
from aaa a, aaa b
where a.numb=b.numb
and a.data!=b.data
order by a.data;
and i want to update every row where those numbers are the same but with different letters, and in result i want to have new data with more than one letter (for example in data column- "a c d e"), i just want to create concatenation within. How can i make that? the point is to make something like group by for number but for that grouped column i would like to put additional value.
that is how it looks like in begining
id | data |numb
1 q 1
2 z 8
3 i 7
4 a 2
5 q 4
6 h 1
7 b 9
8 u 9
9 s 4
That i would like to get at end
id | data |numb
1 q h 1
2 z 8
3 i 7
4 a 2
5 q s 4
7 b u 9

Try this
SELECT MIN(id),
LISTAGG(data,' ') WITHIN GROUP(
ORDER BY data
) data,
numb
FROM aaa GROUP BY numb
ORDER BY 1
Demo

This selects 10 random strings 1 to 4 letters long, letters in words may repeat:
select level, dbms_random.string('l', dbms_random.value(1, 4))
from dual connect by level <= 10
This selects 1 to 10 random strings 1 to 26 letters long, letters do not repeat and are sorted:
with aaa(id, data, numb) as (
select level, dbms_random.string('L', 1),
round(dbms_random.value(0, 10))
from dual connect by level <= 30)
select numb, listagg(data) within group (order by data) list
from (select distinct data, numb from aaa)
group by numb

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Transform table to one-hot encoding for many rows - sql

Related

How to merge integers from multiple cells to one in pandas?

Unsure how to use where clause with two columns

Excluding Data Values which common column

Split function across multiple fields in BigQuery SQL

How to update table with concatenation

Categories

Resources