How to merge integers from multiple cells to one in pandas?

How to merge integers from multiple cells to one in pandas? - pandas

I am giving up the SQL solution, and now switching to Pandas.
My goal is to merge the integer data as below:
Data input:
ACCT
SOURCES
A
1
A
2
B
1
C
4
expected output:
ACCT
SOURCES
A
1,2
B
1
C
4

Given:
ACCT SOURCES
0 A 1
1 A 2
2 B 1
3 C 4
Doing:
df.SOURCES = df.SOURCES.astype(str)
df = df.groupby('ACCT', as_index=False)['SOURCES'].agg(','.join)
print(df)
Output:
ACCT SOURCES
0 A 1,2
1 B 1
2 C 4

You can use XMLAGG to concatenate them together. It puts spaces between the values, you can replace those with a comma.
The innermost cast is if sources is actually defined as integer, not char/varchar.
select
acct,
oreplace(cast(xmlagg(cast(sources as varchar(5))) as varchar(10000)),' ',',')
from
<your table>
group by
acct

Related

Qlik Sense Sum of One field based on unique value of other fields

Sample Data:
P Q R
1 A 3
1 A 3
1 A 2
1 B 5
1 C 7
2 A 3
2 A 3
Expected Output:
P Q R
1 A 5
1 B 5
1 C 7
2 A 3
i Have tried this Sum (Distinct R) but it is not working. i need to group by P and Q column and add Unique Value of R for that. Please support

In chart, you have to add P and Q fields as dimensions. Then your expression should work just fine.
In script your code should look like this:
Load P, Q, Sum( Distinct R ) as sum_of_R
FROM sample_data
Group By P, Q;

Applying transformations or joining conditions to achieve the result in pyspark or hive

Given two tables or dataframes. One will be having datasets and corresponding tables. Other will be having source and target.
I need a solution for the below condition:
Once we find ft.dataset = st.source, we need to replace ft.table in st.source and neglect the remaining records.
For example: Here in first block of second table which is from seq_no 1 to 6, we have a match at Abc, so we replaced with db.table1 and neglect the remaining records in that block. Similarly we need to do the same for each and every block of second table.
Note that Target is same in all the rows of second table.
Please help me with a possible solution in pyspark or Hive.
First table(ft):
Dataset | Table
_________________
Abc db.table1
Xyz db.table2
Def db.table3
Second table(st):
Target| seq_no| source
______________________
A 1 A
A 2 B1
A 3 C1
A 4 D1
A 5 Abc
A 6 Xyz
A 1 A
A 2 B1
A 3 C1
A 4 D1
A 5 Def
A 6 Abc
A 7 Xyz
Expected output:
Target| seq_no | source
_______________________
A 1 A
A 2 B1
A 3 C1
A 4 D1
A 5 db.table1
A 1 A
A 2 B1
A 3 C1
A 4 D1
A 5 db.table3

In Hive, you can use a left join to search for a match in the first table, and a window min() to identify the sequence of the first match
select target, seq_no, source
from (
select
st.target,
st.seq_no,
coalesce(st.source, ft.table) as source,
min(case when ft.dataset is not null then st.seq_no end) over(partition by st.target) first_matched_seq_no
from st
left join ft on ft.dataset = st.source
) t
where first_matched_seq_no is null or seq_no <= first_matched_seq_no
order by target, seq_no

Transform table to one-hot encoding for many rows

I have a SQL table of the following format:
ID Cat
1 A
1 B
1 D
1 F
2 B
2 C
2 D
3 A
3 F
Now, I want to create a table with one ID per row, and multiple Cat's in a row. My desired output looks as follows:
ID A B C D E F
1 1 1 0 1 0 1
2 0 1 1 1 0 0
3 1 0 0 0 0 1
I have found:
Transform table to one-hot-encoding of single column value
However, I have more than 1000 Cat's, so I am looking for code to write this automatically, rather than manually. Who can help me with this?

First let me transform the data you pasted into an actual table:
WITH data AS (
SELECT REGEXP_EXTRACT(data2, '[0-9]') id, REGEXP_EXTRACT(data2, '[A-Z]') cat
FROM (
SELECT SPLIT("""1 A
1 B
1 D
1 F
2 B
2 C
2 D
3 A
3 F""", '\n') AS data1
), UNNEST(data1) data2
)
SELECT * FROM data
(try sharing a table next time)
Now we can do some manual 1-hot encoding:
SELECT id
, MAX(IF(cat='A',1,0)) cat_A
, MAX(IF(cat='B',1,0)) cat_B
, MAX(IF(cat='C',1,0)) cat_C
FROM data
GROUP BY id
Now we want to write a script that will automatically create the columns we want:
SELECT STRING_AGG(FORMAT("MAX(IF(cat='%s',1,0))cat_%s", cat, cat), ', ')
FROM (
SELECT DISTINCT cat
FROM data
ORDER BY 1
)
That generates a string that you can copy paste into a query, that 1-hot encodes your arrays/rows:
SELECT id
,
MAX(IF(cat='A',1,0))cat_A, MAX(IF(cat='B',1,0))cat_B, MAX(IF(cat='C',1,0))cat_C, MAX(IF(cat='D',1,0))cat_D, MAX(IF(cat='F',1,0))cat_F
FROM data
GROUP BY id
And that's exactly what the question was asking for. You can generate SQL with SQL, but you'll need to write a new query using that result.

BigQuery has no dynamic column with standardSQL, but depending on what you want to do on the next step, there might be a way to make it easier.
Following code sample groups Cat by ID and uses a JavaScript function to do one-hot encoding and return JSON string.
CREATE TEMP FUNCTION trans(cats ARRAY<STRING>)
RETURNS STRING
LANGUAGE js
AS
"""
// TODO: Doing one hot encoding for one cat and return as JSON string
return "{a:1}";
"""
;
WITH id_cat AS (
SELECT 1 as ID, 'A' As Cat UNION ALL
SELECT 1 as ID, 'B' As Cat UNION ALL
SELECT 1 as ID, 'C' As Cat UNION ALL
SELECT 2 as ID, 'A' As Cat UNION ALL
SELECT 3 as ID, 'C' As Cat)
SELECT ID, trans(ARRAY_AGG(Cat))
FROM id_cat
GROUP BY ID;

Create multiple rows based on 1 column

I currently have a table with a quantity in it.
ID Code Quantity
1 A 1
2 B 3
3 C 2
4 D 1
Is there anyway to write a sql statement that would get me
ID Code Quantity
1 A 1
2 B 1
2 B 1
2 B 1
3 C 1
3 C 1
4 D 1
I need to break out the quantity and have that many number of rows
Thanks

Here's one option using a numbers table to join to:
with numberstable as (
select 1 AS Number
union all
select Number + 1 from numberstable where Number<100
)
select t.id, t.code, 1
from yourtable t
join numberstable n on t.quantity >= n.number
order by t.id
Online Demo
Please note, depending on which database you are using, this may not be the correct approach to creating the numbers table. This works in most databases supporting common table expressions. But the key to the answer is the join and the on criteria.

One way would be to generate an array with X elements (where X is the quantity). So for rows
ID Code Quantity
1 A 1
2 B 3
3 C 2
you would get
ID Code Quantity ArrayVar
1 A 1 [1]
2 B 3 [1,2,3]
3 C 2 [2]
using a sequence function (e.g, in PrestoDB, sequence(start, stop) -> array(bigint))
Then, unnest the array, so for each ID, you get a X rows, and set the quantity to 1. Not sure what SQL distribution you're using, but this should work!

You can use connect by statement to cross join tables in order to get your desired output.
check my solution it works pretty robust.
select
"ID",
"Code",
1 QUANTITY
from Table1, table(cast(multiset
(select level from dual
connect by level <= Table1."Quantity") as sys.OdciNumberList));

Sort values from a table based on hierarchy values of a field

I would like to know how can i sort values from a table based on hierarchy values of a field.
EX:
A B
--------
1 A
2 F
3 A
4 P
5 O
6 F
I would like sort the values by the B field and appear first the F Values, then A values, then P values and in the end the O values.
In the end, the result must be like this:
2 F
6 F
1 A
3 A
4 P
5 O

Use a case expression in order by.
select *
from tablename
order by case when B = 'F' then 1
when B = 'A' then 2
when B = 'P' then 3
when B = 'O' then 4
end, A

More compact:
order by translate (B, 'FAPO', '1234')
This will also allow you, if needed (now or in the future) to have PF compared to PA, rather than just single letter values in column B.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to merge integers from multiple cells to one in pandas? - pandas

I am giving up the SQL solution, and now switching to Pandas. My goal is to merge the integer data as below: Data input: ACCT SOURCES A 1 A 2 B 1 C 4 expected output: ACCT SOURCES A 1,2 B 1 C 4

Given: ACCT SOURCES 0 A 1 1 A 2 2 B 1 3 C 4 Doing: df.SOURCES = df.SOURCES.astype(str) df = df.groupby('ACCT', as_index=False)['SOURCES'].agg(','.join) print(df) Output: ACCT SOURCES 0 A 1,2 1 B 1 2 C 4

Related

Qlik Sense Sum of One field based on unique value of other fields

Applying transformations or joining conditions to achieve the result in pyspark or hive

Transform table to one-hot encoding for many rows

Create multiple rows based on 1 column

Sort values from a table based on hierarchy values of a field

Categories

Resources