How to melt four column in the same table in one column with pandas.melt - pandas

I am learning data analytics and have a project which need 4 column to melt in one column and using - between them
Ex puppo none doggo none
We want them in one column
As puppo-doggo

Assume that your source DataFrame contains:
id A B C D
0 X1 1 5 9 12
1 X2 2 6 10 14
2 X3 3 7 11 15
3 X4 4 8 12 16
To melt columns A thru D you can run:
result = df.melt(id_vars=['id'], value_vars=['A', 'B', 'C', 'D'],
var_name='Column', value_name='Value')
The result is:
id Column Value
0 X1 A 1
1 X2 A 2
2 X3 A 3
3 X4 A 4
4 X1 B 5
5 X2 B 6
6 X3 B 7
7 X4 B 8
8 X1 C 9
9 X2 C 10
10 X3 C 11
11 X4 C 12
12 X1 D 12
13 X2 D 14
14 X3 D 15
15 X4 D 16
Read the documentation cocerning melt and experiment with
other setting of available parameters and their default values.

Related

Cartesian product in R

What is the fastest way to find cartesian product of two lists in R? For example, I have:
x <- c(a,b,c,d) y <- c(1, 2, 3)
I need to make from them the following data.frame:
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
10 d 1
11 d 2
12 d 3
Assuming x cross y, this would be one way:
# Tideyverse solution
library(tidyr)
x <- letters[1:4]
y <- c(1, 2, 3)
tibble(
x = x,
y = list(y)
) %>%
unnest(y)
# A tibble: 12 x 2
x y
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
10 d 1
11 d 2
12 d 3
# Base R solution
expand.grid(y = y, x = x)
y x
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
7 1 c
8 2 c
9 3 c
10 1 d
11 2 d
12 3 d

Stack multiple columns into single column while maintaining other columns in Pandas?

Given pandas multiple columns as below
cl_a cl_b cl_c cl_d cl_e
0 1 a 5 6 20
1 2 b 4 7 21
2 3 c 3 8 22
3 4 d 2 9 23
4 5 e 1 10 24
I would like to stack the column cl_c cl_d cl_e into a single column with the name ax. But, please note that, the columns cl_a cl_b were maintained.
cl_a cl_b ax from_col
1,a,5,cl_c
2,b,4,cl_c
3,c,3,cl_c
4,d,2,cl_c
5,e,1,cl_c
1,a,6,cl_d
2,b,7,cl_d
3,c,8,cl_d
4,d,9,cl_d
5,e,10,cl_d
1,a,20,cl_e
2,b,21,cl_e
3,c,22,cl_e
4,d,23,cl_e
5,e,24,cl_e
So far, the following code does the job
df = pd.DataFrame ( {'cl_a': [1,2,3,4,5], 'cl_b': ['a','b','c','d','e'],
'cl_c': [5,4,3,2,1],'cl_d': [6,7,8,9,10],
'cl_e': [20,21,22,23,24]})
df_new = pd.DataFrame()
for col_name in ['cl_c','cl_d','cl_e']:
df_new=df_new.append (df [['cl_a', 'cl_b', col_name]].rename(columns={col_name: "ax"}))
However, I am curious whether there is Pandas build-in approach that can do the trick
Edit:
Upon Quong answer, I realise of the need to include another column (i.e., from_col) beside the ax. The from_col indicate the origin of ax previous column name.
Yes, it's called melt:
df.melt(['cl_a','cl_b'], value_name='ax').drop(columns='variable')
Output:
cl_a cl_b ax
0 1 a 5
1 2 b 4
2 3 c 3
3 4 d 2
4 5 e 1
5 1 a 6
6 2 b 7
7 3 c 8
8 4 d 9
9 5 e 10
10 1 a 20
11 2 b 21
12 3 c 22
13 4 d 23
14 5 e 24
Or equivalently set_index().stack():
(df.set_index(['cl_a','cl_b']).stack()
.reset_index(level=-1, drop=True)
.reset_index(name='ax')
)
with a slightly different output:
cl_a cl_b ax
0 1 a 5
1 1 a 6
2 1 a 20
3 2 b 4
4 2 b 7
5 2 b 21
6 3 c 3
7 3 c 8
8 3 c 22
9 4 d 2
10 4 d 9
11 4 d 23
12 5 e 1
13 5 e 10
14 5 e 24

Split a column by element and create new ones with pandas

Goal: I want to split one single column by elements (not the strings cells) and, from that division, create new columns, where the element is the title of the new column and the other values from another columns compose the respective column.
There is a way of doing that with pandas? Thanks in advance.
Example:
[IN]:
A 1
A 2
A 6
A 99
B 7
B 8
B 19
B 18
[OUT]:
A B
1 7
2 8
6 19
99 18
Just an alternative if 2 column input data:
print(df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1=pd.DataFrame(df.groupby('col1')['col2'].apply(list).to_dict())
print(df1)
A B
0 1 7
1 2 8
2 6 19
3 99 18
Use Series.str.split with GroupBy.cumcount for counter, then reshape by DataFrame.set_index with Series.unstack:
print (df)
col
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1 = df['col'].str.split(expand=True)
g = df1.groupby(0).cumcount()
df2 = df1.set_index([0, g])[1].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
If 2 columns input data:
print (df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
g = df.groupby('col1').cumcount()
df2 = df.set_index(['col1', g])['col2'].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18

Select dataframe columns based on column values in Pandas

My dataframe looks like:
A B C D .... Y Z
0 5 12 14 4 2
3 6 15 10 1 30
2 10 20 12 5 15
I want to create another dataframe that only contains the columns with an average value greater than 10:
C D .... Z
12 14 2
15 10 30
20 12 15
Use:
df = df.loc[:, df.mean() > 10]
print (df)
C D Z
0 12 14 2
1 15 10 30
2 20 12 15
Detail:
print (df.mean())
A 1.666667
B 7.000000
C 15.666667
D 12.000000
Y 3.333333
Z 15.666667
dtype: float64
print (df.mean() > 10)
A False
B False
C True
D True
Y False
Z True
dtype: bool
Alternative:
print (df[df.columns[df.mean() > 10]])
C D Z
0 12 14 2
1 15 10 30
2 20 12 15
Detail:
print (df.columns[df.mean() > 10])
Index(['C', 'D', 'Z'], dtype='object')

Summing record in table based on key in row

The table is a big table,millions of records.
Column X represents an item etc Table, chair... ...
Column Y has values which i would like to sum up.
Column Z would place the value if the row needs to be summed.
Table 1
ID Column X Column Y Column Z
1 X1 5 1
2 x1 4 1
3 x1 Null 0
4 x1 5 1
5 x1 Null 0
6 x2 5 1
7 x2 5 1
8 x2 Null 0
9 x3 2 1
10 x3 Null 0
11 x3 2 1
12 x4 Null 0
13 x4 Null 0
14 x5 Null 0
... ...
the list goes on
Wanted Result
Table 1
ID Column X Column YY Column Z
1 X1 14 1
2 x1 14 1
3 x1 14 0
4 x1 14 1
5 x1 14 0
6 x2 10 1
7 x2 10 1
8 x2 10 0
9 x3 4 1
10 x3 4 0
11 x3 4 1
12 x4 0 0
13 x4 0 0
14 x5 0 0
I would require a select statement to get the intended result.
Try This
Select ID,X,Sum(Y)Over(Partition By X) AS YY,Z
From Table
Try:
SELECT
T1.ID,
T1.X,
(
SELECT
SUM(T2.Y*T2.Z)
FROM
Table AS T2
WHERE
T2.ID = T1.ID
) AS YY,
T1.Z
FROM Table AS T1
is it ok for million of record than #Naveen.though #Naveen query is very short
with cte
(select ColumnX,sum(ColumnY) ColumnY from table1 group by ColumnX )
select a.id, a.ColumnX,b.ColumnY,a.ColumnZ from table1 a inner join cte b on a.ColumnX=b.ColumnX