What is the fastest way to find cartesian product of two lists in R? For example, I have:
x <- c(a,b,c,d) y <- c(1, 2, 3)
I need to make from them the following data.frame:
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
10 d 1
11 d 2
12 d 3
Assuming x cross y, this would be one way:
# Tideyverse solution
library(tidyr)
x <- letters[1:4]
y <- c(1, 2, 3)
tibble(
x = x,
y = list(y)
) %>%
unnest(y)
# A tibble: 12 x 2
x y
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
10 d 1
11 d 2
12 d 3
# Base R solution
expand.grid(y = y, x = x)
y x
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
7 1 c
8 2 c
9 3 c
10 1 d
11 2 d
12 3 d
Given pandas multiple columns as below
cl_a cl_b cl_c cl_d cl_e
0 1 a 5 6 20
1 2 b 4 7 21
2 3 c 3 8 22
3 4 d 2 9 23
4 5 e 1 10 24
I would like to stack the column cl_c cl_d cl_e into a single column with the name ax. But, please note that, the columns cl_a cl_b were maintained.
cl_a cl_b ax from_col
1,a,5,cl_c
2,b,4,cl_c
3,c,3,cl_c
4,d,2,cl_c
5,e,1,cl_c
1,a,6,cl_d
2,b,7,cl_d
3,c,8,cl_d
4,d,9,cl_d
5,e,10,cl_d
1,a,20,cl_e
2,b,21,cl_e
3,c,22,cl_e
4,d,23,cl_e
5,e,24,cl_e
So far, the following code does the job
df = pd.DataFrame ( {'cl_a': [1,2,3,4,5], 'cl_b': ['a','b','c','d','e'],
'cl_c': [5,4,3,2,1],'cl_d': [6,7,8,9,10],
'cl_e': [20,21,22,23,24]})
df_new = pd.DataFrame()
for col_name in ['cl_c','cl_d','cl_e']:
df_new=df_new.append (df [['cl_a', 'cl_b', col_name]].rename(columns={col_name: "ax"}))
However, I am curious whether there is Pandas build-in approach that can do the trick
Edit:
Upon Quong answer, I realise of the need to include another column (i.e., from_col) beside the ax. The from_col indicate the origin of ax previous column name.
Yes, it's called melt:
df.melt(['cl_a','cl_b'], value_name='ax').drop(columns='variable')
Output:
cl_a cl_b ax
0 1 a 5
1 2 b 4
2 3 c 3
3 4 d 2
4 5 e 1
5 1 a 6
6 2 b 7
7 3 c 8
8 4 d 9
9 5 e 10
10 1 a 20
11 2 b 21
12 3 c 22
13 4 d 23
14 5 e 24
Or equivalently set_index().stack():
(df.set_index(['cl_a','cl_b']).stack()
.reset_index(level=-1, drop=True)
.reset_index(name='ax')
)
with a slightly different output:
cl_a cl_b ax
0 1 a 5
1 1 a 6
2 1 a 20
3 2 b 4
4 2 b 7
5 2 b 21
6 3 c 3
7 3 c 8
8 3 c 22
9 4 d 2
10 4 d 9
11 4 d 23
12 5 e 1
13 5 e 10
14 5 e 24
Goal: I want to split one single column by elements (not the strings cells) and, from that division, create new columns, where the element is the title of the new column and the other values from another columns compose the respective column.
There is a way of doing that with pandas? Thanks in advance.
Example:
[IN]:
A 1
A 2
A 6
A 99
B 7
B 8
B 19
B 18
[OUT]:
A B
1 7
2 8
6 19
99 18
Just an alternative if 2 column input data:
print(df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1=pd.DataFrame(df.groupby('col1')['col2'].apply(list).to_dict())
print(df1)
A B
0 1 7
1 2 8
2 6 19
3 99 18
Use Series.str.split with GroupBy.cumcount for counter, then reshape by DataFrame.set_index with Series.unstack:
print (df)
col
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1 = df['col'].str.split(expand=True)
g = df1.groupby(0).cumcount()
df2 = df1.set_index([0, g])[1].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
If 2 columns input data:
print (df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
g = df.groupby('col1').cumcount()
df2 = df.set_index(['col1', g])['col2'].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
My dataframe looks like:
A B C D .... Y Z
0 5 12 14 4 2
3 6 15 10 1 30
2 10 20 12 5 15
I want to create another dataframe that only contains the columns with an average value greater than 10:
C D .... Z
12 14 2
15 10 30
20 12 15
Use:
df = df.loc[:, df.mean() > 10]
print (df)
C D Z
0 12 14 2
1 15 10 30
2 20 12 15
Detail:
print (df.mean())
A 1.666667
B 7.000000
C 15.666667
D 12.000000
Y 3.333333
Z 15.666667
dtype: float64
print (df.mean() > 10)
A False
B False
C True
D True
Y False
Z True
dtype: bool
Alternative:
print (df[df.columns[df.mean() > 10]])
C D Z
0 12 14 2
1 15 10 30
2 20 12 15
Detail:
print (df.columns[df.mean() > 10])
Index(['C', 'D', 'Z'], dtype='object')
The table is a big table,millions of records.
Column X represents an item etc Table, chair... ...
Column Y has values which i would like to sum up.
Column Z would place the value if the row needs to be summed.
Table 1
ID Column X Column Y Column Z
1 X1 5 1
2 x1 4 1
3 x1 Null 0
4 x1 5 1
5 x1 Null 0
6 x2 5 1
7 x2 5 1
8 x2 Null 0
9 x3 2 1
10 x3 Null 0
11 x3 2 1
12 x4 Null 0
13 x4 Null 0
14 x5 Null 0
... ...
the list goes on
Wanted Result
Table 1
ID Column X Column YY Column Z
1 X1 14 1
2 x1 14 1
3 x1 14 0
4 x1 14 1
5 x1 14 0
6 x2 10 1
7 x2 10 1
8 x2 10 0
9 x3 4 1
10 x3 4 0
11 x3 4 1
12 x4 0 0
13 x4 0 0
14 x5 0 0
I would require a select statement to get the intended result.
Try This
Select ID,X,Sum(Y)Over(Partition By X) AS YY,Z
From Table
Try:
SELECT
T1.ID,
T1.X,
(
SELECT
SUM(T2.Y*T2.Z)
FROM
Table AS T2
WHERE
T2.ID = T1.ID
) AS YY,
T1.Z
FROM Table AS T1
is it ok for million of record than #Naveen.though #Naveen query is very short
with cte
(select ColumnX,sum(ColumnY) ColumnY from table1 group by ColumnX )
select a.id, a.ColumnX,b.ColumnY,a.ColumnZ from table1 a inner join cte b on a.ColumnX=b.ColumnX