Transforming pandas dataframe : un pivot - pandas

I have the following dataframe :
commune nuance_1 votes_1 nuance_2 votes_2 nuance_3 votes_3
A X 12 Y 20 Z 5
B X 10 Y 5
C Z 7 X 2
and I would like to obtain after transformation :
commune nuance votes
A X 12
A Y 20
A Z 5
B X 10
B Y 5
C Z 7
C X 2
Is there a way to do this ( sort of un pivot ) ?

You can use pd.wide_to_long here:
out = (pd.wide_to_long(df,['nuance','votes'],'commune','j',sep='_')
.dropna(how='all').sort_index(0).droplevel(1).reset_index())
print(out)
commune nuance votes
0 A X 12.0
1 A Y 20.0
2 A Z 5.0
3 B X 10.0
4 B Y 5.0
5 C Z 7.0
6 C X 2.0

Related

Cartesian product in R

What is the fastest way to find cartesian product of two lists in R? For example, I have:
x <- c(a,b,c,d) y <- c(1, 2, 3)
I need to make from them the following data.frame:
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
10 d 1
11 d 2
12 d 3
Assuming x cross y, this would be one way:
# Tideyverse solution
library(tidyr)
x <- letters[1:4]
y <- c(1, 2, 3)
tibble(
x = x,
y = list(y)
) %>%
unnest(y)
# A tibble: 12 x 2
x y
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
10 d 1
11 d 2
12 d 3
# Base R solution
expand.grid(y = y, x = x)
y x
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
7 1 c
8 2 c
9 3 c
10 1 d
11 2 d
12 3 d

Pandas Merge(): Appending data from merged columns and replace null values

I'd like to merge to tables while replacing the null value in one table with the non-null values from another.
The code below is an example of the tables to be merged:
# Table 1 (has rows with missing values)
a=['x','x','x','y','y','y']
b=['z', 'z', 'z' ,'w', 'w' ,'w' ]
c=[1,1,1,np.nan, np.nan, np.nan]
table_1=pd.DataFrame({'a':a, 'b':b, 'c':c})
table_1
a b c
0 x z 1.0
1 x z 1.0
2 x z 1.0
3 y w NaN
4 y w NaN
5 y w NaN
# Table 2 (new table to be appended to table_1, and would need to use values in column 'c' to replace values in the same column in table_1)
a=['y', 'y', 'y']
b=['w', 'w', 'w']
c=[2,2,2]
table_2=pd.DataFrame({'a':a, 'b':b, 'c':c})
table_2
a b c
0 y w 2
1 y w 2
2 y w 2
This is the code I use for merging the 2 tables, and the ouput I get
# Merging the two tables
merged_table=pd.merge(table_1, table_2, on=['a', 'b'], how='left')
merged_table
Current output (I don't understand why the number of rows is increased):
a b c_x c_y
0 x z 1.0 NaN
1 x z 1.0 NaN
2 x z 1.0 NaN
3 y w NaN 2.0
4 y w NaN 2.0
5 y w NaN 2.0
6 y w NaN 2.0
7 y w NaN 2.0
8 y w NaN 2.0
9 y w NaN 2.0
10 y w NaN 2.0
11 y w NaN 2.0
Desired output (to replace the null values in the 'c' column in table_1 with the numeric values from table_2):
a b c
0 x z 1.0
1 x z 1.0
2 x z 1.0
3 y w 2.0
4 y w 2.0
5 y w 2.0
try:
out=table_1.append(table_2).dropna(subset=['c']).reset_index(drop=True)
#OR
out=pd.concat([table_1,table_2]).dropna(subset=['c']).reset_index(drop=True)
output of out:
a b c
0 x z 1.0
1 x z 1.0
2 x z 1.0
3 y w 2.0
4 y w 2.0
5 y w 2.0

Pandas Merge(): Appending data from merged columns and replace null values (Extension from question https://stackoverflow.com/questions/68471939)

I'd like to merge two tables while replacing the null value in one column from one table with the non-null values from the same labelled column from another table.
The code below is an example of the tables to be merged:
# Table 1 (has rows with missing values)
a=['x','x','x','y','y','y']
b=['z', 'z', 'z' ,'w', 'w' ,'w' ]
c=[1 for x in a]
d=[2 for x in a]
e=[3 for x in a]
f=[4 for x in a]
g=[1,1,1,np.nan, np.nan, np.nan]
table_1=pd.DataFrame({'a':a, 'b':b, 'c':c, 'd':d, 'e':e, 'f':f, 'g':g})
table_1
a b c d e f g
0 x z 1 2 3 4 1.0
1 x z 1 2 3 4 1.0
2 x z 1 2 3 4 1.0
3 y w 1 2 3 4 NaN
4 y w 1 2 3 4 NaN
5 y w 1 2 3 4 NaN
# Table 2 (new table to be merged to table_1, and would need to use values in column 'c' to replace values in the same column in table_1, while keeping the values in the other non-null rows)
a=['y', 'y', 'y']
b=['w', 'w', 'w']
g=[2,2,2]
table_2=pd.DataFrame({'a':a, 'b':b, 'g':g})
table_2
a b g
0 y w 2
1 y w 2
2 y w 2
This is the code I use for merging the 2 tables, and the ouput I get
merged_table=pd.merge(table_1, table_2, on=['a', 'b'], how='left')
merged_table
Current output:
a b c d e f g_x g_y
0 x z 1 2 3 4 1.0 NaN
1 x z 1 2 3 4 1.0 NaN
2 x z 1 2 3 4 1.0 NaN
3 y w 1 2 3 4 NaN 2.0
4 y w 1 2 3 4 NaN 2.0
5 y w 1 2 3 4 NaN 2.0
6 y w 1 2 3 4 NaN 2.0
7 y w 1 2 3 4 NaN 2.0
8 y w 1 2 3 4 NaN 2.0
9 y w 1 2 3 4 NaN 2.0
10 y w 1 2 3 4 NaN 2.0
11 y w 1 2 3 4 NaN 2.0
Desired output:
a b c d e f g
0 x z 1 2 3 4 1.0
1 x z 1 2 3 4 1.0
2 x z 1 2 3 4 1.0
3 y w 1 2 3 4 2.0
4 y w 1 2 3 4 2.0
5 y w 1 2 3 4 2.0
There are some problems you have to solve:
Tables 1,2 'g' column type: it should be float. So we use DataFrame.astype({'column_name':'type'}) for both tables 1,2;
Indexes. You are allowed to insert data by index, because other columns of table_1 contain the same data : 'y w 1 2 3 4'. Therefore we should filter NaN values from 'g' column of the table 1: ind=table_1[*pd.isnull*(table_1['g'])] and create a new Series with new indexes from table 1 that cover NaN values from 'g': pd.Series(table_2['g'].to_list(),index=ind.index)
try this solution:
table_1=table_1.astype({'a':'str','b':'str','g':'float'})
table_2=table_2.astype({'a':'str','b':'str','g':'float'})
ind=table_1[pd.isnull(table_1['g'])]
table_1.loc[ind.index,'g']=pd.Series(table_2['g'].to_list(),index=ind.index)
Here is the output.

How can I append a Series to an existing DataFrame row with pandas?

I want to take a series and append it to an existing dataframe row. For example:
df
A B C
0 2 3 4
1 5 6 7
2 7 8 9
series
0 x
1 y
2 z
-->
A B C D E F
0 2 3 4 x y z
1 5 6 7 ...
2 7 8 9 ...
I want to do this using a for loop, appending a different series to each row of the dataframe. The series may have different lengths. Is there an easy way to accomplish this?
Use loc and the series's index as the column name
lst = [
[2,3,4],
[5,6,7],
[7,8,9]
]
df = pd.DataFrame(lst, columns=list("ABC"))
print(df)
###
A B C
0 2 3 4
1 5 6 7
2 7 8 9
s1 = pd.Series(list("xyz"))
s1.index = list("DEF")
print(s1)
###
D x
E y
F z
dtype: object
s2 = pd.Series(list("abcd"))
s2.index = list("GHIJ")
print(s2)
###
G a
H b
I c
J d
dtype: object
for idx, s in enumerate([s1, s2]):
df.loc[idx, s.index] = s.values
print(df)
###
A B C D E F G H I J
0 2 3 4 x y z NaN NaN NaN NaN
1 5 6 7 NaN NaN NaN a b c d
2 7 8 9 NaN NaN NaN NaN NaN NaN NaN
Try this:
df['D'], df['E'], df['F'] = s.tolist()
And now:
print(df)
Gives:
A B C D E F
0 2 3 4 x y z
1 5 6 7 x y z
2 7 8 9 x y z
Edit:
If you are not sure how many extra values there are, try:
from string import ascii_uppercase as letters
df = df.assign(**dict(zip([letters[i + len(df.columns)] for i, v in enumerate(series)], series.tolist())))
print(df)
Output:
A B C D E F
0 2 3 4 x y z
1 5 6 7 x y z
2 7 8 9 x y z

Pandas duplicates when grouped

x = df.groupby(["Customer ID", "Category"]).sum().sort_values(by="VALUE", ascending=False)
I want to group by Customer ID but when I use above code, it duplicates customers...
Here is the result:
Source DF:
Customer ID Category Value
0 A x 5
1 B y 5
2 B z 6
3 C x 7
4 A z 2
5 B x 5
6 A x 1
new: https://ufile.io/dpruz
I think you are looking for something like this:
df_out = df.groupby(['Customer ID','Category']).sum()
df_out.reindex(df_out.sum(level=0).sort_values('Value', ascending=False).index,level=0)
Output:
Value
Customer ID Category
B x 5
y 5
z 6
A x 6
z 2
C x 7