Pandas duplicates when grouped - pandas

x = df.groupby(["Customer ID", "Category"]).sum().sort_values(by="VALUE", ascending=False)
I want to group by Customer ID but when I use above code, it duplicates customers...
Here is the result:
Source DF:
Customer ID Category Value
0 A x 5
1 B y 5
2 B z 6
3 C x 7
4 A z 2
5 B x 5
6 A x 1
new: https://ufile.io/dpruz

I think you are looking for something like this:
df_out = df.groupby(['Customer ID','Category']).sum()
df_out.reindex(df_out.sum(level=0).sort_values('Value', ascending=False).index,level=0)
Output:
Value
Customer ID Category
B x 5
y 5
z 6
A x 6
z 2
C x 7

Related

dataframe sorting by sum of values

I have the following df:
df = pd.DataFrame({'from':['A','A','A','B','B','C','C','C'],'to':['J','C','F','C','M','Q','C','J'],'amount':[1,1,2,12,13,5,5,1]})
df
and I wish to sort it is such way that the highest amount of 'from' is first. So in this example, 'from' B has 12+13 = 25 so B is the first in the list. Then comes C with 11 and then A with 4.
One way to do it is like this:
df['temp'] = df.groupby(['from'])['amount'].transform('sum')
df.sort_values(by=['temp'], ascending =False)
but I'm just adding another column. Wonder if there's a better way?
I think your method is good and explicit.
A variant without the temporary column could be:
df.sort_values(by='from', ascending=False,
key=lambda x: df['amount'].groupby(x).transform('sum'))
output:
from to amount
3 B C 12
4 B M 13
5 C Q 5
6 C C 5
7 C J 1
0 A J 1
1 A C 1
2 A F 2
In your case do with argsort
out = df.iloc[(-df.groupby(['from'])['amount'].transform('sum')).argsort()]
Out[53]:
from to amount
3 B C 12
4 B M 13
5 C Q 5
6 C C 5
7 C J 1
0 A J 1
1 A C 1
2 A F 2

Cartesian product in R

What is the fastest way to find cartesian product of two lists in R? For example, I have:
x <- c(a,b,c,d) y <- c(1, 2, 3)
I need to make from them the following data.frame:
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
10 d 1
11 d 2
12 d 3
Assuming x cross y, this would be one way:
# Tideyverse solution
library(tidyr)
x <- letters[1:4]
y <- c(1, 2, 3)
tibble(
x = x,
y = list(y)
) %>%
unnest(y)
# A tibble: 12 x 2
x y
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
10 d 1
11 d 2
12 d 3
# Base R solution
expand.grid(y = y, x = x)
y x
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
7 1 c
8 2 c
9 3 c
10 1 d
11 2 d
12 3 d

merge two matrix (dataframe) into one in between columns

I have two dataframe like these:
df1 a b c
0 1 2 3
1 2 3 4
2 3 4 5
df2 x y z
0 T T F
1 F T T
2 F T F
I want to merge these matrix according column one i between like this:
df a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
whats your idea? how we can merge or append or concate?!!
I used this code. it work dynamically:
df=pd.DataFrame()
for i in range(0,6):
if i%2 == 0:
j=(i)/2
df.loc[:,i] = df1.iloc[:,int(j)]
else:
j=(i-1)/2
df.loc[:,i] = df2.iloc[:,int(j)]
And it works correctly !!
Try:
df = pd.concat([df1, df2], axis=1)
df = df[['a','x','b','y','c','z']]
Prints:
a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F

Transforming pandas dataframe : un pivot

I have the following dataframe :
commune nuance_1 votes_1 nuance_2 votes_2 nuance_3 votes_3
A X 12 Y 20 Z 5
B X 10 Y 5
C Z 7 X 2
and I would like to obtain after transformation :
commune nuance votes
A X 12
A Y 20
A Z 5
B X 10
B Y 5
C Z 7
C X 2
Is there a way to do this ( sort of un pivot ) ?
You can use pd.wide_to_long here:
out = (pd.wide_to_long(df,['nuance','votes'],'commune','j',sep='_')
.dropna(how='all').sort_index(0).droplevel(1).reset_index())
print(out)
commune nuance votes
0 A X 12.0
1 A Y 20.0
2 A Z 5.0
3 B X 10.0
4 B Y 5.0
5 C Z 7.0
6 C X 2.0

SQL max value Y for each distinct X

Is there a way to query a table in such a way as to get the max values for EACH x value? As in... say there are two columns in a table. Call it x and y. Is there a way to get the MAX(Y) for EACH X? So if x repeats
X Y
1 6
1 7
1 8
1 8
1 8
1 9
2 5
2 5
2 5
2 4
2 5
3 3
3 4
3 6
4 2
4 4
4 5
5 2
5 1
5 5
the query would get the highest y value for x=1, the highest y value for x=2, and so on?
Just group by the column that should be distinct. Then all aggregate functions like max() are applied to each group
select x, max(y) as max_y
from your_table
group by x
Try this:
select X, MAX(Y)
from my_table
group by X
order by X;
This gets the MAX Y for each X value.