keep all column after sum and groupby including empty values - pandas

I have the following dataframe:
source name cost other_c other_b
a a 7 dd 33
b a 6 gg 44
c c 3 ee 55
b a 2
d b 21 qw 21
e a 16 aq
c c 10 55
I am doing a sum of name and source with:
new_df = df.groupby(['source', 'name'], as_index=False)['cost'].sum()
but it is dropping the remaining 6 columns in my dataframe. Is there a way to keep the rest of the columns? I'm not looking to add new column, just carry over the columns from the original dataframe

Related

Pandas create new column with specific row values from dict

I have a dataframe:
ID val
1 a
2 b
3 c
4 d
5 a
7 d
6 v
8 j
9 k
10 a
I have a dictionary as follows:
{aa:3, bb: 3,cc:4}
In the dictionary the numerical values indicates the number of records. The sum of numerical values is equal to the number of rows that I have in the data frame. In this example 3 + 3 + 4 = 10 and I have 10 rows in the data frame.
I am trying to split the data frame by rows that are equal to the number given in the dictionary and fill the key as column value into a new column. The desired output is as follows:
ID val. new_col
1 a. aa
2 b aa
3 c. aa
4 d. bb
5 a. bb
6 v. bb
7. d. cc
8 j. cc
9 k. cc
10 a. cc
The order of the fill is not important as long as the count of records match with the count given in the dict. I am trying to resolve this by iterating through the dict but I am not able to isolate specific number of records of the data frame with every new key value pair.
I have also tried using pd.cut by splitting the dict values to bins and keys as column values. However I am getting the error ValueError: bins must increase monotonically.
d = {'aa':3, 'bb': 3,'cc':4}
df['new_col'] = pd.Series([np.repeat(i, j) for i, j in d.items()]).explode().to_numpy()
df
Out[64]:
ID val new_col
0 1 a aa
1 2 b aa
2 3 c aa
3 4 d bb
4 5 a bb
5 7 d bb
6 6 v cc
7 8 j cc
8 9 k cc
9 10 a cc

Pandas groupby nlargest slice

There were kind of similar named questions, but they do not reflect the use case I am facing. I have a dataframe with groups and values. I want to select values sliced by their order (confusing maybe, example will explain better).
This is my data:
group value
a 20
a 16
a 14
a 13
a 12
b 19
b 17
b 16
b 14
b 13
b 12
b 12
b 11
I want to group by group and slice [a:b] with nlargest logic, in other words, if a = 2 and b = 7 the biggest 3rd, 4th, 5ht, 6th and 7th variables per each group. I could not find any question here on this use case, or could I find something in pandas-dev github.
If there are less than b elements in any of the groups, then b = len(of that group) should be applied. If there are two or more elements with the same value, they should all be selected if they are within the [a:b] slice.
My desired result looks like this:
group value
a 14
a 13
a 12
b 16
b 14
b 13
b 12
b 12
Here, the group a has 5 elements which is less than b in the example and because of that, 3rd to the 5th biggest elements are returned. In group b 6th and 7th biggest values are the same, so they are both returned.
The closest question to mine is this question about slice but it does not use nlargest logic. It just slices the groups.
If you could guide me on that, I would appreciate!
You could try the following:
import pandas as pd
gbg = df.groupby('group')
a=2
b=7
res = gbg['value'].agg(lambda x: pd.Series.to_list(x)[a:b]).to_frame().explode('value').reset_index()
# .agg will "aggregate" the groups, here it will create the slices by group
# .to_frame will convert results from pd.Series to pd.DataFrame
# .explode() will write the list values in rows again
# .reset_index() will restore the column 'group'
The intermediate result after .agg():
group
a [14, 13, 12]
b [16, 14, 13, 12, 12]
Name: value, dtype: object
And the full result:
group value
0 a 14
1 a 13
2 a 12
3 b 16
4 b 14
5 b 13
6 b 12
7 b 12
By sorting the dataframe first and using the slice method which this approach gives me the result I expected.
df.sort_values(["group", "value"], ascending = False).groupby("group").slice(2, 7)
Output is
group value
a 14
a 13
a 12
b 16
b 14
b 13
b 12
b 12

Pandas, multiply part of one DF against another based on condition

Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...

Percentage calculation from pivot table pandas

I have a set of data which I have already imported from excel xlsx file. After that I determine to find out the percentage of the total profit from each of the customer segment. I manage to use the pivot_table to summarize the the total profit of each customer segment. However, I also would like to know the percentage. How do I do that?
Pivot_table
profit = df.pivot_table(index = ['Customer Segment'], values = ['Profit'], aggfunc=sum)
Result So far
Customer Segment Profit
A a
B b
C c
D d
Maybe adding the percentage column to the pivot table would be an ideal way. But how can I do that?
How about
df['percent'] = df['Profit']/sum(df['Profit'])
For example you have this data frame:
Customer Segment Customer Profit
0 A AAA 12
1 B BBB 43
2 C CCC 45
3 D DDD 23
4 D EEE 67
5 C FFF 21
6 B GGG 45
7 A JJJ 67
8 A KKK 32
9 B LLL 13
10 C MMM 43
11 D NNN 13
From the above data frame you want to make pivot table.
import pandas as pd
import numpy as np
tableframe = pd.pivot_table(df, values='Profit', index=['Customer Segment'], aggfunc=np.sum)
Here is your pivot table:
Profit
Customer Segment
A 111
B 101
C 109
D 103
Now you want to add another column to tableframe then compute the percentage.
tableframe['percentage'] = ((tableframe.Profit / tableframe.Profit.sum()) * 100)
Here is your final tableframe:
Profit percentage
Customer Segment
A 111 26.179245
B 101 23.820755
C 109 25.707547
D 103 24.292453

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)