change groupby dimension to 1 from 2 in pandas - pandas

I performed a groupby:
df=pd.DataFrame({'grp':['a','a','b','b'],'value':[1,2,1,10]})
df.groupby('grp').agg({'value':['mean','median']})
and got:
how do I change this to a normal df that I can manipulate and access?

Change your code a bit - add column for aggregation after groupby and pass list of functions:
df1 = df.groupby('grp')['value'].agg(['mean','median'])
print (df1)
mean median
grp
a 1.5 1.5
b 5.5 5.5
Another idea is remove first level of MultiIndex, but if more columns is possible get duplicated columns names:
df1 = df.groupby('grp').agg({'value':['mean','median']})
df1.columns = df1.columns.droplevel(0)
print (df1)
mean median
grp
a 1.5 1.5
b 5.5 5.5
Then is better for avoid duplicated columns names use map with join:
df1 = df.groupby('grp').agg({'value':['mean','median']})
df1.columns = df1.columns.map('_'.join)
print (df1)
value_mean value_median
grp
a 1.5 1.5
b 5.5 5.5
Or for pandas 0.25 use named aggregation:
df2 = df.groupby("grp").agg(a=pd.NamedAgg(column='value', aggfunc='mean'),
b=pd.NamedAgg(column='value', aggfunc='median'))
print (df2)
a b
grp
a 1.5 1.5
b 5.5 5.5

You can simply drop a level of the columns and reset the index of the DataFrame like this:
df=pd.DataFrame({'grp':['a','a','b','b'],'value':[1,2,1,10]})
df1 = df.groupby('grp').agg({'value':['mean','median']})
df1.columns = df1.columns.droplevel(0)
df1.reset_index()
Also, if you want a combined column, you can:
df1.columns = df1.columns.map('_'.join)
instead of:
df1.columns = df1.columns.droplevel(0)

Related

Concatenate single row dataframe with multiple row dataframe

I have a dataframe with large number of columns but single row as df1:
Col1 Col2 Price Qty
A B 16 5
I have another dataframe as follows, df2:
Price Qty
8 2.5
16 5
6 1.5
I want to achieve the following:
Col1 Col2 Price Qty
A B 8 2.5
A B 16 5
A B 6 1.5
Where essentially I am taking all rows of df1 and repeat it while concatenating with df2 but bring the Price and Qty columns from df2 and replace the ones present originally in df1.
I am not sure how to proceed with above.
I believe the following approach will work,
# first lets repeat the single row df1 as many times as there are rows in df2
df1 = pd.DataFrame(np.repeat(df1.values, len(df2.index), axis=0), columns=df1.columns)
# lets reset the indexes of both DataFrames just to be safe
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
# now, lets merge the two DataFrames based on the index
# after dropping the Price and Qty columns from df1
df3 = pd.merge(df1.drop(['Price', 'Qty'], axis=1), df2, left_index=True, right_index=True)
# finally, lets drop the index columns
df3.drop(['index_x', 'index_y'], inplace=True, axis=1)

How do I offset a dataframe with values in another dataframe?

I have two dataframes. One is the basevales (df) and the other is an offset (df2).
How do I create a third dataframe that is the first dataframe offset by matching values (the ID) in the second dataframe?
This post doesn't seem to do the offset... Update only some values in a dataframe using another dataframe
import pandas as pd
# initialize list of lists
data = [['1092', 10.02], ['18723754', 15.76], ['28635', 147.87]]
df = pd.DataFrame(data, columns = ['ID', 'Price'])
offsets = [['1092', 100.00], ['28635', 1000.00], ['88273', 10.]]
df2 = pd.DataFrame(offsets, columns = ['ID', 'Offset'])
print (df)
print (df2)
>>> print (df)
ID Price
0 1092 10.02
1 18723754 15.76 # no offset to affect it
2 28635 147.87
>>> print (df2)
ID Offset
0 1092 100.00
1 28635 1000.00
2 88273 10.00 # < no match
This is want I want to produce: The price has been offset by matching
ID Price
0 1092 110.02
1 18723754 15.76
2 28635 1147.87
I've also looked at Pandas Merging 101
I don't want to add columns to the dataframe, and I don;t want to just replace column values with values from another dataframe.
What I want is to add (sum) column values from the other dataframe to this dataframe, where the IDs match.
The closest I come is df_add=df.reindex_like(df2) + df2 but the problem is that it sums all columns - even the ID column.
Try this :
df['Price'] = pd.merge(df, df2, on=["ID"], how="left")[['Price','Offset']].sum(axis=1)

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

filter df based on index condition

I have a df with lots of rows:
13790226       0.320  0.001976    
9895d5dis 182.600  0.040450     
105066007     18.890  0.006432     
109067019     52.500  0.034011     
111845014     16.400  0.023974     
11668574e      7.180  0.070714     
113307021      4.110  0.017514     
113679I37      8.180  0.010837     
I would like to filter this df in order to obtain the rows where the index last char is not a digit
Desired df:
9895d5dis 182.600 0.040450
11668574e 7.180 0.070714
How can I do it?
df['is_digit'] = [i[-1].isdigit() for i in df.index.values]
df[df['is_digit'] == False]
But I like regex better:
df[df.index.str.contains('[A-z]$')]
Is the column on which you are filtering index or a column? If its a column
df1 = df[df[0].str.contains('[A-Za-z]')]
Returns
0 1 2
1 9895d5dis 182.60 0.040450
5 11668574e 7.18 0.070714
7 113679I37 8.18 0.010837 #looks like read_clipboard is reading 1 in 113679137 as I
If its index, first do
df = df.reset_index()
Here's a concise way without creating a new temp column:
df
b c
a
9895d5dis 182.60 0.040450
105066007 18.89 0.006432
109067019 52.50 0.034011
111845014 16.40 0.023974
11668574e 7.18 0.070714
113307021 4.11 0.017514
113679I37 8.18 0.010837
df[~df.index.str[-1].str.isnumeric()]
b c
a
9895d5dis 182.60 0.040450
11668574e 7.18 0.070714
Throwing this into the mix:
df.loc[[x for x in df.index if x[-1].isalpha()]]

Assigning index column to empty pandas dataframe

I am creating an empty dataframe that i then want to add data to one row at a time. I want to index on the first column, 'customer_ID'
I have this:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'],index=['customer_ID'])
In[2]: df
Out[3]:
customer_ID a b c
customer_ID NaN NaN NaN NaN
So there is already a row of NaN that I don't want.
Can I point the index to the first column without adding a row of data?
The answer, I think, as hinted at by #JD Long is to set the index in a seprate instruction:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'])
In[2]: df.set_index('customer_ID',inplace = True)
In[3]: df
Out[3]:
Empty DataFrame
Columns: [customer_ID, a, b, c]
Index: []
I can then add rows:
In[4]: id='x123'
In[5]: df.loc[id]=[id,4,5,6]
In[6]: df
Out[7]:
customer_ID a b c
x123 x123 4.0 5.0 6.0
yes... and you can dropna at any time if you are so inclined:
df = df.set_index('customer_ID').dropna()
df
Because you didn't have any row in your dataframe when you just create it.
df= pd.DataFrame({'customer_ID': ['2'],'a': ['1'],'b': ['A'],'c': ['1']})
df.set_index('customer_ID',drop=False)
df