How to order dataframe using a list in pandas - pandas

I have a pandas dataframe as follows.
import pandas as pd
data = [['Alex',10, 175],['Bob',12, 178],['Clarke',13, 179]]
df = pd.DataFrame(data,columns=['Name','Age', 'Height'])
print(df)
I also have a list as follows.
mynames = ['Emj', 'Bob', 'Jenne', 'Alex', 'Clarke']
I want to order the rows of my dataframe in the order of mynames list. In other words, my output should be as follows.
Name Age Height
0 Bob 12 178
1 Alex 10 175
2 Clarke 13 179
I was trying to do this as follows. I am wondering if there is an easy way to do this in pandas than converting the dataframe to list.
I am happy to provide more details if needed.

You can do pd.Categorical + argsort
df=df.loc[pd.Categorical(df.Name,mynames).argsort()]
Name Age Height
1 Bob 12 178
0 Alex 10 175
2 Clarke 13 179

Related

Kronecker product over the rows of a pandas dataframe

So I have these two dataframes and I would like to get a new dataframe which consists of the kronecker product of the rows of the two dataframes. What is the correct way to this?
As an example:
DataFrame1
c1 c2
0 10 100
1 11 110
2 12 120
and
DataFrame2
a1 a2
0 5 7
1 1 10
2 2 4
Then I would like to have the following matrix:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
I hope my question is clear.
PS. I saw this question was posted here kronecker product pandas dataframes. However, the answer given is not the correct answer (I believe to mine and the original question, but definitely not to mine). The answer there gives a Kronecker product of both dataframes, but I only want it over the rows.
Create MultiIndex by MultiIndex.from_product, convert both columns to MultiIndex by DataFrame.reindex and multiple Dataframe, last flatten MultiIndex:
c = pd.MultiIndex.from_product([df1, df2])
df = df1.reindex(c, axis=1, level=0).mul(df2.reindex(c, axis=1, level=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
print (df)
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
Use numpy for efficiency:
import numpy as np
pd.DataFrame(np.einsum('nk,nl->nkl', df1, df2).reshape(df1.shape[0], -1),
columns=pd.MultiIndex.from_product([df1, df2]).map(''.join)
)
Output:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480

Improper broadcasting(?) on dataframe multiply() operation on two multindex slices

I'm trying to multipy() two multilevel slices of a dataframe, however I'm unable to coerce the multiply operation to broadcast properly, so I just end up with lots of nans. It's like somehow I'm not specifying the indexing properly.
I've tried all variations of both axis and level but it eithers throws an exception or gives me a 6x6 grid of Nan
import numpy as np
import pandas as pd
np.random.seed(0)
idx = pd.IndexSlice
df_a = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['weight', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.randint(70, high=120, size=(6,3), dtype=int)
)
df_a.index.name= "m"
df_b = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['coef', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.rand(6,3)
)
df_b.index.name= "m"
df_c = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['extraneous', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.rand(6,3)
)
df = df_a.join([df_b, df_c])
# What I'm wanting:
# new column = coef*weight
#measure NewCol
#person alice bob sue
#m
#0 30.2 48.1 88.9
#...
#5 18.3 32.2 103
#all of these variations generatea 6x6 grid of NaNs
df.loc[:,idx['weight',:]].multiply(df.loc[:,idx['coef',:]], axis="rows", )
df.loc[:,idx['weight',:]].multiply(df.loc[:,idx['coef',:]], axis="colums", )
Here is an approach using pandas.concat:
df = pd.concat([df,
pd.concat({'NewCol': df['coef'].mul(df['weight'])},
axis=1)],
axis=1)
output:
measure weight coef extraneous NewCol
person alice bob sue alice bob sue alice bob sue alice bob sue
m
0 107 98 89 0.906243 0.761173 0.754762 0.889252 0.140435 0.708203 96.968045 74.594927 67.173827
1 106 77 117 0.193279 0.138338 0.699014 0.826331 0.087769 0.242337 20.487623 10.652021 81.784634
2 104 77 101 0.340416 0.131111 0.394653 0.465670 0.825667 0.624923 35.403258 10.095575 39.859948
3 80 92 116 0.329999 0.144878 0.794014 0.539082 0.968411 0.588952 26.399889 13.328731 92.105674
4 75 76 100 0.024841 0.083313 0.113684 0.160948 0.003354 0.246954 1.863067 6.331802 11.368357
5 115 99 71 0.662492 0.755795 0.123242 0.144265 0.993883 0.513367 76.186541 74.823720 8.750217
You can try via to_numpy() If you want to assign changes back to DataFrame:
df.loc[:,idx['weight',:]]=df.loc[:,idx['weight',:]].to_numpy()*df.loc[:,idx['coef',:]].to_numpy()
#you can also use values attribute
OR
If you want to create a new MultiIndexed column then use concat()+join():
df=df.join(pd.concat([df['coef'].mul(df['weight'])],keys=['NewCol'],axis=1))
#OR
#df=df.join(pd.concat({'NewCol': df['coef'].mul(df['weight'])},axis=1))

Transform a dataframe with comma thousand separator to space separator Pandas

I have an issue with format in Pandas. So, I have a Column in A DataFrame witch countains numbers with a comma seperator like (200,000). So I would like to transfom this in (200 000).
The easy way it's to use the replace function but I want also to transform the type into Integer. It's not working because there are spaces between.
In the end, I just want to do a ranking with descending sorted values ​​like this:
Id
Villas
Price_nospace
3
Peace
35000000
3
Peace
35000000
2
Rosa
27000000
1
Beach
25000000
0
Palm
22000000
As you can see, It's not easy to read the price without separator. So I would like to make the price more readable. But when I have space separator I can't convert to Int.
And If I don't convert to integer, I can use sort_values function. So I am stuck.
Thank you for your help.
Modified the sample input a bit to sort(descending) the values in output.
Below solution will sort(descending) the dataframe by Price_nospace and replace the comma with space. But the Price_nospace will be of object type in output.
Sample Input
Id Villas Price_nospace
0 3 Peace 220,000
1 3 Peace 350,000
2 2 Rosa 270,000
3 1 Beach 250,000
4 0 Palm 230,000
Code
df['Price_new'] = df['Price_nospace'].str.replace(',','',regex=True).astype(int)
df = df.sort_values(by='Price_new', ascending=False)
df['Price_nospace'] = df['Price_nospace'].str.replace(',',' ',regex=True)
df = df.drop(columns='Price_new').reset_index(drop=True) ## reset_index, if required
df
Output
Id Villas Price_nospace
0 3 Peace 350 000
1 2 Rosa 270 000
2 1 Beach 250 000
3 0 Palm 230 000
4 3 Peace 220 000
Explanation
Introduced a new column Price_new to convert Price_nospace values to int and sort the values.
Once df is sorted, just replaced comma with space for Price_nospace and deleted temp column Price_new.
Another option is to alter how the data is displayed but not affect the underlying type.
Use pd.options.display.float_format after converting the str prices to float prices:
import pandas as pd
def my_float_format(x):
'''
Number formatting with custom thousands separator
'''
return f'{x:,.0f}'.replace(',', ' ')
# set display float_format
pd.options.display.float_format = my_float_format
df = pd.DataFrame({
'Id': [3, 3, 2, 1, 0],
'Villas': ['Peace', 'Peace', 'Rosa', 'Beach', 'Palm'],
'Price_nospace': ['35,000,000', '35,000,000', '27,000,000',
'25,000,000', '22,000,000']
})
# Convert str prices to float
df['Price_nospace'] = (
df['Price_nospace'].str.replace(',', '', regex=True).astype(float)
)
Output:
print(df)
Id Villas Price_nospace
0 3 Peace 35 000 000
1 3 Peace 35 000 000
2 2 Rosa 27 000 000
3 1 Beach 25 000 000
4 0 Palm 22 000 000
print(df.dtypes)
Id int64
Villas object
Price_nospace float64
dtype: object
Since the type is float64 any numeric operations will function as normal.
The same my_float_format function can be used on export as well:
df.to_csv(float_format=my_float_format)
,Id,Villas,Price_nospace
0,3,Peace,35 000 000
1,3,Peace,35 000 000
2,2,Rosa,27 000 000
3,1,Beach,25 000 000
4,0,Palm,22 000 000

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

Issue looping through dataframes in Pandas

I have a dict 'd' set up which is a list of dataframes E.g.:
d["DataFrame1"]
Will return that dataframe with all its columns:
ID Name
0 123 John
1 548 Eric
2 184 Sam
3 175 Andy
Each dataframe has a column in it called 'Names'. I want to extract this column from each dataframe in the dict and to create a new dataframe consisting of these columns.
df_All_Names = pd.DataFrame()
for df in d:
df_All_Names[df] = df['Names']
Returns the error:
TypeError: string indices must be integers
Unsure where I'm going wrong here.
For example you have df as follow
df=pd.DataFrame({'Name':['X', 'Y']})
df1=pd.DataFrame({'Name':['X1', 'Y1']})
And we create a dict
d=dict()
d['df']=df
d['df1']=df1
Then presetting a empty data frame:
yourdf=pd.DataFrame()
Using items with for loop
for key,val in d.items():
yourdf[key]=val['Name']
yield :
yourdf
Out[98]:
df df1
0 X X1
1 Y Y1
Your can use reduce and concatenate all of the columns named ['Name'] in your dictionary of dataframes
Sample Data
from functools import reduce
d = {'df1':pd.DataFrame({'ID':[0,1,2],'Name':['John','Sam','Andy']}),'df2':pd.DataFrame({'ID':[3,4,5],'Name':['Jen','Cara','Jess']})}
You can stack the data side by side using axis=1
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=1),d.values())
Name Name
0 John Jen
1 Sam Cara
2 Andy Jess
Or on top of one an other usingaxis=0
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=0),d.values())
0 John
1 Sam
2 Andy
0 Jen
1 Cara
2 Jess