Improper broadcasting(?) on dataframe multiply() operation on two multindex slices - pandas

I'm trying to multipy() two multilevel slices of a dataframe, however I'm unable to coerce the multiply operation to broadcast properly, so I just end up with lots of nans. It's like somehow I'm not specifying the indexing properly.
I've tried all variations of both axis and level but it eithers throws an exception or gives me a 6x6 grid of Nan
import numpy as np
import pandas as pd
np.random.seed(0)
idx = pd.IndexSlice
df_a = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['weight', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.randint(70, high=120, size=(6,3), dtype=int)
)
df_a.index.name= "m"
df_b = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['coef', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.rand(6,3)
)
df_b.index.name= "m"
df_c = pd.DataFrame(index=range(6),
columns=pd.MultiIndex.from_product([['extraneous', ], ['alice','bob', 'sue']],
names=['measure','person']),
data=np.random.rand(6,3)
)
df = df_a.join([df_b, df_c])
# What I'm wanting:
# new column = coef*weight
#measure NewCol
#person alice bob sue
#m
#0 30.2 48.1 88.9
#...
#5 18.3 32.2 103
#all of these variations generatea 6x6 grid of NaNs
df.loc[:,idx['weight',:]].multiply(df.loc[:,idx['coef',:]], axis="rows", )
df.loc[:,idx['weight',:]].multiply(df.loc[:,idx['coef',:]], axis="colums", )

Here is an approach using pandas.concat:
df = pd.concat([df,
pd.concat({'NewCol': df['coef'].mul(df['weight'])},
axis=1)],
axis=1)
output:
measure weight coef extraneous NewCol
person alice bob sue alice bob sue alice bob sue alice bob sue
m
0 107 98 89 0.906243 0.761173 0.754762 0.889252 0.140435 0.708203 96.968045 74.594927 67.173827
1 106 77 117 0.193279 0.138338 0.699014 0.826331 0.087769 0.242337 20.487623 10.652021 81.784634
2 104 77 101 0.340416 0.131111 0.394653 0.465670 0.825667 0.624923 35.403258 10.095575 39.859948
3 80 92 116 0.329999 0.144878 0.794014 0.539082 0.968411 0.588952 26.399889 13.328731 92.105674
4 75 76 100 0.024841 0.083313 0.113684 0.160948 0.003354 0.246954 1.863067 6.331802 11.368357
5 115 99 71 0.662492 0.755795 0.123242 0.144265 0.993883 0.513367 76.186541 74.823720 8.750217

You can try via to_numpy() If you want to assign changes back to DataFrame:
df.loc[:,idx['weight',:]]=df.loc[:,idx['weight',:]].to_numpy()*df.loc[:,idx['coef',:]].to_numpy()
#you can also use values attribute
OR
If you want to create a new MultiIndexed column then use concat()+join():
df=df.join(pd.concat([df['coef'].mul(df['weight'])],keys=['NewCol'],axis=1))
#OR
#df=df.join(pd.concat({'NewCol': df['coef'].mul(df['weight'])},axis=1))

Related

index compatibility of dataframe with multiindex result from apply on group

We have to apply an algorithm to columns in a dataframe, the data has to be grouped by a key and the result shall form a new column in the dataframe. Since it is a common use-case we wonder if we have chosen a correct approach or not.
Following code reflects our approach to the problem in a simplified manner.
import numpy as np
import pandas as pd
np.random.seed(42)
N = 100
key = np.random.randint(0, 2, N).cumsum()
x = np.random.rand(N)
data = dict(key=key, x=x)
df = pd.DataFrame(data)
This generates a DataFrame as follows.
key x
0 0 0.969585
1 1 0.775133
2 1 0.939499
3 1 0.894827
4 1 0.597900
.. ... ...
95 53 0.036887
96 54 0.609564
97 55 0.502679
98 56 0.051479
99 56 0.278646
Application of exemplary methods on the DataFrame groups.
def magic(x, const):
return (x + np.abs(np.random.rand(len(x))) + float(const)).round(1)
def pandas_confrom_magic(df_per_key, const=1):
index = df_per_key['x'].index # preserve index
x = df_per_key['x'].to_numpy()
y = magic(x, const) # perform some pandas incompatible magic
return pd.Series(y, index=index) # reconstruct index
g = df.groupby('key')
y_per_g = g.apply(lambda df: pandas_confrom_magic(df, const=5))
When assigning a new column to the result df['y'] = y_per_g it will throw a TypeError.
TypeError: incompatible index of inserted column with frame index
Thus a compatible multiindex needs to be introduced first.
df.index.name = 'index'
df = df.set_index('key', append=True).reorder_levels(['key', 'index'])
df['y'] = y_per_g
df.reset_index('key', inplace=True)
Which yields the intended result.
key x y
index
0 0 0.969585 6.9
1 1 0.775133 6.0
2 1 0.939499 6.1
3 1 0.894827 6.4
4 1 0.597900 6.6
... ... ... ...
95 53 0.036887 6.0
96 54 0.609564 6.0
97 55 0.502679 6.5
98 56 0.051479 6.0
99 56 0.278646 6.1
Now we wonder if there is a more straight forward way of dealing with the index and if we generally have chosen a favorable approach.
Use Series.droplevel to remove first level of MultiIndex, such that it has the same index as df, then assign will working well:
g = df.groupby('key')
df['y'] = g.apply(lambda df: pandas_confrom_magic(df, const=5)).droplevel('key')
print (df)
key x y
0 0 0.969585 6.9
1 1 0.775133 6.0
2 1 0.939499 6.1
3 1 0.894827 6.4
4 1 0.597900 6.6
.. ... ... ...
95 53 0.036887 6.0
96 54 0.609564 6.0
97 55 0.502679 6.5
98 56 0.051479 6.0
99 56 0.278646 6.1
[100 rows x 3 columns]

Kronecker product over the rows of a pandas dataframe

So I have these two dataframes and I would like to get a new dataframe which consists of the kronecker product of the rows of the two dataframes. What is the correct way to this?
As an example:
DataFrame1
c1 c2
0 10 100
1 11 110
2 12 120
and
DataFrame2
a1 a2
0 5 7
1 1 10
2 2 4
Then I would like to have the following matrix:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
I hope my question is clear.
PS. I saw this question was posted here kronecker product pandas dataframes. However, the answer given is not the correct answer (I believe to mine and the original question, but definitely not to mine). The answer there gives a Kronecker product of both dataframes, but I only want it over the rows.
Create MultiIndex by MultiIndex.from_product, convert both columns to MultiIndex by DataFrame.reindex and multiple Dataframe, last flatten MultiIndex:
c = pd.MultiIndex.from_product([df1, df2])
df = df1.reindex(c, axis=1, level=0).mul(df2.reindex(c, axis=1, level=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
print (df)
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
Use numpy for efficiency:
import numpy as np
pd.DataFrame(np.einsum('nk,nl->nkl', df1, df2).reshape(df1.shape[0], -1),
columns=pd.MultiIndex.from_product([df1, df2]).map(''.join)
)
Output:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480

Python keep rows if a specific column contains a particular value or string

I am very green in python. I have not found a specific answer to my problem searching for online resources. With that said it would be great if you could give some hints.
I have an example of df as below:
import pandas as pd
df = pd.DataFrame({'names':['Alex','Joseph','Kate'],'exam1': [90, 68, 70], 'exam2': [100, 98, 88]})
print(df)
names exam1 exam2
0 Alex 90 100
1 Joseph 68 98
2 Kate 70 88
I would like to make a for loop to iterate over the rows and if the names column is equal to Joseph and Kate to get a new df as below:
names exam1 exam2
0 Joseph 68 98
1 Kate 70 88
I know there is a way like below but I would like to do it via for loop.
list=['Joseph','Kate']
new_df=df[df['names'].isin(list)]
Thank you in Advance.
Not sure why you'd want to use loops but this is how you'd it:
rows = []
for index, row in df.iterrows():
if row['names'] == 'Kate' or row['names'] == 'Joseph':
rows.append(row)
new_df = pd.DataFrame(rows)
print(new_df)
names exam1 exam2
1 Joseph 68 98
2 Kate 70 88

How to order dataframe using a list in pandas

I have a pandas dataframe as follows.
import pandas as pd
data = [['Alex',10, 175],['Bob',12, 178],['Clarke',13, 179]]
df = pd.DataFrame(data,columns=['Name','Age', 'Height'])
print(df)
I also have a list as follows.
mynames = ['Emj', 'Bob', 'Jenne', 'Alex', 'Clarke']
I want to order the rows of my dataframe in the order of mynames list. In other words, my output should be as follows.
Name Age Height
0 Bob 12 178
1 Alex 10 175
2 Clarke 13 179
I was trying to do this as follows. I am wondering if there is an easy way to do this in pandas than converting the dataframe to list.
I am happy to provide more details if needed.
You can do pd.Categorical + argsort
df=df.loc[pd.Categorical(df.Name,mynames).argsort()]
Name Age Height
1 Bob 12 178
0 Alex 10 175
2 Clarke 13 179

How to reset pandas data reader index? [duplicate]

This seems rather obvious, but I can't seem to figure out how to convert an index of data frame to a column?
For example:
df=
gi ptt_loc
0 384444683 593
1 384444684 594
2 384444686 596
To,
df=
index1 gi ptt_loc
0 0 384444683 593
1 1 384444684 594
2 2 384444686 596
either:
df['index1'] = df.index
or, .reset_index:
df = df.reset_index(level=0)
so, if you have a multi-index frame with 3 levels of index, like:
>>> df
val
tick tag obs
2016-02-26 C 2 0.0139
2016-02-27 A 2 0.5577
2016-02-28 C 6 0.0303
and you want to convert the 1st (tick) and 3rd (obs) levels in the index into columns, you would do:
>>> df.reset_index(level=['tick', 'obs'])
tick obs val
tag
C 2016-02-26 2 0.0139
A 2016-02-27 2 0.5577
C 2016-02-28 6 0.0303
rename_axis + reset_index
You can first rename your index to a desired label, then elevate to a series:
df = df.rename_axis('index1').reset_index()
print(df)
index1 gi ptt_loc
0 0 384444683 593
1 1 384444684 594
2 2 384444686 596
This works also for MultiIndex dataframes:
print(df)
# val
# tick tag obs
# 2016-02-26 C 2 0.0139
# 2016-02-27 A 2 0.5577
# 2016-02-28 C 6 0.0303
df = df.rename_axis(['index1', 'index2', 'index3']).reset_index()
print(df)
index1 index2 index3 val
0 2016-02-26 C 2 0.0139
1 2016-02-27 A 2 0.5577
2 2016-02-28 C 6 0.0303
To provide a bit more clarity, let's look at a DataFrame with two levels in its index (a MultiIndex).
index = pd.MultiIndex.from_product([['TX', 'FL', 'CA'],
['North', 'South']],
names=['State', 'Direction'])
df = pd.DataFrame(index=index,
data=np.random.randint(0, 10, (6,4)),
columns=list('abcd'))
The reset_index method, called with the default parameters, converts all index levels to columns and uses a simple RangeIndex as new index.
df.reset_index()
Use the level parameter to control which index levels are converted into columns. If possible, use the level name, which is more explicit. If there are no level names, you can refer to each level by its integer location, which begin at 0 from the outside. You can use a scalar value here or a list of all the indexes you would like to reset.
df.reset_index(level='State') # same as df.reset_index(level=0)
In the rare event that you want to preserve the index and turn the index into a column, you can do the following:
# for a single level
df.assign(State=df.index.get_level_values('State'))
# for all levels
df.assign(**df.index.to_frame())
For MultiIndex you can extract its subindex using
df['si_name'] = R.index.get_level_values('si_name')
where si_name is the name of the subindex.
If you want to use the reset_index method and also preserve your existing index you should use:
df.reset_index().set_index('index', drop=False)
or to change it in place:
df.reset_index(inplace=True)
df.set_index('index', drop=False, inplace=True)
For example:
print(df)
gi ptt_loc
0 384444683 593
4 384444684 594
9 384444686 596
print(df.reset_index())
index gi ptt_loc
0 0 384444683 593
1 4 384444684 594
2 9 384444686 596
print(df.reset_index().set_index('index', drop=False))
index gi ptt_loc
index
0 0 384444683 593
4 4 384444684 594
9 9 384444686 596
And if you want to get rid of the index label you can do:
df2 = df.reset_index().set_index('index', drop=False)
df2.index.name = None
print(df2)
index gi ptt_loc
0 0 384444683 593
4 4 384444684 594
9 9 384444686 596
This should do the trick (if not multilevel indexing) -
df.reset_index().rename({'index':'index1'}, axis = 'columns')
And of course, you can always set inplace = True, if you do not want to assign this to a new variable in the function parameter of rename.
df1 = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
p = df1.index.values
df1.insert( 0, column="new",value = p)
df1
new gi ptt
0 0 232 342
1 1 66 56
2 2 34 662
3 3 43 123
In the newest version of pandas 1.5.0, you could use the function reset_index with the new argument names to specify a list of names you want to give the index columns. Here is a reproducible example with one index column:
import pandas as pd
df = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
gi ptt
0 232 342
1 66 56
2 34 662
3 43 123
df.reset_index(names=['new'])
Output:
new gi ptt
0 0 232 342
1 1 66 56
2 2 34 662
3 3 43 123
This can also easily be applied with MultiIndex. Just create a list of the names you want.
I usually do it this way:
df = df.assign(index1=df.index)