How to use indexing by matching strings in data frame in pandas - pandas

I try to resolve the following problem. I have two data sets, say df1 and df2:
df1
NameSP Val Char1 BVA
0 'ACCR' 0.091941 A Y'
1 'SDRE' 0.001395 S Y'
2 'ACUZ' 0.121183 A N'
3 'SRRE' 0.001512 S N'
4 'FFTR' 0.035609 F N'
5 'STZE' 0.000637 S N'
6 'AHZR' 0.001418 A Y'
7 'DEES' 0.000876 D N'
8 'UURR' 0.023878 U Y'
9 'LLOH' 0.004371 L Y'
10 'IUUT' 0.049102 I N'
df2
NameSP Val1 Glob
0 'ACCR' 0.234 20000
1 'FFTR' 0.222 10000
2 'STZE' 0.001 5000
3 'DEES' 0.006 2000
4 'UURR' 0.134 20000
5 'LLOH' 0.034 10000
I would like to perform indexing of df2 in df1, and then use the indexing vector for various matrix operation. This would be something similar to strmatch(A,B,'exact') in Matlab. I can get the indexing properly by using .iloc and then .isin as in the following code:
import pandas as pd
import numpy as np
df1 = pd.read_excel('C:\PYTHONCODES\LINEAROPT\TEST_DATA1.xlsx')
df2 = pd.read_excel('C:\PYTHONCODES\LINEAROPT\TEST_DATA2.xlsx')
print(df1)
print(df2)
ddf1 = df1.iloc[:,0]
ddf2 = df2.iloc[:,0]
pindex = ddf1[ddf1.isin(ddf2)]
print(pindex.index)
which gives me:
Int64Index([0, 4, 5, 7, 8, 9], dtype='int64')
But I can not find the way to use this index for mapping and building my arrays. As an example, I would like to have a vector that has the same number of elements that df1, but with Val1 values from df2 at indexed positions and zeros everywhere else. So it should look like that:
0.234
0
0
0
0.222
0.001
0
0.006
0.134
0.034
0
Or another mapping problem. How to use such indexing to map the values from colon "Val" in df1 in a vector that would contain Val from df1 at indexed rows and zeros everywhere else. So this time it should look like:
0.091941
0.0
0.0
0.0
0.035609
0.000637
0.0
0.000876
0.023878
0.004371
0.0
Any idea of how to that in efficient and elegant way?
Thanks for help!

First problem
df2.set_index('NameSP')['Val1'].reindex(df1['NameSP']).fillna(0)
Second problem
df1['Val1'].where(df1['NameSP'].isin(df2['NameSP']), 0)

Related

Multiplying two data frames in pandas

I have two data frames as shown below df1 and df2. I want to create a third dataframe i.e. df as shown below. What would be the appropriate way?
df1={'id':['a','b','c'],
'val':[1,2,3]}
df1=pd.DataFrame(df)
df1
id val
0 a 1
1 b 2
2 c 3
df2={'yr':['2010','2011','2012'],
'val':[4,5,6]}
df2=pd.DataFrame(df2)
df2
yr val
0 2010 4
1 2011 5
2 2012 6
df={'id':['a','b','c'],
'val':[1,2,3],
'2010':[4,8,12],
'2011':[5,10,15],
'2012':[6,12,18]}
df=pd.DataFrame(df)
df
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
I can basically convert df1 and df2 as 1 by n matrices and get n by n result and assign it back to the df1. But is there any easy pandas way?
TL;DR
We can do it in one line like this:
df1.join(df1.val.apply(lambda x: x * df2.set_index('yr').val))
or like this:
df1.join(df1.set_index('id') # df2.set_index('yr').T, on='id')
Done.
The long story
Let's see what's going on here.
To find the output of multiplication of each df1.val by values in df2.val we use apply:
df1['val'].apply(lambda x: x * df2.val)
The function inside will obtain df1.vals one by one and multiply each by df2.val element-wise (see broadcasting for details if needed). As far as df2.val is a pandas sequence, the output is a data frame with indexes df1.val.index and columns df2.val.index. By df2.set_index('yr') we force years to be indexes before multiplication so they will become column names in the output.
DataFrame.join is joining frames index-on-index by default. So due to identical indexes of df1 and the multiplication output, we can apply df1.join( <the output of multiplication> ) as is.
At the end we get the desired matrix with indexes df1.index and columns id, val, *df2['yr'].
The second variant with # operator is actually the same. The main difference is that we multiply 2-dimentional frames instead of series. These are the vertical and horizontal vectors, respectively. So the matrix multiplication will produce a frame with indexes df1.id and columns df2.yr and element-wise multiplication as values. At the end we connect df1 with the output on identical id column and index respectively.
This works for me:
df2 = df2.T
new_df = pd.DataFrame(np.outer(df1['val'],df2.iloc[1:]))
df = pd.concat([df1, new_df], axis=1)
df.columns = ['id', 'val', '2010', '2011', '2012']
df
The output I get:
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
Your question is a bit vague. But I suppose you want to do something like that:
df = pd.concat([df1, df2], axis=1)

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

Binarize a continuous feature with NaNs Python

I have a pandas dataframe of 4000 rows and 35 features, in which some of the continuous features contain missing values (NaNs). For example, one of them (with 46 missing values) has a very left-skewed distribution and I would like to binarize it by choosing a threshold of 1.5 below which I would like to set it as the class 0 and above or equal to 1.5 as the class 1.
Like: X_original = [0.01,2.80,-1.74,1.34,1.55], X_bin = [0, 1, 0, 0, 1].
I tried doing: dataframe["bin"] = (dataframe["original"] > 1.5).astype(int).
However, I noticed that the missing values (NaNs) disappeared and they are encoded in the 0 class.
How could I solve this problem?
To the best of my knowledge there is way to keep the missing values after a comparison, but you can do the following:
import pandas as pd
import numpy as np
X_original = pd.Series([0.01,2.80,-1.74, np.nan,1.55])
X_bin = X_original > 1.5
X_bin[X_original.isna()] = np.NaN
print(X_bin)
Output
0 0.0
1 1.0
2 0.0
3 NaN
4 1.0
dtype: float64
To keep the column as Integer (and also nullable), do:
X_bin = X_bin.astype(pd.Int8Dtype())
print(X_bin)
Output
0 0
1 1
2 0
3 <NA>
4 1
dtype: Int8
The best way to handle this issue that I found was to use list comprehension:
dataframe["Bin"] = [0 if el<1.5 else 1 if el >= 1.5 else np.NaN for el in dataframe["Original"]]
Then I convert the float numbers to objects except the np.NaN
dataframe["Bin"] = dataframe["Bin"].replace([0.0,1.0],["0","1"])

Random Choice loop through groups of samples

I have a df containing column of "Income_group", "Rate", and "Probability", respectively. I need randomly select rate for each income group. How can I write a Loop function and print out the result for each income bin.
The pandas data frame table looks like this:
import pandas as pd
df={'Income_Groups':['1','1','1','2','2','2','3','3','3'],
'Rate':[1.23,1.25,1.56, 2.11,2.32, 2.36,3.12,3.45,3.55],
'Probability':[0.25, 0.50, 0.25,0.50,0.25,0.25,0.10,0.70,0.20]}
df2=pd.DataFrame(data=df)
df2
Datatable
Shooting in the dark here, but you can use np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], p=x['Probability']))
)
Output (can vary due to randomness):
Income_Groups
1 1.25
2 2.36
3 3.45
dtype: float64
You can also pass size into np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], size=3, p=x['Probability']))
)
Output:
Income_Groups
1 [1.23, 1.25, 1.25]
2 [2.36, 2.11, 2.11]
3 [3.12, 3.12, 3.45]
dtype: object
GroupBy.apply because of the weights.
import numpy as np
(df2.groupby('Income_Groups')
.apply(lambda gp: np.random.choice(a=gp.Rate, p=gp.Probability, size=1)[0]))
#Income_Groups
#1 1.23
#2 2.11
#3 3.45
#dtype: float64
Another silly way because your weights seem to be have precision to 2 decimal places:
s = df2.set_index(['Income_Groups', 'Probability']).Rate
(s.repeat(s.index.get_level_values('Probability')*100) # Weight
.sample(frac=1) # Shuffle |
.reset_index() # + | -> Random Select
.drop_duplicates(subset=['Income_Groups']) # Select |
.drop(columns='Probability'))
# Income_Groups Rate
#0 2 2.32
#1 1 1.25
#3 3 3.45

How to use the diff() function in pandas but enter the difference values in a new column?

I have a dataframe df:
df
x-value
frame
1 15
2 20
3 19
How can I get:
df
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Not to say there is anything wrong with what #Wen posted as a comment, but I want to post a more complete answer.
The Problem
There are 3 things going on that need to be addressed:
Calculating the values that are the differences from one row to the next.
Handling the fact that the "difference" will be one less value than the original length of the dataframe and we'll have to fill in a value for the missing bit.
How do we assign this to a new column.
Option #1
The most natural way to do the diff would be to use pd.Series.diff (as #Wen suggested). But in order to produce the stated results, which are integers, I recommend using the pd.Series.fillna parameter, downcast='infer'. Finally, I don't like editing the dataframe unless there is a need for it. So I use pd.DataFrame.assign to produce a new dataframe that is a copy of the old one with a new column associated.
df.assign(**{'delta-x': df['x-value'].diff().fillna(0, downcast='infer')})
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Option #2
Similar to #1 but I'll use numpy.diff to preserve int type in addition to picking up some performance.
df.assign(**{'delta-x': np.append(0, np.diff(df['x-value'].values))})
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Testing
pir1 = lambda d: d.assign(**{'delta-x': d['x-value'].diff().fillna(0, downcast='infer')})
pir2 = lambda d: d.assign(**{'delta-x': np.append(0, np.diff(d['x-value'].values))})
res = pd.DataFrame(
index=[10, 300, 1000, 3000, 10000, 30000],
columns=['pir1', 'pir2'], dtype=float)
for i in res.index:
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=1000)
res.plot(loglog=True)
res.div(res.min(1), 0)
pir1 pir2
10 2.069498 1.0
300 2.123017 1.0
1000 2.397373 1.0
3000 2.804214 1.0
10000 4.559525 1.0
30000 7.058344 1.0