Random Choice loop through groups of samples - pandas

I have a df containing column of "Income_group", "Rate", and "Probability", respectively. I need randomly select rate for each income group. How can I write a Loop function and print out the result for each income bin.
The pandas data frame table looks like this:
import pandas as pd
df={'Income_Groups':['1','1','1','2','2','2','3','3','3'],
'Rate':[1.23,1.25,1.56, 2.11,2.32, 2.36,3.12,3.45,3.55],
'Probability':[0.25, 0.50, 0.25,0.50,0.25,0.25,0.10,0.70,0.20]}
df2=pd.DataFrame(data=df)
df2
Datatable

Shooting in the dark here, but you can use np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], p=x['Probability']))
)
Output (can vary due to randomness):
Income_Groups
1 1.25
2 2.36
3 3.45
dtype: float64
You can also pass size into np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], size=3, p=x['Probability']))
)
Output:
Income_Groups
1 [1.23, 1.25, 1.25]
2 [2.36, 2.11, 2.11]
3 [3.12, 3.12, 3.45]
dtype: object

GroupBy.apply because of the weights.
import numpy as np
(df2.groupby('Income_Groups')
.apply(lambda gp: np.random.choice(a=gp.Rate, p=gp.Probability, size=1)[0]))
#Income_Groups
#1 1.23
#2 2.11
#3 3.45
#dtype: float64
Another silly way because your weights seem to be have precision to 2 decimal places:
s = df2.set_index(['Income_Groups', 'Probability']).Rate
(s.repeat(s.index.get_level_values('Probability')*100) # Weight
.sample(frac=1) # Shuffle |
.reset_index() # + | -> Random Select
.drop_duplicates(subset=['Income_Groups']) # Select |
.drop(columns='Probability'))
# Income_Groups Rate
#0 2 2.32
#1 1 1.25
#3 3 3.45

Related

How to use indexing by matching strings in data frame in pandas

I try to resolve the following problem. I have two data sets, say df1 and df2:
df1
NameSP Val Char1 BVA
0 'ACCR' 0.091941 A Y'
1 'SDRE' 0.001395 S Y'
2 'ACUZ' 0.121183 A N'
3 'SRRE' 0.001512 S N'
4 'FFTR' 0.035609 F N'
5 'STZE' 0.000637 S N'
6 'AHZR' 0.001418 A Y'
7 'DEES' 0.000876 D N'
8 'UURR' 0.023878 U Y'
9 'LLOH' 0.004371 L Y'
10 'IUUT' 0.049102 I N'
df2
NameSP Val1 Glob
0 'ACCR' 0.234 20000
1 'FFTR' 0.222 10000
2 'STZE' 0.001 5000
3 'DEES' 0.006 2000
4 'UURR' 0.134 20000
5 'LLOH' 0.034 10000
I would like to perform indexing of df2 in df1, and then use the indexing vector for various matrix operation. This would be something similar to strmatch(A,B,'exact') in Matlab. I can get the indexing properly by using .iloc and then .isin as in the following code:
import pandas as pd
import numpy as np
df1 = pd.read_excel('C:\PYTHONCODES\LINEAROPT\TEST_DATA1.xlsx')
df2 = pd.read_excel('C:\PYTHONCODES\LINEAROPT\TEST_DATA2.xlsx')
print(df1)
print(df2)
ddf1 = df1.iloc[:,0]
ddf2 = df2.iloc[:,0]
pindex = ddf1[ddf1.isin(ddf2)]
print(pindex.index)
which gives me:
Int64Index([0, 4, 5, 7, 8, 9], dtype='int64')
But I can not find the way to use this index for mapping and building my arrays. As an example, I would like to have a vector that has the same number of elements that df1, but with Val1 values from df2 at indexed positions and zeros everywhere else. So it should look like that:
0.234
0
0
0
0.222
0.001
0
0.006
0.134
0.034
0
Or another mapping problem. How to use such indexing to map the values from colon "Val" in df1 in a vector that would contain Val from df1 at indexed rows and zeros everywhere else. So this time it should look like:
0.091941
0.0
0.0
0.0
0.035609
0.000637
0.0
0.000876
0.023878
0.004371
0.0
Any idea of how to that in efficient and elegant way?
Thanks for help!
First problem
df2.set_index('NameSP')['Val1'].reindex(df1['NameSP']).fillna(0)
Second problem
df1['Val1'].where(df1['NameSP'].isin(df2['NameSP']), 0)

Pandas groupby in combination with sklean preprocessing continued

Continue from this post:
Pandas groupby in combination with sklearn preprocessing
I need to do preprocessing by scaling grouped data by two columns, somehow get some error for the second method
import pandas as pd
import numpy as np
from sklearn.preprocessing import robust_scale,minmax_scale
df = pd.DataFrame( dict( id=list('AAAAABBBBB'),
loc = (10,20,10,20,10,20,10,20,10,20),
value=(0,10,10,20,100,100,200,30,40,100)))
df['new'] = df.groupby(['id','loc']).value.transform(lambda x:minmax_scale(x.astype(float) ))
df['new'] = df.groupby(['id','loc']).value.transform(lambda x:robust_scale(x ))
The second one give me error like this:
ValueError: Expected 2D array, got 1D array instead: array=[ 0. 10.
100.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a
single sample.
If I use reshape I got error like this:
Exception: Data must be 1-dimensional
If I ever print out the grouped data, g['value'] is pandas series.
for n, g in df.groupby(['id','loc']):
print(type(g['value']))
Do you know what might cause it?
Thanks.
Base on the warning code , you should add reshape and concatenate
df.groupby(['id','loc']).value.transform(lambda x:np.concatenate(robust_scale(x.values.reshape(-1,1))))
Out[606]:
0 -0.2
1 -1.0
2 0.0
3 1.0
4 1.8
5 0.0
6 1.0
7 -2.0
8 -1.0
9 0.0
Name: value, dtype: float64

Unexpected Result Updating a Copy of a DF when using iterrows

When I ran this code, I was expected df2 to update accurately but it does not. Here is the code...
import pandas as pd
import numpy as np
exam_data = [{'name':'Anastasia', 'score':12.5}, {'name':'Dima','score':9}, {'name':'Katherine','score':16.5}]
df = pd.DataFrame(exam_data)
df2 = df.copy()
for index, row in df.iterrows():
df2['score'] = row['score'] * 2
print(row['name'], row['score'])
print(df2)
As you can see from the output below, the scores did not double, they were all set to 33.0
Anastasia 12.5
Dima 9.0
Katherine 16.5
name score
0 Anastasia 33.0
1 Dima 33.0
2 Katherine 33.0
What is going on, why am I seeing that unanticipated result?
Because you set df2['score'] every time you iteration. Try to make a change:
row['score'] = row['score'] * 2
Pandas works column-wise; instead of iterating over the rows (which is slow), you can just use
df2['score'] = df['score'] * 2
That will update the entire column at once.

Pandas .loc without KeyError

>>> pd.DataFrame([1], index=['1']).loc['2'] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['1','2']] # Succeeds, as in the answer below.
I'd like something that doesn't fail in either of
>>> pd.DataFrame([1], index=['1']).loc['2'] # KeyError
>>> pd.DataFrame([1], index=['1']).loc[['2']] # KeyError
Is there a function like loc which gracefully handles this, or some other way of expressing this query?
Update for #AlexLenail comment
It's a fair point that this will be slow for large lists. I did a little bit of more digging and found that the intersection method is available for Indexes and columns. I'm not sure about the algorithmic complexity but it's much faster empirically.
You can do something like this.
good_keys = df.index.intersection(all_keys)
df.loc[good_keys]
Or like your example
df = pd.DataFrame([1], index=['1'])
df.loc[df.index.intersection(['2'])]
Here is a little experiment below
n = 100000
# Create random values and random string indexes
# have the bad indexes contain extra values not in DataFrame Index
rand_val = np.random.rand(n)
rand_idx = []
for x in range(n):
rand_idx.append(str(x))
bad_idx = []
for x in range(n*2):
bad_idx.append(str(x))
df = pd.DataFrame(rand_val, index=rand_idx)
df.head()
def get_valid_keys_list_comp():
# Return filtered DataFrame using list comprehension to filter keys
vkeys = [key for key in bad_idx if key in df.index.values]
return df.loc[vkeys]
def get_valid_keys_intersection():
# Return filtered DataFrame using list intersection() to filter keys
vkeys = df.index.intersection(bad_idx)
return df.loc[vkeys]
%%timeit
get_valid_keys_intersection()
# 64.5 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
get_valid_keys_list_comp()
# 6.14 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original answer
I'm not sure if pandas has a built-in function to handle this but you can use Python list comprehension to filter to valid indexes with something like this.
Given a DataFrame df2
A B C D F
test 1.0 2013-01-02 1.0 3 foo
train 1.0 2013-01-02 1.0 3 foo
test 1.0 2013-01-02 1.0 3 foo
train 1.0 2013-01-02 1.0 3 foo
You can filter your index query with this
keys = ['test', 'train', 'try', 'fake', 'broken']
valid_keys = [key for key in keys if key in df2.index.values]
df2.loc[valid_keys]
This will also work for columns if you use df2.columns instead of df2.index.values
I found an alternative (provided a check for df.empty is made beforehand). You could do something like this
df[df.index=='2'] -> returns either a dataframe with matched values or empty dataframe.
It seems to work fine for me. I'm running Python 3.5 with pandas version 0.20.3.
import numpy as np
import pandas as pd
# Create dataframe
data = {'distance': [0, 300, 600, 1000],
'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])
keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']
# Create a subset of the dataframe.
df.loc[keys]
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
Virginia NaN NaN
Or if you want to exclude the NaN row:
df.loc[keys].dropna()
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
This page https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike has the solution:
In [8]: pd.DataFrame([1], index=['1']).reindex(['2'])
Out[8]:
0
2 NaN
Using the sample dataframe from #binjip's answer:
import numpy as np
import pandas as pd
# Create dataframe
data = {'distance': [0, 300, 600, 1000],
'population': [4.8, 0.7, 6.4, 2.9]}
df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])
keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']
Get matching records from the dataframe. NB: The dataframe index must be unique for this to work!
df.reindex(keys)
distance population
Alabama 0.0 4.8
Alaska 300.0 0.7
Arizona 600.0 6.4
Virginia NaN NaN
If you want to omit missing keys:
df.reindex(df.index.intersection(keys))
distance population
Alabama 0 4.8
Alaska 300 0.7
Arizona 600 6.4
df.loc uses index (values from df.index) not the position of the row. Did you mean to use .iloc instead

Convert strings with time suffixes to numbers in numpy

I have a numpy series with values like "1.0s", "100ms", etc. I can't plot this (with pandas, after putting the array into a series), because pandas doesn't recognize that these are numbers. How can I have numpy or pandas extrapolate these into numbers, while paying attention to the suffixes?
see question how do I get at the pandas.offsets object given an offset string
use pandas.tseries.frequencies.to_offset
convert to timedeltas
get total seconds
from pandas.tseries.frequencies import to_offset
s = pd.Series(['1.0s', '100ms', '10s', '0.5T'])
pd.to_timedelta(s.apply(to_offset)).dt.total_seconds()
0 0.0
1 0.1
2 10.0
3 300.0
dtype: float64
This code could solve your problem.
# Test data
se = Series(['10s', '100ms', '1.0s'])
# Pattern to match ms and as integer of float
pat = "([0-9]*\.?[0-9]+)(ms|s)"
# Extracting the data
df = se.str.extract(pat, flags=0, expand=True)
# Renaming columns
df.columns = ['value', 'unit']
# Converting to number
df['value'] = pd.to_numeric(df['value'])
# Converting to the same unit
df.loc[df['unit']=='s', ['value', 'unit']] = (df['value'] * 1000, 'ms')
# Now you are ready to plot !
print(df['value'])
# 0 10000.0
# 1 100.0
# 2 100000.0