I am trying to implement a weighted random selection in a dataframe. I used the code below to build the dataframe:
import pandas as pd
from numpy import exp
import random
moves = [(1, 2), (1, 3), (1, 4), (2, 1), (2, 3), (2, 4)]
data = {'moves': list(map(lambda i: moves[i] if divmod(i, len(moves))[0] != 1 else moves[divmod(i, len(moves))[1]],
[i for i in range(2 * len(moves))])),
'player': list(map(lambda i: 1 if i >= len(moves) else 2,
[i for i in range(2 * len(moves))])),
'wins': [random.randint(0, 2) for i in range(2 * len(moves))],
'playout_number': [random.randint(0,1) for i in range(2 * len(moves))]
}
frame = pd.DataFrame(data)
and then I created a list and inserted it as the new column 'weight':
total = sum(map(lambda a, b: exp(a/b) if b != 0 else 0, frame['wins'], frame['playout_number']))
weights = list(map(lambda a, b: exp(a/b) / total if b != 0 else 0, frame['wins'], frame['playout_number']))
frame = frame.assign(weight=weights)
Now I want to select a random row based on each row's weight in the new column inserted.
The problem is that I want to use pandas.DataFrame.sample(weights=weight), But I don't know how. I can do that with numpy.random.choice(weights=weights), But I'd prefer keep using pandas library functions.
I appreciate helps in advance.
You can use parameters n or frac with weights in sample.
Parameter weights can be array, so is possible use list:
df = frame.sample(n=1, weights=weights)
Or column of df (Series):
#select 1 row - n=1
df = frame.sample(n=1, weights=frame.weight)
print (df)
moves player playout_number wins weight
6 (1, 2) 1 1 2 0.258325
#select 20% rows - frac=0.2
df = frame.sample(frac=0.2, weights=frame.weight)
print (df)
moves player playout_number wins weight
5 (2, 4) 2 1 2 0.221747
4 (2, 3) 2 1 1 0.081576
Related
I have this dataframe with multiple headers
name, 00590BL, 01090BL, 01100MS, 02200MS
lat, 613297, 626278, 626323, 616720
long, 5185127, 5188418, 5188431, 5181393
elv, 1833, 1915, 1915, 1499
1956-01-01, 1, 2, 2, -2
1956-01-02, 2, 3, 3, -1
1956-01-03, 3, 4, 4, 0
1956-01-04, 4, 5, 5, 1
1956-01-05, 5, 6, 6, 2
I read this as
dfr = pd.read_csv(f_name,
skiprows = 0,
header = [0,1,2,3],
index_col = 0,
parse_dates = True
)
I would like to remove the columns 01090BL, 01100MS. The idea, in the main program, is to have a list of the columns that i want to remove and then drop them. I have, consequently, done as follow:
2bremoved = ['01090BL', '01100MS']
dfr = dfr.drop(2bremoved, axis=1, inplace=True)
but I get the following error:
PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
/usr/lib/python3/dist-packages/pandas/core/frame.py:4906: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I have thus done the following:
aa = dfr.drop(2bremoved, axis=1, inplace=True,level = 0)
but I get an empty dataframe. What am I missing?
thanks
Don't use inplace=True when assigning the output, also a variable name cannot start with a digit in python:
to_remove = ['01090BL', '01100MS']
aa = dfr.drop(to_remove, axis=1, level=0)
Output:
name 00590BL 02200MS
lat 613297 616720
long 5185127 5181393
elv 1833 1499
1956-01-01 1 -2
1956-01-02 2 -1
1956-01-03 3 0
1956-01-04 4 1
1956-01-05 5 2
I have a table of values with 2 different columns say x and y, if a value in the y column = 0 then I need to apply a multiplier to the x column and vice versa. How would I go about doing this?
Thanks in advance.
I would use slicing on rows with .loc to modify each column:
import pandas as pd
df = pd.DataFrame({'x':[1,0,2,0], 'y':[1,3,0,4]})
df.loc[df['x'] == 0, 'y'] = df.loc[df['x'] == 0, 'y'] * 2
df.loc[df['y'] == 0, 'x'] = df.loc[df['y'] == 0, 'x'] * 2
I am trying to map a column of my df with a dictionary, where the dictionary contains tuples as values. I want to be able to only return the first value of the tuple in the output column. Is there a way to do that?
The situation:
d = {'key1': (1, 2, 3)}
df['lookup_column'] = 'key1'
df['return_column'] = df['lookup_column'].map(d)
Output:
df['return_column'] = (1, 2, 3)
Adding this returns an error:
df['return_column'] = df['return_column'][0]
Running this instead also returns an error:
df['return_column'] = df['lookup_column'].map(d[0])
The desired outcome:
df['return_column'] = 1
Thank you!
Use str for first element of Iterable, here tuple - it return NaN if no match:
df['return_column'] = df['return_column'].str[0]
All together:
df = pd.DataFrame({'lookup_column':['key1','key2']})
d = {'key1': (1, 2, 3)}
df['return_column1'] = df['lookup_column'].map(d)
df['return_column2'] = df['lookup_column'].map(d).str[0]
Second alternative with dict.get for default value if no match, here is ouput tuple so is possible use tuple (np.nan,):
df['return_column4'] = df['lookup_column'].map(lambda x: d.get(x, (np.nan,)))
df['return_column5'] = df['lookup_column'].map(lambda x: d.get(x, (np.nan,))[0])
print (df)
lookup_column return_column1 return_column2 return_column4 return_column5
0 key1 (1, 2, 3) 1.0 (1, 2, 3) 1.0
1 key2 NaN NaN (nan,) NaN
I am trying to apply a function to every column in a dataframe, when I try to do it on just a single fixed column name it works. I tried doing it on every column, but when I try passing the column name as an argument in the function I get an error.
How do you properly pass arguments to apply a function on a data frame?
def result(row,c):
if row[c] >=0 and row[c] <=1:
return 'c'
elif row[c] >1 and row[c] <=2:
return 'b'
else:
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df.apply(result, args = (c), axis=1)
TypeError: ('result() takes exactly 2 arguments (21 given)', u'occurred at index 0')
Input data frame format:
d = {'c1': [1, 2, 1, 0], 'c2': [3, 0, 1, 2]}
df = pd.DataFrame(data=d)
df
c1 c2
0 1 3
1 2 0
2 1 1
3 0 2
You don't need to pass the column name to apply. As you only want to check if values of the columns are in certain range and should return a, b or c. You can make the following changes.
def result(val):
if 0<=val<=1:
return 'c'
elif 1<val<=2:
return 'b'
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df[c].apply(result)
Note that this will replace your column values.
A faster way is np.select:
import numpy as np
values = ['c', 'b']
for col in df.columns:
df[col] = np.select([0<=df[col]<=1, 1<df[col]<=2], values, default = 'a')
How to use the values of one column to access values in another
import numpy
impot pandas
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=[['Value']])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
so how to access the value 'bleh' for each row?
df.Value.iloc[df['bleh']]
Edit:
Thanks to #ScottBoston. My DF constructor had one layer of [] too much.
The correct answer is:
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
df['idx_int'] = range(df.shape[0])
df['haa'] = df['idx_int'] - df.bleh.values
df['newcol'] = df.Value.iloc[df['haa'].values].values
Try:
df['Value'].tolist()
Output:
[-1.0856306033005612,
0.9973454465835858,
0.28297849805199204,
-1.506294713918092,
-0.5786002519685364,
1.651436537097151,
-2.426679243393074,
-0.42891262885617726,
1.265936258705534,
-0.8667404022651017]
Your dataframe constructor still needs to be fixed.
Are you looking for:
df.set_index('bleh')
output:
Value
bleh
0 -1.085631
1 0.997345
2 0.282978
1 -1.506295
4 -0.578600
0 1.651437
0 -2.426679
4 -0.428913
1 1.265936
7 -0.866740
If so you, your dataframe constructor has as extra set of [] in it.
np.random.seed(123)
df = pd.DataFrame((np.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: np.random.randint(0, x + 1, 1)[0])
columns paramater in dataframe takes a list not a list of list.