How to assign a column to dataframe as weights for each row and then sample the dataframe according to those weights?

How to assign a column to dataframe as weights for each row and then sample the dataframe according to those weights? - pandas

I am trying to implement a weighted random selection in a dataframe. I used the code below to build the dataframe:
import pandas as pd
from numpy import exp
import random
moves = [(1, 2), (1, 3), (1, 4), (2, 1), (2, 3), (2, 4)]
data = {'moves': list(map(lambda i: moves[i] if divmod(i, len(moves))[0] != 1 else moves[divmod(i, len(moves))[1]],
[i for i in range(2 * len(moves))])),
'player': list(map(lambda i: 1 if i >= len(moves) else 2,
[i for i in range(2 * len(moves))])),
'wins': [random.randint(0, 2) for i in range(2 * len(moves))],
'playout_number': [random.randint(0,1) for i in range(2 * len(moves))]
}
frame = pd.DataFrame(data)
and then I created a list and inserted it as the new column 'weight':
total = sum(map(lambda a, b: exp(a/b) if b != 0 else 0, frame['wins'], frame['playout_number']))
weights = list(map(lambda a, b: exp(a/b) / total if b != 0 else 0, frame['wins'], frame['playout_number']))
frame = frame.assign(weight=weights)
Now I want to select a random row based on each row's weight in the new column inserted.
The problem is that I want to use pandas.DataFrame.sample(weights=weight), But I don't know how. I can do that with numpy.random.choice(weights=weights), But I'd prefer keep using pandas library functions.
I appreciate helps in advance.

You can use parameters n or frac with weights in sample.
Parameter weights can be array, so is possible use list:
df = frame.sample(n=1, weights=weights)
Or column of df (Series):
#select 1 row - n=1
df = frame.sample(n=1, weights=frame.weight)
print (df)
moves player playout_number wins weight
6 (1, 2) 1 1 2 0.258325
#select 20% rows - frac=0.2
df = frame.sample(frac=0.2, weights=frame.weight)
print (df)
moves player playout_number wins weight
5 (2, 4) 2 1 2 0.221747
4 (2, 3) 2 1 1 0.081576

Related

drop columns according to header value ()

I have this dataframe with multiple headers
name, 00590BL, 01090BL, 01100MS, 02200MS
lat, 613297, 626278, 626323, 616720
long, 5185127, 5188418, 5188431, 5181393
elv, 1833, 1915, 1915, 1499
1956-01-01, 1, 2, 2, -2
1956-01-02, 2, 3, 3, -1
1956-01-03, 3, 4, 4, 0
1956-01-04, 4, 5, 5, 1
1956-01-05, 5, 6, 6, 2
I read this as
dfr = pd.read_csv(f_name,
skiprows = 0,
header = [0,1,2,3],
index_col = 0,
parse_dates = True
)
I would like to remove the columns 01090BL, 01100MS. The idea, in the main program, is to have a list of the columns that i want to remove and then drop them. I have, consequently, done as follow:
2bremoved = ['01090BL', '01100MS']
dfr = dfr.drop(2bremoved, axis=1, inplace=True)
but I get the following error:
PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
/usr/lib/python3/dist-packages/pandas/core/frame.py:4906: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I have thus done the following:
aa = dfr.drop(2bremoved, axis=1, inplace=True,level = 0)
but I get an empty dataframe. What am I missing?
thanks

Don't use inplace=True when assigning the output, also a variable name cannot start with a digit in python:
to_remove = ['01090BL', '01100MS']
aa = dfr.drop(to_remove, axis=1, level=0)
Output:
name 00590BL 02200MS
lat 613297 616720
long 5185127 5181393
elv 1833 1499
1956-01-01 1 -2
1956-01-02 2 -1
1956-01-03 3 0
1956-01-04 4 1
1956-01-05 5 2

How to apply a multiplier to particular searched values in a dataframe

I have a table of values with 2 different columns say x and y, if a value in the y column = 0 then I need to apply a multiplier to the x column and vice versa. How would I go about doing this?
Thanks in advance.

I would use slicing on rows with .loc to modify each column:
import pandas as pd
df = pd.DataFrame({'x':[1,0,2,0], 'y':[1,3,0,4]})
df.loc[df['x'] == 0, 'y'] = df.loc[df['x'] == 0, 'y'] * 2
df.loc[df['y'] == 0, 'x'] = df.loc[df['y'] == 0, 'x'] * 2

Retrieving only one element of a tuple when the tuple is the value of a dictionary

I am trying to map a column of my df with a dictionary, where the dictionary contains tuples as values. I want to be able to only return the first value of the tuple in the output column. Is there a way to do that?
The situation:
d = {'key1': (1, 2, 3)}
df['lookup_column'] = 'key1'
df['return_column'] = df['lookup_column'].map(d)
Output:
df['return_column'] = (1, 2, 3)
Adding this returns an error:
df['return_column'] = df['return_column'][0]
Running this instead also returns an error:
df['return_column'] = df['lookup_column'].map(d[0])
The desired outcome:
df['return_column'] = 1
Thank you!

Use str for first element of Iterable, here tuple - it return NaN if no match:
df['return_column'] = df['return_column'].str[0]
All together:
df = pd.DataFrame({'lookup_column':['key1','key2']})
d = {'key1': (1, 2, 3)}
df['return_column1'] = df['lookup_column'].map(d)
df['return_column2'] = df['lookup_column'].map(d).str[0]
Second alternative with dict.get for default value if no match, here is ouput tuple so is possible use tuple (np.nan,):
df['return_column4'] = df['lookup_column'].map(lambda x: d.get(x, (np.nan,)))
df['return_column5'] = df['lookup_column'].map(lambda x: d.get(x, (np.nan,))[0])
print (df)
lookup_column return_column1 return_column2 return_column4 return_column5
0 key1 (1, 2, 3) 1.0 (1, 2, 3) 1.0
1 key2 NaN NaN (nan,) NaN

Pandas apply function on multiple columns

I am trying to apply a function to every column in a dataframe, when I try to do it on just a single fixed column name it works. I tried doing it on every column, but when I try passing the column name as an argument in the function I get an error.
How do you properly pass arguments to apply a function on a data frame?
def result(row,c):
if row[c] >=0 and row[c] <=1:
return 'c'
elif row[c] >1 and row[c] <=2:
return 'b'
else:
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df.apply(result, args = (c), axis=1)
TypeError: ('result() takes exactly 2 arguments (21 given)', u'occurred at index 0')
Input data frame format:
d = {'c1': [1, 2, 1, 0], 'c2': [3, 0, 1, 2]}
df = pd.DataFrame(data=d)
df
c1 c2
0 1 3
1 2 0
2 1 1
3 0 2

You don't need to pass the column name to apply. As you only want to check if values of the columns are in certain range and should return a, b or c. You can make the following changes.
def result(val):
if 0<=val<=1:
return 'c'
elif 1<val<=2:
return 'b'
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df[c].apply(result)
Note that this will replace your column values.
A faster way is np.select:
import numpy as np
values = ['c', 'b']
for col in df.columns:
df[col] = np.select([0<=df[col]<=1, 1<df[col]<=2], values, default = 'a')

How to use the values of one column to access values in another column?

How to use the values of one column to access values in another
import numpy
impot pandas
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=[['Value']])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
so how to access the value 'bleh' for each row?
df.Value.iloc[df['bleh']]
Edit:
Thanks to #ScottBoston. My DF constructor had one layer of [] too much.
The correct answer is:
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
df['idx_int'] = range(df.shape[0])
df['haa'] = df['idx_int'] - df.bleh.values
df['newcol'] = df.Value.iloc[df['haa'].values].values

Try:
df['Value'].tolist()
Output:
[-1.0856306033005612,
0.9973454465835858,
0.28297849805199204,
-1.506294713918092,
-0.5786002519685364,
1.651436537097151,
-2.426679243393074,
-0.42891262885617726,
1.265936258705534,
-0.8667404022651017]
Your dataframe constructor still needs to be fixed.
Are you looking for:
df.set_index('bleh')
output:
Value
bleh
0 -1.085631
1 0.997345
2 0.282978
1 -1.506295
4 -0.578600
0 1.651437
0 -2.426679
4 -0.428913
1 1.265936
7 -0.866740
If so you, your dataframe constructor has as extra set of [] in it.
np.random.seed(123)
df = pd.DataFrame((np.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: np.random.randint(0, x + 1, 1)[0])
columns paramater in dataframe takes a list not a list of list.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to assign a column to dataframe as weights for each row and then sample the dataframe according to those weights? - pandas

Related

drop columns according to header value ()

How to apply a multiplier to particular searched values in a dataframe

Retrieving only one element of a tuple when the tuple is the value of a dictionary

Pandas apply function on multiple columns

How to use the values of one column to access values in another column?

Categories

Resources