Retrieving only one element of a tuple when the tuple is the value of a dictionary - pandas

I am trying to map a column of my df with a dictionary, where the dictionary contains tuples as values. I want to be able to only return the first value of the tuple in the output column. Is there a way to do that?
The situation:
d = {'key1': (1, 2, 3)}
df['lookup_column'] = 'key1'
df['return_column'] = df['lookup_column'].map(d)
Output:
df['return_column'] = (1, 2, 3)
Adding this returns an error:
df['return_column'] = df['return_column'][0]
Running this instead also returns an error:
df['return_column'] = df['lookup_column'].map(d[0])
The desired outcome:
df['return_column'] = 1
Thank you!

Use str for first element of Iterable, here tuple - it return NaN if no match:
df['return_column'] = df['return_column'].str[0]
All together:
df = pd.DataFrame({'lookup_column':['key1','key2']})
d = {'key1': (1, 2, 3)}
df['return_column1'] = df['lookup_column'].map(d)
df['return_column2'] = df['lookup_column'].map(d).str[0]
Second alternative with dict.get for default value if no match, here is ouput tuple so is possible use tuple (np.nan,):
df['return_column4'] = df['lookup_column'].map(lambda x: d.get(x, (np.nan,)))
df['return_column5'] = df['lookup_column'].map(lambda x: d.get(x, (np.nan,))[0])
print (df)
lookup_column return_column1 return_column2 return_column4 return_column5
0 key1 (1, 2, 3) 1.0 (1, 2, 3) 1.0
1 key2 NaN NaN (nan,) NaN

Related

drop columns according to header value ()

I have this dataframe with multiple headers
name, 00590BL, 01090BL, 01100MS, 02200MS
lat, 613297, 626278, 626323, 616720
long, 5185127, 5188418, 5188431, 5181393
elv, 1833, 1915, 1915, 1499
1956-01-01, 1, 2, 2, -2
1956-01-02, 2, 3, 3, -1
1956-01-03, 3, 4, 4, 0
1956-01-04, 4, 5, 5, 1
1956-01-05, 5, 6, 6, 2
I read this as
dfr = pd.read_csv(f_name,
skiprows = 0,
header = [0,1,2,3],
index_col = 0,
parse_dates = True
)
I would like to remove the columns 01090BL, 01100MS. The idea, in the main program, is to have a list of the columns that i want to remove and then drop them. I have, consequently, done as follow:
2bremoved = ['01090BL', '01100MS']
dfr = dfr.drop(2bremoved, axis=1, inplace=True)
but I get the following error:
PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
/usr/lib/python3/dist-packages/pandas/core/frame.py:4906: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I have thus done the following:
aa = dfr.drop(2bremoved, axis=1, inplace=True,level = 0)
but I get an empty dataframe. What am I missing?
thanks
Don't use inplace=True when assigning the output, also a variable name cannot start with a digit in python:
to_remove = ['01090BL', '01100MS']
aa = dfr.drop(to_remove, axis=1, level=0)
Output:
name 00590BL 02200MS
lat 613297 616720
long 5185127 5181393
elv 1833 1499
1956-01-01 1 -2
1956-01-02 2 -1
1956-01-03 3 0
1956-01-04 4 1
1956-01-05 5 2

Pandas: select multiple rows or default with new API

I need to retrieve multiples rows (which could be duplicated) and if the index does not exist get a default value. An example with Series:
s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
labels = ['a', 'd', 'f']
result = s.loc[labels]
result = result.fillna(my_default_value)
Now, I'm using DataFrame, an equivalent with names is:
df = pd.DataFrame({
"Person": {
"name_1": "Genarito",
"name_2": "Donald Trump",
"name_3": "Joe Biden",
"name_4": "Pablo Escobar",
"name_5": "Dalai Lama"
}
})
default_value = 'No name'
names_to_retrieve = ['name_1', 'name_2', 'name_8', 'name_3']
result = df.loc[names_to_retrieve]
result = result.fillna(default_value)
With both examples it's throwing a warning saying:
FutureWarning: Passing list-likes to .loc or [] with any missing
label will raise KeyError in the future, you can use .reindex() as an
alternative.
In the documentation of the issue it says that you should use reindex but they say that It won't work with duplicates...
Is there any way to work without warnings and duplicated indexes?
Thanks in advance
Let's try merge:
result = (pd.DataFrame({'label':labels})
.merge(s.to_frame(name='x'), left_on='label',
right_index=True, how='left')
.set_index('label')['x']
)
Output:
label
a 0.0
a 1.0
d NaN
f NaN
Name: x, dtype: float64
How about :
on_values = s.loc[s.index.intersection(labels).unique()]
off_values = pd.Series(default_value,index=s.index.difference(labels))
result = pd.concat([on_values,off_values])
Check isin with append
out = s[s.index.isin(labels)]
out = out.append(pd.Series(index=set(labels)-set(s.index),dtype='float').fillna(0))
out
Out[341]:
a 0.0
a 1.0
d 0.0
f 0.0
dtype: float64
You can write a simple function to handle the rows in labels and missing from labels separately, then join. When True the in_order argument will ensure that if you specify labels = ['d', 'a', 'f'], the output is ordered ['d', 'a', 'f'].
def reindex_with_dup(s: pd.Series or pd.DataFrame, labels, fill_value=np.NaN, in_order=True):
labels = pd.Series(labels)
s1 = s.loc[labels[labels.isin(s.index)]]
if isinstance(s, pd.Series):
s2 = pd.Series(fill_value, index=labels[~labels.isin(s.index)])
if isinstance(s, pd.DataFrame):
s2 = pd.DataFrame(fill_value, index=labels[~labels.isin(s.index)],
columns=s.columns)
s = pd.concat([s1, s2])
if in_order:
s = s.loc[labels.drop_duplicates()]
return s
reindex_with_dup(s, ['d', 'a', 'f'], fill_value='foo')
#d foo
#a 0
#a 1
#f foo
#dtype: object
This retains the .loc behavior that if your index is duplicated and your labels are duplicated it duplicates the selection:
reindex_with_dup(s, ['d', 'a', 'a', 'f', 'f'], fill_value='foo')
#d foo
#a 0
#a 1
#a 0
#a 1
#f foo
#f foo
#dtype: object

Pandas apply function on multiple columns

I am trying to apply a function to every column in a dataframe, when I try to do it on just a single fixed column name it works. I tried doing it on every column, but when I try passing the column name as an argument in the function I get an error.
How do you properly pass arguments to apply a function on a data frame?
def result(row,c):
if row[c] >=0 and row[c] <=1:
return 'c'
elif row[c] >1 and row[c] <=2:
return 'b'
else:
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df.apply(result, args = (c), axis=1)
TypeError: ('result() takes exactly 2 arguments (21 given)', u'occurred at index 0')
Input data frame format:
d = {'c1': [1, 2, 1, 0], 'c2': [3, 0, 1, 2]}
df = pd.DataFrame(data=d)
df
c1 c2
0 1 3
1 2 0
2 1 1
3 0 2
You don't need to pass the column name to apply. As you only want to check if values of the columns are in certain range and should return a, b or c. You can make the following changes.
def result(val):
if 0<=val<=1:
return 'c'
elif 1<val<=2:
return 'b'
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df[c].apply(result)
Note that this will replace your column values.
A faster way is np.select:
import numpy as np
values = ['c', 'b']
for col in df.columns:
df[col] = np.select([0<=df[col]<=1, 1<df[col]<=2], values, default = 'a')

How to use the values of one column to access values in another column?

How to use the values of one column to access values in another
import numpy
impot pandas
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=[['Value']])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
so how to access the value 'bleh' for each row?
df.Value.iloc[df['bleh']]
Edit:
Thanks to #ScottBoston. My DF constructor had one layer of [] too much.
The correct answer is:
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
df['idx_int'] = range(df.shape[0])
df['haa'] = df['idx_int'] - df.bleh.values
df['newcol'] = df.Value.iloc[df['haa'].values].values
Try:
df['Value'].tolist()
Output:
[-1.0856306033005612,
0.9973454465835858,
0.28297849805199204,
-1.506294713918092,
-0.5786002519685364,
1.651436537097151,
-2.426679243393074,
-0.42891262885617726,
1.265936258705534,
-0.8667404022651017]
Your dataframe constructor still needs to be fixed.
Are you looking for:
df.set_index('bleh')
output:
Value
bleh
0 -1.085631
1 0.997345
2 0.282978
1 -1.506295
4 -0.578600
0 1.651437
0 -2.426679
4 -0.428913
1 1.265936
7 -0.866740
If so you, your dataframe constructor has as extra set of [] in it.
np.random.seed(123)
df = pd.DataFrame((np.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: np.random.randint(0, x + 1, 1)[0])
columns paramater in dataframe takes a list not a list of list.

How to assign a column to dataframe as weights for each row and then sample the dataframe according to those weights?

I am trying to implement a weighted random selection in a dataframe. I used the code below to build the dataframe:
import pandas as pd
from numpy import exp
import random
moves = [(1, 2), (1, 3), (1, 4), (2, 1), (2, 3), (2, 4)]
data = {'moves': list(map(lambda i: moves[i] if divmod(i, len(moves))[0] != 1 else moves[divmod(i, len(moves))[1]],
[i for i in range(2 * len(moves))])),
'player': list(map(lambda i: 1 if i >= len(moves) else 2,
[i for i in range(2 * len(moves))])),
'wins': [random.randint(0, 2) for i in range(2 * len(moves))],
'playout_number': [random.randint(0,1) for i in range(2 * len(moves))]
}
frame = pd.DataFrame(data)
and then I created a list and inserted it as the new column 'weight':
total = sum(map(lambda a, b: exp(a/b) if b != 0 else 0, frame['wins'], frame['playout_number']))
weights = list(map(lambda a, b: exp(a/b) / total if b != 0 else 0, frame['wins'], frame['playout_number']))
frame = frame.assign(weight=weights)
Now I want to select a random row based on each row's weight in the new column inserted.
The problem is that I want to use pandas.DataFrame.sample(weights=weight), But I don't know how. I can do that with numpy.random.choice(weights=weights), But I'd prefer keep using pandas library functions.
I appreciate helps in advance.
You can use parameters n or frac with weights in sample.
Parameter weights can be array, so is possible use list:
df = frame.sample(n=1, weights=weights)
Or column of df (Series):
#select 1 row - n=1
df = frame.sample(n=1, weights=frame.weight)
print (df)
moves player playout_number wins weight
6 (1, 2) 1 1 2 0.258325
#select 20% rows - frac=0.2
df = frame.sample(frac=0.2, weights=frame.weight)
print (df)
moves player playout_number wins weight
5 (2, 4) 2 1 2 0.221747
4 (2, 3) 2 1 1 0.081576