Pandas: select multiple rows or default with new API - pandas

I need to retrieve multiples rows (which could be duplicated) and if the index does not exist get a default value. An example with Series:
s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
labels = ['a', 'd', 'f']
result = s.loc[labels]
result = result.fillna(my_default_value)
Now, I'm using DataFrame, an equivalent with names is:
df = pd.DataFrame({
"Person": {
"name_1": "Genarito",
"name_2": "Donald Trump",
"name_3": "Joe Biden",
"name_4": "Pablo Escobar",
"name_5": "Dalai Lama"
}
})
default_value = 'No name'
names_to_retrieve = ['name_1', 'name_2', 'name_8', 'name_3']
result = df.loc[names_to_retrieve]
result = result.fillna(default_value)
With both examples it's throwing a warning saying:
FutureWarning: Passing list-likes to .loc or [] with any missing
label will raise KeyError in the future, you can use .reindex() as an
alternative.
In the documentation of the issue it says that you should use reindex but they say that It won't work with duplicates...
Is there any way to work without warnings and duplicated indexes?
Thanks in advance

Let's try merge:
result = (pd.DataFrame({'label':labels})
.merge(s.to_frame(name='x'), left_on='label',
right_index=True, how='left')
.set_index('label')['x']
)
Output:
label
a 0.0
a 1.0
d NaN
f NaN
Name: x, dtype: float64

How about :
on_values = s.loc[s.index.intersection(labels).unique()]
off_values = pd.Series(default_value,index=s.index.difference(labels))
result = pd.concat([on_values,off_values])

Check isin with append
out = s[s.index.isin(labels)]
out = out.append(pd.Series(index=set(labels)-set(s.index),dtype='float').fillna(0))
out
Out[341]:
a 0.0
a 1.0
d 0.0
f 0.0
dtype: float64

You can write a simple function to handle the rows in labels and missing from labels separately, then join. When True the in_order argument will ensure that if you specify labels = ['d', 'a', 'f'], the output is ordered ['d', 'a', 'f'].
def reindex_with_dup(s: pd.Series or pd.DataFrame, labels, fill_value=np.NaN, in_order=True):
labels = pd.Series(labels)
s1 = s.loc[labels[labels.isin(s.index)]]
if isinstance(s, pd.Series):
s2 = pd.Series(fill_value, index=labels[~labels.isin(s.index)])
if isinstance(s, pd.DataFrame):
s2 = pd.DataFrame(fill_value, index=labels[~labels.isin(s.index)],
columns=s.columns)
s = pd.concat([s1, s2])
if in_order:
s = s.loc[labels.drop_duplicates()]
return s
reindex_with_dup(s, ['d', 'a', 'f'], fill_value='foo')
#d foo
#a 0
#a 1
#f foo
#dtype: object
This retains the .loc behavior that if your index is duplicated and your labels are duplicated it duplicates the selection:
reindex_with_dup(s, ['d', 'a', 'a', 'f', 'f'], fill_value='foo')
#d foo
#a 0
#a 1
#a 0
#a 1
#f foo
#f foo
#dtype: object

Related

Retrieving only one element of a tuple when the tuple is the value of a dictionary

I am trying to map a column of my df with a dictionary, where the dictionary contains tuples as values. I want to be able to only return the first value of the tuple in the output column. Is there a way to do that?
The situation:
d = {'key1': (1, 2, 3)}
df['lookup_column'] = 'key1'
df['return_column'] = df['lookup_column'].map(d)
Output:
df['return_column'] = (1, 2, 3)
Adding this returns an error:
df['return_column'] = df['return_column'][0]
Running this instead also returns an error:
df['return_column'] = df['lookup_column'].map(d[0])
The desired outcome:
df['return_column'] = 1
Thank you!
Use str for first element of Iterable, here tuple - it return NaN if no match:
df['return_column'] = df['return_column'].str[0]
All together:
df = pd.DataFrame({'lookup_column':['key1','key2']})
d = {'key1': (1, 2, 3)}
df['return_column1'] = df['lookup_column'].map(d)
df['return_column2'] = df['lookup_column'].map(d).str[0]
Second alternative with dict.get for default value if no match, here is ouput tuple so is possible use tuple (np.nan,):
df['return_column4'] = df['lookup_column'].map(lambda x: d.get(x, (np.nan,)))
df['return_column5'] = df['lookup_column'].map(lambda x: d.get(x, (np.nan,))[0])
print (df)
lookup_column return_column1 return_column2 return_column4 return_column5
0 key1 (1, 2, 3) 1.0 (1, 2, 3) 1.0
1 key2 NaN NaN (nan,) NaN

New dataframe creation within loop and append of the results to the existing dataframe

I am trying to create conditional subsets of rows and columns from a DataFrame and append them to the existing dataframes that match the structure of the subsets. New subsets of data would need to be stored in the smaller dataframes and names of these smaller dataframes would need to be dynamic. Below is an example.
#Sample Data
df = pd.DataFrame({'a': [1,2,3,4,5,6,7], 'b': [4,5,6,4,3,4,6,], 'c': [1,2,2,4,2,1,7], 'd': [4,4,2,2,3,5,6,], 'e': [1,3,3,4,2,1,7], 'f': [1,1,2,2,1,5,6,]})
#Function to apply to create the subsets of data - I would need to apply a #function like this to many combinations of columns
def f1 (df, input_col1, input_col2):
#Subset ros
t=df[df[input_col1]>=3]
#Subset of columns
t=t[[input_col1, input_col2]]
t = t.sort_values([input_col1], ascending=False)
return t
#I want to create 3 different dataframes t1, #t2, and t3, but I would like to create them in the loop - not via individual #function calls.
#These Individual calls - these are just examples of what I am trying to achieve via loop
#t1=f1(df, 'a', 'b')
#t2=f1(df, 'c', 'd')
#t3=f1(df, 'e', 'f')
#These are empty dataframes to which I would like to append the resulting #subsets of data
column_names=['col1','col2']
g1 = pd.DataFrame(np.empty(0, dtype=[('col1', 'f8'),('col2', 'f8')]))
g2 = pd.DataFrame(np.empty(0, dtype=[('col1', 'f8'),('col2', 'f8')]))
g3 = pd.DataFrame(np.empty(0, dtype=[('col1', 'f8'),('col2', 'f8')]))
list1=['a', 'c', 'e']
list2=['b', 'd', 'f']
t={}
g={}
#This is what I want in the end - I would like to call the function inside of #the loop, create new dataframes dynamically and then append them to the #existing dataframes, but I am getting errors. Is it possible to do?
for c in range(1,4,1):
for i,j in zip(list1,list2):
t['t'+str(c)]=f1(df, i, j)
g['g'+str(c)]=g['g'+str(c)].append(t['t'+str(c)], ignore_index=True)
I guess you want to create t1,t2,t3 dynamically.
You can use globals().
g1 = pd.DataFrame(np.empty(0, dtype=[('a', 'f8'), ('b', 'f8')]))
g2 = pd.DataFrame(np.empty(0, dtype=[('c', 'f8'), ('d', 'f8')]))
g3 = pd.DataFrame(np.empty(0, dtype=[('e', 'f8'), ('f', 'f8')]))
list1 = ['a', 'c', 'e']
list2 = ['b', 'd', 'f']
for c in range(1, 4, 1):
globals()['t' + str(c)] = f1(df, list1[c-1], list2[c-1])
globals()['g' + str(c)] = globals()['g' + str(c)].append(globals()['t' + str(c)])

Pandas apply function on multiple columns

I am trying to apply a function to every column in a dataframe, when I try to do it on just a single fixed column name it works. I tried doing it on every column, but when I try passing the column name as an argument in the function I get an error.
How do you properly pass arguments to apply a function on a data frame?
def result(row,c):
if row[c] >=0 and row[c] <=1:
return 'c'
elif row[c] >1 and row[c] <=2:
return 'b'
else:
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df.apply(result, args = (c), axis=1)
TypeError: ('result() takes exactly 2 arguments (21 given)', u'occurred at index 0')
Input data frame format:
d = {'c1': [1, 2, 1, 0], 'c2': [3, 0, 1, 2]}
df = pd.DataFrame(data=d)
df
c1 c2
0 1 3
1 2 0
2 1 1
3 0 2
You don't need to pass the column name to apply. As you only want to check if values of the columns are in certain range and should return a, b or c. You can make the following changes.
def result(val):
if 0<=val<=1:
return 'c'
elif 1<val<=2:
return 'b'
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df[c].apply(result)
Note that this will replace your column values.
A faster way is np.select:
import numpy as np
values = ['c', 'b']
for col in df.columns:
df[col] = np.select([0<=df[col]<=1, 1<df[col]<=2], values, default = 'a')

How can I select rows from one DataFrame, where a part of the row's index is in another DataFrame's index and meets certain criteria?

I have two DataFrames. df provides a lot of data. test_df describes whether certain tests have passed or not. I need to select from df only the rows where the tests have not failed by looking up this info in test_df. So far, I'm able to reduce my test_df to passed_tests. So, what's left is to select only the rows from df where the relevant part of the row index is in passed_tests. How can I do that?
Updates:
test_db doesn't haven't unique rows. Where there are duplicate rows (and there may be more than 1 duplicate), the test that was the most positive takes priority. i.e True > Ok > False.
My code:
import pandas as pd
import numpy as np
index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux']), np.array(['a', 'a', 'b', 'b', 'a', 'b', 'b'])]
data = np.array(['False', 'True', 'False', 'False', 'False', 'Ok', 'False'])
columns = ["Passed?"]
test_df = pd.DataFrame(data, index=index, columns=columns)
print test_df
index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux', 'qux']),
np.array(['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b']),
np.array(['1', '2', '1', '2', '1', '2', '1', '2'])]
data = np.random.randn(8, 2)
columns = ["X", "Y"]
df = pd.DataFrame(data, index=index, columns=columns)
print df
passed_tests = test_df.loc[test_df['Passed?'].isin(['True', 'Ok'])]
print passed_tests
df
X Y
foo a 1 0.589776 -0.234717
2 0.105161 1.937174
b 1 -0.092252 0.143451
2 0.939052 -0.239052
qux a 1 0.757239 2.836032
2 -0.445335 1.352374
b 1 2.175553 -0.700816
2 1.082709 -0.923095
test_df
Passed?
foo a False
a True
b False
b False
qux a False
b Ok
b False
passed_tests
Passed?
foo a True
qux b Ok
required solution
X Y
foo a 1 0.589776 -0.234717
2 0.105161 1.937174
qux b 1 2.175553 -0.700816
2 1.082709 -0.923095
You need reindex with method='ffill', then check values by isin and last use boolean indexing:
print (test_df.reindex(df.index, method='ffill'))
Passed?
foo a 1 True
2 True
b 1 False
2 False
qux a 1 False
2 False
b 1 Ok
2 Ok
mask = test_df.reindex(df.index, method='ffill').isin(['True', 'Ok'])['Passed?']
print (mask)
foo a 1 True
2 True
b 1 False
2 False
qux a 1 False
2 False
b 1 True
2 True
Name: Passed?, dtype: bool
print (df[mask])
X Y
foo a 1 -0.580448 -0.168951
2 -0.875165 1.304745
qux b 1 -0.147014 -0.787483
2 0.188989 -1.159533
EDIT:
For remove duplicates here is the easier use:
get columns from MultiIndex by reset_index
sort_values - Passed? column descending, first and second ascending
drop_duplicates - keep only first value
set_index for MultiIndex back
rename_axis for remove index names
test_df = test_df.reset_index()
.sort_values(['level_0','level_1', 'Passed?'], ascending=[1,1,0])
.drop_duplicates(['level_0','level_1'])
.set_index(['level_0','level_1'])
.rename_axis([None, None])
print (test_df)
Passed?
foo a True
b False
qux a False
b Ok
Another solution is simplier - sorting first and then groupby with first:
test_df = test_df.sort_values('Passed?', ascending=False)
.groupby(level=[0,1])
.first()
print (test_df)
Passed?
foo a True
b False
qux a False
b Ok
EDIT1:
Convert values to ordered Categorical.
index = [np.array(['foo', 'foo', 'foo', 'foo', 'qux', 'qux', 'qux']), np.array(['a', 'a', 'b', 'b', 'a', 'b', 'b'])]
data = np.array(['False', 'True', 'False', 'False', 'False', 'Acceptable', 'False'])
columns = ["Passed?"]
test_df = pd.DataFrame(data, index=index, columns=columns)
#print (test_df)
cat = ['False', 'Acceptable','True']
test_df["Passed?"] = test_df["Passed?"].astype('category', categories=cat, ordered=True)
print (test_df["Passed?"])
foo a False
a True
b False
b False
qux a False
b Acceptable
b False
Name: Passed?, dtype: category
Categories (3, object): [False < Acceptable < True]
test_df = test_df.sort_values('Passed?', ascending=False).groupby(level=[0,1]).first()
print (test_df)
Passed?
foo a True
b False
qux a False
b Acceptable

Aggregate/Remove duplicate rows in DataFrame based on swapped index levels

Sample input
import pandas as pd
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
Which looks like this:
value
from to type
A B 1 5
B C 2 2
A 1 1
C B 1 3
Goal
I now want to remove "duplicate" rows from this in the following sense: for each row with an arbitrary index (from, to, type), if there exists a row (to, from, type), the value of the second row should be added to the first row and the second row be dropped. In the example above, the row (B, A, 1) with value 1 should be added to the first row and dropped, leading to the following desired result.
Sample result
value
from to type
A B 1 6
B C 2 2
C B 1 3
This is my best try so far. It feels unnecessarily verbose and clunky:
# aggregate val of rows with (from,to,type) == (to,from,type)
df2 = df.reset_index()
df3 = df2.rename(columns={'from':'to', 'to':'from'})
df_both = df.join(df3.set_index(
['from', 'to', 'type']),
rsuffix='_b').sum(axis=1)
# then remove the second, i.e. the (to,from,t) row
rows_to_keep = []
rows_to_remove = []
for a,b,t in df_both.index:
if (b,a,t) in df_both.index and not (b,a,t) in rows_to_keep:
rows_to_keep.append((a,b,t))
rows_to_remove.append((b,a,t))
df_final = df_both.drop(rows_to_remove)
df_final
Especially the second "de-duplication" step feels very unpythonic. (How) can I improve these steps?
Not sure how much better this is, but it's certainly different
import pandas as pd
from collections import Counter
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
ls = df.to_records()
ls = list(ls)
ls2=[]
for l in ls:
i=0
while i <= l[3]:
ls2.append(list(l)[:3])
i+=1
counted = Counter(tuple(sorted(entry)) for entry in ls2)