Get True/False boolean list of row in pandas dataframe out of condition - pandas

I am working with several Pandas DataFrames and I need the following filtering:
Suppose I get a list like
L=['EP6','EP3','EP2']
I need to get the following vector of a row:
for row concept 1 True where columns index is in L, False where not.
I am trying:
# D being the DataFrame
L=['EP6', 'EP3','EP2']
[True for ind in D.columns if ind in L ]
But only get [True,True,True]
I need the complete list like:
desire_result=[0,0,0,0,1,0,0,1,1,0]
Note: be aware that the 1 in the desire result do not have anything to do with the 1 the Dataframe is populate with.
Thanks

We have isin in pandas
D.columns.isin(L)

You here made a filter where you yield True, given ind in L, and otherwise, you do not yield an element.
You here want to perform a mapping. You can still use list comprehension, but you should put the condition in the yield part:
[ind in L for ind in D.columns]
or if you want integers:
[int(ind in L) for ind in D.columns]

Related

I tried to add filter for the Countries but it gives me an eeror [duplicate]

I have a dataframe with a lot of columns in it. Now I want to select only certain columns. I have saved all the names of the columns that I want to select into a Python list and now I want to filter my dataframe according to this list.
I've been trying to do:
df_new = df[[list]]
where list includes all the column names that I want to select.
However I get the error:
TypeError: unhashable type: 'list'
Any help on this one?
You can remove one []:
df_new = df[list]
Also better is use other name as list, e.g. L:
df_new = df[L]
It look like working, I try only simplify it:
L = []
for x in df.columns:
if not "_" in x[-3:]:
L.append(x)
print (L)
List comprehension:
print ([x for x in df.columns if not "_" in x[-3:]])

Advanced condition lookup in pandas(numpy)

given:
a list of elements 'ls' and a big df 'df', all the elements of 'ls' is in the 'df'.
ls = ['a0','a1','a2','b0','b2','c0',...,'c_k']
df = [['a0','b0','c0'],
['a0','b0','c1'],
['a0','b0','c2'],
...
['a_i','b_j','c_k']]
goal:
I want to collect the rows set of the 'df' that contains the most elements of 'ls', such as ['a0','b0','c0'] is the best one. But at most a row just contain only 2 elements
tried:
I tried enumerating 3 or 2 elements in 'ls', but it was too expensive and probably return None since there exist only 2 elements in some row.
I tried to use a dictionary to count, but it didn't work either.
I've been puzzling over this problem all day, any help will be greatly appreciated.
I would go like this:
row_id = df.apply(lambda x: x.isin(ls).sum(), axis=1)
This will give you the row index with max entries in the list.
The desired row can be obtained so:
df.iloc[row_id, :]

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])

pandas merge multiple dataframes

For example: I have multiple dataframes. Each data frame has columns: variable_code, variable_description, year.
df1:
variable_code, variable_description
N1, Number of returns
N2, Number of Exemptions
df2:
variable_code, variable_description
N1, Number of returns
NUMDEP, # of dependent
I want to merge these two dataframes to get all variable_codes in both df1 and df2.
variable_code, variable_description
N1 Number of returns
N2 Number of Exemption
NUMDEP # of dependent
There is documentation for merge right here
Since your columns you want to merge on are both called "variable_code" then you can use on='variable_code'
so the whole thing would be:
df1.merge(df2, on='variable_code')
You can specify How='outer' if you want blanks where you have data in only one of those tables. Use how='inner' if you want only data that is in both tables (no blanks).
To attain your requirement, try this:
import pandas as pd
#Create the first dataframe, through a dictionary - several other possibilities exist.
data1 = {'variable_code': ['N1','N2'], 'variable_description': ['Number of returns','Number of Exemptions']}
df1 = pd.DataFrame(data=data1)
#Create second dataframe
data2 = {'variable_code': ['N1','NUMDEP'], 'variable_description': ['Number of returns','# of dependent']}
df2 = pd.DataFrame(data=data2)
#place the dataframes on a list.
dfs = [df1,df2] #additional dfs can be added here.
#You can loop over the list,merging the dfs. But here reduce and a lambda is used.
resultant_df = reduce(lambda left,right: pd.merge(left,right,on=['variable_code','variable_description'],how='outer'), dfs)
This gives:
>>> resultant_df
variable_code variable_description
0 N1 Number of returns
1 N2 Number of Exemptions
2 NUMDEP # of dependent
There are several options available for how, each catering for various needs. outer, used here allows for inclusion of even the rows with empty data. See the docs for detailed explanation on the other options.
First, concatenate df1, df2, by using
final_df = pd.concat([df1,df2]).
Then we can convert columns variable_code, variable_name into dictionary. variable_code as keys, variable_name as values by using
d = dict(zip(final_df['variable_code'], final_df['variable_name'])).
Then convert d into dataframe:
d_df = pd.DataFrame(list(d.items()), columns=['variable_code', 'variable_name']).

pandas: Assign values to a slice a MultiIndex by range of secondary index

I have a problem with assigning a series like object to a slice of a Pandas dataframe.
Maybe I'm not using the Datafarme the way it is intended to, so some enlightment will be greatly appreciated.
I've already read the following articles:
pandas: slice a MultiIndex by range of secondary index
Returning a view versus a copy
As far as I understand the way I'm evoking the slice with one .loc call does ensure I'm getting not a copy of the data. Obviously also the original dataframe gets altered, but instead of the expected data I get NaN values.
See the appended code snipet.
Do I have to iterate over the desired section of the dataframe for each single value I want to change and use the .set_value(row_idx,col_idx,val) method?
kind regards and thanks in advance
Markus
In [1]: import pandas as pd
In [2]: mindex = pd.MultiIndex.from_product([['one','two'],['first','second']])
In [3]: dfmi = pd.DataFrame([list('abcd'),list('efgh'),list('ijkl'),list('mnop')],
...: index = mindex, columns=(['X','Y','Z','Q']))
In [4]: print(dfmi)
X Y Z Q
one first a b c d
second e f g h
two first i j k l
second m n o p
In [5]: dfmi.loc[('two',slice('first','second')),'X']
Out[5]:
two first i
second m
Name: X, dtype: object
In [6]: substitute = pd.Series(data=["ab","cd"], index= mindex.levels[1])
...: print(substitute)
first ab
second cd
dtype: object
In [7]: dfmi.loc[('two',slice('first','second')),'X'] = substitute
In [8]: print(dfmi)
X Y Z Q
one first a b c d
second e f g h
two first NaN j k l
second NaN n o p
What's happening is that substitute has an index, which determine the location of the values, and dfmi.loc[('two',slice('first','second')),'X'] is also specifying such location.
During the assignment pandas is trying to align both index and since they do not match (they would if substitute was also a multi-index), the result of the alignment are all NA's, which get inserted.
A solution could be to get rid of the index of substitute since the location of where you want to insert the values is already specified in the loc:
dfmi.loc[('two',slice('first','second')),'X'] = substitute.values
or even simpler, insert the values directly:
dfmi.loc[('two',slice('first','second')),'X'] = ["ab","cd"]
Can you try this:
dfmi.loc['two']['X']=substitute