filter dataframe rows based on length of column values - pandas

I have a pandas dataframe as follows:
df = pd.DataFrame([ [1,2], [np.NaN,1], ['test string1', 5]], columns=['A','B'] )
df
A B
0 1 2
1 NaN 1
2 test string1 5
I am using pandas 0.20. What is the most efficient way to remove any rows where 'any' of its column values has length > 10?
len('test string1')
12
So for the above e.g., I am expecting an output as follows:
df
A B
0 1 2
1 NaN 1

If based on column A
In [865]: df[~(df.A.str.len() > 10)]
Out[865]:
A B
0 1 2
1 NaN 1
If based on all columns
In [866]: df[~df.applymap(lambda x: len(str(x)) > 10).any(axis=1)]
Out[866]:
A B
0 1 2
1 NaN 1

I had to cast to a string for Diego's answer to work:
df = df[df['A'].apply(lambda x: len(str(x)) <= 10)]

In [42]: df
Out[42]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
2 test string1 5 test string1test string1 2017-01-03
In [43]: df.dtypes
Out[43]:
A object
B int64
C object
D datetime64[ns]
dtype: object
In [44]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(1)]
Out[44]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
Explanation:
df.select_dtypes(['object']) selects only columns of object (str) dtype:
In [45]: df.select_dtypes(['object'])
Out[45]:
A C
0 1 2
1 NaN NaN
2 test string1 test string1test string1
In [46]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10))
Out[46]:
A C
0 False False
1 False False
2 True True
now we can "aggregate" it as follows:
In [47]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)
Out[47]:
0 False
1 False
2 True
dtype: bool
finally we can select only those rows where value is False:
In [48]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)]
Out[48]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02

Use the apply function of series, in order to keep them:
df = df[df['A'].apply(lambda x: len(x) <= 10)]

Related

pandas change all rows with Type X if 1 Type X Result = 1

Here is a simple pandas df:
>>> df
Type Var1 Result
0 A 1 NaN
1 A 2 NaN
2 A 3 NaN
3 B 4 NaN
4 B 5 NaN
5 B 6 NaN
6 C 1 NaN
7 C 2 NaN
8 C 3 NaN
9 D 4 NaN
10 D 5 NaN
11 D 6 NaN
The object of the exercise is: if column Var1 = 3, set Result = 1 for all that Type.
This finds the rows with 3 in Var1 and sets Result to 1,
df['Result'] = df['Var1'].apply(lambda x: 1 if x == 3 else 0)
but I can't figure out how to then catch all the same Type and make them 1. In this case it should be all the As and all the Cs. Doesn't have to be a one-liner.
Any tips please?
Create boolean mask and for True/False to 1/0 mapp convert values to integers:
df['Result'] = df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']).astype(int)
#alternative
df['Result'] = np.where(df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']), 1, 0)
print (df)
Type Var1 Result
0 A 1 1
1 A 2 1
2 A 3 1
3 B 4 0
4 B 5 0
5 B 6 0
6 C 1 1
7 C 2 1
8 C 3 1
9 D 4 0
10 D 5 0
11 D 6 0
Details:
Get all Type values if match condition:
print (df.loc[df['Var1'].eq(3), 'Type'])
2 A
8 C
Name: Type, dtype: object
Test original column Type by filtered types:
print (df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 True
9 False
10 False
11 False
Name: Type, dtype: bool
Or use GroupBy.transform with any for test if match at least one value, thi solution is slowier if larger df:
df['Result'] = df['Var1'].eq(3).groupby(df['Type']).transform('any').astype(int)

splitting string columns while the first part before the splitting pattern is missing

I'm trying to split a string column into different columns and tried
How to split a column into two columns?
The pattern of the strings look like the following:
import pandas as pd
import numpy as np
>>> data = {'ab': ['a - b', 'a - b', 'b', 'c', 'whatever']}
>>> df = pd.DataFrame(data=data)
ab
0 a - b
1 a - b
2 b
3 c
4 whatever
>>> df['a'], df['b'] = df['ab'].str.split('-', n=1).str
ab a b
0 a - b a b
1 a - b a b
2 b b NaN
3 c c NaN
4 whatever whatever NaN
The expected result is
ab a b
0 a - b a b
1 a - b a b
2 b NaN b
3 c NaN c
4 whatever NaN whatever
The method I came up with is
df.loc[~ df.ab.str.contains(' - '), 'b'] = df['ab']
df.loc[~ df.ab.str.contains(' - '), 'a'] = np.nan
Is there more generic/efficient way to do this task?
we can extractall as long as we know the specific strings to extract:
df.ab.str.extract(r"(a)?(?:\s-\s)?(b)?")
Out[47]:
0 1
0 a b
1 a b
2 NaN b
3 a NaN
data used:
data = {'ab': ['a - b', 'a - b', 'b','a']}
df = pd.DataFrame(data=data)
with your edit, it seems your aim is to put anything that is by itself on the second column. You could do:
df.ab.str.extract(r"(\S*)(?:\s-\s)?(\b\S+)")
Out[59]:
0 1
0 a b
1 a b
2 b
3 c
4 whatever
I will using get_dummies
s=df['ab'].str.get_dummies(' - ')
s=s.mask(s.eq(1),s.columns.tolist()).mask(s.eq(0))
s
Out[7]:
a b
0 a b
1 a b
2 NaN b
Update
df.ab.str.split(' - ',expand=True).apply(lambda x : pd.Series(sorted(x,key=pd.notnull)),axis=1)
Out[22]:
0 1
0 a b
1 a b
2 None b
3 None c
4 None whatever

Construct DataFrame from list of dicts

Trying to construct pandas DataFrame from list of dicts
List of dicts:
a = [{'1': 'A'},
{'2': 'B'},
{'3': 'C'}]
Pass list of dicts into pd.DataFrame():
df = pd.DataFrame(a)
Actual results:
1 2 3
0 A NaN NaN
1 NaN B NaN
2 NaN NaN C
pd.DataFrame(a, columns=['Key', 'Value'])
Actual results:
Key Value
0 NaN NaN
1 NaN NaN
2 NaN NaN
Expected results:
Key Value
0 1 A
1 2 B
2 3 C
try this,
from collections import ChainMap
data = dict(ChainMap(*a))
pd.DataFrame(data.items(), columns= ['Key','Value'])
O/P:
Key Value
0 1 A
1 2 B
2 3 C
Something like this with a list comprehension:
pd.DataFrame(([(x, y) for i in a for x, y in i.items()]),columns=['Key','Value'])
Key Value
0 1 A
1 2 B
2 3 C

Sum columns in pandas having string and number

I need to sum column and column b, which contain string in 1st row
>>> df
a b
0 c d
1 1 2
2 3 4
>>> df['sum'] = df.sum(1)
>>> df
a b sum
0 c d cd
1 1 2 3
2 3 4 7
I only need to add numeric values and get an output like
>>> df
a b sum
0 c d "dummyString/NaN"
1 1 2 3
2 3 4 7
I need to add only some columns
df['sum']=df['a']+df['b']
solution if mixed data - numeric with strings:
I think simpliest is convert non numeric values after sum by to_numeric to NaNs:
df['sum'] = pd.to_numeric(df[['a','b']].sum(1), errors='coerce')
Or:
df['sum'] = pd.to_numeric(df['a']+df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
EDIT:
Solutions id numbers are strings represenation - first convert to numeric and then sum:
df['sum'] = pd.to_numeric(df['a'], errors='coerce') + pd.to_numeric(df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
Or:
df['sum'] = (df[['a', 'b']].apply(lambda x: pd.to_numeric(x, errors='coerce'))
.sum(axis=1, min_count=1))
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0

Warning with loc function with pandas dataframe

While working on a SO Question i came across a warning error using with loc, precise details are as belows:
DataFrame Samples:
First dataFrame df1 :
>>> data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'],
... 'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
... 'Year':[2012, 2014, 2015]}
>>> df1 = pd.DataFrame(data1)
>>> df1.set_index('Sample')
Location Year
Sample
Sample_A Bangladesh 2012
Sample_D Myanmar 2014
Sample_E Thailand 2015
Second dataframe df2:
>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
... 'Sample_A': [0,1,0,0,1],
... 'Sample_B':[0,0,1,0,0],
... 'Sample_C':[1,0,0,0,1],
... 'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num')
Sample_A Sample_B Sample_C Sample_D
Num
Value_1 0 0 1 0
Value_2 1 0 0 0
Value_3 0 1 0 1
Value_4 0 0 0 1
Value_5 1 0 1 0
>>> samples
['Sample_A', 'Sample_D', 'Sample_E']
While i'm taking samples to preserve the column from it as follows it works but at the same time it produce warning ..
>>> df3 = df2.loc[:, samples]
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
Warnings:
indexing.py:1472: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self._getitem_tuple(key)
Would like to know about to handle this to a better way!
Use reindex like:
df3 = df2.reindex(columns=samples)
print (df3)
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
Or if want only intersected columns use Index.intersection:
df3 = df2[df2.columns.intersection(samples)]
#alternative
#df3 = df2[np.intersect1d(df2.columns, samples)]
print (df3)
Sample_A Sample_D
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0