Reading Pandas Data frame if condition occurs in any row/column - pandas

Attempting to read a data frame that has values in a random rows/columns order and I would like to get a new column where the all the values containing 'that' are summarized.
Input:
0 1 2 3 4
0 this=1 that=2 who=2 was=3 where=5
1 that=4 who=5 this=1 was=3 where=2
2 was=2 who=7 this=7 that=3 where=7
3 was=3 who=4 this=7 that=1 where=8
4 that=1 who=3 this=4 was=1 where=3
Output:
0
0 that=2
1 that=4
2 that=3
3 that=1
4 that=1
I have been successfully able to get the correct result with the following code. But with larger data frames it takes a long time to complete
df1=pd.DataFrame([['this=1', 'that=2', 'who=2', 'was=3', 'where=5'],
['that=4', 'who=5', 'this=1', 'was=3', 'where=2'],
['was=2', 'who=7', 'this=7', 'that=3','where=7'],
['was=3', 'who=4', 'this=7', 'that=1', 'where=8'],
['that=1', 'who=3', 'this=4', 'was=1', 'where=3']],
columns=[0,1,2,3,4])
df2=pd.DataFrame()
for i in df1.index:
data=[name for name in df1[i] if name[0:4]=='that']
df2=df2.append(pd.DataFrame(data))

df1[df1.apply(lambda x: x.str.contains('that'))].stack()
Let's break this down:
df1.apply(lambda x: x.str.contains('that')) Applies our lambda function to the entire dataframe. in english it reads: if that is in our value, True
0 1 2 3 4
0 False True False False False
1 True False False False False
2 False False False True False
3 False False False True False
4 True False False False False
df1[] around that will return the values, instead of True/False:
0 1 2 3 4
0 NaN that=2 NaN NaN NaN
1 that=4 NaN NaN NaN NaN
2 NaN NaN NaN that=3 NaN
3 NaN NaN NaN that=1 NaN
4 that=1 NaN NaN NaN NaN
stack() will stack all the values into one Series. stack() gets rid of NA by default, which is what you needed.
if the extra index is tripping you up, you can also reset the index for a single series:
df1[df1.apply(lambda x: x.str.contains('that'))].stack().reset_index(drop=True)
0 that=2
1 that=4
2 that=3
3 that=1
4 that=1
dtype: object

Related

Drop rows in pandas dataframe depending on order and NaN

I am using pandas to import a dataframe, and want to drop certain rows before grouping the information.
How do I go from the following (example):
Name1 Name2 Name3
0 A1 B1 1
1 NaN NaN 2
2 NaN NaN 3
3 NaN B2 4
4 NaN NaN 5
5 NaN NaN 6
6 NaN B3 7
7 NaN NaN 8
8 NaN NaN 9
9 A2 B4 1
10 NaN NaN 2
11 NaN NaN 3
12 NaN B5 4
13 NaN NaN 5
14 NaN NaN 6
15 NaN B6 7
16 NaN NaN 8
17 NaN NaN 9
to:
Name1 Name2 Name3
0 A1 B1 1
3 NaN B2 4
6 NaN B3 7
8 NaN NaN 9
9 A2 B4 1
12 NaN B5 4
15 NaN B6 7
17 NaN NaN 9
(My actual case consists of several thousand lines with the same structure as the example)
I have tried removing rows with NaN in Name2 using df=df[df['Name2'].notna()] , but then I get this:
Name1 Name2 Name3
0 A1 B1 1
3 NaN B2 4
6 NaN B3 7
9 A2 B4 1
12 NaN B5 4
15 NaN B6 7
I also need to keep line 8 and 17 in the example above.
Assuming you want to keep the rows that are either:
not NA in column "Name2"
or the last row before a non-NA "Name1" or the end of data
You can use boolean indexing:
# is the row not-NA in Name2?
m1 = df['Name2'].notna()
# is is the last row of a group?
m2 = df['Name1'].notna().shift(-1, fill_value=True)
# keep if either of the above condition is True
out = df[m1|m2]
Output:
Name1 Name2 Name3
0 A1 B1 1
3 NaN B2 4
6 NaN B3 7
8 NaN NaN 9
9 A2 B4 1
12 NaN B5 4
15 NaN B6 7
17 NaN NaN 9
Intermediates:
Name1 Name2 Name3 m1 m2 m1|m2
0 A1 B1 1 True False True
1 NaN NaN 2 False False False
2 NaN NaN 3 False False False
3 NaN B2 4 True False True
4 NaN NaN 5 False False False
5 NaN NaN 6 False False False
6 NaN B3 7 True False True
7 NaN NaN 8 False False False
8 NaN NaN 9 False True True
9 A2 B4 1 True False True
10 NaN NaN 2 False False False
11 NaN NaN 3 False False False
12 NaN B5 4 True False True
13 NaN NaN 5 False False False
14 NaN NaN 6 False False False
15 NaN B6 7 True False True
16 NaN NaN 8 False False False
17 NaN NaN 9 False True True
You can use the thresh argument in df.dropna.
# toy data
data = {'name1': [np.nan, np.nan, np.nan, np.nan], 'name2': [np.nan, 1, 2, np.nan], 'name3': [1, 2, 3, 4]}
df = pd.DataFrame(data)
name1 name2 name3
0 NaN NaN 1
1 NaN 1.0 2
2 NaN 2.0 3
3 NaN NaN 4
To remove rows with 2+ NaN, just do this:
df.dropna(thresh = 2)
name1 name2 name3
1 NaN 1.0 2
2 NaN 2.0 3
If you want to keep lines 8 and 17, you may want to first save them separately in another variable and add them to df afterwards using df.append and then resorting by index.

pandas change all rows with Type X if 1 Type X Result = 1

Here is a simple pandas df:
>>> df
Type Var1 Result
0 A 1 NaN
1 A 2 NaN
2 A 3 NaN
3 B 4 NaN
4 B 5 NaN
5 B 6 NaN
6 C 1 NaN
7 C 2 NaN
8 C 3 NaN
9 D 4 NaN
10 D 5 NaN
11 D 6 NaN
The object of the exercise is: if column Var1 = 3, set Result = 1 for all that Type.
This finds the rows with 3 in Var1 and sets Result to 1,
df['Result'] = df['Var1'].apply(lambda x: 1 if x == 3 else 0)
but I can't figure out how to then catch all the same Type and make them 1. In this case it should be all the As and all the Cs. Doesn't have to be a one-liner.
Any tips please?
Create boolean mask and for True/False to 1/0 mapp convert values to integers:
df['Result'] = df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']).astype(int)
#alternative
df['Result'] = np.where(df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']), 1, 0)
print (df)
Type Var1 Result
0 A 1 1
1 A 2 1
2 A 3 1
3 B 4 0
4 B 5 0
5 B 6 0
6 C 1 1
7 C 2 1
8 C 3 1
9 D 4 0
10 D 5 0
11 D 6 0
Details:
Get all Type values if match condition:
print (df.loc[df['Var1'].eq(3), 'Type'])
2 A
8 C
Name: Type, dtype: object
Test original column Type by filtered types:
print (df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 True
9 False
10 False
11 False
Name: Type, dtype: bool
Or use GroupBy.transform with any for test if match at least one value, thi solution is slowier if larger df:
df['Result'] = df['Var1'].eq(3).groupby(df['Type']).transform('any').astype(int)

Find how many common missing (nan) values are in DataFrame columns

I have got DataFrame with about 20 columns and 50000 rows. Small part of this Pandas table is below:
I am looking for some way to count how many missing values are in the same positions (rows) in a few columns.
When number of the columns is known simply code like these:
((df['HomeRemote'].isnull() & df['CompanySize'].isnull()).sum()
probably is the answer, but unfortunately number of the columns to compare could be more than 2. I don't know it, because it depends on situation and that's why I looking for something like "universal" solution (working for any number of columns).
My idea is finding a way to "push" every df[col].isnull() into the FOR loop (where col is name of column), but I have a problem with put '&' between every df[col].isnull().
Maybe anyone here have some other possibility to consider?
If something is not clear enough, please let me know.
Try:
Sample input:
>>> df
A B C D E
0 NaN 1.0 1.0 1.0 1.0
1 NaN NaN NaN 1.0 1.0
2 1.0 1.0 1.0 NaN 1.0
3 1.0 1.0 1.0 1.0 NaN
4 1.0 1.0 NaN 1.0 1.0
5 NaN 1.0 1.0 1.0 1.0
6 1.0 1.0 1.0 NaN 1.0
7 1.0 NaN 1.0 NaN 1.0
8 1.0 1.0 1.0 1.0 1.0
9 1.0 1.0 1.0 1.0 1.0
How many missing values are in the same position in columns A, B, C:
>>> df[['A', 'B', 'C']].isnull().all(axis=1).sum()
1
Step by step:
# Find missing values
>>> df[['A', 'B', 'C']].isnull()
A B C
0 True False False
1 True True True # <- HERE
2 False False False
3 False False False
4 False False True
5 True False False
6 False False False
7 False True False
8 False False False
9 False False False
# Reduce
>>> df[['A', 'B', 'C']].isnull().all(axis=1)
0 False
1 True # <- HERE
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
dtype: bool
# Reduce again
>>> df[['A', 'B', 'C']].isnull().all(axis=1).sum()
1

Python Pandas: Number of times the same value appears in different columns within the same row

I have a wide dataset and am trying to figure out how many times the value in column 2019_5 appears for that same member and whether it's continuous or not. The best I've managed to come up with is dataframe['number_yrs'] = 5 - dataframe.isnull().sum(axis=1) which gives the following. The problem is that it just looks at whether there is or isn't a NaN, not whether the value is equal to 2019_5
member 2015_5 2016_5 2017_5 2018_5 2019_5 number_yrs
0 aaa NaN NaN NaN NaN 12 1
1 bbb 7.0 7.0 7.0 7.0 7 5
2 ccc 10.0 10.0 NaN NaN 10 3
3 ddd 12.0 NaN NaN 7.0 7 3
4 eee 12.0 NaN 10.0 NaN 10 3
What I want is for it to be 2 for member ddd and 2 for member eee
I'd also like to add a continuous column that is y/n and indicates whether number_yrs is continuous. I expect it would look something like this when all is said and done (correctly)
member number_yrs continuous
0 aaa 1 n
1 bbb 5 y
2 ccc 3 n
3 ddd 2 y
4 eee 2 n
Try:
df["number_yrs"] = df.filter(regex="_5$").apply(
lambda x: x.eq(x["2019_5"]).sum(), axis=1
)
df["continuous"] = np.where(
df.filter(regex="_5$").apply(
lambda x: sorted(m := x.eq(x["2019_5"])) == m.tolist() and m.sum() > 1,
axis=1,
),
"y",
"n",
)
print(df)
Prints:
member 2015_5 2016_5 2017_5 2018_5 2019_5 number_yrs continuous
0 aaa NaN NaN NaN NaN 12 1 n
1 bbb 7.0 7.0 7.0 7.0 7 5 y
2 ccc 10.0 10.0 NaN NaN 10 3 n
3 ddd 12.0 NaN NaN 7.0 7 2 y
4 eee 12.0 NaN 10.0 NaN 10 2 n
For the number_yrs, you can compare the 2019_5 column against the whole dataframe (but transposed to assure alignment since the latter will be a series):
# keep `member` column aside from calculations
>>> df = df.set_index("member")
>>> df.T == df["2019_5"]
member aaa bbb ccc ddd eee
2015_5 False True True False False
2016_5 False True True False False
2017_5 False True False False True
2018_5 False True False True False
2019_5 True True True True True
As you can see, this returns a boolean frame where each entry is marked whether they are equal to 2019_5's value per row (and last row is full Trues as expected). Now we can sum the Trues per member to get number_yrs:
>>> (df.T == df["2019_5"]).sum(axis=0)
member
aaa 1
bbb 5
ccc 3
ddd 2
eee 2
As for the continutiy, we can look at 2018_5s value and see if it's the same as that of 2019_5 for each row:
>>> df["2018_5"] == df["2019_5"]
member
aaa False
bbb True
ccc False
ddd True
eee False
We're almost there; we can map True/False values to "y"/"n" with map:
>>> (df["2018_5"] == df["2019_5"]).map({True: "y", False: "n"})
member
aaa n
bbb y
ccc n
ddd y
eee n
So we have the needed new columns. Putting together and assigning those to the dataframe:
df["number_yrs"] = (df.T == df["2019_5"]).sum(axis=0)
df["continuous"] = (df["2018_5"] == df["2019_5"]).map({True: "y", False: "n"})
gives
>>> df
2015_5 2016_5 2017_5 2018_5 2019_5 number_yrs continuous
member
aaa NaN NaN NaN NaN 12 1 n
bbb 7.0 7.0 7.0 7.0 7 5 y
ccc 10.0 10.0 NaN NaN 10 3 n
ddd 12.0 NaN NaN 7.0 7 2 y
eee 12.0 NaN 10.0 NaN 10 2 n

Slicing on entire panda dataframe instead of series results in change of data type and values assignment of first field to NaN, what is happening?

Was trying to do some cleaning on a dataset ,where instead providing a condition on a panda series
head_only[head_only.BasePay > 70000]
I applied the condition on the data frame
head_only[head_only > 70000]
attached images of my observation, could anyone help me understand what is it that's happening ?
Your second solution raise error if numeric with strings columns:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2.0,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df[df > 5])
TypeError: '>' not supported between instances of 'str' and 'int'
If compare only numeric columns it get values higher like 4 and all another numbers convert to misisng values:
df1 = df.select_dtypes(np.number)
print (df1[df1 > 4])
B C D E
0 NaN 7.0 NaN 5.0
1 5.0 8.0 NaN NaN
2 NaN 9.0 5.0 6.0
3 5.0 NaN 7.0 9.0
4 5.0 NaN NaN NaN
5 NaN NaN NaN NaN
Here are replaced at least one value in each column, so integers columns are converted to floats (because NaN is float):
print (df1[df1 > 4].dtypes)
B float64
C float64
D float64
E float64
dtype: object
If need compare all numeric columns if at least one of them match condition use DataFrame.any for test if at least one value is True:
#returned boolean DataFrame
print ((df1 > 7))
B C D E
0 False False False False
1 False True False False
2 False True False False
3 False False False True
4 False False False False
5 False False False False
print ((df1 > 7).any(axis=1))
0 False
1 True
2 True
3 True
4 False
5 False
dtype: bool
print (df1[(df1 > 7).any(axis=1)])
B C D E
1 5 8.0 3 3
2 4 9.0 5 6
3 5 4.0 7 9
Or if need filter original all columns is possible filter only numeric columns by DataFrame.select_dtypes:
print (df[(df.select_dtypes(np.number) > 7).any(axis=1)])
A B C D E F
1 b 5 8.0 3 3 a
2 c 4 9.0 5 6 a
3 d 5 4.0 7 9 b