How to count number of rows containing exactly one NaN (XOR operation) on a pandas dataframe? - pandas

In the table below, I would return 2, by the sum of the rows with indices 2 and 3.
0 NaN NaN
1 NaN NaN
2 Apple NaN
3 NaN Mango
4 Banana Grape
To elaborate, row 2 contains one non-NaN element, so for a variable tracking the count, count += 1 when we iterate through each row and encounter row 2. Similarly, for row 3, we would have count += 1, leading to the count being a total of 2. Since row 4 contains two non-NaN elements, we do not increment the count. Rows 0 and 1 contain two NaN's, so the count is also not updated.

You can do:
df.isnull().sum(axis=1).eq(1).sum()
# 2
.isnull() Check nulls
.sum(axis=1) Count nulls by row
.eq(1) Check number of nulls equal to 1
.sum() Count rows with exactly one null

Related

Sum positive and negative numbers in column separately then recalculate column based on net total

I have a dataframe column that has both positive and negative numbers.
I would like to sum/total only the negative numbers in this column, then add this negative total to each positive number in this column.
For the first number in the column that is greater than 0, I want the number in this row to reflect this net between the negative sum/total and this number in a new column "Q1_new". All other numbers below this row should stay the same.
Here is my data:
data = {'Q1_old': [5, -4, 3, 2, -2]}
df = pd.DataFrame(data)
Q1_old
0 5
1 -4
2 3
3 2
4 -2
My desired result:
Q1 Q1_new
2 3 2
3 2 2

Drop rows not equal to values for unique items - pandas

I've got a df that contains various strings that are associated with unique values. For these unique values, I want to drop the rows that are not equal to a separate list, except for the last row.
Using below, the various string values in Label are associated with Item. So for each unique Item, there could be multiple rows in Label with various strings. I only want to keep the strings that are in label_list, except for the last row.
I'm not sure I can do this another way as the amount of strings not in label_list is too many to account for. The ordering van also vary. So for each unique value in Item, I really only want the last row and whatever rows that are in label_list.
label_list = ['A','B','C','D']
df = pd.DataFrame({
'Item' : [10,10,10,10,10,20,20,20],
'Label' : ['A','X','C','D','Y','A','B','X'],
'Count' : [80.0,80.0,200.0,210.0,260.0,260.0,300.0,310.0],
})
df = df[df['Label'].isin(label_list)]
Intended output:
Item Label Value
0 10 A 80.0
1 10 C 200.0
2 10 D 210.0
3 10 Y 260.0
4 20 A 260.0
5 20 B 300.0
6 20 X 310.0
This comes to mind as a quick and dirty solution:
df = pd.concat([df[df['Label'].isin(label_list)],df.drop_duplicates('Item',keep='last')]).drop_duplicates(keep='first')
We are appending the last row of each Item group, but in case the last row is duplicsted because it is also in label_list we are using drop duplicates for the concatenated outout too.
check if 'Label' isin label_list
check if rows are duplicated
boolean slice the dataframe
isin_ = df['Label'].isin(label_list)
duped = df.duplicated('Item', keep='last')
df[isin_ | ~duped]
Item Label Count
0 10 A 80.0
2 10 C 200.0
3 10 D 210.0
4 10 Y 260.0
5 20 A 260.0
6 20 B 300.0
7 20 X 310.0

pandas - how to vectorized group by calculations instead of iteration

Here is a code sniplet to simulate the problem i am facing. i am using iteration on large datasets
df = pd.DataFrame({'grp':np.random.choice([1,2,3,4,5],500),'col1':np.arange(0,500),'col2':np.random.randint(0,10,500),'col3':np.nan})
for index, row in df.iterrows():
#based on group label, get last 3 values to calculate mean
d=df.iloc[0:index].groupby('grp')
try:
dgrp_sum=d.get_group(row.grp).col2.tail(3).mean()
except:
dgrp_sum=999
#after getting last 3 values of group with reference to current row reference, multiply by other rows
df.at[index,'col3']=dgrp_sum*row.col1*row.col2
if i want to speed it up with vectors, how do i convert this code?
You basically calculate moving average over every group.
Which means you can group dataframe by "grp" and calculate rolling mean.
At the end you multiply columns in each row because it is not dependent on group.
df["col3"] = df.groupby("grp").col2.rolling(3, min_periods=1).mean().reset_index(0,drop=True)
df["col3"] = df[["col1", "col2", "col3"]].product(axis=1)
Note: In your code, each calculated mean is placed in the next row, thats why you probably have this try block.
# Skipping last product gives only mean
# np.random.seed(1234)
# print(df[df["grp"] == 2])
grp col1 col2 iter mask
4 2 4 6 999.000000 6.000000
5 2 5 0 6.000000 3.000000
6 2 6 9 3.000000 5.000000
17 2 17 1 5.000000 3.333333
27 2 27 9 3.333333 6.333333

what is the difference between the total= df.isnull().sum(), percent1= df.count(),percent= df.isnull().count()?

Can anyone tell difference between the total= df.isnull().sum(), percent1= df.count(), df.isnull().count() as Ideally df.isnull().count() should give all the count of only null values but it is giving count of all the values .Can anyone help me to understand this?
Below is the code where i am getting output of variable total as only null values count and percent1 as only not null values count and percent as count of all the values irrespective of null or not null.
total= df.isnull().sum().sort_values(ascending=False)
percent1= df.count()#helps to get all the non null values count
percent= df.isnull().count()
print(total)
print(percent1)
print(percent)
The definition of count according to the doc is:
Count non-NA cells for each column or row.
And using isnull (or isna) change your dataframe df of whatever types you have in it to a boolean dataframe, with True where nan is originally df and False otherwise, there is no more nan in this dataframe, so count on df.isnull() will return the number of row of df as no nan exist in it. With an example:
df = pd.DataFrame({'a':range(4), 'b':[1,np.nan, 3, np.nan]})
print (df)
a b
0 0 1.0
1 1 NaN
2 2 3.0
3 3 NaN
if you use count on this dataframe you get:
print (df.count())
a 4
b 2 # here you get 2 because you have 2 nan in the column b as defined above
dtype: int64
but if you use isnull on it you get
print (df.isnull())
a b
0 False False
1 False True #was nan in column b in df
2 False False
3 False True
Here you don't have nan anymore, so the result of count will be the number of rows for both columns
print (df.isnull().count())
a 4
b 4 #no more nan in df.isnull()
dtype: int64
But because True is actually equal to 1 and False equal to 0, then using the sum method will add one for each True in df.isnull(), meaning of nan originally in df
print (df.isnull().sum())
a 0 # because only False in column a of df.isnull()
b 2 # because you have two True in df.isnull() in column b
dtype: int64
Finally, you can see the relation like this:
(df.count()+df.isnull().sum())==df.isnull().count()

get second largest value in row in selected columns in dataframe in pandas

I have a dataframe with subset of it shown below. There are more columns to the right and left of the ones I am showing you
M_cols 10D_MA 30D_MA 50D_MA 100D_MA 200D_MA Max Min 2nd smallest
68.58 70.89 69.37 **68.24** 64.41 70.89 64.41 68.24
**68.32**71.00 69.47 68.50 64.49 71.00 64.49 68.32
68.57 **68.40** 69.57 71.07 64.57 71.07 64.57 68.40
I can get the min (and max is easy as well) with the following code
df2['MIN'] = df2[['10D_MA','30D_MA','50D_MA','100D_MA','200D_MA']].max(axis=1)
But how do I get the 2nd smallest. I tried this and got the following error
df2['2nd SMALLEST'] = df2[['10D_MA','30D_MA','50D_MA','100D_MA','200D_MA']].nsmallest(2)
TypeError: nsmallest() missing 1 required positional argument: 'columns'
Seems like this should be a simple answer but I am stuck
For example you have following df
df=pd.DataFrame({'V1':[1,2,3],'V2':[3,2,1],'V3':[3,4,9]})
After pick up the value need to compare , we just need to sort value by axis=0(default)
sortdf=pd.DataFrame(np.sort(df[['V1','V2','V3']].values))
sortdf
Out[419]:
0 1 2
0 1 3 3
1 2 2 4
2 1 3 9
1st max:
sortdf.iloc[:,-1]
Out[421]:
0 3
1 4
2 9
Name: 2, dtype: int64
2nd max
sortdf.iloc[:,-2]
Out[422]:
0 3
1 2
2 3
Name: 1, dtype: int64