Sum positive and negative numbers in column separately then recalculate column based on net total - dataframe

I have a dataframe column that has both positive and negative numbers.
I would like to sum/total only the negative numbers in this column, then add this negative total to each positive number in this column.
For the first number in the column that is greater than 0, I want the number in this row to reflect this net between the negative sum/total and this number in a new column "Q1_new". All other numbers below this row should stay the same.
Here is my data:
data = {'Q1_old': [5, -4, 3, 2, -2]}
df = pd.DataFrame(data)
Q1_old
0 5
1 -4
2 3
3 2
4 -2
My desired result:
Q1 Q1_new
2 3 2
3 2 2

Related

Add/subtract value of a column to the entire column of the dataframe pandas

I have a DataFrame like this, where for column2 I need to add 0.004 throughout the column to get a 0 value in row 1 of column 2. Similarly, for column 3 I need to subtract 0.4637 from the entire column to get a 0 value at row 1 column 3. How do I efficiently execute this?
Here is my code -
df2 = pd.DataFrame(np.zeros((df.shape[0], len(df.columns)))).round(0).astype(int)
for (i,j) in zip(range(0, 5999),range(1,len(df.columns))):
if j==1:
df2.values[i,j] = df.values[i,j] + df.values[0,1]
elif j>1:
df2.iloc[i,j] = df.iloc[i,j] - df.iloc[0,j]
print(df2)
Any help would be greatly appreciated. Thank you.
df2 = df - df.iloc[0]
Explanation:
Let's work through an example.
df = pd.DataFrame(np.arange(20).reshape(4, 5))
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
df.iloc[0] selects the first row of the dataframe:
0 0
1 1
2 2
3 3
4 4
Name: 0, dtype: int64
This is a Series. The first column printed here is its index (column names of the dataframe), and the second one - the actual values of the first row of the dataframe.
We can convert it to a list to better see its values
df.iloc[0].tolist()
[0, 1, 2, 3, 4]
Then, using broadcasting, we are subtracting each value from the whole column where it has come from.

How to split pandas dataframe into multiple dataframes (holding together rows) based upon a column's value

My problem is similar to split a dataframe into chunks of N rows problem, expect that the number of rows in each chunk will be different. I have a datafame as such:
A
B
C
1
2
0
1
2
1
1
2
2
1
2
0
1
2
1
1
2
2
1
2
3
1
2
4
1
2
0
A and B are just whatever don't pay attention. Column C though starts at 0 and increments with each row until it suddenly resets to 0. So in the dataframe included the first 3 rows are a new dataframe, then the next 5 are a second new dataframe, and this continues as my dataframe adds more and more rows.
To finish off the question,
df = [x for _, x in df.groupby(df['C'].eq(0).cumsum())]
allows me to group all the subgroups and then with this groupby I can select each subgroups as a separate dataframe.

How to count number of rows containing exactly one NaN (XOR operation) on a pandas dataframe?

In the table below, I would return 2, by the sum of the rows with indices 2 and 3.
0 NaN NaN
1 NaN NaN
2 Apple NaN
3 NaN Mango
4 Banana Grape
To elaborate, row 2 contains one non-NaN element, so for a variable tracking the count, count += 1 when we iterate through each row and encounter row 2. Similarly, for row 3, we would have count += 1, leading to the count being a total of 2. Since row 4 contains two non-NaN elements, we do not increment the count. Rows 0 and 1 contain two NaN's, so the count is also not updated.
You can do:
df.isnull().sum(axis=1).eq(1).sum()
# 2
.isnull() Check nulls
.sum(axis=1) Count nulls by row
.eq(1) Check number of nulls equal to 1
.sum() Count rows with exactly one null

Sequence of numbers per category given first entry (Python, Pandas)

Suppose I have 5 categories {A, B, C, D, E} and Several date entries of PURCHASES with distinct dates (for instance, A may range from 01/01/1900 to 31/01/1901 and B from 02/02/1930 to 03/03/1933.
I want to create a new column 'day of occurrence' where I have sequence of number 1...N from the point I find the first date in which number of purchases >= 5.
I want this to compare how categories are similar from the day they've achieved 5 purchases (dates are irrelevant here, but product lifetime is)
Thanks!
Here is how you can label rows from 1 to N depending on column value.
import pandas as pd
df = pd.DataFrame(data=[3, 6, 9, 3, 6], columns=['data'])
df['day of occurrence'] = 0
values_count = df.loc[df['data'] > 5].shape[0]
df.loc[df['data'] > 5, 'day of occurrence'] = range(1, values_count + 1)
The initial DataFrame:
data
0 3
1 6
2 9
3 3
4 6
Output DataFrame:
data day of occurrence
0 3 0
1 6 1
2 9 2
3 3 0
4 6 3
Your data should be sorted by date, for example, df = df.sort_values(by='your-datetime-column')

Looking up values from one dataframe in specific row of another dataframe

I'm struggling with a bit of a complex (to me) lookup-type problem.
I have a dataframe df1 that looks like this:
Grade Value
0 A 16
1 B 12
2 C 5
And another dataframe (df2) where the values in one of the columns('Grade') from df1 forms the index:
Tier 1 Tier 2 Tier 3
A 20 17 10
B 16 11 3
C 7 6 2
I've been trying to write a bit of code that for each row in df1, look ups the row corresponding with 'Grade' in df2, finds the smallest value in df2 greater than 'Value', and returns the name of that column.
E.g. for the second row of df1, it looks up the row with index 'B' in df2: 16 is the smallest value greater than 12, so it returns 'Tier 1'. Ideal output would be:
Grade Value Tier
0 A 16 Tier 2
1 B 12 Tier 1
2 C 5 Tier 2
My novice, downloaded-Python-last-week attempt so far has been as follows, which is throwing up all manner of errors and doesn't even try returning the column name. Sorry also about the micro-ness of the question: any help appreciated!
for i, row in input_df1.iterrows():
Tier = np.argmin(df1['Value']<df2.loc[row,0:df2.shape[1]])
df2.loc[df1.Grade].eq(df1.Value, 0).idxmax(1)