Add positive elements (and negative) in each row? - pandas

For each row of my data I want positive values together and negative values together:
c1 c2 c3 c4 c5
1 2 3 -1 -2
3 2 -1 2 -9
3 -5 1 2 4
Output
c1 c2 c3 c4 c5 sum_positive sum_negative
1 2 3 -1 -2 6 -3
3 2 -1 2 -9 7 -10
3 -5 1 2 4 10 -5
I was trying to use a for loop like: (G is my df) and add positive and negative elements in 2 list and add them but I thought there might be a better way to do that.
g=[]
for i in range(G.shape[0]):
for j in range(G.shape[1]):
if G.iloc[i,j]>=0:
g.append(G.iloc[i,j])
g.append('skill_next')

Loops or .apply will be pretty slow, so your best bet is to just .clip the values and take the sum directly:
In [58]: df['sum_positive'] = df.clip(lower=0).sum(axis=1)
In [59]: df['sum_negative'] = df.clip(upper=0).sum(axis=1)
In [60]: df
Out[60]:
c1 c2 c3 c4 c5 sum_positive sum_negative
0 1 2 3 -1 -2 6 -3
1 3 2 -1 2 -9 7 -10
2 3 -5 1 2 4 10 -5

Or you can use where:
df['sum_negative'] = df.where(df<0).sum(1)
df['sum_positive'] = df.where(df>0).sum(1)
RESULT:
c1 c2 c3 c4 c5 sum_negative sum_positive
0 1 2 3 -1 -2 -3.0 6.0
1 3 2 -1 2 -9 -10.0 7.0
2 3 -5 1 2 4 -5.0 10.0

Related

Pandas pivot columns based on column name prefix

I have a dataframe:
df = AG_Speed AG_wolt AB_Speed AB_wolt C1 C2 C3
1 2 3 4 6 7 8
1 9 2 6 4 1 8
And I want to pivot it based on prefix to get:
df = Speed Wolt C1 C2 C3 Category
1 2 6 7 8 AG
3 4 6 7 8 AB
1 9 4 1 8 AG
2 6 4 1 8 AG
What is the best way to do it?
We can use pd.wide_to_long for this. But since it expects the column names to start with the stubnames, we have to reverse the column format:
df.columns = ["_".join(col.split("_")[::-1]) for col in df.columns]
res = pd.wide_to_long(
df,
stubnames=["Speed", "wolt"],
i=["C1", "C2", "C3"],
j="Category",
sep="_",
suffix="[A-Za-z]+"
).reset_index()
C1 C2 C3 Category Speed wolt
0 6 7 8 AG 1 2
1 6 7 8 AB 3 4
2 4 1 8 AG 1 9
3 4 1 8 AB 2 6
If you want the columns in a specific order, use DataFrame.reindex:
res.reindex(columns=["Speed", "wolt", "C1", "C2", "C3", "Category"])
Speed wolt C1 C2 C3 Category
0 1 2 6 7 8 AG
1 3 4 6 7 8 AB
2 1 9 4 1 8 AG
3 2 6 4 1 8 AB
One option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = ['C1', 'C2', 'C3'],
names_to = ('Category', '.value'),
names_sep='_')
C1 C2 C3 Category Speed wolt
0 6 7 8 AG 1 2
1 4 1 8 AG 1 9
2 6 7 8 AB 3 4
3 4 1 8 AB 2 6
In the above solution, the .value determines which parts of the column labels remain as headers - the labels are split apart with the names_sep.

How to sum up pandas columns only if the last digit of a column is less equal the last digit of this one

So I have a pandas dataframe like this
contract
account
number1
number2
A1
A
3
1
B1
B
2
2
A2
A
1
3
A3
A
5
5
B2
B
3
4
C1
C
3
3
and I want to add two columns that sum up the number columns so far so for example, for A1 it would just be the numbers for A1, for A2 it would be the numbers for A1 and A2, for A3 it would be sum1 = number1 for A1 + number1 for A2 + number1 for A3, and so on. Basically sum the columns everything else with the same account whose last number of contract is less than the current one.
contract
account
number1
number2
sum1
sum2
A1
A
3
1
3
1
B1
B
2
2
2
2
A2
A
1
2
4
3
A3
A
5
5
9
8
B2
B
3
4
5
6
C1
C
3
3
3
3
assign, groupby and cumsum
df=df.assign(sum1=df.groupby('account')['number1'].cumsum(),sum2=df.groupby('account')['number2'].cumsum())
contract account number1 number2 sum1 sum2
0 A1 A 3 1 3 1
1 B1 B 2 2 2 2
2 A2 A 1 3 4 4
3 A3 A 5 5 9 9
4 B2 B 3 4 5 6
5 C1 C 3 3 3 3
IIUC, sort the values and groupby+agg:
df.join(
df.sort_values(by='contract')
.groupby('account')
.agg(sum1=('number1','cumsum'),
sum2=('number2','cumsum'))
)
Output:
contract account number1 number2 sum1 sum2
0 A1 A 3 1 3 1
1 B1 B 2 2 2 2
2 A2 A 1 3 4 4
3 A3 A 5 5 9 9
4 B2 B 3 4 5 6
5 C1 C 3 3 3 3
Use groupby + cumsum:
df[['sum1', 'sum2']] = df.groupby('account')[['number1', 'number2']].cumsum()
Output:
>>> df
contract account number1 number2 sum1 sum2
0 A1 A 3 1 3 1
1 B1 B 2 2 2 2
2 A2 A 1 3 4 4
3 A3 A 5 5 9 9
4 B2 B 3 4 5 6
5 C1 C 3 3 3 3

how to sum vlaues in dataframes based on index match

I have about 16 dataframes representing weekly users' clickstream data. The photos show the samples for weeks from 0-3. I want to make a new dataframe in this way: for example if a new df is w=2, then w2=w0+w1+w2. For w3, w3=w0+w1+w2+3. As you can see the datasets do not have identical id_users, but id a user does not show in a certain week. All dataframes have the same columns, but indexes are not exactly same. So how to add based on the logic where indexes match?
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
43284 1 8 0 8 5 0 0 0 2 3 1
45664 0 16 0 4 0 0 0 0 5 16 2
52014 0 0 0 5 4 0 0 0 0 2 2
53488 1 37 0 19 0 0 3 0 3 23 6
60135 0 124 0 87 3 0 24 0 8 19 14
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
40419 0 8 0 3 4 0 6 0 1 6 0
43284 1 4 0 14 26 2 0 0 2 4 2
45664 0 9 0 15 11 0 0 0 1 6 14
52014 0 0 0 8 9 0 8 0 2 2 1
53488 0 2 0 4 0 0 4 0 0 0 0
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
40419 0 8 0 3 4 0 6 0 1 6 0
43284 1 4 0 14 26 2 0 0 2 4 2
45664 0 9 0 15 11 0 0 0 1 6 14
52014 0 0 0 8 9 0 8 0 2 2 1
53488 0 2 0 4 0 0 4 0 0 0 0
concat then groupby sum
out = pd.concat([df1,df2]).groupby('id_user',as_index=False).sum()
Out[147]:
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
0 40419 0 8 0 3 4 0 6 0 1 6 0
1 43284 2 12 0 22 31 2 0 0 4 7 3
2 45664 0 25 0 19 11 0 0 0 6 22 16
3 52014 0 0 0 13 13 0 8 0 2 4 3
4 53488 1 39 0 23 0 0 7 0 3 23 6
5 60135 0 124 0 87 3 0 24 0 8 19 14

How to perform set-like operations in pandas?

I need to fill a column with values, that are present in a set and not present in any other columns.
initial df
c0 c1 c2 c3 c4 c5
0 4 5 6 3 2 1
1 1 5 4 0 2 3
2 5 6 4 0 1 3
3 5 4 6 2 0 1
4 5 6 4 0 1 3
5 0 1 4 5 6 2
I need df['c6'] column that is a set-like difference operation product between a set of set([0,1,2,3,4,5,6]) and each row of df
so that the result df is
c0 c1 c2 c3 c4 c5 c6
0 4 5 6 3 2 1 0
1 1 5 4 0 2 3 6
2 5 6 4 0 1 3 2
3 5 4 6 2 0 1 3
4 5 6 4 0 1 3 2
5 0 1 4 5 6 2 3
Thank you!
Slightly different approach:
df['c6'] = sum(range(7)) - df.sum(axis=1)
or if you want to be more verbose:
df['c6'] = sum([0,1,2,3,4,5,6]) - df.sum(axis=1)
Use numpy setdiff1d to find the difference between the two arrays and assign the output to column c6
ck = np.array([0,1,2,3,4,5,6])
M = df.to_numpy()
df['c6'] = [np.setdiff1d(ck,i)[0] for i in M]
c0 c1 c2 c3 c4 c5 c6
0 4 5 6 3 2 1 0
1 1 5 4 0 2 3 6
2 5 6 4 0 1 3 2
3 5 4 6 2 0 1 3
4 5 6 4 0 1 3 2
5 0 1 4 5 6 2 3
A simple way I could think of is using a list comprehension and set difference:
s = {0, 1, 2, 3, 4, 5, 6}
s
{0, 1, 2, 3, 4, 5, 6}
df['c6'] = [tuple(s.difference(vals))[0] for vals in df.values]
df
c0 c1 c2 c3 c4 c5 c6
0 4 5 6 3 2 1 0
1 1 5 4 0 2 3 6
2 5 6 4 0 1 3 2
3 5 4 6 2 0 1 3
4 5 6 4 0 1 3 2
5 0 1 4 5 6 2 3

Display Rows only if group of rows' sum is greater then 0

I have a table like the one below. I would like to get this data to SSRS (Grouped by LineID and Product and Column as Hour) to show only those rows where HourCount > 0 for every LineID and Product.
LineID Product Hour HourCount
3 A 0 0
3 A 1 0
3 A 2 0
3 A 3 0
3 A 4 0
3 A 5 0
3 B 0 65
3 B 1 56
3 B 2 45
3 B 3 34
3 B 4 43
3 B 5 45
4 A 0 54
4 A 1 34
4 A 2 45
4 A 3 44
4 A 4 55
4 A 5 44
4 B 0 0
4 B 1 0
4 B 2 0
4 B 3 0
4 B 4 0
4 B 5 0
5 A 0 45
5 A 1 77
5 A 2 66
5 A 3 55
5 A 4 0
5 A 5 0
5 B 0 0
5 B 1 0
5 B 2 45
5 B 3 0
5 B 4 0
5 B 5 0
Basically I would like this table to look like this before it's in SSRS:
LineID Product Hour HourCount
3 B 0 65
3 B 1 56
3 B 2 45
3 B 3 34
3 B 4 43
3 B 5 45
4 A 0 54
4 A 1 34
4 A 2 45
4 A 3 44
4 A 4 55
4 A 5 44
5 A 0 45
5 A 1 77
5 A 2 66
5 A 3 55
5 A 4 0
5 A 5 0
5 B 0 0
5 B 1 0
5 B 2 45
5 B 3 0
5 B 4 0
5 B 5 0
So display Product for the line only if any of the Hourd have HourCount higher then 0.
Is there any query that could give me these results or I should play with display settings in SSRS?
Something like this should work:
with NonZero as
(
select *
, GroupZeroCount = sum(HourCount) over (partition by LineID, Product)
from HourTable
)
select LineID
, Product
, [Hour]
, HourCount
from NonZero
where GroupZeroCount > 0
SQL Fiddle with demo.
You could certainly so something similar in SSRS, but it's certainly much easier and intuitive to apply at the T-SQL level.
I think you are looking for
SELECT LineID,Product,Hour,Count(Hour) AS HourCount
FROM abc
GROUP BY LineID,Productm,Hour HAVING Count(Hour) > 0