How to perform set-like operations in pandas? - pandas

I need to fill a column with values, that are present in a set and not present in any other columns.
initial df
c0 c1 c2 c3 c4 c5
0 4 5 6 3 2 1
1 1 5 4 0 2 3
2 5 6 4 0 1 3
3 5 4 6 2 0 1
4 5 6 4 0 1 3
5 0 1 4 5 6 2
I need df['c6'] column that is a set-like difference operation product between a set of set([0,1,2,3,4,5,6]) and each row of df
so that the result df is
c0 c1 c2 c3 c4 c5 c6
0 4 5 6 3 2 1 0
1 1 5 4 0 2 3 6
2 5 6 4 0 1 3 2
3 5 4 6 2 0 1 3
4 5 6 4 0 1 3 2
5 0 1 4 5 6 2 3
Thank you!

Slightly different approach:
df['c6'] = sum(range(7)) - df.sum(axis=1)
or if you want to be more verbose:
df['c6'] = sum([0,1,2,3,4,5,6]) - df.sum(axis=1)

Use numpy setdiff1d to find the difference between the two arrays and assign the output to column c6
ck = np.array([0,1,2,3,4,5,6])
M = df.to_numpy()
df['c6'] = [np.setdiff1d(ck,i)[0] for i in M]
c0 c1 c2 c3 c4 c5 c6
0 4 5 6 3 2 1 0
1 1 5 4 0 2 3 6
2 5 6 4 0 1 3 2
3 5 4 6 2 0 1 3
4 5 6 4 0 1 3 2
5 0 1 4 5 6 2 3

A simple way I could think of is using a list comprehension and set difference:
s = {0, 1, 2, 3, 4, 5, 6}
s
{0, 1, 2, 3, 4, 5, 6}
df['c6'] = [tuple(s.difference(vals))[0] for vals in df.values]
df
c0 c1 c2 c3 c4 c5 c6
0 4 5 6 3 2 1 0
1 1 5 4 0 2 3 6
2 5 6 4 0 1 3 2
3 5 4 6 2 0 1 3
4 5 6 4 0 1 3 2
5 0 1 4 5 6 2 3

Related

Pandas pivot columns based on column name prefix

I have a dataframe:
df = AG_Speed AG_wolt AB_Speed AB_wolt C1 C2 C3
1 2 3 4 6 7 8
1 9 2 6 4 1 8
And I want to pivot it based on prefix to get:
df = Speed Wolt C1 C2 C3 Category
1 2 6 7 8 AG
3 4 6 7 8 AB
1 9 4 1 8 AG
2 6 4 1 8 AG
What is the best way to do it?
We can use pd.wide_to_long for this. But since it expects the column names to start with the stubnames, we have to reverse the column format:
df.columns = ["_".join(col.split("_")[::-1]) for col in df.columns]
res = pd.wide_to_long(
df,
stubnames=["Speed", "wolt"],
i=["C1", "C2", "C3"],
j="Category",
sep="_",
suffix="[A-Za-z]+"
).reset_index()
C1 C2 C3 Category Speed wolt
0 6 7 8 AG 1 2
1 6 7 8 AB 3 4
2 4 1 8 AG 1 9
3 4 1 8 AB 2 6
If you want the columns in a specific order, use DataFrame.reindex:
res.reindex(columns=["Speed", "wolt", "C1", "C2", "C3", "Category"])
Speed wolt C1 C2 C3 Category
0 1 2 6 7 8 AG
1 3 4 6 7 8 AB
2 1 9 4 1 8 AG
3 2 6 4 1 8 AB
One option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = ['C1', 'C2', 'C3'],
names_to = ('Category', '.value'),
names_sep='_')
C1 C2 C3 Category Speed wolt
0 6 7 8 AG 1 2
1 4 1 8 AG 1 9
2 6 7 8 AB 3 4
3 4 1 8 AB 2 6
In the above solution, the .value determines which parts of the column labels remain as headers - the labels are split apart with the names_sep.

How to compute column sum on the basis of other column value in pandas dataframe?

P
T1
T2
T3
0
1
2
3
1
1
2
0
2
3
1
2
3
1
0
2
In the above pandas dataframe df,
I want to add columns on the basis of the value of column 'P'.
if df['P'] == 0: 0
if df['P'] == 1: T1 (=1)
if df['P'] == 2: T1+T2 (=3+1=4)
if df['P'] == 3: T1+T2+T3 (=1+0+2=3)
In other words, I want to add from T1 to TN if df['P'] == N.
How can I implement this with Python code?
EDIT:
For sum values by P column create mask by broadcasting np.arange by length of filtered columns by DataFrame.filter, compare by P values and this mask pass to DataFrame.where, last use sum per rows:
np.random.seed(20)
c = [f'{x}{i + 1}' for x in ['T','U','V'] for i in range(3)]
df = pd.DataFrame(np.random.randint(4, size=(10,10)), columns=['P'] + c)
arrP = df['P'].to_numpy()[:, None]
for c in ['T','U','V']:
df1 = df.filter(regex=rf'^{c}')
df[f'{c}_SUM'] = df1.where(np.arange(len(df1.columns)) < arrP, 0).sum(axis=1)
print (df)
P T1 T2 T3 U1 U2 U3 V1 V2 V3 T_SUM U_SUM V_SUM
0 3 2 3 3 0 2 1 0 3 2 8 3 5
1 3 2 0 2 0 1 2 2 3 3 4 3 8
2 0 1 2 2 2 0 1 1 3 1 0 0 0
3 3 2 2 2 1 3 2 1 3 2 6 6 6
4 3 1 1 3 1 2 2 0 2 3 5 5 5
5 2 3 2 3 1 1 1 0 3 0 5 2 3
6 2 3 2 3 3 3 2 1 1 2 5 6 2
7 3 2 0 2 1 1 2 2 2 3 4 4 7
8 2 2 1 0 2 2 0 3 3 0 3 4 6
9 2 2 3 2 2 3 2 2 1 1 5 5 3

Dataframe within a Dataframe - to create new column_

For the following dataframe:
import pandas as pd
df=pd.DataFrame({'list_A':[3,3,3,3,3,\
2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,4,4]})
How can 'list_A' be manipulated to give 'list_B'?
Desired output:
list_A
list_B
0
3
1
1
3
1
2
3
1
3
3
0
4
2
1
5
2
1
6
2
0
7
2
0
8
4
1
9
4
1
10
4
1
11
4
1
12
4
0
13
4
0
14
4
0
15
4
0
16
4
0
As you can see, if List_A has the number 3 - then the first 3 values of List_B are '1' and then the value of List_B changes to '0', until List_A changes value again.
GroupBy.cumcount
df['list_B'] = df['list_A'].gt(df.groupby('list_A').cumcount()).astype(int)
print(df)
Output
list_A list_B
0 3 1
1 3 1
2 3 1
3 3 0
4 3 0
5 2 1
6 2 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 0
12 4 1
13 4 1
14 4 1
15 4 1
16 4 0
17 4 0
18 4 0
19 4 0
20 4 0
21 4 0
22 4 0
23 4 0
EDIT
blocks = df['list_A'].ne(df['list_A'].shift()).cumsum()
df['list_B'] = df['list_A'].gt(df.groupby(blocks).cumcount()).astype(int)

pandas: bin data into specific number of bins of specific size

I would like to bin a dataframe by the values in a single column into bins of a specific size and number.
Here is an example df:
df= pd.DataFrame(np.random.randint(0,10000,size=(10000, 4)), columns=list('ABCD'))
Say I want to bin by column D, I will first sort the data:
df.sort('D')
I would now wish to bin so that the first if bin size is 50 and bin number is 100, the first 50 values will go into bin 1, the next into bin 2, and so on and so forth. Any remaining values after the twenty bins should all go into the final bin. Is there anyway of doing this?
EDIT:
Here is a sample input:
x = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
And here is the expected output:
A B C D bin
0 6 8 6 5 3
1 5 4 9 1 1
2 5 1 7 4 3
3 6 3 3 3 2
4 2 5 9 3 2
5 2 5 1 3 2
6 0 1 1 0 1
7 3 9 5 8 3
8 2 4 0 1 1
9 6 4 5 6 3
As an extra aside, is it also possible to bin any equal values in the same bin? So for example, say I have bin 1 which contains values, 0,1,1 and then bin 2 contains 1,1,2. Is there any way of putting those two 1 values in bin 2 into bin 1? This will create very uneven bin sizes but this is not an issue.
It seems you need floor divide np.arange and then assign to new column:
idx = df['D'].sort_values().index
df['b'] = pd.Series(np.arange(len(df)) // 3 + 1, index = idx)
print (df)
A B C D bin b
0 6 8 6 5 3 3
1 5 4 9 1 1 1
2 5 1 7 4 3 3
3 6 3 3 3 2 2
4 2 5 9 3 2 2
5 2 5 1 3 2 2
6 0 1 1 0 1 1
7 3 9 5 8 3 4
8 2 4 0 1 1 1
9 6 4 5 6 3 3
Detail:
print (np.arange(len(df)) // 3 + 1)
[1 1 1 2 2 2 3 3 3 4]
EDIT:
I create another question about problem with last values here:
N = 3
idx = df['D'].sort_values().index
#one possible solution, thanks divakar
def replace_irregular_groupings(a, N):
n = len(a)
m = N*(n//N)
if m!=n:
a[m:] = a[m-1]
return a
idx = df['D'].sort_values().index
arr = replace_irregular_groupings(np.arange(len(df)) // N + 1, N)
df['b'] = pd.Series(arr, index = idx)
print (df)
A B C D bin b
0 6 8 6 5 3 3
1 5 4 9 1 1 1
2 5 1 7 4 3 3
3 6 3 3 3 2 2
4 2 5 9 3 2 2
5 2 5 1 3 2 2
6 0 1 1 0 1 1
7 3 9 5 8 3 3
8 2 4 0 1 1 1
9 6 4 5 6 3 3

Display Rows only if group of rows' sum is greater then 0

I have a table like the one below. I would like to get this data to SSRS (Grouped by LineID and Product and Column as Hour) to show only those rows where HourCount > 0 for every LineID and Product.
LineID Product Hour HourCount
3 A 0 0
3 A 1 0
3 A 2 0
3 A 3 0
3 A 4 0
3 A 5 0
3 B 0 65
3 B 1 56
3 B 2 45
3 B 3 34
3 B 4 43
3 B 5 45
4 A 0 54
4 A 1 34
4 A 2 45
4 A 3 44
4 A 4 55
4 A 5 44
4 B 0 0
4 B 1 0
4 B 2 0
4 B 3 0
4 B 4 0
4 B 5 0
5 A 0 45
5 A 1 77
5 A 2 66
5 A 3 55
5 A 4 0
5 A 5 0
5 B 0 0
5 B 1 0
5 B 2 45
5 B 3 0
5 B 4 0
5 B 5 0
Basically I would like this table to look like this before it's in SSRS:
LineID Product Hour HourCount
3 B 0 65
3 B 1 56
3 B 2 45
3 B 3 34
3 B 4 43
3 B 5 45
4 A 0 54
4 A 1 34
4 A 2 45
4 A 3 44
4 A 4 55
4 A 5 44
5 A 0 45
5 A 1 77
5 A 2 66
5 A 3 55
5 A 4 0
5 A 5 0
5 B 0 0
5 B 1 0
5 B 2 45
5 B 3 0
5 B 4 0
5 B 5 0
So display Product for the line only if any of the Hourd have HourCount higher then 0.
Is there any query that could give me these results or I should play with display settings in SSRS?
Something like this should work:
with NonZero as
(
select *
, GroupZeroCount = sum(HourCount) over (partition by LineID, Product)
from HourTable
)
select LineID
, Product
, [Hour]
, HourCount
from NonZero
where GroupZeroCount > 0
SQL Fiddle with demo.
You could certainly so something similar in SSRS, but it's certainly much easier and intuitive to apply at the T-SQL level.
I think you are looking for
SELECT LineID,Product,Hour,Count(Hour) AS HourCount
FROM abc
GROUP BY LineID,Productm,Hour HAVING Count(Hour) > 0