Aggregate features row-wise in dataframe - pandas

i am trying to create features from sample that looks like this:
index
user
product
sub_product
status
0
u1
p1
sp1
NA
1
u1
p1
sp2
NA
2
u1
p1
sp3
CANCELED
3
u1
p1
sp4
AVAIL
4
u2
p3
sp2
AVAIL
5
u2
p3
sp3
CANCELED
6
u2
p3
sp7
NA
first, i created dummies:
pd.get_dummies(x, columns = ['product', 'sub_product', 'status']
but i also need to group by row, to have 1 row by user, what is the best way to do it?
If i'll just group it:
pd.get_dummies(x, columns = ['product', 'sub_product', 'status'].groupby('user').max()
user
product_p1
product_p3
sub_product_sp1
sub_product_sp2
sub_product_sp3
sub_product_sp4
sub_product_sp7
status_AVAIL
status_CANCELED
status_NA
u1
1
0
1
1
1
1
0
1
1
1
u2
0
1
0
1
1
0
1
1
1
1
i will loose information, fo ex. that for u1 sp3 status is canceled. So it's looks like i have to create dummies for every column combination?

Update: You are basically looking for pivot:
out = (df.astype(str)
.assign(value=1)
.pivot_table(index=['user'], columns=['product','sub_product','status'],
values='value', fill_value=0, aggfunc='max')
)
out.columns = ['_'.join(x) for x in out.columns]

Related

SQL Server- How to count number of values in multiple columns and insert into another table with row values as columns and columns as row values

I have a table in SQL server with values like below-
Item
A
B
C
D
Date
P1
Yes
No
Yes
NA
20210801
P2
Yes
Yes
Yes
NA
20210801
P3
Yes
Yes
No
No
20210801
P4
Yes
No
NA
No
20210801
P5
No
NA
No
Yes
20210801
P6
NA
NA
Yes
No
20210801
P1
Yes
No
Yes
NA
20210901
P2
Yes
Yes
Yes
NA
20210901
P3
No
No
Yes
NA
20210901
P4
Yes
No
NA
No
20210901
P5
No
NA
No
Yes
20210901
P6
NA
NA
Yes
No
20210901
I want the count of each of the row values(Yes, No, NA) for every column, like below. The column names will be row values.
Source
Yes
No
NA
Date
A
3
2
1
20210901
B
1
3
2
20210901
C
4
1
1
20210901
D
1
2
3
20210901
The code will run with a specific value in where clause for Date column (e.g. WHERE Date ='20210901')
Please help me to achieve this.
One method is to unpivot using apply and aggregate:
select t.source, t.date, v.*
from t cross apply
(select sum(case when x = 'Yes' then 1 else 0 end) as num_yes,
sum(case when x = 'No' then 1 else 0 end) as num_no,
sum(case when x = 'NA' then 1 else 0 end) as num_na
from (values (a), (b), (c), (d)) v(x)
) v;

How to find difference between rows in a pandas multiIndex, by level 1

Suppose we have a DataFrame like this, only with many, many more index A values:
df = pd.DataFrame([[1,2,1,2],
[1,1,2,2],
[2,2,1,0],
[1,2,1,2],
[2,1,1,2] ], columns=['A','B','c1','c2'])
df.groupby(['A','B']).sum()
## result
c1 c2
A B
1 1 2 2
2 2 4
2 1 1 2
2 1 0
How can I get a data frame that consists of the difference between rows, by the second level of the index, level B?
The output here would be
A c1 c2
1 0 -2
2 0 2
Note In my particular use case, I have a lot of column A values, so I can write out the value for A explicitly.
Check diff and dropna
g = df.groupby(['A','B'])[['c1','c2']].sum()
g = g.groupby(level=0).diff().dropna()
g
Out[25]:
c1 c2
A B
1 2 0.0 2.0
2 2 0.0 -2.0
Assigning the first grouping to result variable:
result = df.groupby(['A','B']).sum()
You could use a pipe operation with nth:
result.groupby('A').pipe(lambda df: df.nth(0) - df.nth(-1))
c1 c2
A
1 0 -2
2 0 2
A simpler option, in my opinion, would be to use agg combined with numpy's ufunc reduce, as this covers scenarios where you have more than two rows:
result.groupby('A').agg(np.subtract.reduce)
c1 c2
A
1 0 -2
2 0 2

Effectively removing rows from a Pandas DataFrame with groupby and temporal conditions?

I have a dataframe with tens of millions of rows:
| userId | pageId | bannerId | timestap |
|--------+--------+----------+---------------------|
| A | P1 | B1 | 2020-10-10 01:00:00 |
| A | P1 | B1 | 2020-10-10 01:00:10 |
| B | P1 | B1 | 2020-10-10 01:00:00 |
| B | P2 | B2 | 2020-10-10 02:00:00 |
What I'd like to do is remove all rows where for the same userId, pageId, bannerId, timestamp is within n minutes of the previous occurrence of that same userId, pageId, bannerId pair.
What I'm doing now:
# Get all instances of `userId, pageId, bannerId` that repeats,
# although, not all of them will have repeated within the `n` minute
# threshold I'm interested in.
groups = in df.groupby(['userId', 'pageId', 'bannerId']).userId.count()
# Iterate through each group, and manually check if the repetition was
# within `n` minutes. Keep track of all IDs to be removed.
to_remove = []
for user_id, page_id, banner_id in groups.index:
sub = df.loc[
(df.userId == user_id) &
(df.pageId == pageId) &
(df.bannerId == bannerId)
].sort_values('timestamp')
# Now that each occurrence is listed chronologically,
# check time diff.
sub = sub.loc[
((sub.timestamp.shift(1) - sub.timestamp) / pd.Timedelta(minutes=1)).abs() <= n
]
if sub.shape[0] > 0:
to_remove += sub.index.tolist()
This does work as I'd like. Only issue is that with the large amount of data I have, it takes hours to complete.
To get a more instructive result, I took a bit longer
source DataFrame:
userId pageId bannerId timestap
0 A P1 B1 2020-10-10 01:00:00
1 A P1 B1 2020-10-10 01:04:10
2 A P1 B1 2020-10-10 01:05:00
3 A P1 B1 2020-10-10 01:08:20
4 A P1 B1 2020-10-10 01:09:30
5 A P1 B1 2020-10-10 01:11:00
6 B P1 B1 2020-10-10 01:00:00
7 B P2 B2 2020-10-10 02:00:00
Note: timestap column is of datetime type.
Start from defining a "filtering" function for a group
of timestap values (for some combination of userId,
pageId and bannerId):
def myFilter(grp, nMin):
prevTs = np.nan
grp = grp.sort_values()
res = []
for ts in grp:
if pd.isna(prevTs) or (ts - prevTs) / pd.Timedelta(1, 'm') >= nMin:
prevTs = ts
res.append(ts)
return res
Then set the time threshold (the number of minutes):
nMin = 5
And the last thing is to generate the result:
result = df.groupby(['userId', 'pageId', 'bannerId'])\
.timestap.apply(myFilter, nMin).explode().reset_index()
For my data sample, the result is:
userId pageId bannerId timestap
0 A P1 B1 2020-10-10 01:00:00
1 A P1 B1 2020-10-10 01:05:00
2 A P1 B1 2020-10-10 01:11:00
3 B P1 B1 2020-10-10 01:00:00
4 B P2 B2 2020-10-10 02:00:00
Note that "ordinary" diff is not enough, because eg. starting from the
row with timestamp 01:05:00, two following rows (01:08:20 and 01:09:30)
should be dropped, as they are within 5 minutes limit from 01:05:00.
So it is not enough to look at the previous row only.
Starting from some row you should "mark for drop" all following rows until
you find a row with the timestamp more or at least equally distant from the
"start row" than the limit.
And in this case just this rows becomes the starting row for analysis of
following rows (within the current group).

How to change a CRUD-matrix into one using multi-index on columns?

I have a data structure like the one below, which mimics a CRUD matrix for processes and data objects:
import pandas as pd
df=pd.DataFrame({'ObjA':'crud ru ru cu r'.split(),
'ObjB':'r r ru rud crud'.split(),
'ObjC':'d crud crud ru r'.split(),
}, index='P1 P2 P3 P4 P5'.split())
df
This results in:
ObjA ObjB ObjC
P1 crud r d
P2 ru r crud
P3 ru ru crud
P4 cu rud ru
P5 r crud r
I need to change this structure into one which uses indicator variables for c, r, u and d, so that the data objects (columns of the initial data structure) appear as level 0 and the crud-indicator variables as level 1 of a column multi-index, like shown here:
df_dict={}
for col in df.columns:
df_dict[col]=df[col].str.get_dummies('').reindex(columns='c r u d'.split())
pd.concat(df_dict, axis=1)
yielding:
ObjA ObjB ObjC
c r u d c r u d c r u d
P1 1 1 1 1 0 1 0 0 0 0 0 1
P2 0 1 1 0 0 1 0 0 1 1 1 1
P3 0 1 1 0 0 1 1 0 1 1 1 1
P4 1 0 1 0 0 1 1 1 0 1 1 0
P5 0 1 0 0 1 1 1 1 0 1 0 0
Is there a more elegant way to achieve the desired outcome apart from ugly iteration, building separate dataframes and then conatenating everything back into the final structure? I know there must be a way to do it with df.apply(some_clever_func, axis=) but my experiments so far all failed.
Ok, I could not stop tinkering with it and finally came up with an even better solution, I believe:
(df
.apply(lambda x: x.str.get_dummies(sep='').stack(), axis='columns')
.reindex([*'crud'], level=1, axis='columns')
)
Although this is the slowest solution it is the one best readable. Funny enough, the solution in the question, using iteration, performs best.

Record filtration issue

I have 2 tables in database : 1. Agents table & 2. UserPermissions table
Table 1 Agents contains list of Agents as given below; here CompanyID & AgentID are unique values.
AgentID AgentCode CompanyID RailEnabled
1 A1 1 1
2 A2 2 0
3 A3 3 1
4 A4 4 0
Table 2 UserPermissions table contains list of all users who are coming under a particular agent. And there are 2 fields ModuleCode whose values are like 'RAL' for Rail in short and Enabled fields.
UserID UserName ModuleCode Enabled CompanyID
1 U1 RAL 1 1
2 U2 RAL 0 1
3 U3 RAL 1 1
13 U4 BUS 1 1
4 U1 RAL 0 2
5 U2 RAL 0 2
6 U3 RAL 0 2
14 U4 HTL 1 2
7 U1 RAL 0 3
8 U2 RAL 0 3
9 U3 RAL 0 3
15 U4 FLT 1 3
10 U1 RAL 0 4
11 U2 RAL 0 4
12 U3 RAL 0 4
16 U4 BUS 1 4
Now I need to filter out only the Agents for which RailEnabled is true but none of the users of that Agent is not enabled for Rail service(Module code: RAL).
Thanks for your help in advance.
Use NOT EXISTS.
SQLFiddle
select * from agents a
where railenabled = 1 --assuming 1 means enabled
and not exists (select 1 --not exists to remove all agents whose users have RAL enabled
from userpermissions b
where a.companyid = b.companyid
and b.modulecode = 'RAL'
and b.enabled = 1
);