Pandas df row count - pandas

Date Ct
0 2015-04-01 1
1 2015-04-01 2
2 2015-04-01 3
3 2015-04-01 4
4 2015-04-02 1
5 2015-04-02 2
6 2015-04-02 3
7 2015-04-02 4
8 2015-04-03 1
9 2015-04-03 2
10 2015-04-03 3
11 2015-04-03 4
12 2015-04-04 1
13 2015-04-04 2
14 2015-04-04 3
15 2015-04-04 4
I have a string column 'Date' and I would like to create the 'Ct' column as represented below to maintain a count of the rows for a certain date. Date needs to be a string in my application, there will not always be an equal number of rows for each date, and 'Ct' will always count in the order of the numerical index. An answer or a nudge in the right direction would be greatly appreciated.

OK, this is a little weird but you can add a temp column and set this value to 1:
df['temp'] = 1
you can then perform a groupby on 'Date' and call transform on the 'temp' column to perform the count:
In [80]:
df['Ct'] = df.groupby('Date')['temp'].transform(pd.Series.cumsum)
df
Out[80]:
Date temp Ct
0 2015-04-01 1 1
1 2015-04-01 1 2
2 2015-04-01 1 3
3 2015-04-01 1 4
4 2015-04-02 1 1
5 2015-04-02 1 2
6 2015-04-02 1 3
7 2015-04-02 1 4
8 2015-04-03 1 1
9 2015-04-03 1 2
10 2015-04-03 1 3
11 2015-04-03 1 4
12 2015-04-04 1 1
13 2015-04-04 1 2
14 2015-04-04 1 3
15 2015-04-04 1 4
In [81]:
df.drop('temp',axis=1,inplace=True)
df
Out[81]:
Date Ct
0 2015-04-01 1
1 2015-04-01 2
2 2015-04-01 3
3 2015-04-01 4
4 2015-04-02 1
5 2015-04-02 2
6 2015-04-02 3
7 2015-04-02 4
8 2015-04-03 1
9 2015-04-03 2
10 2015-04-03 3
11 2015-04-03 4
12 2015-04-04 1
13 2015-04-04 2
14 2015-04-04 3
15 2015-04-04 4
The reason we can't just say perform the cumsum on the 'Date' column is because if it's a string then this will result in your date strings being concatenated with each other which is not what you want.
EDIT
Thanks to the master #Jeff for pointing out that the temp columns is unncecessary and you just use cumcount
In [150]:
df['Ct'] = df.groupby('Date').cumcount() + 1
df
Out[150]:
Date Ct
0 2015-04-01 1
1 2015-04-01 2
2 2015-04-01 3
3 2015-04-01 4
4 2015-04-02 1
5 2015-04-02 2
6 2015-04-02 3
7 2015-04-02 4
8 2015-04-03 1
9 2015-04-03 2
10 2015-04-03 3
11 2015-04-03 4
12 2015-04-04 1
13 2015-04-04 2
14 2015-04-04 3
15 2015-04-04 4

Related

How can I add rows iteratively to a select result set in pl sql?

In the work_order table there is wo_no. When I query the work_order table I want 2 additional columns (Task_no, Task_step_no) in the results set as follows
this should be iterate for all the wo_no s in the work_order table. task_no should go up to 5 and task_step_no should go upto 2000. (please have a look on the attached image to see the results set if not clear)
Any idea how to get such a results set in plsql?
One option is to use 2 row generators cross joined to your current table.
SQL> with
2 work_order (wo_no) as
3 (select 1 from dual union all
4 select 2 from dual
5 ),
6 task (task_no) as
7 (select level from dual connect by level <= 5),
8 step (task_step_no) as
9 (select level from dual connect by level <= 20) --> you'd have 2000 here
10 select y.wo_no, t.task_no, s.task_step_no
11 from work_order y cross join task t cross join step s
12 order by 1, 2, 3;
Result:
WO_NO TASK_NO TASK_STEP_NO
---------- ---------- ------------
1 1 1
1 1 2
1 1 3
1 1 4
1 1 5
1 1 6
1 1 7
1 1 8
1 1 9
1 1 10
1 1 11
1 1 12
1 1 13
1 1 14
1 1 15
1 1 16
1 1 17
1 1 18
1 1 19
1 1 20
1 2 1
1 2 2
1 2 3
1 2 4
1 2 5
1 2 6
1 2 7
1 2 8
1 2 9
1 2 10
1 2 11
1 2 12
1 2 13
1 2 14
1 2 15
1 2 16
1 2 17
1 2 18
1 2 19
1 2 20
1 3 1
1 3 2
1 3 3
1 3 4
1 3 5
1 3 6
1 3 7
1 3 8
1 3 9
1 3 10
1 3 11
1 3 12
1 3 13
1 3 14
1 3 15
1 3 16
1 3 17
1 3 18
1 3 19
1 3 20
1 4 1
1 4 2
1 4 3
1 4 4
1 4 5
1 4 6
1 4 7
1 4 8
1 4 9
1 4 10
1 4 11
1 4 12
1 4 13
1 4 14
1 4 15
1 4 16
1 4 17
1 4 18
1 4 19
1 4 20
1 5 1
1 5 2
1 5 3
1 5 4
1 5 5
1 5 6
1 5 7
1 5 8
1 5 9
1 5 10
1 5 11
1 5 12
1 5 13
1 5 14
1 5 15
1 5 16
1 5 17
1 5 18
1 5 19
1 5 20
2 1 1
2 1 2
2 1 3
2 1 4
2 1 5
2 1 6
2 1 7
2 1 8
2 1 9
2 1 10
2 1 11
2 1 12
2 1 13
2 1 14
2 1 15
2 1 16
2 1 17
2 1 18
2 1 19
2 1 20
2 2 1
2 2 2
2 2 3
2 2 4
2 2 5
2 2 6
2 2 7
2 2 8
2 2 9
2 2 10
2 2 11
2 2 12
2 2 13
2 2 14
2 2 15
2 2 16
2 2 17
2 2 18
2 2 19
2 2 20
2 3 1
2 3 2
2 3 3
2 3 4
2 3 5
2 3 6
2 3 7
2 3 8
2 3 9
2 3 10
2 3 11
2 3 12
2 3 13
2 3 14
2 3 15
2 3 16
2 3 17
2 3 18
2 3 19
2 3 20
2 4 1
2 4 2
2 4 3
2 4 4
2 4 5
2 4 6
2 4 7
2 4 8
2 4 9
2 4 10
2 4 11
2 4 12
2 4 13
2 4 14
2 4 15
2 4 16
2 4 17
2 4 18
2 4 19
2 4 20
2 5 1
2 5 2
2 5 3
2 5 4
2 5 5
2 5 6
2 5 7
2 5 8
2 5 9
2 5 10
2 5 11
2 5 12
2 5 13
2 5 14
2 5 15
2 5 16
2 5 17
2 5 18
2 5 19
2 5 20
200 rows selected.
SQL>
As you already have the work_order table, you'd just use it in FROM clause (not as a CTE):
with
task (task_no) as
(select level from dual connect by level <= 5),
step (task_step_no) as
(select level from dual connect by level <= 20)
select y.wo_no, t.task_no, s.task_step_no
from work_order y cross join task t cross join step s
order by 1, 2, 3;

pandas retain values on different index dataframes

I need to merge two dataframes with different frequencies (daily to weekly). However, would like to retain the weekly values when merging to the daily dataframe.
There is a grouping variable in the data, group.
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
daily={'date':[datetime.date(2022,1,1)+relativedelta(day=i) for i in range(1,10)]*2,
'group':['A' for x in range(1,10)]+['B' for x in range(1,10)],
'daily_value':[x for x in range(1,10)]*2}
weekly={'date':[datetime.date(2022,1,1),datetime.date(2022,1,7)]*2,
'group':['A','A']+['B','B'],
'weekly_value':[100,200,300,400]}
daily_data=pd.DataFrame(daily)
weekly_data=pd.DataFrame(weekly)
daily_data output:
date group daily_value
0 2022-01-01 A 1
1 2022-01-02 A 2
2 2022-01-03 A 3
3 2022-01-04 A 4
4 2022-01-05 A 5
5 2022-01-06 A 6
6 2022-01-07 A 7
7 2022-01-08 A 8
8 2022-01-09 A 9
9 2022-01-01 B 1
10 2022-01-02 B 2
11 2022-01-03 B 3
12 2022-01-04 B 4
13 2022-01-05 B 5
14 2022-01-06 B 6
15 2022-01-07 B 7
16 2022-01-08 B 8
17 2022-01-09 B 9
weekly_data output:
date group weekly_value
0 2022-01-01 A 100
1 2022-01-07 A 200
2 2022-01-01 B 300
3 2022-01-07 B 400
The desired output
desired={'date':[datetime.date(2022,1,1)+relativedelta(day=i) for i in range(1,10)]*2,
'group':['A' for x in range(1,10)]+['B' for x in range(1,10)],
'daily_value':[x for x in range(1,10)]*2,
'weekly_value':[100]*6+[200]*3+[300]*6+[400]*3}
desired_data=pd.DataFrame(desired)
desired_data output:
date group daily_value weekly_value
0 2022-01-01 A 1 100
1 2022-01-02 A 2 100
2 2022-01-03 A 3 100
3 2022-01-04 A 4 100
4 2022-01-05 A 5 100
5 2022-01-06 A 6 100
6 2022-01-07 A 7 200
7 2022-01-08 A 8 200
8 2022-01-09 A 9 200
9 2022-01-01 B 1 300
10 2022-01-02 B 2 300
11 2022-01-03 B 3 300
12 2022-01-04 B 4 300
13 2022-01-05 B 5 300
14 2022-01-06 B 6 300
15 2022-01-07 B 7 400
16 2022-01-08 B 8 400
17 2022-01-09 B 9 400
Use merge_asof with sorting values by datetimes, last sorting like original by both columns:
daily_data['date'] = pd.to_datetime(daily_data['date'])
weekly_data['date'] = pd.to_datetime(weekly_data['date'])
df = (pd.merge_asof(daily_data.sort_values('date'),
weekly_data.sort_values('date'),
on='date',
by='group').sort_values(['group','date'], ignore_index=True))
print (df)
date group daily_value weekly_value
0 2022-01-01 A 1 100
1 2022-01-02 A 2 100
2 2022-01-03 A 3 100
3 2022-01-04 A 4 100
4 2022-01-05 A 5 100
5 2022-01-06 A 6 100
6 2022-01-07 A 7 200
7 2022-01-08 A 8 200
8 2022-01-09 A 9 200
9 2022-01-01 B 1 300
10 2022-01-02 B 2 300
11 2022-01-03 B 3 300
12 2022-01-04 B 4 300
13 2022-01-05 B 5 300
14 2022-01-06 B 6 300
15 2022-01-07 B 7 400
16 2022-01-08 B 8 400
17 2022-01-09 B 9 400

Counting groups in columns in dataframe

I have a dataframe df:
prds
0 E01
1 E02
2 E03
3 E04
4 E01
5 E02
6 E03
7 E04
8 F01
9 F02
10 F03
11 F04
12 F05
I would like to have an count on each group in the column 'prds' on another column 'match', hence:
prds match
0 E01 1
1 E02 2
2 E03 3
3 E04 4
4 E01 1
5 E02 2
6 E03 3
7 E04 4
8 F01 1
9 F02 2
10 F03 3
11 F04 4
12 F05 5
Any help would be greatly appreciated please. Thanking you in advance.
If each group is possible defined by ending by 1 value is possible use Series.str.endswith with Series.cumsum and pass to GroupBy.cumcount:
df['match'] = df.groupby(df['prds'].str.endswith('1').cumsum()).cumcount() + 1
print (df)
prds match
0 E01 1
1 E02 2
2 E03 3
3 E04 4
4 E01 1
5 E02 2
6 E03 3
7 E04 4
8 F01 1
9 F02 2
10 F03 3
11 F04 4
12 F05 5
You can simply extract digits:
df['match'] = df['prds'].str.extract('(\d+)').astype('int')
Output:
prds match
0 E01 1
1 E02 2
2 E03 3
3 E04 4
4 E01 1
5 E02 2
6 E03 3
7 E04 4
8 F01 1
9 F02 2
10 F03 3
11 F04 4
12 F05 5

Groupby filter based on count, calculate duration, penultimate status

I have a dataframe as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
21 3 M 2019-05-20 200
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
28 5 M 2018-10-10 200
29 5 F 2019-06-10 500
30 6 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
where
F = Failure
M = Maintenance
P = Planned
Step1 - Select the data of IDs which is having at least two status(F or M or P) before the last Failure
Step2 - Ignore the rows if the last raw per ID is not F, expected output after this as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
Now, for each id last status is failure
Then from the above df I would like to prepare below Data frame
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
1 3 2 2 P 487 151
2 3 3 2 M 487 61
3 3 2 2 P 640 90
4 3 1 1 M 518 151
7 2 1 1 M 518 151
SLS = Second Last Status
LS = Last Status
I tried the following code to calculate the duration.
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
We can create a mask with gropuby + bfill that allows us to perform both selections.
m = df.Status.eq('F').replace(False, np.NaN).groupby(df.ID).bfill()
df = df.loc[m.groupby(df.ID).transform('sum').gt(2) & m]
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
The second part is a bit more annoying. There's almost certainly a smarter way to do this, but here's the straight forward way:
s = df.Date.diff().dt.days
res = pd.concat([df.groupby('ID').Status.value_counts().unstack().add_prefix('No_of_'),
df.groupby('ID').Status.apply(lambda x: x.iloc[-2]).to_frame('SLS'),
(s.where(s.gt(0)).groupby(df.ID).apply(lambda x: x.cumsum().iloc[-2])
.to_frame('NoDays_to_SLS')),
s.groupby(df.ID).apply(lambda x: x.iloc[-1]).to_frame('NoDays_SLS_to_LS')],
axis=1)
Output:
No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
ID
1 3 2 2 P 487.0 151.0
2 3 3 1 M 487.0 61.0
3 3 2 2 P 640.0 90.0
4 3 2 1 M 518.0 151.0
7 2 2 1 M 518.0 151.0
Here's my attempt (Note: I am using pandas 0.25) :
df = pd.read_clipboard()
df['Date'] = pd.to_datetime(df['Date'])
df_1 = df.groupby('ID',group_keys=False)\
.apply(lambda x: x[(x['Status']=='F')[::-1].cumsum().astype(bool)])
df_2 = df_1[df_1.groupby('ID')['Status'].transform('count') > 2]
g = df_2.groupby('ID')
df_Counts = g['Status'].value_counts().unstack().add_prefix('No_of_')
df_SLS = g['Status'].agg(lambda x: x.iloc[-2]).rename('SLS')
df_dates = g['Date'].agg(NoDays_to_SLS = lambda x: x.iloc[-2]-x.iloc[0],
NoDays_to_SLS_LS = lambda x: x.iloc[-1]-x.iloc[-2])
pd.concat([df_Counts, df_SLS, df_dates], axis=1).reset_index()
Output:
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_to_SLS_LS
0 1 3 2 2 P 487 days 151 days
1 2 3 3 1 M 487 days 61 days
2 3 3 2 2 P 640 days 90 days
3 4 3 2 1 M 518 days 151 days
4 7 2 2 1 M 518 days 151 days
There are some enhancements in 0.25 that this code uses.

SQL Existing Column Conditional Update Query

I have this data
AnsID QuesID AnsOrder
-----------------------
1 5 NULL
2 5 NULL
3 5 NULL
4 5 NULL
5 5 NULL
6 3 NULL
7 3 NULL
8 3 NULL
9 3 NULL
10 3 NULL
11 4 NULL
12 4 NULL
13 4 NULL
14 4 NULL
15 4 NULL
16 7 NULL
17 9 NULL
18 9 NULL
19 9 NULL
20 9 NULL
21 8 NULL
22 8 NULL
23 8 NULL
24 8 NULL
Want to UPDATE it into this format
AnsID QuesID AnsOrder
-----------------------
1 5 1
2 5 2
3 5 3
4 5 4
5 5 5
6 3 1
7 3 2
8 3 3
9 3 4
10 3 5
11 4 1
12 4 2
13 4 3
14 4 4
15 4 5
16 7 1
17 9 1
18 9 2
19 9 3
20 9 4
21 8 1
22 8 2
23 8 3
24 8 4
Basicaly I want to update AnsOrder column in ascending order according to QuesID column,
like this for more readability.
AnsID QuesID AnsOrder
-----------------------
1 5 1
2 5 2
3 5 3
4 5 4
5 5 5
6 3 1
7 3 2
8 3 3
9 3 4
10 3 5
11 4 1
12 4 2
13 4 3
14 4 4
15 4 5
16 7 1
17 9 1
18 9 2
19 9 3
20 9 4
21 8 1
22 8 2
23 8 3
24 8 4
You might generate row_numbers by quesID and assign them to AnsOrder like this:
; with ord as (
select *,
row_number() over (partition by quesID
order by AnsID) rn
from table1
)
update ord
set ansorder = rn
I've ordered by AnsID for consistency.
Check this # Sql Fiddle.