Change dataframe by group - pandas

I have a pandas dataframe that looks something like this
activity time date
0 Phone 04:00 20210810
1 Phone 08:30 20210810
2 Coffee 10:30 20210810
3 Lunch 04:00 20210810
4 Phone 10:30 20210810
5 Phone 04:00 20210810
6 Lunch 08:30 20210810
7 Lunch 10:30 20210810
0 Phone 08:45 20210811
1 Pooping 08:50 20210811
2 Coffee 10:30 20210811
3 Lunch 04:00 20210811
4 Phone 10:30 20210811
5 Meeting 04:00 20210811
6 Lunch 08:30 20210811
7 Lunch 10:30 20210811
and i need to change it to :
date activity time
20210810 Phone 04:00
08:30
10:30
04:00
Coffee 10:30
Lunch 04:00
08:30
10:30
20210811 Phone 08:45
10:30
Pooping 08:50
Coffee 10:30
Meeting 04:00
Lunch 04:00
08:30
10:30
Basically sort by date, activity and then add '' for the same type.

Set as index and sort:
df.set_index(['date', 'activity']).sort_index()
Or, if the values need to be sorted as well:
df.set_index(['date', 'activity']).sort_values(by='time').sort_index()
By default, in jupyter/ipython the index will display only the first value of the successive rows. If you need another format, please update your question.

Related

SQL BETWEEN Time doesn't show results?

I want to find BETWEEN time but not show result.im share database please share valuable idea....
SQL code
SELECT *
FROM ci_time_slot
WHERE type like '%B%'
and sloat_name BETWEEN '8:00 AM' and '12:00 PM'
table
id sloat_name type
1 8:00 AM A,B,C,D
2 8:15 AM A
3 8:30 AM A,B
4 8:45 AM A,C
5 9:00 AM A,B,D
6 9:15 AM A
7 9:30 AM A,B,C
8 9:45 AM A
9 10:00 AM A,B,D
10 10:15 AM A,C
11 10:30 AM A,B
12 10:45 AM A
13 11:00 AM A,B,C,D
14 11:15 AM A
15 11:30 AM A,B
16 11:45 AM A,C
17 12:00 PM A,B,D
need result
show this type of result depend on time
id sloat_name type
1 8:00 AM A,B,C,D
2 8:15 AM A
3 8:30 AM A,B
4 8:45 AM A,C
5 9:00 AM A,B,D
6 9:15 AM A
7 9:30 AM A,B,C
8 9:45 AM A
9 10:00 AM A,B,D
10 10:15 AM A,C
11 10:30 AM A,B
12 10:45 AM A
13 11:00 AM A,B,C,D
14 11:15 AM A
15 11:30 AM A,B
16 11:45 AM A,C
17 12:00 PM A,B,D
You haven't mentioned yet which dbms are you using ,I'm using MySQL for demonstration. You have to cast sloat_name to time to use between.
Try :
SELECT c.*
FROM ci_time_slot c
WHERE type like '%B%'
and cast(sloat_name as time) between '08:00:00' and '12:00:00';
Result:
id sloat_name type
1 8:00 AM A,B,C,D
3 8:30 AM A,B
5 9:00 AM A,B,D
7 9:30 AM A,B,C
9 10:00 AM A,B,D
11 10:30 AM A,B
13 11:00 AM A,B,C,D
15 11:30 AM A,B
17 12:00 PM A,B,D
Demo

What's the difference between changing datetime string to datetime by pd.to_datetime & datetime.strptime()

I have a df that looks similar to this (shortened version, with less rows):
Time (EDT) Open High Low Close
0 02.01.2006 19:00:00 0.85224 0.85498 0.85224 0.85498
1 02.01.2006 20:00:00 0.85498 0.85577 0.85423 0.85481
2 02.01.2006 21:00:00 0.85481 0.85646 0.85434 0.85646
3 02.01.2006 22:00:00 0.85646 0.85705 0.85623 0.85651
4 02.01.2006 23:00:00 0.85643 0.85691 0.85505 0.85653
5 03.01.2006 00:00:00 0.85653 0.8569 0.85601 0.85626
6 03.01.2006 01:00:00 0.85626 0.85653 0.85524 0.8557
7 03.01.2006 02:00:00 0.85558 0.85597 0.85486 0.85597
8 03.01.2006 03:00:00 0.85597 0.85616 0.85397 0.8548
9 03.01.2006 04:00:00 0.85469 0.85495 0.8529 0.85328
10 03.01.2006 05:00:00 0.85316 0.85429 0.85222 0.85401
11 03.01.2006 06:00:00 0.85401 0.8552 0.853 0.8552
12 03.01.2006 07:00:00 0.8552 0.8555 0.85319 0.85463
13 03.01.2006 08:00:00 0.85477 0.85834 0.8545 0.85788
14 03.01.2006 09:00:00 0.85788 0.85838 0.85341 0.85416
15 03.01.2006 10:00:00 0.8542 0.8542 0.85006 0.85111
16 03.01.2006 11:00:00 0.85115 0.85411 0.85 0.85345
17 03.01.2006 12:00:00 0.85337 0.85432 0.8526 0.85413
18 03.01.2006 13:00:00 0.85413 0.85521 0.85363 0.85363
19 03.01.2006 14:00:00 0.85325 0.8561 0.85305 0.85606
20 03.01.2006 15:00:00 0.8561 0.85675 0.85578 0.85599
I need to convert the date string to datetime, then set date column as index, and resample. When I use method 1, I can't resample properly, the data how it resamples is wrong and it creates extra future dates. Let say my last date is 2018-11, I will see 2018-12 something like that.
method 1:
df['Time (EDT)'] = pd.to_datetime(df['Time (EDT)']) <---- this takes long also, because theres 90000 rows
df.set_index('Time (EDT)', inplace=True)
ohlc_dict = {'Open':'first','High':'max', 'Low':'min','Close'}
df=df.resample'4H', base=17, closed='left', label='left').agg(ohlc_dict)
result:
Time (EDT) Open High Low Close
1/1/2006 21:00 0.86332 0.86332 0.86268 0.86321
1/2/2006 1:00 0.86321 0.86438 0.86111 0.86164
1/2/2006 5:00 0.86164 0.86222 0.8585 0.86134
1/2/2006 9:00 0.86149 0.86297 0.85695 0.85793
1/2/2006 13:00 0.85801 0.85947 0.85759 0.8591
1/2/2006 17:00 0.8591 0.86034 0.85757 0.85825
1/2/2006 21:00 0.85825 0.85969 0.84377 0.84412
1/3/2006 1:00 0.84445 0.8468 0.84286 0.84642
1/3/2006 5:00 0.84659 0.8488 0.84494 0.84872
1/3/2006 9:00 0.84829 0.84915 0.84271 0.84416
1/3/2006 13:00 0.84372 0.8453 0.84346 0.84423
1/3/2006 17:00 0.84426 0.84693 0.84426 0.84516
1/3/2006 21:00 0.84523 0.8458 0.84442 0.84579
When I use method 2. It resamples properly.
method 2:
def to_datetime_obj(date_string):
datetime_obj = datetime.strptime(date_string[:], '%d.%m.%Y %H:%M:%S')
return datetime_obj
datetime_objs = None
date_list = df['Time (EDT)'].tolist()
datetime_objs=list(map(to_datetime_obj, date_list)) <--- this is faster also
df.iloc[:,:1] = datetime_objs
df.set_index('Time (EDT)', inplace=True)
ohlc_dict = {'Open':'first','High':'max', 'Low':'min','Close'}
df=df.resample'4H', base=17, closed='left', label='left').agg(ohlc_dict)
result:
Time (EDT) Open High Low Close
1/2/2006 17:00 0.85224 0.85577 0.85224 0.85481
1/2/2006 21:00 0.85481 0.85705 0.85434 0.85626
1/3/2006 1:00 0.85626 0.85653 0.8529 0.85328
1/3/2006 5:00 0.85316 0.85834 0.85222 0.85788
1/3/2006 9:00 0.85788 0.85838 0.85 0.85413
1/3/2006 13:00 0.85413 0.85675 0.85305 0.85525
1/3/2006 17:00 0.85525 0.85842 0.85502 0.85783
1/3/2006 21:00 0.85783 0.85898 0.85736 0.85774
1/4/2006 1:00 0.85774 0.85825 0.8558 0.85595
1/4/2006 5:00 0.85595 0.85867 0.85577 0.85839
1/4/2006 9:00 0.85847 0.85981 0.85586 0.8578
1/4/2006 13:00 0.85773 0.85886 0.85597 0.85653
1/4/2006 17:00 0.85653 0.85892 0.85642 0.8584
1/4/2006 21:00 0.8584 0.85863 0.85658 0.85715
1/5/2006 1:00 0.85715 0.8588 0.85641 0.85791
1/5/2006 5:00 0.85803 0.86169 0.85673 0.86065
The df.index of method 1 and 2 are the same visually before resampling.
They are both pandas.core.indexes.datetimes.DatetimeIndex
But when I compare them, they are actually different method1_df.index != method2_df.index
Why is that? How to fix? Thanks.
It's surprising that a vectorized method (pd.to_datetime), written in Cython is slower than a pure Python method (datetime.strptime).
You can specify the format to pd.to_datetime whicch speeds it up a lot:
pd.to_datetime(df['Time (EDT)'], format='%d.%m.%Y %H:%M:%S')
For your second problem, I think it may have something to do with the order of day and month in your string data. Have you verified that the two methods actually give you the same datetimes?
s1 = pd.to_datetime(df['Time (EDT)'])
s2 = pd.Series(map(to_datetime_obj, date_list))
(s1 == s2).all()
For me datetime.strptime was 3 times faster than pd.to_datetime for 2 operations per row on a 880,000+ rows DataFrame.

Generate rows with time intervals between 2 dates in Oracle

I have table in which Sunday to Saturdy "Doctor Start" and "End Time" is given.
I want to create time slots of 15 minutes.
On the basis of that, the patient clicks on calendar datetime interval which shows slots that have already been booked.
The following example shows how to split time into slices of 15 minutes. It uses hierarchical query. A little bit of explanation:
line 2: trunc function, applied to a date value, returns "beginning" of that day (at midnight). Adding 15 / (24*60) adds 15 minutes (as there are 24 hours in a day and 60 minutes in an hour). Multiplying 15 by level works as a "loop", i.e. adds 15-by-15-by-15 ... minutes to previous value.
line 4: similar to line 2, but it makes sure that a day (24 hours * 60 minutes) is divided to 15-minutes parts
line 6: start time is trivial
line 7: end time just adds 15 minutes to start_time
line 9: return only time between 10 and 16 hours (you don't have patients at 02:15 AM, right?)
SQL> with fifteen as
2 (select trunc(sysdate) + (level * 15)/(24*60) c_time
3 from dual
4 connect by level <= (24*60) / 15
5 )
6 select to_char(c_time, 'hh24:mi') start_time,
7 to_char(c_time + 15 / (24 * 60), 'hh24:mi') end_time
8 from fifteen
9 where extract(hour from cast (c_time as timestamp)) between 10 and 15;
START_TIME END_TIME
---------- ----------
10:00 10:15
10:15 10:30
10:30 10:45
10:45 11:00
11:00 11:15
11:15 11:30
11:30 11:45
11:45 12:00
12:00 12:15
12:15 12:30
12:30 12:45
12:45 13:00
13:00 13:15
13:15 13:30
13:30 13:45
13:45 14:00
14:00 14:15
14:15 14:30
14:30 14:45
14:45 15:00
15:00 15:15
15:15 15:30
15:30 15:45
15:45 16:00
24 rows selected.
SQL>

SQL - Working Out Time 'Period'

I am compiling a report from 2 different data sources. The request is that I provide an hourly breakdown of sales, quotes and calls so I can provide the call to quote rate etc.
The telephone data comes in the format of 'hourly' like below:
Date Start Offered Answered
----------------------- ------------- ------------------------------
2016-05-09 00:00:00.000 08:00 0 0
2016-05-09 00:00:00.000 09:00 7 5
2016-05-09 00:00:00.000 10:00 7 7
2016-05-09 00:00:00.000 11:00 7 6
2016-05-09 00:00:00.000 12:00 10 10
2016-05-09 00:00:00.000 13:00 5 5
2016-05-09 00:00:00.000 14:00 2 2
2016-05-09 00:00:00.000 15:00 2 2
2016-05-09 00:00:00.000 16:00 7 7
2016-05-09 00:00:00.000 17:00 7 7
2016-05-09 00:00:00.000 18:00 0 0
2016-05-09 00:00:00.000 19:00 0 0
This suits me down to the ground so I can show the data hourly i.e. between 08:00 and 09:00 there were no calls offer and subsequently none answered.
Now the difference here is the Quote/Sales system provides the time as such 0932, 1001 etc.
What would be the best way to put these Quotes/Sales into the appropriate 'hourly' pots.
I.e. 0932 should essentially be 09:00 as a time below and 1001 should be 10:00.
Is the best way to accomplish this just CASE statements based on the first 2 digits of the time i.e.
CASE WHEN SUBSTRING([QuoteTime],1,2) = 09 THEN '0900' END
I figure I could do it this way but it would be 'cumbersome' and quite a performance hit on a number of rows.
Thoughts?

sql-server-2008 R2-split-time-intervals-given-by-starttime-endtime-at-selected-point

I have a table has following data
CREATE TABLE #TempStudentSchedulingRecord(
seq_id int identity(1,1),
Student_ID int ,
PeriodNumber int ,
CPStartTime datetime ,
CPEndTime datetime ,
DateItem datetime ,
FirstSegmentEndTime datetime,
SecondSegmentEndTime datetime
)
Now
Insert Into #TempStudentSchedulingRecord
values(2730,1,'1900-01-01 07:25:00.000','1900-01-01 08:20:00.000','2010-10-05 00:00:00.000','2015-05-27 09:45:00.000','2015-05-27 16:00:00.000')
Insert Into #TempStudentSchedulingRecord values(2730,1,'1900-01-01 08:25:00.000','1900-01-01 10:00:00.000','2010-10-05 00:00:00.000','2015-05-27 09:45:00.000','2015-05-27 16:00:00.000')
Insert Into #TempStudentSchedulingRecord values(2730,1,'1900-01-01 10:05:00.000','1900-01-01 11:35:00.000','2010-10-05 00:00:00.000','2015-05-27 09:45:00.000','2015-05-27 16:00:00.000')
Now Here The firstSegmentTime is same in all rows. I want To find The Total minutes scheduled before First segment End and classes included in first segment.
Similarly find the The Total minutes scheduled between First segment End time and second segment end time and classes included in first segment.
Here the Period 2 resides both segments.. How can I calculate total minutes scheduled before first segment and second segment.
Desired Output is
Student Period CPStartTime CPEndTime DateItem FSET SSET
2730 1 01/01/00 08:25 AM 01/01/00 10:00 AM 10/05/10 12:00 AM 05/27/15 09:45 AM 05/27/15 04:00 PM
2730 2 01/01/00 08:25 AM 05/27/15 09:45 AM 10/05/10 12:00 AM 05/27/15 09:45 AM 05/27/15 04:00 PM
2730 2 05/27/15 09:45 AM 01/01/00 10:00 AM 10/05/10 12:00 AM 05/27/15 09:45 AM 05/27/15 04:00 PM
2730 3 01/01/00 10:05 AM 01/01/00 11:35 AM 10/05/10 12:00 AM 05/27/15 09:45 AM 05/27/15 04:00 PM