I have a dataframe with columns: Text, Start time, and end time.
All three of them are strings.
The dataframe
What I am focused on currently is that, I need to convert the elements of Start & End columns into number of seconds. That is, converting 00:00:26 into 26. Or 00:01:27 into 87. etc.
The output elements need to be in int format.
I have already figured out the way to convert the string of timelog into proper timestamps
datetime_str = '00:00:26'
datetime_object = datetime.strptime(datetime_str, '%H:%M:%S').time()
print(datetime_object)
print(type(datetime_object))
Output:
00:00:26
<class 'datetime.time'>
But how do I convert this 00:00:26 into an integer 26.
Since you're manipulating a df, you can simply use Timedelta & total_seconds from pandas :
df["Start_2"] = pd.to_timedelta(df["Start"]).dt.total_seconds().astype(int)
df["End_2"] = pd.to_timedelta(df["End"]).dt.total_seconds().astype(int)
Output :
print(df)
Start End Start_2 End_2
0 00:00:05 00:00:13 5 13
1 00:00:13 00:00:21 13 21
2 00:00:21 00:00:27 21 27
3 00:00:27 00:00:36 27 36
4 00:00:36 00:00:42 36 42
5 00:00:42 00:00:47 42 47
6 00:00:47 00:00:54 47 54
7 00:00:54 00:00:59 54 59
8 00:00:59 00:01:07 59 67
9 00:01:07 00:01:13 67 73
Related
i have the following SQL that I would like to use to plot cumulative distribution plot but i can't seem to get the data right.
Sample Data:
token_Length
Frequency
1
6436
2
7489
3
3724
4
2440
5
667
6
396
7
264
8
215
9
117
10
90
11
61
12
29
13
69
15
40
18
45
How do i prepare this data to create a CDF plot in looker?
So that it looks like
token_Length
Frequency
cume_dist
1
6436
0.291459107
2
7489
0.630604112
3
3724
0.799248256
4
2440
0.909745494
5
667
0.939951091
6
396
0.95788425
7
264
0.969839688
8
215
0.979576125
9
117
0.984874558
10
90
0.988950276
11
61
0.991712707
12
29
0.993025994
13
69
0.996150711
15
40
0.997962141
18
45
1
I have tried a measure as follows:
measure: cume_dist {
type: number
sql: cume_dist() over (order by ${token_length} ASC);;
}
This generates SQL as:
SELECT
token_length,
COUNT(*) AS "count",
cume_dist() over (order by (token_length) ASC) AS "cume_dist"
FROM string_facts
I have a data that looks like this:
$ Time : int 0 1 5 8 10 11 15 17 18 20 ...
$ NumOfFlights: int 1 6 144 91 504 15 1256 1 1 578 ...
Time col is just 24hr time. From 0 up all the way until 2400
What I hope to get is:
hour | number of flight
-------------------------------------
1st | 240
2nd | 223
... | ...
24th | 122
Where 1st hour is from midnight to 1am, and 2nd is 1am to 2am, and so on until finally 24th which is from 11pm to midnight. And number of flights is just the total of the NumOfFlights within the range.
I've tried:
dbGetQuery(conn,"
SELECT
flights.CRSDepTime AS Time,
COUNT(flights.CRSDepTime) AS NumOnTimeFlights
FROM flights
GROUP BY CRSDepTime/60
")
But I realise it can't be done this way. The results that I get will have 40 values for time.
> head
Time NumOnTimeFlights
1 50 6055
2 105 2383
3 133 674
4 200 446
5 245 266
6 310 34
> tail
Time NumOnTimeFlights
35 2045 48136
36 2120 103229
37 2215 15737
38 2245 36416
39 2300 15322
40 2355 8018
If your CRSDepTime column is an integer encoded time like HHmm then CRSDepTime/100 will extract the hour.
SELECT
CRSDepTime/100 AS hh,
COUNT(flights.CRSDepTime) AS NumOnTimeFlights
FROM flights
GROUP BY CRSDepTime/100
I have a dataset, in which it has a lot of entries for a single location. I am trying to find a way to sum up all of those entries without affecting any of the other columns. So, just in case I'm not explaining it well enough, I want to use a dataset like this:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
Bedford 10 12 14 17 27
Bedford 11 40 34 9 1
Bedford 7 1 2 3 3
Leeds 1 1 2 0 0
Leeds 20 13 6 1 1
Bath 101 20 33 41 3
Bath 11 2 3 1 0
And turn it into something like this:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
Bedford 28 53 50 29 31
Leeds 21 33 39 1 1
Bath 111 22 36 42 3
Now, I have read up that a groupby should work in a way, but from my understanding a group by will change it into 2 columns and I don't particularly want to make hundreds of 2 columns and then merge it all. Surely there's a much simpler way to do this?
IIUC, groupby+sum will work for you:
df.groupby('Locations',as_index=False,sort=False).sum()
Output:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
0 Bedford 28 53 50 29 31
1 Leeds 21 14 8 1 1
2 Bath 112 22 36 42 3
Pivot table should work for you.
new_df = pd.pivot_table(df, values=['Cyclists', 'maleRunners', 'femalRunners',
'maleCyclists','femaleCyclists'],index='Locations', aggfunc=np.sum)
sample dataframe(df) having following columns:
id created_time faid
0 21 2019-06-17 07:06:45 FF1854155
1 54 2019-04-12 08:06:03 FF30232
2 88 2019-04-20 05:36:03 FF1855531251
3 154 2019-04-26 07:09:22 FF8145292
4 218 2019-07-25 13:20:51 FF0143154
5 219 2019-04-30 18:50:24 FF04211
6 235 2019-04-30 20:37:37 FF0671380
7 266 2019-05-02 08:38:56 FF08070
8 268 2019-05-02 11:08:21 FF591087
May i know how to achieve new dataframe as:
hour count
07 2
08 2
. .
. .
try calculating hours from created_time.
groupby hour and count it
df['hour'] = pd.to_datetime(df['created_time']).dt.hour
res = df.groupby(['hour'],as_index=False)['faid'].count().rename(columns={"faid":"count"})
hour count
07 2
08 2
I have the below dataframe has in a messy way and I need to club row 0 and 1 to make that as columns and keep rest rows from 3 asis:
Start Date 2005-01-01 Unnamed: 3 Unnamed: 4 Unnamed: 5
Dat an_1 an_2 an_3 an_4 an_5
mt mt s t inch km
23 45 67 78 89 9000
change to below dataframe :
Dat_mt an_1_mt an_2 _s an_3_t an_4_inch an_5_km
23 45 67 78 89 9000
IIUC
df.columns=df.loc[0]+'_'+df.loc[1]
df=df.loc[[2]]
df
Out[429]:
Dat_mt an_1_mt an_2_s an_3_t an_4_inch an_5_km
2 23 45 67 78 89 9000