Pandas: add date field to parsed timestamp - pandas

I have several date specific text files (for ex 20150211.txt) that looks like
TopOfBook 0x21 60 07:15:00.862 101 85 5 109 500 24 +
TopOfBook 0x21 60 07:15:00.882 101 91 400 109 500 18 +
TopOfBook 0x21 60 07:15:00.890 101 91 400 105 80 14 +
TopOfBook 0x21 60 07:15:00.914 101 93.3 400 105 80 11.7 +
where the 4th column contains the timestamp.
If I read this into pandas with automatic parsing
df_top = pd.read_csv('TOP_20150210.txt', sep='\t', names=hdr_top, parse_dates=[3])
I get:
0 TopOfBook 0x21 60 2015-05-17 07:15:00.862000 101 85.0 5 109.0 500 24.0 +
1 TopOfBook 0x21 60 2015-05-17 07:15:00.882000 101 91.0 400 109.0 500 18.0 +
2 TopOfBook 0x21 60 2015-05-17 07:15:00.890000 101 91.0 400 105.0 80 14.0 +
Where the time part of course is correct, but how do I add the correct date part of this timestamp (2015-02-11)? Thank you

After parsing the dates, the third column has dtype <M8[ns]. This is the NumPy datetime64 dtype with nanosecond resolution. You can do fast date arithmetic by adding or subtracting NumPy timedelta64s.
So, for example, subtracting 6 days from df[3] yields
In [139]: df[3] - np.array([6], dtype='<m8[D]')
Out[139]:
0 2015-05-11 07:15:00.862000
1 2015-05-11 07:15:00.882000
2 2015-05-11 07:15:00.890000
3 2015-05-11 07:15:00.914000
Name: 3, dtype: datetime64[ns]
To find the correct number of days to subtract you could use
today = df.iloc[0,3]
date = pd.Timestamp(re.search(r'\d+', filename).group())
n = (today-date).days
import datetime as DT
import numpy as np
import pandas as pd
import re
filename = '20150211.txt'
df = pd.read_csv(filename, sep='\t', header=None, parse_dates=[3])
today = df.iloc[0,3]
date = pd.Timestamp(re.search(r'\d+', filename).group())
n = (today-date).days
df[3] -= np.array([n], dtype='<m8[D]')
print(df)
yields
0 1 2 3 4 5 6 7 8 \
0 TopOfBook 0x21 60 2015-02-11 07:15:00.862000 101 85.0 5 109 500
1 TopOfBook 0x21 60 2015-02-11 07:15:00.882000 101 91.0 400 109 500
2 TopOfBook 0x21 60 2015-02-11 07:15:00.890000 101 91.0 400 105 80
3 TopOfBook 0x21 60 2015-02-11 07:15:00.914000 101 93.3 400 105 80
9
0 24.0
1 18.0
2 14.0
3 11.7

You could apply and construct the datetime using your desired date values and then copying the time portion to the constructor:
In [9]:
import datetime as dt
df[3] = df[3].apply(lambda x: dt.datetime(2015,2,11,x.hour,x.minute,x.second,x.microsecond))
df
Out[9]:
0 1 2 3 4 5 6 7 8 \
0 TopOfBook 0x21 60 2015-02-11 07:15:00.862000 101 85.0 5 109 500
1 TopOfBook 0x21 60 2015-02-11 07:15:00.882000 101 91.0 400 109 500
2 TopOfBook 0x21 60 2015-02-11 07:15:00.890000 101 91.0 400 105 80
3 TopOfBook 0x21 60 2015-02-11 07:15:00.914000 101 93.3 400 105 80
9 10
0 24.0 +
1 18.0 +
2 14.0 +
3 11.7 +

Related

how to transpose m x n into k x 2 form dataframe in pandas

I have m x n form dataframe such as belows.
date1 amt1 date2 amt2
2021-01-02 120 1991-01-02 90
2021-01-03 100 1991-01-03 95
2021-01-04 110 1991-01-04 95
....
Is there any way to transpose into k x 2 form dataframe like...
date amt
2021-01-02 120
2021-01-03 100
2021-01-04 110
...
1991-01-02 90
1991-01-03 95
1991-01-04 95
...
This can be done easily with reshape, although a bit different order:
pd.DataFrame(df.to_numpy().reshape(-1, 2), columns=['date', 'amt'])
Output:
date amt
0 2021-01-02 120
1 1991-01-02 90
2 2021-01-03 100
3 1991-01-03 95
4 2021-01-04 110
5 1991-01-04 95
reset_index then u se pd.wide_to long,
df.reset_index(inplace=True)
pd.wide_to_long(df, stubnames=['date', 'amt'], i=['index'], j='id').reset_index(drop=True)
date amt
0 2021-01-02 120
1 2021-01-03 100
2 2021-01-04 110
3 1991-01-02 90
4 1991-01-03 95
5 1991-01-04 95

Convert decimal Day-of-year dataframe to datetime with HH:MM

Is it possible to convert an entire column of decimal Day-Of-Year into datetime format YYYY-mm-dd HH:MM ? I tried counting the amount of seconds and minutes in a day, but decimal DOY is different from decimal hours.
Example:
DOY = 181.82015046296297
Converted to:
Timestamp('2021-06-05 14:00:00')
Here the date would be a datetime object appearing only as 2021-06-05 14:00:00 in my dataframe. And the year I am interested in is 2021.
Use Timedelta to create an offset from the first day of year
Input data:
>>> df
DayOfYear
0 254
1 156
2 303
3 32
4 100
5 8
6 329
7 82
8 218
9 293
df['Date'] = pd.to_datetime('2021') \
+ df['DayOfYear'].sub(1).apply(pd.Timedelta, unit='D')
Output result:
>>> df
DayOfYear Date
0 254 2021-09-11
1 156 2021-06-05
2 303 2021-10-30
3 32 2021-02-01
4 100 2021-04-10
5 8 2021-01-08
6 329 2021-11-25
7 82 2021-03-23
8 218 2021-08-06
9 293 2021-10-20

How to select data for especific time intervals after using Pandas’ resample function?

I used Pandas’ resample function for calculating the sales of a list of proucts every 6 months.
I used the resample function for ‘6M’ and using apply({“column-name”:”sum”}).
Now I’d like to create a table with the sum of the sales for the first six months.
How can I extract the sum of the first 6 months, given that all products have records for more than 3 years, and none of them have the same start date?
Thanks in advance for any suggestions.
Here is an example of the data:
Product Date sales
Product 1 6/30/2017 20
12/31/2017 60
6/30/2018 50
12/31/2018 100
Product 2 1/31/2017 30
7/31/2017 150
1/31/2018 200
7/31/2018 300
1/31/2019 100
While waiting for your data, I worked on this. See if this is something that will be helpful for you.
import pandas as pd
df = pd.DataFrame({'Date':['2018-01-10','2018-02-15','2018-03-18',
'2018-07-10','2018-09-12','2018-10-14',
'2018-11-16','2018-12-20','2019-01-10',
'2019-04-15','2019-06-12','2019-10-18',
'2019-12-02','2020-01-05','2020-02-25',
'2020-03-15','2020-04-11','2020-07-22'],
'Sales':[200,300,100,250,150,350,150,200,250,
200,300,100,250,150,350,150,200,250]})
#first breakdown the data by Yearly Quarters
df['YQtr'] = pd.PeriodIndex(pd.to_datetime(df.Date), freq='Q')
#next create a column to identify Half Yearly - H1 for Jan-Jun & H2 for Jul-Dec
df.loc[df['YQtr'].astype(str).str[-2:].isin(['Q1','Q2']),'HYear'] = df['YQtr'].astype(str).str[:-2]+'H1'
df.loc[df['YQtr'].astype(str).str[-2:].isin(['Q3','Q4']),'HYear'] = df['YQtr'].astype(str).str[:-2]+'H2'
#Do a cummulative sum on Half Year to get sales by H1 & H2 for each year
df['HYear_cumsum'] = df.groupby('HYear')['Sales'].cumsum()
#Now filter out only the rows with the max value. That's the H1 & H2 sales figure
df1 = df[df.groupby('HYear')['HYear_cumsum'].transform('max')== df['HYear_cumsum']]
print (df)
print (df1)
The output of this will be:
Source Data + Half Year cumulative sum:
Date Sales YQtr HYear HYear_cumsum
0 2018-01-10 200 2018Q1 2018H1 200
1 2018-02-15 300 2018Q1 2018H1 500
2 2018-03-18 100 2018Q1 2018H1 600
3 2018-07-10 250 2018Q3 2018H2 250
4 2018-09-12 150 2018Q3 2018H2 400
5 2018-10-14 350 2018Q4 2018H2 750
6 2018-11-16 150 2018Q4 2018H2 900
7 2018-12-20 200 2018Q4 2018H2 1100
8 2019-01-10 250 2019Q1 2019H1 250
9 2019-04-15 200 2019Q2 2019H1 450
10 2019-06-12 300 2019Q2 2019H1 750
11 2019-10-18 100 2019Q4 2019H2 100
12 2019-12-02 250 2019Q4 2019H2 350
13 2020-01-05 150 2020Q1 2020H1 150
14 2020-02-25 350 2020Q1 2020H1 500
15 2020-03-15 150 2020Q1 2020H1 650
16 2020-04-11 200 2020Q2 2020H1 850
17 2020-07-22 250 2020Q3 2020H2 250
The half year cumulative sum for each half year.
Date Sales YQtr HYear HYear_cumsum
2 2018-03-18 100 2018Q1 2018H1 600
7 2018-12-20 200 2018Q4 2018H2 1100
10 2019-06-12 300 2019Q2 2019H1 750
12 2019-12-02 250 2019Q4 2019H2 350
16 2020-04-11 200 2020Q2 2020H1 850
17 2020-07-22 250 2020Q3 2020H2 250
I will look at your sample data and work on it later tonight.

Groupby sum in years in pandas

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit sale_date discount
A 50 2016-12-01 5
A 50 2017-01-03 4
B 200 2016-12-24 10
A 50 2017-01-18 3
B 200 2017-01-28 15
A 50 2017-01-18 6
B 200 2017-01-28 20
A 50 2017-04-18 6
B 200 2017-12-08 25
A 50 2017-11-18 6
B 200 2017-08-21 20
B 200 2017-12-28 30
A 50 2018-03-18 10
B 300 2018-06-08 45
B 300 2018-09-20 50
A 50 2018-11-18 8
B 300 2018-11-28 35
From the above I would like to prepare below dataframe and plot that into line plot.
Expected Output
bought_year total_profit
2016 250
2017 1250
2018 1000
X axis = bought_year
Y axis = profit
use groupby with dt.year and .agg to name your column.
df1 = df.groupby(df['sale_date'].dt.year).agg(total_profit=('profit','sum'))\
.reset_index().rename(columns={'sale_date': 'bought_year'})
print(df1)
bought_year total_profit
0 2016 250
1 2017 1250
2 2018 1000
df1.set_index('bought_year').plot(kind='bar')

Pandas - Slice between two indexes

I need to process data with tensorflow for classification. Therefore I need to create DataFrames for each unit which was processed in my machine. The machine continously writes process data and also writes when a unit enters and leaves the machine.
A value in 'uid_in' means the unit with the logged number entered the machine, 'uid_out' means the unit left the machine.
I need to create a DataFrame like this for each unit processes by the machine.
[...]
time uhz1 uhz2 lhz1 lh2 uid_in uid_out
5 08:05:00 201 200 101 100 1.0 NaN #Unit1 enters the machine
6 08:06:00 201 200 99 101 2.0 NaN
[...]
14 08:14:00 199 199 99 101 10.0 NaN
15 08:15:00 201 201 100 100 11.0 1.0 #Unit1 leaves the machine
[...]
How can I create the Dataframe df.loc[enter:leave] for each unit automatically?
When I try to pass a DataFrame.index it does not work in df.loc
start = df[df.uid_in.isin([123])]
end = df[df.uid_out.isin([123])]
unit1_df = df.loc[start:end]
Your code almost worked out!
Original DataFrame:
time uhz1 uhz2 lhz1 lh2 uid_in uid_out
0 08:00:00 201 199 100 100 NaN NaN
1 08:01:00 199 199 100 99 NaN NaN
[...]
5 08:05:00 201 200 101 100 1.0 NaN
[...]
55 08:55:00 241 241 140 140 NaN 41.0
[...]
58 08:58:00 244 244 143 143 NaN NaN
59 08:59:00 245 245 144 144 NaN NaN
New code:
start = df[df.uid_in.eq(1.0)].index[0]
end = df[df.uid_out.eq(1.0)].index[0]
unit1_df = df.loc[start:end]
print(unit1_df)
Output
time uhz1 uhz2 lhz1 lh2 uid_in uid_out
5 08:05:00 201 200 101 100 1.0 NaN
6 08:06:00 201 200 99 101 2.0 NaN
[...]
14 08:14:00 199 199 99 101 10.0 NaN
15 08:15:00 201 201 100 100 11.0 1.0
I think you were pretty close. I modified your statements and picked out the start and end indices of start and end, as Ian indicated.
""" time uhz1 uhz2 lhz1 lh2 uid_in uid_out
5 08:05:00 201 200 101 100 1.0 NaN
6 08:06:00 201 200 99 101 2.0 NaN
14 08:14:00 199 199 99 101 10.0 NaN
15 08:15:00 201 201 100 100 11.0 1.0
"""
import pandas as pd
df = pd.read_clipboard()
start = df.uid_in.eq(1.0).index[0]
end = df.uid_out.eq(1.0).index[0]
unit1_df = df.loc[start:end]
unit1_df
Output:
time uhz1 uhz2 lhz1 lh2 uid_in uid_out
5 08:05:00 201 200 101 100 1.0 NaN
6 08:06:00 201 200 99 101 2.0 NaN
14 08:14:00 199 199 99 101 10.0 NaN
15 08:15:00 201 201 100 100 11.0 1.0
One-liner:
unit1_df = df.loc[df.uid_in.eq(1.0).index[0]:df.uid_out.eq(1.0).index[0]]