Resampling a DataFrame to hourly 15min and 5min periods in Julia - dataframe

I'm quite new to Julia but I'm giving it a try since the benchmarks claim it to be much faster than Python.
I'm trying to use some stock tick data in the format ["unixtime", "price", "amount"]
I managed to load the data and convert the unixtime to a date in Julia, but now I need to resample the data to use olhc (open, high, low, close) for the price and sum for the amount, for a specific period in Julia (hourly, 15min, 5 min, etc...):
julia> head(btc_raw_data)
6x3 DataFrame:
date price amount
[1,] 2011-09-13T13:53:36 UTC 5.8 1.0
[2,] 2011-09-13T13:53:44 UTC 5.83 3.0
[3,] 2011-09-13T13:53:49 UTC 5.9 1.0
[4,] 2011-09-13T13:53:54 UTC 6.0 20.0
[5,] 2011-09-13T14:32:53 UTC 5.95 12.4521
[6,] 2011-09-13T14:35:04 UTC 5.88 7.458
I see there is a package called Resampling, but it doesn't seem to accept a time period only the number of row I want the output data to have.
Any other alternatives?

You can convert DataFrame (from DataFrames.jl) to TimeArray (from TimeSeries.jl) using https://github.com/femtotrader/TimeSeriesIO.jl
using TimeSeriesIO: TimeArray
ta = TimeArray(df, colnames=[:price], timestamp=:date)
You can resample timeseries (TimeArray from TimeSeries.jl) using TimeSeriesResampler https://github.com/femtotrader/TimeSeriesResampler.jl
and TimeFrames https://github.com/femtotrader/TimeFrames.jl
using TimeSeriesResampler: resample, mean, ohlc, sum, TimeFrame
# Define a sample timeseries (prices for example)
idx = DateTime(2010,1,1):Dates.Minute(1):DateTime(2011,1,1)
idx = idx[1:end-1]
N = length(idx)
y = rand(-1.0:0.01:1.0, N)
y = 1000 + cumsum(y)
#df = DataFrame(Date=idx, y=y)
ta = TimeArray(collect(idx), y, ["y"])
println("ta=")
println(ta)
# Define how datetime should be grouped (timeframe)
tf = TimeFrame(dt -> floor(dt, Dates.Minute(15)))
# resample using OHLC values
ta_ohlc = ohlc(resample(ta, tf))
println("ta_ohlc=")
println(ta_ohlc)
# resample using mean values
ta_mean = mean(resample(ta, tf))
println("ta_mean=")
println(ta_mean)
# Define an other sample timeseries (volume for example)
vol = rand(0:0.01:1.0, N)
ta_vol = TimeArray(collect(idx), vol, ["vol"])
println("ta_vol=")
println(ta_vol)
# resample using sum values
ta_vol_sum = sum(resample(ta_vol, tf))
println("ta_vol_sum=")
println(ta_vol_sum)
You should get:
julia> ta
525600x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:59:00
y
2010-01-01T00:00:00 | 1000.16
2010-01-01T00:01:00 | 1000.1
2010-01-01T00:02:00 | 1000.98
2010-01-01T00:03:00 | 1001.38
⋮
2010-12-31T23:56:00 | 972.3
2010-12-31T23:57:00 | 972.85
2010-12-31T23:58:00 | 973.74
2010-12-31T23:59:00 | 972.8
julia> ta_ohlc
35040x4 TimeSeries.TimeArray{Float64,2,DateTime,Array{Float64,2}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
Open High Low Close
2010-01-01T00:00:00 | 1000.16 1002.5 1000.1 1001.54
2010-01-01T00:15:00 | 1001.57 1002.64 999.38 999.38
2010-01-01T00:30:00 | 999.13 1000.91 998.91 1000.91
2010-01-01T00:45:00 | 1001.0 1006.42 1001.0 1006.42
⋮
2010-12-31T23:00:00 | 980.84 981.56 976.53 976.53
2010-12-31T23:15:00 | 975.74 977.46 974.71 975.31
2010-12-31T23:30:00 | 974.72 974.9 971.73 972.07
2010-12-31T23:45:00 | 972.33 973.74 971.49 972.8
julia> ta_mean
35040x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
y
2010-01-01T00:00:00 | 1001.1047
2010-01-01T00:15:00 | 1001.686
2010-01-01T00:30:00 | 999.628
2010-01-01T00:45:00 | 1003.5267
⋮
2010-12-31T23:00:00 | 979.1773
2010-12-31T23:15:00 | 975.746
2010-12-31T23:30:00 | 973.482
2010-12-31T23:45:00 | 972.3427
julia> ta_vol
525600x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:59:00
vol
2010-01-01T00:00:00 | 0.37
2010-01-01T00:01:00 | 0.67
2010-01-01T00:02:00 | 0.29
2010-01-01T00:03:00 | 0.28
⋮
2010-12-31T23:56:00 | 0.74
2010-12-31T23:57:00 | 0.66
2010-12-31T23:58:00 | 0.22
2010-12-31T23:59:00 | 0.47
julia> ta_vol_sum
35040x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
vol
2010-01-01T00:00:00 | 7.13
2010-01-01T00:15:00 | 6.99
2010-01-01T00:30:00 | 8.73
2010-01-01T00:45:00 | 8.27
⋮
2010-12-31T23:00:00 | 6.11
2010-12-31T23:15:00 | 7.49
2010-12-31T23:30:00 | 5.75
2010-12-31T23:45:00 | 8.36

Related

finding maximum value in a moving period? in data frame [duplicate]

I have a table df with columns "timestamp" and "Y". I want to add another column "MaxY" which contains the largest Y value at most 24 hours in the future. That is
df.MaxY.iloc[i] = df[(df.timestamp > df.timestamp.iloc[i]) &
(df.timestamp < df.timestamp.iloc[i] + timedelta(hours=24))].Y.max()
Obviously, computing it like that is very slow. Is there a better way?
In a similar case of computing "SumY" I can do it using a trick with cumsum(). However here similar tricks don't seem to work.
As requested, an example table (MaxY is the output. Input is the first two columns only).
-------------------------------
| timestamp | Y | MaxY |
-------------------------------
| 2016-03-29 12:00 | 1 | 3 | rows 2 and 3 fall within 24 hours, so MaxY = max(2,3)
| 2016-03-29 13:00 | 2 | 4 | rows 3 and 4 fall in the time interval, so MaxY = max(3, 4)
| 2016-03-30 11:00 | 3 | 4 | rows 4, 5, 6 all fall in the interval so MaxY = max(4, 3, 2)
| 2016-03-30 12:30 | 4 | 3 | max (3, 2)
| 2016-03-30 13:30 | 3 | 2 | row 6 is the only row in the interval
| 2016-03-30 14:00 | 2 | nan? | there are no rows in the time interval. Any value will do.
-------------------------------
Here's a way with resample/rolling. I get a weird warning using pandas version 0.18.0 and python 3.5. I don't think it's a concern but not sure why it is generated.
This assumes index is 'timestamp', if not, precede the following with df = df.set_index('timestamp'):
>>> df2 = df.resample('30min').sort_index(ascending=False).fillna(np.nan)
>>> df2 = df2.rolling(48,min_periods=1).max()
>>> df.join(df2,rsuffix='2')
Y Y2
timestamp
2016-03-29 12:00:00 1 3.0
2016-03-29 13:00:00 2 4.0
2016-03-30 11:00:00 3 4.0
2016-03-30 12:30:00 4 4.0
2016-03-30 13:30:00 3 3.0
2016-03-30 14:00:00 2 2.0
On this tiny dataframe it seems to be about twice as fast, but you'd have to test it on a larger dataframe to get a reasonable idea of relative speed.
Hopefully this is somewhat self expanatory. The ascending sort is necessary because rolling only allows a backwards or centered window as far as I can tell.
Consider an apply() solution that may run faster. Function returns the max of a time-conditional series from each row.
import pandas as pd
from datetime import timedelta
def daymax(row):
ser = df.Y[(df.timestamp > row) &
(df.timestamp <= row + timedelta(hours=24))]
return ser.max()
df['MaxY'] = df.timestamp.apply(daymax)
print(df)
# timestamp Y MaxY
#0 2016-03-29 12:00:00 1 3.0
#1 2016-03-29 13:00:00 2 4.0
#2 2016-03-30 11:00:00 3 4.0
#3 2016-03-30 12:30:00 4 3.0
#4 2016-03-30 13:30:00 3 2.0
#5 2016-03-30 14:00:00 2 NaN
what's wrong with
df['MaxY'] = df[::-1].Y.shift(-1).rolling('24H').max()
df[::-1] reverses the df (you want it "backwards") and shift(-1) takes care of the "in the future".

string column conversion to float in Pandas DataFrame

I want to get left value (LD) pipe separated value from the DataFrame column "'CA Distance Nominal (LD | au)" here is the code.
when I convert the string to float I get all the values as NaN.
cneos = pd.read_csv('cneos.csv')
print(cneos['CA Distance Nominal (LD | au)'].head())
cneos['Distance']=pd.to_numeric(cneos['CA Distance Nominal (LD | au)'], errors='coerce')
print(cneos['Distance'].head())
Result
0 2.02 | 0.00520
1 0.39 | 0.00100
2 8.98 | 0.02307
3 3.88 | 0.00996
4 4.84 | 0.01244
Name: CA Distance Nominal (LD | au), dtype: object
After to_numeric()
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: Distance, dtype: float64
How can I get the both values LD and AU separated in float
I'm not sure that it is the best way to resolve your problem, but it works:
separeted_data_frame = pd.DataFrame(cneos['CA Distance Nominal (LD | au)'].apply(lambda x: x.split('|')).to_list())
separeted_data_frame.columns = ['LD', 'AU']
separeted_data_frame.LD = separeted_data_frame.LD.astype(float)
separeted_data_frame.AU = separeted_data_frame.AU.astype(float)
cneos = cneos.join(separeted_data_frame).drop('CA Distance Nominal (LD | au)', 1)
The result is:
LD AU
0 2.02 0.00520
1 0.39 0.00100
2 8.98 0.02307
3 3.88 0.00996
4 4.84 0.01244
Is it what you wanted?

Dealing with multiple values in Pandas Dataframe Cell

Columns are the description of the data and the rows keep the values. However, in some columns there are multiple values (tabular form on website). Rows of those tabular get merged in one cell and are separated by hashtags. Since they are only part of the tabular they refer to other columns with values in cells also separated by hashtags.
Column Name: solution_id | type labour | labour_unit | est_labour_quantity | est_labour_costs | est_labour_total_costs
10 | WorkA#WorkB | Person#Person | 2.0#2.0 | 300.0#300.0. | 600.0#600.0
11 | WorkC#WorkD | Person#Person | 3.0#2.0 | 300.0#300.0. | 900.0#600.0
My questions are twofold:
What would be a good way to transform the data to work on it more efficiently, e.g. create as many as new columns as there are entries in one cell. So e.g. separate it like e.g.
Column Name: solution_id | type labour_1 | labour_unit_1 | est_labour_quantity_1 | est_labour_costs_1 | est_labour_total_costs_1 | type labour_2 | labour_unit_2 | est_labour_quantity_2 | est_labour_costs_2 | est_labour_total_costs_2
10 | WorkA | Person. | 2.0. | 300.0. | 600.0. | WorkB | Person | 2.0 | 300.0 | 600.0
11 | WorkC | Person. | 3.0. | 300.0. | 900.0. | WorkD | Person | 2.0 | 300.0 | 600.0
This makes it more readable but it doubles the amount of columns and I have some cells with up to 5 entries, so it would be x5 more columns. What I also don't like so much about the idea is that the new column names are not really meaningful and it will be hard to interpret them.
How can I make this separation in pandas, so that I have WorkA and then the associated values, and then Work B etc...
If there is another better way to work with this tabular form, maybe bring it all in one cell? Please let me know!
Use:
#unpivot by melt
df = df.melt('solution_id')
#create lists by split #
df['value'] = df['value'].str.split('#')
#repeat rows by value column
df = df.explode('value')
#counter for new columns names
df['g'] = df.groupby(['solution_id','variable']).cumcount().add(1)
#pivoting and sorting MultiIndex
df = (df.pivot('solution_id',['variable', 'g'], 'value')
.sort_index(level=1, axis=1, sort_remaining=False))
#flatten MultiIndex
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
type_labour_1 labour_unit_1 est_labour_quantity_1 \
solution_id
10 WorkA Person 2.0
11 WorkC Person 3.0
est_labour_costs_1 est_labour_total_costs_1 type_labour_2 \
solution_id
10 300.0 600.0 WorkB
11 300.0 900.0 WorkD
labour_unit_2 est_labour_quantity_2 est_labour_costs_2 \
solution_id
10 Person 2.0 300.0.
11 Person 2.0 300.0.
est_labour_total_costs_2
solution_id
10 600.0
11 600.0
You can split your strings, explode and reshape:
df2 = (df
.set_index('solution_id')
.apply(lambda c: c.str.split('#'))
.explode(list(df.columns[1:]))
.assign(idx=lambda d: d.groupby(level=0).cumcount().add(1))
.set_index('idx', append=True)
.unstack('idx')
.sort_index(axis=1, level='idx', sort_remaining=False)
)
df2.columns = [f'{a}_{b}' for a,b in df2.columns]
output:
type labour_1 labour_unit_1 est_labour_quantity_1 est_labour_costs_1 est_labour_total_costs_1 type labour_2 labour_unit_2 est_labour_quantity_2 est_labour_costs_2 est_labour_total_costs_2
solution_id
10 WorkA Person 2.0 300.0 600.0 WorkB Person 2.0 300.0. 600.0
11 WorkC Person 3.0 300.0 900.0 WorkD Person 2.0 300.0. 600.0
Or, shorter code using the same initial split followed by slicing and concatenation:
df2=(df
.set_index('solution_id')
.apply(lambda c: c.str.split('#'))
)
pd.concat([df2.apply(lambda c: c.str[i]).add_suffix(f'_{i+1}')
for i in range(len(df2.iat[0,0]))], axis=1)

Store count as variable and use it for calculations in PySpark

I have dataframe df1:
+------+-----------+----------+----------+-----+
| sid|acc_term_id|first_name| last_name|major|
+------+-----------+----------+----------+-----+
|106454| 2014B| Doris| Marshall| BIO|
|106685| 2015A| Sara|Richardson| CHM|
|106971| 2015B| Rose| Butler| CHM|
|107298| 2015B| Kayla| Barnes| CSC|
|107555| 2016A| Carolyn| Ford| PHY|
|107624| 2016B| Marie| Webb| BIO|
I want to store the count of sid from this dataframe
c_value = current.agg({"sid": "count"}).collect()[0][0]
and use it for creating a prop column as shown in code below:
c_value = current.agg({"sid": "count"}).collect()[0][0]
stud_major = (
current
.groupBy('major')
.agg(
expr('COUNT(*) AS n_students')
)
.select('major', 'n_students', expr('ROUND(n_students/c_value, 4) AS prop'),
)
)
stud_major.show(16)
When I run the code I get error:
cannot resolve '`c_value`' given input columns: [major, n_students]; line 1 pos 17;
If I put numeric value 2055 instead of c_value everything ok like below.
+
-----+----------+------+
|major|n_students| prop|
+-----+----------+------+
| MTH| 320|0.1557|
| CHM| 405|0.1971|
| CSC| 508|0.2472|
| BIO| 615|0.2993|
| PHY| 207|0.1007|
+-----+----------+------+
Probably there are other ways to calculate but I need by storing count as variable.
Any ideas?
In jupyter Use pandas agg
j=df.agg({'sid':'count'})
df.groupby("major")['sid'].agg(n_students=(lambda x: x.count()), prop=(lambda x: x.count()/j))
major n_students prop
0 BIO 2 0.333333
1 CHM 2 0.333333
2 CSC 1 0.166667
3 PHY 1 0.166667
and pyspark
from pyspark.sql.functions import *
df.groupby('major').agg(count('sid').alias('n_students')).withColumn('prop', round((col('n_students')/c_value),2)).show()
Alternatively You could
c_value = df.agg({"sid": "count"}).collect()[0][0]
df.groupBy('major').agg(expr('COUNT(*) AS n_students')).selectExpr('major',"n_students", f"ROUND(n_students/{c_value},2) AS prop").show()
+-----+----------+----+
|major|n_students|prop|
+-----+----------+----+
| BIO| 2|0.33|
| CHM| 2|0.33|
| CSC| 1|0.17|
| PHY| 1|0.17|
+-----+----------+----+

KeyError after resampling

I have two dataframes df indexed by Datetime and df2 which has column Date (Series).
Before resampling I can run:
>>>> df[df2['Date'][0]]
and obtain all rows corresponding to day df2['Date'][0] which is 2013-08-07 in this example. However after resampling by day I can no longer obtain the row corresponding to that day as:
>>>> df.resample('D', how=np.max)[df2['Date'][0]]
KeyError: u'no item named 2013-08-07'
although that day is in the dataset
>>>> df.resample('D', how=np.max).head()
| Temp | etc
Date | |
---------------------------
2013-08-07 | 26.1 |
---------------------------
2013-08-08 | 28.2 |
---------------------------
etc
I am not sure whether it is a bug or it is designed to be like this, or, if the latter is true, why. But you can do this to get the desired result:
In [168]:
df1=pd.DataFrame(np.random.random(100), columns=['Temp'])
df1.index=pd.date_range('2013-08-07',periods=100,freq='5H')
df1.index.name='Date'
In [169]:
df2=pd.DataFrame(pd.date_range('2013-08-07',periods=23, freq='D'), columns=['Date'])
In [170]:
#You can do this
df3=df1.resample('D', how=np.max)
print df3[df3.index==df2['Date'][0]]
Temp
Date
2013-08-07 0.8128
[1 rows x 1 columns]
In [171]:
df3[df2['Date'][0]]
#Error