Calculate the integral of a Pandas DataFrame column for specific time intervals (e.g. per day) using time index - pandas

I have a dataframe (df) that includes power sensor data for a year. The data are sampled in irregular frequencies. My df is similar to that:
rng = pd.date_range('2020-07-30 12:00:00', periods=24, freq='6H')
df = pd.DataFrame(np.array([1, 4, 5, 2, 1, 6, 1, 4, 5, 2, 1, 6, 1, 4, 5, 2, 1, 6, 1, 4, 5, 2, 1, 6]), rng, columns=['power'])
df.index.name = 'Date'
df["month"] = df.index.month
df["week"] = df.index.week
What I want to do is to calculate the integral for each day and then be able to sum up these integrals for different duration e.g. weekly, monthly, etc.
For the whole dataframe the following give correct answers (they consider the time in the x-axis):
np.trapz(df["power"], df.index, axis=0)/np.timedelta64(1, 'h')
or
df.apply(integrate.trapz, args=(df.index,))/np.timedelta64(1, 'h')
When I try to integrate per day I have tried:
df.groupby(df.index.date)["power"].apply(np.trapz)
It has two problems:
it assumes that the "power" measurements are equally spaced and are per 1 unit of time
it does not consider the contribution from the first time unit when the day changes (e.g. on 31/7/2020 the value should have been 13 but now it calculates 8.5
I also tried:
df.groupby(df.index.date)["power"].apply(integrate.trapz, args=(df.index,))
but I get: TypeError: trapz() got an unexpected keyword argument 'args'
I would like my results to look like:
Date Energy(kWh)
2020-07-30 15
2020-07-31 78
2020-08-01 84
2020-08-02 66
2020-08-03 78
2020-08-04 84
2020-08-05 30
and then to be able to groupby e.g.
df = df.groupby(["month", "week"])["power"].sum()
and the result looks like:
month week Energy(kWh)
7 31 93
8 31 150
32 192
So how can I use in the integration, the index of my initial dataframe?

Try this:
dp = df.set_index('Date')
dp['Energy(kWh)'] = dp["power"].rolling('1D').apply(integrate.trapz) #1D is referenced as 1 day you can choose 1H or 1S.

Related

calculating within an interval Pandas DataFrame

I have a multiple dataFrames with pricing data which I've stored into two dicts (dici1, dicti2). Each dict contains dataFrame with prices in the interval form -5.0 to 0.0 or 0.0 to 5.0 and corespondending Volumes for each hour of a day.
Here is a simplified example:
Hour = [1, 1, 1, 1, 1,
2, 2, 2,
3, 3, 3, 3, 3,
4, 4, 4,
5, 5, 5]
Price = [0.0, 1.05, 1.45, 3.67, 4.0,
0.0, 4.5, 5.0,
0.0, 1.45, 1.89, 3.23, 5.0,
0.0, 3.23, 5.0,
0.0, 3.97, 4.2]
Volume = np.random.uniform(low=547.6, high=11452.3, size=19)
Data = {'Date': '2018-01-01', 'Hour' : Hour ,'Price' : Price, 'Volume' : Volume, 'Delta' : 'NaN'}
df = pd.DataFrame(Data, index=list(np.arange(0,19,1)))
print(df)
The hours have several price lengths as you can see if you plot it:
I'm now wanna compute the delta for each Price [0.0, 5.0] interval so for example volume at index 5 - volume at index 7 or at index 8 minus index 12.
I think an iteration/loop is needed therfore but I'm new in pandas and I havn't properly understood yet how it works.
I've tried on a single df to groupby like
group = dicti1[1].groupby(["Price"])
a = group.get_group(-5.0)
a.Delta = a.Volume - list(group.get_group(0.0).Volume)
a
But you can't do it for a dict. Also I think it would be better if a function counts the rows and take only the rows which fiting the Interval So here Index (5-7, 8-12, 13-16)
I've tried to apply groupby to the dicts like
for i in dicti1:
dicti1[i].groupby(["Price"])
but it doesn't work on dicts, getting 'AttributeError: 'DataFrame' object has no attribute 'get_group'.
So maybe someone could help here?
The output should look like this table:
Index
Date
Hour
Delta
0
2018-01-01
2
Volume p(0.0) - Volume p(5.0)
1
2018-01-01
3
Volume p(0.0) - Volume p(5.0)
2
2018-01-01
4
Volume p(0.0) - Volume p(5.0)

forecasting in time series in R with dplyr and ggplot

Hope all goes well.
I have a data set that I can share a small piece of it:
date=c("2022-08-01","2022-08-02","2022-08-03","2022-08-04",
"2022-08-05","2022-08-6")
sold_items=c(12,18,9,31,19,10)
df <- data.frame(date=as.Date(date),sold_items)
df %>% sample_n(5)
date sold_items
1 2022-08-04 31
2 2022-08-03 9
3 2022-08-01 12
4 2022-08-06 10
5 2022-08-02 18
I need to forecast the number of sold items in the next two weeks (14 days after the last available date in the data).
And also need to show the forecasted data along with the current data on one graph using ggplot
I have been looking into forecast package to use ARIMA but I am lost and could not convert this data to a time series object.
I wonder if someone can provide a solution with dplyr to my problem.
Thank you very much.
# first create df
` df =
tibble(
sold = c(12, 18, 9, 31, 19, 10),
date = seq(as.Date("2022-08-01"),
by = "day",
length = length(sold))) %>%
relocate(date)
#then coerce to a tsibble object (requires package fpp3) and model:
df %>%
as_tsibble(index = date) %>%
model(ARIMA(sold)) %>%
forecast(h = 14)

Plotting by groupby and average

I have a dataframe with multiple columns and rows. One column, say 'name' has several rows with names, the same name used multiple times. Other rows, say, 'x', 'y', 'z', 'zz' have values. I want to group by name and get the mean of each column (x,y,z,zz)for each name, then plot on a bar chart.
Using the pandas.DataFrame.groupby is an important data-wrangling stuff. Let's first make a dummy Pandas data frame.
df = pd.DataFrame({"name": ["John", "Sansa", "Bran", "John", "Sansa", "Bran"],
"x": [2, 3, 4, 5, 6, 7],
"y": [5, -3, 10, 34, 1, 54],
"z": [10.6, 99.9, 546.23, 34.12, 65.04, -74.29]})
>>>
name x y z
0 John 2 5 10.60
1 Sansa 3 -3 99.90
2 Bran 4 10 546.23
3 John 5 34 34.12
4 Sansa 6 1 65.04
5 Bran 7 54 -74.29
We can use the label of the column to group the data (here the label is "name"). Explicitly defining the by parameter can be omitted (c.f., df.groupby("name")).
df.groupby(by = "name").mean().plot(kind = "bar")
which gives us a nice bar graph.
Transposing the group by results using T (as also suggested by anky) yields a different visualization. We can also pass a dictionary as the by parameter to determine the groups. The by parameter can also be a function, Pandas series, or ndarray.
df.groupby(by = {1: "Sansa", 2: "Bran"}).mean().T.plot(kind = "bar")

Rolling means in Pandas dataframe

I am trying to run some computations on DataFrames. I want to compute the average difference between two sets of rolling mean. To be more specific, the average of the difference between a long-term mean (lst) and a smaller-one (lst_2). I am trying to combine the calculation with a double for loop as follows:
import pandas as pd
import numpy as pd
def main(df):
df=df.pct_change()
lst=[100,150,200,250,300]
lst_2=[5,10,15,20]
result=pd.DataFrame(np.sum([calc(df,T,t) for T in lst for t in lst_2]))/(len(lst)+len(lst_2))
return result
def calc(df,T,t):
roll=pd.DataFrame(np.sign(df.rolling(t).mean()-df.rolling(T).mean()))
return roll
Overall I should have 20 differences (5 and 100, 10 and 100, 15 and 100 ... 20 and 300); I take the sign of the difference and I want the average of these differences at each point in time. Ideally the result would be a dataframe result.
I got the error: cannot copy sequence with size 3951 to array axis with dimension 1056 when it runs the double for loops. Obviously I understand that due to rolling of different T and t, the dimensions of the dataframes are not equal when it comes to the array conversion (with np.sum), but I thought it would put "NaN" to align the dimensions.
Hope I have been clear enough. Thank you.
As requested in the comments, here is an example. Let's suppose the following
dataframe:
df = pd.DataFrame({'A': [100,101.4636,104.9477,106.7089,109.2701,111.522,113.3832,113.8672,115.0718,114.6945,111.7446,108.8154]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
df=df.pct_change()
and I have the following 2 sets of mean I need to compute:
lst=[8,10]
lst_1=[3,4]
Then I follow these steps:
1/
I want to compute the rolling mean(3) - rolling mean(8), and get the sign of it:
roll=np.sign(df.rolling(3).mean()-df.rolling(8).mean())
This should return the following:
roll = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
2/
I redo step 1 with the combination of differences 3-10 ; 4-8 ; 4-10. So I get overall 4 roll dataframes.
roll_3_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_3_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
3/
Now that I have all the diffs, I simply want the average of them, so I sum all the 4 rolling dataframes, and I divide it by 4 (number of differences computed). The results should be (before dropping all N/A values):
result = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])

numpy, sums of subsets with no iterations [duplicate]

I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)