What algorithm can I use to compute number of say positive or negative postings seen until a certain timepoint? - data-science

I wish to check if my understanding and proposed algorithm below would be correct.
to calculate the number of positive postings I have seen until time point ti, I am proposing a loop as below:
sumofPi = 0
for x = 0 until x = ti
sumofPi = sumofPi + Pi-1
I am not sure if this will work but the idea is to be able to sum up the positive postings that comes in within a certain timepoint in a data stream.
Thanks

The sequence seems fine as long as the events are indexed in order and you are comfortable loosing events that happened at the same time but indexed differently as a result of that limitation. You may also want to address posting type filtering.
Your algorithm in Python:
# Sample data
postingevents=[1,0,1,1,0,1]
# Algorithm:
sumofPi = 0
ti=4
for i in range(0,ti):
sumofPi += postingevents[i]
print(sumofPi)
3
Looks like you are dealing with time series.
For time series, I would suggest rolling sum or rolling weighted averages, there's an example here
Below are some Python code samples using loops and recursion with a data sample (Event indicator & epoch time stamp)
# Data sample:
postingevents=[1,0,1,1,0,1]
postingti=[1497634668,1497634669,1497634697,1497634697,1497634714,1497634718]
postings=([postingevents,postingti])
# All events preceeding time stamp T. Events do not need to be ordered by time.
def sumpi_notordered(X,t):
return sum([xv if yv<=t else 0 for (xv,yv) in zip(X[0],X[1])])
# Sum ordered events indexed by T, using recursion.
def sumpi_ordered(X,t):
if t>=1:
return X[t]+sumpi_ordered(X,t-1)
else:
return(X[t])
print(sumpi_notordered(postings,1497634697))
3
print(sumpi_ordered(postingevents,3))
3

Related

Pandas rolling window on an offset between 4 and 2 weeks in the past

I have a datafile with quality scores from different suppliers over a time range of 3 years. The end goal is to use machine learning to predict the quality label (good or bad) of a shipment based on supplier information.
I want to use the mean historic quality data over a specific period of time as an input feature in this model by using pandas rolling window. the problem with this method is that pandas only allows you to create a window from t=0-x until t=0 for you rolling window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='14d',closed='left').mean()
And this is were the problem comes in. For my feature I want to use quality data from a period of 2 weeks, but these 2 weeks are not the 2 weeks before the corresponding shipment, but of 2 weeks, starting from t=-4weeks , and ending on t=-2weeks.
You would imagine that this could be solved by using the same string of code but changing the window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='28d' - '14d',closed='left').mean()
This, or any other type of denotation of this specific window does not seem to work.
It seems like pandas does not offer a solution to this problem, so we made a work around it with the following solution:
def time_shift_week(df):
def _avg_score_interval_func(series):
current_time = series.index[-1]
result = series[(series.index > ( current_time- pd.Timedelta(value=4, unit='w')))
& (series.index < (current_time - pd.Timedelta(value=2, unit='w')))]
return result.mean() if len(result)>0 else 0.0
temp_df = df.groupby(by=["supplier", "timestamp"], as_index=False).aggregate({"score": np.mean}).set_index('timestamp')
temp_df["w-42"] = (
temp_df
.groupby(["supplier"])
.ag_score
.apply(lambda x:
x
.rolling(window='30D', closed='both')
.apply(_avg_score_interval_func)
))
return temp_df.reset_index()
This results in a new df in which we find the average score score per supplier per timestamp, which we can subsequently merge with the original data frame to obtain the new feature.
Doing it this way seems really cumbersome and overly complicated for the task I am trying to perform. Eventhough we have found a workaround, I am wondering if there is an easier method of doing this.
Is anyone aware of a less complicated way of performing this rolling window feature extraction?
While pandas does not have the custom date offset you need, calculating the mean is pretty simple: it's just sum divided by count. You can subtract the 14-day rolling window from the 28-day rolling window:
# Some sample data. All scores are sequential for easy verification
idx = pd.MultiIndex.from_product(
[list("ABC"), pd.date_range("2020-01-01", "2022-12-31")],
names=["supplier", "timestamp"],
)
df = pd.DataFrame({"score": np.arange(len(idx))}, index=idx).reset_index()
# Now we gonna do rolling avg on score with the custom window.
# closed=left mean the current row will be excluded from the window.
score = df.set_index("timestamp").groupby("supplier")["score"]
r28 = score.rolling("28d", closed="left")
r14 = score.rolling("14d", closed="left")
avg_score = (r28.sum() - r14.sum()) / (r28.count() - r14.count())

How to fix "Solution Not Found" Error in Gekko Optimization with rolling principle

My program is optimizing the charging and decharging of a home battery to minimize the cost of electricity at the end of the year. In this case there also is a PV, which means that sometimes you're injecting electricity into the grid and receive money. The net offtake is the result of the usage of the home and the PV installation. So these are the possible situations:
Net offtake > 0 => usage home > PV => discharge from battery or take from grid
Net offtake < 0 => usage home < PV => charge battery or injection into grid
The electricity usage of homes is measured each 15 minutes, so I have 96 measurement point in 1 day. I want to optimilize the charging and decharging of the battery for 2 days, so that day 1 takes the usage of day 2 into account.
I wrote a controller that reads the data and gives each time the input values for 2 days to the optimization. With a rolling principle, it goes to the next 2 days and so on. Below you can see the code from my controller.
from gekko import GEKKO
from simulationModel2_2d_1 import getSimulation2
from exportModel2 import exportToExcelModel2
import numpy as np
#import matplotlib.pyplot as plt
import pandas as pd
import time
import math
# ------------------------ Import and read input data ------------------------
file = r'Data Sim 2.xlsx'
data = pd.read_excel(file, sheet_name='Input', na_values='NaN')
dataRead = pd.DataFrame(data, columns= ['Timestep','Verbruik woning (kWh)','Netto afname (kWh)','Prijs afname (€/kWh)',
'Prijs injectie (€/kWh)','Capaciteit batterij (kW)',
'Capaciteit batterij (kWh)','Rendement (%)',
'Verbruikersprofiel','Capaciteit PV (kWp)','Aantal dagen'])
timestep = dataRead['Timestep'].to_numpy()
usage_home = dataRead['Verbruik woning (kWh)'].to_numpy()
net_offtake = dataRead['Netto afname (kWh)'].to_numpy()
price_offtake = dataRead['Prijs afname (€/kWh)'].to_numpy()
price_injection = dataRead['Prijs injectie (€/kWh)'].to_numpy()
cap_batt_kW = dataRead['Capaciteit batterij (kW)'].iloc[0]
cap_batt_kWh = dataRead['Capaciteit batterij (kWh)'].iloc[0]
efficiency = dataRead['Rendement (%)'].iloc[0]
usersprofile = dataRead['Verbruikersprofiel'].iloc[0]
days = dataRead['Aantal dagen'].iloc[0]
pv = dataRead['Capaciteit PV (kWp)'].iloc[0]
# ------------- Optimization model & Rolling principle (2 days) --------------
# Initialise model
m = GEKKO()
# Output data
ts = []
charging = [] # Amount to charge/decharge batterij
e_batt = [] # Amount of energy in the battery
usage_net = [] # Usage after home, battery and pv
p_paid = [] # Price paid for energy of 15min
# Energy in battery to pass
energy = 0
# Iterate each day for one year
for d in range(int(days)-1):
d1_timestep = []
d1_net_offtake = []
d1_price_offtake = []
d1_price_injection = []
d2_timestep = []
d2_net_offtake = []
d2_price_offtake = []
d2_price_injection = []
# Iterate timesteps
for i in range(96):
d1_timestep.append(timestep[d*96+i])
d2_timestep.append(timestep[d*96+i+96])
d1_net_offtake.append(net_offtake[d*96+i])
d2_net_offtake.append(net_offtake[d*96+i+96])
d1_price_offtake.append(price_offtake[d*96+i])
d2_price_offtake.append(price_offtake[d*96+i+96])
d1_price_injection.append(price_injection[d*96+i])
d2_price_injection.append(price_injection[d*96+i+96])
# Input data simulation of 2 days
ts_temp = np.concatenate((d1_timestep, d2_timestep))
net_offtake_temp = np.concatenate((d1_net_offtake, d2_net_offtake))
price_offtake_temp = np.concatenate((d1_price_offtake, d2_price_offtake))
price_injection_temp = np.concatenate((d1_price_injection, d2_price_injection))
if(d == 7):
print(ts_temp)
print(energy)
# Simulatie uitvoeren
charging_temp, e_batt_temp, usage_net_temp, p_paid_temp, energy_temp = getSimulation2(ts_temp, net_offtake_temp, price_offtake_temp, price_injection_temp, cap_batt_kW, cap_batt_kWh, efficiency, energy, pv)
# Take over output first day, unless last 2 days
energy = energy_temp
if(d == (days-2)):
for t in range(1,len(ts_temp)):
ts.append(ts_temp[t])
charging.append(charging_temp[t])
e_batt.append(e_batt_temp[t])
usage_net.append(usage_net_temp[t])
p_paid.append(p_paid_temp[t])
elif(d == 0):
for t in range(int(len(ts_temp)/2)+1):
ts.append(ts_temp[t])
charging.append(charging_temp[t])
e_batt.append(e_batt_temp[t])
usage_net.append(usage_net_temp[t])
p_paid.append(p_paid_temp[t])
else:
for t in range(1,int(len(ts_temp)/2)+1):
ts.append(ts_temp[t])
charging.append(charging_temp[t])
e_batt.append(e_batt_temp[t])
usage_net.append(usage_net_temp[t])
p_paid.append(p_paid_temp[t])
print('Simulation day '+str(d+1)+' complete.')
# ------------------------ Export output data to Excel -----------------------
a = exportToExcelModel2(ts, usage_home, net_offtake, price_offtake, price_injection, charging, e_batt, usage_net, p_paid, cap_batt_kW, cap_batt_kWh, efficiency, usersprofile, pv)
print(a)
The optimization with Gekko happens in the following code:
from gekko import GEKKO
def getSimulation2(timestep, net_offtake, price_offtake, price_injection,
cap_batt_kW, cap_batt_kWh, efficiency, start_energy, pv):
# ---------------------------- Optimization model ----------------------------
# Initialise model
m = GEKKO(remote = False)
# Global options
m.options.SOLVER = 1
m.options.IMODE = 6
# Constants
speed_charging = cap_batt_kW/4
m.time = timestep
max_cap_batt = m.Const(value = cap_batt_kWh)
min_cap_batt = m.Const(value = 0)
max_charge = m.Const(value = speed_charging) # max battery can charge in 15min
max_decharge = m.Const(value = -speed_charging) # max battery can decharge in 15min
# Parameters
usage_home = m.Param(net_offtake)
price_offtake = m.Param(price_offtake)
price_injection = m.Param(price_injection)
# Variables
e_batt = m.Var(value=start_energy, lb = min_cap_batt, ub = max_cap_batt) # energy in battery
price_paid = m.Var() # price paid each 15min
charging = m.Var(lb = max_decharge, ub = max_charge) # amount charge/decharge each 15min
usage_net = m.Var(lb=min_cap_batt)
# Equations
m.Equation(e_batt==(m.integral(charging)+start_energy)*efficiency)
m.Equation(-charging <= e_batt)
m.Equation(usage_net==usage_home + charging)
price = m.Intermediate(m.if2(usage_net*1e6, price_injection, price_offtake))
price_paid = m.Intermediate(usage_net * price / 100)
# Objective
m.Minimize(price_paid)
# Solve problem
m.options.COLDSTART=2
m.solve()
m.options.TIME_SHIFT=0
m.options.COLDSTART=0
m.solve()
# Energy to pass
energy_left = e_batt[95]
#m.cleanup()
return charging, e_batt, usage_net, price_paid, energy_left
The data you need for input can be found in this Excel document:
https://docs.google.com/spreadsheets/d/1S40Ut9-eN_PrftPCNPoWl8WDDQtu54f0/edit?usp=sharing&ouid=104786612700360067470&rtpof=true&sd=true
With this code, it always ends at day 17 with the "Solution Not Found" Error.
I already tried extending the default iteration limit to 500 but it didn't work.
I also tried with other solvers but also no improvement.
By presolving with COLDSTART it already reached day 17, without this it ends at day 8.
I solved the days were my optimization ends individually and then the solution was always found immediately with the same code.
Someone who can explain this to me and maybe find a solution? Thanks in advance!
This is kind of big to troubleshoot, but here are some general ideas that might help. This assumes, as you said, that the model solves fine for day 1-2, and day 3-4, and day 5-6, etc. And that those results pass inspection (aka the basic model is working as you say).
Then something is (obviously) amiss around day 17. Some things to look at and try:
Start the model at day 16-17, see if it works in isolation
gather your results as you go and do a time series plot of the key variables, maybe one of them is on an obvious bad trend towards a limit causing an infeasibility... Perhaps the e_batt variable is slowly declining because not enough PV energy is available and hits minimum on Day 17
Radically change the upper/lower bounds on your variables to test them to see if they might be involved in the infeasibility (assuming there is one)
Make a different (fake) dataset where the solution is assured and kind of obvious... all constants or a pattern that you know will produce some known result and test the model outputs against it.
It might also be useful to pursue the tips in this excellent post on troubleshooting gekko, and edit your question with the results of that effort.
edit: couple ideas from your comment...
You didn't say what the result was from troubleshooting and whether the error is infeasible or max iterations, or ???. But...
If the model seems to crunch after about 15 days, I'm betting that it is becoming infeasible. Did you plot the battery level over course of days?
Also, I'm suspicious of your equation for the e_batt level... Why are you multiplying the prior battery state by the efficiency factor? That seems incorrect. That is charge that is already in the battery. Odds are you are (incorrectly) hitting the battery charge level every day with the efficiency tax and that the max charge level isn't sufficient to keep up with demand.
In addition to tips above try:
fix your efficiency formula to not multiply the efficiency times the previous state
change the efficiency to 100%
make the upper limit on charge huge
As an aside: I don't really see the connection to PV energy available here. What you are basically modeling is some "mystery battery" that you can charge and discharge anytime you want. I would get this debugged first and then look at the energy available by time of day...you aren't going to be generating charge at midnight. :).

How to set Custom Business Day End Frequency in Pandas

I have a pandas dataframe with an unusual DatetimeIndex. The frame contains daily data (end of each day) from 1985 to 1990 but some "random" days are missing:
DatetimeIndex(['1985-01-02', '1985-01-03', '1985-01-04', '1985-01-07',
'1985-01-08', '1985-01-09', '1985-01-10', '1985-01-11',
'1985-01-14', '1985-01-15',
...
'1990-12-17', '1990-12-18', '1990-12-19', '1990-12-20',
'1990-12-21', '1990-12-24', '1990-12-26', '1990-12-27',
'1990-12-28', '1990-12-31'],
dtype='datetime64[ns]', name='date', length=1516, freq=None)
I often need operations like shifting an entire column such that a value that is at the last day of a month (which could e.g. in my DatetimeIndex be '1985-05-30') is shifted to the last day of the next (which could e.g. my DatetimeIndex be '1985-06-27').
While looking for a smart way to perform such shifts, I stumbled over Offset Aliases provided by pandas.tseries.offsets. It can be observed that there are the aliases custom business day frequency (C) and custom business month end frequency (CBM). When looking at an example, it seems like that this could provide exactly what I need:
mth_us = pd.offsets.CustomBusinessMonthEnd(calendar=USFederalHolidayCalendar())
day_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
df['Col1_shifted'] = df['Col1'].shift(periods=1, freq = mth_us) # shifted by 1 month
df['Col2_shifted'] = df['Col2'].shift(periods=1, freq = day_us) # shifted by 1 day
The problem is that my DatetimeIndex is not equal to USFederalHolidayCalendar(). Can someone please tell me how I can use pd.offsets.CustomBusinessMonthEnd (and also pd.offsets.CustomBusinessDay) with my own custom DatetimeIndex?
If not, has any of you an idea how to tackle this issue in a different way?
Thanks a lot for your help!

Why even though I sliced my original DataFrame and assigned it to another variable, my original DataFrame still changed values?

I am trying to calculate a portfolio's daily total price, by multiplying weights of each asset with the daily price of the assets.
Currently I have a DataFrame tw which is all zeros except for the dates that I want to re-balance, which holds my assets weights. What I would like to do is for each month, populate the zeros with the weights I am trying to re-balance with, till the next re-balancing date, and so on and so forth.
My code:
df_of_weights = tw.loc[dates_to_rebalance[13]:]
temp_date = dates_to_rebalance[13]
counter = 0
for date in df_of_weights.index:
if date.year == temp_date.year and date.month == temp_date.month:
if date.day == temp_date.day:
pass
else:
df_of_weights.loc[date] = df_of_weights.loc[temp_date].values
counter += 1
temp_date = dates_to_rebalance[13+counter]
I understand that if you slice your DataFrame and assign it to a variable (df_of_weights), changing the values of said variable would not affect the original DataFrame. However, the values in tw changed. Have been searching for an answer online for a while now and am really confused.
You should use copy in order to fix the problem such that:
df_of_weights = tw.loc[dates_to_rebalance[13]:].copy()
The problem is slicing provides view instead of copy. The issue is still open.
https://github.com/pandas-dev/pandas/issues/15631

Moving Average of time series using a sliding window over an array

I need to write a function below that can compute the moving average of time series using a sliding window over an array. This function should take an array of date strings (say arr_date), an array of numbers (say arr_record), and a sliding window (default value 50). It should:
Return a list of dictionaries for all windows.
Each dictionary should include the date, average value, min, max, standard deviation at each window.
Able to handle missing data in time series by replacing missing data with the most recent available data.
(b) Download SPY daily data (Dec. 31, 2017 to Dec. 31, 2018) from Yahoo! as your test data in a .csv file. Read reading .csv file example and write a test programming for calling your function.
Does anyone have any thoughts? Extremely new to python and struggling.
So something following this logic should probably be a good starting point. Hope this is a helpful start, and welcome to the cs community.
def sliding_window( dates, numbers, sliding_window_value):
# list of dictionaries
return_dicts =[{}]
# if window size is greater than length of dates, there's only one window
if sliding_window_value >= len(dates):
return_dicts += [create_window(dates, numbers)]
return return_dicts
# gather all our windows into one list
for i in range (0, len(dates) - sliding_window_value ):
# get our window subsets
dates_subset = dates[i:(sliding_window_value+1)]
numbers_subset = numbers[i:(sliding_window_value+1)]
# get our window stats dictionary
window_stats = create_window(dates_subset,numbers_subset)
# add these stats to our return list
return_dicts += [window_stats]
return return_dicts
def create_window(dates_subset, numbers_subset):
window_min = 1000000 # some high minimum to start
window_max = -1000000 # some low maximuim to start
window_total = 0
for i in range ( 0, len(dates_subset)):
# calculate total
window_total += numbers_subset[i]
# calculate max
if numbers_subset[i] > window_max:
window_max = numbers_subset[i]
# calculate min
if numbers_subset[i] < window_min:
window_min = numbers_subset[i]
# other calculations....
return_dict = {
"min" : window_min,
"max" : window_max,
"average" : window_total / len(dates_subset),
# other calculations....
}
return return_dict
Good luck bud, the work is worth it.