Zarr slow read speed of 43.82 GB file with Xarray

Zarr slow read speed of 43.82 GB file with Xarray - data-science

I want to look-up 8760 times for a single lat/lon combo in less than a second from 43.82 GB file of wind data containing:
8760 times (every hour in a year)
721 latitudes (every 0.25° from -90.0° to 90.0°)
1440 longitude (every 0.25° from -180.0° to 179.75°)
The best time we achieved for a single-year look-up was 16 seconds for both u100 and v100 wind speed at 100m vectors. We want to have a sub-second look-up for the whole year as such file read will need to happen on every user request in our API.
if __name__ == '__main__':
start_time = time.time()
ds = xr.open_dataset("2021.zarr", engine="zarr", chunks={"time": 50})
print(f"Took {round((time.time() - start_time) * 1000, 2)}ms")
location = ds.sel(indexers={"latitude": 53.494, "longitude": 9.979}, method='nearest')
wind_speed = (location.u100.values ** 2 + location.v100.values ** 2) ** 0.5
print(f"Wind Speed: {wind_speed} m/s")
print(f"Took {round((time.time() - start_time) * 1000, 2)}ms")
Output:
Took 94.28ms
Wind Speed: [5.8021994 5.504477 5.4270387 ... 9.563195 8.701231 9.133655 ] m/s
Took 16299.59ms
I would be very thankful for any help!

Related

How to fix "Solution Not Found" Error in Gekko Optimization with rolling principle

My program is optimizing the charging and decharging of a home battery to minimize the cost of electricity at the end of the year. In this case there also is a PV, which means that sometimes you're injecting electricity into the grid and receive money. The net offtake is the result of the usage of the home and the PV installation. So these are the possible situations:
Net offtake > 0 => usage home > PV => discharge from battery or take from grid
Net offtake < 0 => usage home < PV => charge battery or injection into grid
The electricity usage of homes is measured each 15 minutes, so I have 96 measurement point in 1 day. I want to optimilize the charging and decharging of the battery for 2 days, so that day 1 takes the usage of day 2 into account.
I wrote a controller that reads the data and gives each time the input values for 2 days to the optimization. With a rolling principle, it goes to the next 2 days and so on. Below you can see the code from my controller.
from gekko import GEKKO
from simulationModel2_2d_1 import getSimulation2
from exportModel2 import exportToExcelModel2
import numpy as np
#import matplotlib.pyplot as plt
import pandas as pd
import time
import math
# ------------------------ Import and read input data ------------------------
file = r'Data Sim 2.xlsx'
data = pd.read_excel(file, sheet_name='Input', na_values='NaN')
dataRead = pd.DataFrame(data, columns= ['Timestep','Verbruik woning (kWh)','Netto afname (kWh)','Prijs afname (€/kWh)',
'Prijs injectie (€/kWh)','Capaciteit batterij (kW)',
'Capaciteit batterij (kWh)','Rendement (%)',
'Verbruikersprofiel','Capaciteit PV (kWp)','Aantal dagen'])
timestep = dataRead['Timestep'].to_numpy()
usage_home = dataRead['Verbruik woning (kWh)'].to_numpy()
net_offtake = dataRead['Netto afname (kWh)'].to_numpy()
price_offtake = dataRead['Prijs afname (€/kWh)'].to_numpy()
price_injection = dataRead['Prijs injectie (€/kWh)'].to_numpy()
cap_batt_kW = dataRead['Capaciteit batterij (kW)'].iloc[0]
cap_batt_kWh = dataRead['Capaciteit batterij (kWh)'].iloc[0]
efficiency = dataRead['Rendement (%)'].iloc[0]
usersprofile = dataRead['Verbruikersprofiel'].iloc[0]
days = dataRead['Aantal dagen'].iloc[0]
pv = dataRead['Capaciteit PV (kWp)'].iloc[0]
# ------------- Optimization model & Rolling principle (2 days) --------------
# Initialise model
m = GEKKO()
# Output data
ts = []
charging = [] # Amount to charge/decharge batterij
e_batt = [] # Amount of energy in the battery
usage_net = [] # Usage after home, battery and pv
p_paid = [] # Price paid for energy of 15min
# Energy in battery to pass
energy = 0
# Iterate each day for one year
for d in range(int(days)-1):
d1_timestep = []
d1_net_offtake = []
d1_price_offtake = []
d1_price_injection = []
d2_timestep = []
d2_net_offtake = []
d2_price_offtake = []
d2_price_injection = []
# Iterate timesteps
for i in range(96):
d1_timestep.append(timestep[d*96+i])
d2_timestep.append(timestep[d*96+i+96])
d1_net_offtake.append(net_offtake[d*96+i])
d2_net_offtake.append(net_offtake[d*96+i+96])
d1_price_offtake.append(price_offtake[d*96+i])
d2_price_offtake.append(price_offtake[d*96+i+96])
d1_price_injection.append(price_injection[d*96+i])
d2_price_injection.append(price_injection[d*96+i+96])
# Input data simulation of 2 days
ts_temp = np.concatenate((d1_timestep, d2_timestep))
net_offtake_temp = np.concatenate((d1_net_offtake, d2_net_offtake))
price_offtake_temp = np.concatenate((d1_price_offtake, d2_price_offtake))
price_injection_temp = np.concatenate((d1_price_injection, d2_price_injection))
if(d == 7):
print(ts_temp)
print(energy)
# Simulatie uitvoeren
charging_temp, e_batt_temp, usage_net_temp, p_paid_temp, energy_temp = getSimulation2(ts_temp, net_offtake_temp, price_offtake_temp, price_injection_temp, cap_batt_kW, cap_batt_kWh, efficiency, energy, pv)
# Take over output first day, unless last 2 days
energy = energy_temp
if(d == (days-2)):
for t in range(1,len(ts_temp)):
ts.append(ts_temp[t])
charging.append(charging_temp[t])
e_batt.append(e_batt_temp[t])
usage_net.append(usage_net_temp[t])
p_paid.append(p_paid_temp[t])
elif(d == 0):
for t in range(int(len(ts_temp)/2)+1):
ts.append(ts_temp[t])
charging.append(charging_temp[t])
e_batt.append(e_batt_temp[t])
usage_net.append(usage_net_temp[t])
p_paid.append(p_paid_temp[t])
else:
for t in range(1,int(len(ts_temp)/2)+1):
ts.append(ts_temp[t])
charging.append(charging_temp[t])
e_batt.append(e_batt_temp[t])
usage_net.append(usage_net_temp[t])
p_paid.append(p_paid_temp[t])
print('Simulation day '+str(d+1)+' complete.')
# ------------------------ Export output data to Excel -----------------------
a = exportToExcelModel2(ts, usage_home, net_offtake, price_offtake, price_injection, charging, e_batt, usage_net, p_paid, cap_batt_kW, cap_batt_kWh, efficiency, usersprofile, pv)
print(a)
The optimization with Gekko happens in the following code:
from gekko import GEKKO
def getSimulation2(timestep, net_offtake, price_offtake, price_injection,
cap_batt_kW, cap_batt_kWh, efficiency, start_energy, pv):
# ---------------------------- Optimization model ----------------------------
# Initialise model
m = GEKKO(remote = False)
# Global options
m.options.SOLVER = 1
m.options.IMODE = 6
# Constants
speed_charging = cap_batt_kW/4
m.time = timestep
max_cap_batt = m.Const(value = cap_batt_kWh)
min_cap_batt = m.Const(value = 0)
max_charge = m.Const(value = speed_charging) # max battery can charge in 15min
max_decharge = m.Const(value = -speed_charging) # max battery can decharge in 15min
# Parameters
usage_home = m.Param(net_offtake)
price_offtake = m.Param(price_offtake)
price_injection = m.Param(price_injection)
# Variables
e_batt = m.Var(value=start_energy, lb = min_cap_batt, ub = max_cap_batt) # energy in battery
price_paid = m.Var() # price paid each 15min
charging = m.Var(lb = max_decharge, ub = max_charge) # amount charge/decharge each 15min
usage_net = m.Var(lb=min_cap_batt)
# Equations
m.Equation(e_batt==(m.integral(charging)+start_energy)*efficiency)
m.Equation(-charging <= e_batt)
m.Equation(usage_net==usage_home + charging)
price = m.Intermediate(m.if2(usage_net*1e6, price_injection, price_offtake))
price_paid = m.Intermediate(usage_net * price / 100)
# Objective
m.Minimize(price_paid)
# Solve problem
m.options.COLDSTART=2
m.solve()
m.options.TIME_SHIFT=0
m.options.COLDSTART=0
m.solve()
# Energy to pass
energy_left = e_batt[95]
#m.cleanup()
return charging, e_batt, usage_net, price_paid, energy_left
The data you need for input can be found in this Excel document:
https://docs.google.com/spreadsheets/d/1S40Ut9-eN_PrftPCNPoWl8WDDQtu54f0/edit?usp=sharing&ouid=104786612700360067470&rtpof=true&sd=true
With this code, it always ends at day 17 with the "Solution Not Found" Error.
I already tried extending the default iteration limit to 500 but it didn't work.
I also tried with other solvers but also no improvement.
By presolving with COLDSTART it already reached day 17, without this it ends at day 8.
I solved the days were my optimization ends individually and then the solution was always found immediately with the same code.
Someone who can explain this to me and maybe find a solution? Thanks in advance!

This is kind of big to troubleshoot, but here are some general ideas that might help. This assumes, as you said, that the model solves fine for day 1-2, and day 3-4, and day 5-6, etc. And that those results pass inspection (aka the basic model is working as you say).
Then something is (obviously) amiss around day 17. Some things to look at and try:
Start the model at day 16-17, see if it works in isolation
gather your results as you go and do a time series plot of the key variables, maybe one of them is on an obvious bad trend towards a limit causing an infeasibility... Perhaps the e_batt variable is slowly declining because not enough PV energy is available and hits minimum on Day 17
Radically change the upper/lower bounds on your variables to test them to see if they might be involved in the infeasibility (assuming there is one)
Make a different (fake) dataset where the solution is assured and kind of obvious... all constants or a pattern that you know will produce some known result and test the model outputs against it.
It might also be useful to pursue the tips in this excellent post on troubleshooting gekko, and edit your question with the results of that effort.
edit: couple ideas from your comment...
You didn't say what the result was from troubleshooting and whether the error is infeasible or max iterations, or ???. But...
If the model seems to crunch after about 15 days, I'm betting that it is becoming infeasible. Did you plot the battery level over course of days?
Also, I'm suspicious of your equation for the e_batt level... Why are you multiplying the prior battery state by the efficiency factor? That seems incorrect. That is charge that is already in the battery. Odds are you are (incorrectly) hitting the battery charge level every day with the efficiency tax and that the max charge level isn't sufficient to keep up with demand.
In addition to tips above try:
fix your efficiency formula to not multiply the efficiency times the previous state
change the efficiency to 100%
make the upper limit on charge huge
As an aside: I don't really see the connection to PV energy available here. What you are basically modeling is some "mystery battery" that you can charge and discharge anytime you want. I would get this debugged first and then look at the energy available by time of day...you aren't going to be generating charge at midnight. :).

Is there an equivalent to numpy.digitize that works on an pandas.IntervalIndex?

I need to match each hour of the month to a monthly total for the month that the hour falls in.
I am passed a DataFrame (monthly_totals) with a time-based pandas.IntervalIndex, and a second DataFrame (hours) with a pandas.DatetimeIndex. More generally, I need to match the index of one DataFrame, to the interval of another DataFrame that each entry falls into.
I have a working solution, using pandas.Series.apply, but it is quite slow. I see that numpy.digitize exists, and it taunts me, because the bins parameter must be an array, not an IntervalIndex.
My first attempt, which works, but takes about 1 second to process a DataFrame of length 8760, is as follows:
def get_mock_montly_totals(self):
start = '2018-07-01'
end = '2019-07-01'
hourly_rng = pd.date_range(start, end, freq='H')
monthly_rng = pd.date_range(start, end, freq='MS')
mock_series = pd.Series(1, index=hourly_rng)
bins = (monthly_rng + pd.offsets.Day(pd.Timestamp(start).day - 1))
cuts = pd.cut(mock_series.index, bins, right=False)
groups = mock_series.groupby(cuts)
monthly_totals = groups.sum()
return monthly_totals
def get_interval_value(self, frame, key):
try:
return frame.iloc[frame.index.get_loc(key)]
except KeyError:
return np.nan
result = api.get_secret_data().resample('H').asfreq()
hours = result.index.to_series()
monthly_totals = self.get_mock_montly_totals()
# This line takes over a second to run, which is too slow.
result['monthly_totals'] = hours.apply(
lambda h: self.get_interval_value(monthly_totals, h))
Where monthly_totals looks like:
[2018-07-01, 2018-08-01) 744
[2018-08-01, 2018-09-01) 744
[2018-09-01, 2018-10-01) 720
[2018-10-01, 2018-11-01) 744
[2018-11-01, 2018-12-01) 720
[2018-12-01, 2019-01-01) 744
[2019-01-01, 2019-02-01) 744
[2019-02-01, 2019-03-01) 672
[2019-03-01, 2019-04-01) 744
[2019-04-01, 2019-05-01) 720
[2019-05-01, 2019-06-01) 744
[2019-06-01, 2019-07-01) 720
dtype: int64
hours looks like:
time
2018-06-27 00:00:00-10:00 2018-06-27 10:00:00
...
2019-06-24 21:00:00-10:00 2019-06-25 07:00:00
And the output, result['monthly_totals'] should look like:
time
2018-06-27 00:00:00-10:00 NaN
...
2019-06-24 20:00:00-10:00 720
2019-06-24 21:00:00-10:00 720
Again, my solution works, but the call to apply seems to make it hella slow. So I really want some help getting towards a cleaner solution that ditches that. Thank you!

Solar energy conversion w/m^2 to mj/m^2

i am new here, I am using MERRA monthly solar radiation data. I want to convert w/M^2 to MJ/m^2
I am bit confused, how to convert solar radiation monthly average data W/m^2 to MJ/m^2
so far i understood by reading different sources,
Firstly i have to convert w/m^2 to kw/m^2
after kw/m^2 to mj/m^2 .......
Am i doing correctly
Just i am taking one instance:
For may month i have value 294 w/m^2
So 294 * 0.001 = 0.294 kw/m^2
0.294 * 24 (kw to kwh (m^/day)) = 7.056 kwh/m^2/day
7.056 * 3.6 (kwh to mj) = 25.40 mj/day
i am confused i am doing right or wrong .

Not sure why you would take the kWh step in between.
Your panels do 294 Watt per m², i.e. 294 Joule per sec per m². So that's 24*60*60 * 294 = 25401600 Joule per m² per day, or 25.4016 MJ per m² per day.

So if:
1 W/m2 = 1 J/m2 s
Then:
294 W/m2 = 294 J/m2 s
if you want it in days then:
1 day = 60s * 60min *24h = 86400s
294 J/m2 s x 86000s/1day = 25284000 J/m2 day
25284000 J/m2 day x 1MJ/1000000J = 25.284 MJ/m2 day
all together:
294 W/m2 = 294/(1000000/86400) = 25.4016 MJ/m2 day

A watt is the unit of power and Joules are the units of energy, they are related by time. 1 watt is 1 Joule per second 1W = 1 J/s. So the extension of that equation is that 1J = 1w x 1second. 1J = 1Ws. A loose analogy is if you say Litre is a unit of volume and L/S is a unit of flow. So your calculation needs to consider how long you are gathering the solar energy. So the number of Joules, if the sunlight shines at 90degrees to the solar panel for 1 hour is 294W/m2 x 3600s and would give ~1 x 10^7 joules per square metre. Of course as the inclination [the angle of light] varies away from 90 degrees, this will cause the effective power and hence the energy absorbed to drop, as a function of the sine of the angle to the sun. 90 degrees gives a sine of 1 and is full power.

Using DAX for Production Planning

My question is based on building a ramp up for planning production lines.
I have a WIP where a ramp up category is selected to be used for each MSO (Master Sew Order). The Ramp up is based on hour fences (for example 1-6 hours,6-12 hours,etc).
On the WIP, an MSO will have units (example 1,920 units), divided by capacity per hour (80 pcs/hr), to give time needed 24 hours. This then needs to be
calculated based on ramp up, for hours 1-6, 6-12, 12-18, and 18-24 and multiply our by related efficiency.
For example:
Hours 1-6: 20% efficiency * 80 units = 16 units/hr (6 x 16 = 96 units produced)
Hours 6-12: 40% efficiency * 80 units = 32 units/hr (192 units)
Hours 12-18: 60% efficiency * 80 Units = 48 units/hr (288 units)
Hours 18-24: 80% efficiency * 80 units = 64 units/hr (384 units)
Hours 24+: 100% efficiency * 80 units = 80 units/hr ((1920-960)/80)= 12 hours remaining
TOTAL TIME = 36 hours to produce
How would Power BI know to divide up the original 24 hour estimate into parts, multiply by respective efficiency, and return a new result of 36 hours?
Thank you so much in advance!
Kurt
Relationships

I'm not sure how to do this in DAX but you tagged PowerQuery so here's a custom query that computes 36 based on your parameters:
let
MSO = 1920,
Capacity = 80,
Efficiency = {
{6, 0.2},
{12, 0.4},
{18, 0.6},
{24, 0.8},
{#infinity, 1.0}
},
Accumulated = List.Accumulate(Efficiency, [
Remaining = MSO,
RunningHours = 0
], (state, current) =>
let
until = current{0},
eff = current{1},
currentCapacity = eff * Capacity,
RemainingHours = state[Remaining] / currentCapacity,
CappedHours = List.Min({RemainingHours, until - state[RunningHours]})
in [
Remaining = state[Remaining] - currentCapacity * CappedHours,
RunningHours = state[RunningHours] + CappedHours
]),
Result = if Accumulated[Remaining] = 0
then Accumulated[RunningHours]
else error "Not enough time to finish!"
in
Result
The inner lists for Efficiency are of the form time-efficiency-ends,efficiency-value. Plug in infinity to mean the last efficiency never stops.
In a normal iterative programming language you could update state with a for-loop, but in M you need to use List.Accumulate and package all your state into one value.

In your data model you may have MSO in one table containing 2 fields, [Units] and [UnitsPerHour], and another table called EffTable which may store the efficiencies broken out by the hour fences.
Create 4 new calculated columns in your MSO table, one for each hour fence, eg [1--6]:
=
6 * LOOKUPVALUE ( EffTable[Efficiency], EffTable[Hours], "1--6" )
* [UnitsPerHour]
These are fields that hold how many units you would produce in the 4 time slots. Create a new calculated field for the total, [RampUpUnits]:
=
[1--6Hours] + [6--12Hours] + [12--18Hours] + [18--24Hours]
Finally calculate the total time as:
=
24
+ ( [Units] - [RampUpUnits] )
/ [UnitsPerHour]
This calculates the number of hours required for the remaining units and adds it to 24 for the ramp up time.

Pandas shifting uneven timeseries data

I have some irregularly stamped time series data, with timestamps and the observations at every timestamp, in pandas. Irregular basically means that the timestamps are uneven, for instance the gap between two successive timestamps is not even.
For instance the data may look like
Timestamp Property
0 100
1 200
4 300
6 400
6 401
7 500
14 506
24 550
.....
59 700
61 750
64 800
Here the timestamp is say seconds elapsed since a chose origin time. As you can see we could have data at the same timestamp, 6 secs in this case. Basically the timestamps are strictly different, just that second resolution cannot measure the change.
Now I need to shift the timeseries data ahead, say I want to shift the entire data by 60 secs, or a minute. So the target output is
Timestamp Property
0 750
1 800
So the 0 point got matched to the 61 point and the 1 point got matched to the 64 point.
Now I can do this by writing something dirty, but I am looking to use as much as possible any inbuilt pandas feature. If the timeseries were regular, or evenly gapped, I could've just used the shift() function. But the fact that the series is uneven makes it a bit tricky. Any ideas from Pandas experts would be welcome. I feel that this would be a commonly encountered problem. Many thanks!

Edit: added a second, more elegant, way to do it. I don't know what will happen if you had a timestamp at 1 and two timestamps of 61. I think it will choose the first 61 timestamp but not sure.
new_stamps = pd.Series(range(df['Timestamp'].max()+1))
shifted = pd.DataFrame(new_stamps)
shifted.columns = ['Timestamp']
merged = pd.merge(df,shifted,on='Timestamp',how='outer')
merged['Timestamp'] = merged['Timestamp'] - 60
merged = merged.sort(columns = 'Timestamp').bfill()
results = pd.merge(df,merged, on = 'Timestamp')
[Original Post]
I can't think of an inbuilt or elegant way to do this. Posting this in case it's more elegant than your "something dirty", which is I guess unlikely. How about:
lookup_dict = {}
def assigner(row):
lookup_dict[row['Timestamp']] = row['Property']
df.apply(assigner, axis=1)
sorted_keys = sorted(lookup_dict.keys)
df['Property_Shifted'] = None
def get_shifted_property(row,shift_amt):
for i in sorted_keys:
if i >= row['Timestamp'] + shift_amt:
row['Property_Shifted'] = lookup_dict[i]
return row
df = df.apply(get_shifted_property, shift_amt=60, axis=1)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas