Unable to load large pandas dataframe to pyspark - pandas

I've been trying to join two large pandas dataframes using pyspark using the following code. I'm trying to vary executor cores allocated for the application and measure scalability of pyspark (strong scaling).
r = 1000000000 # 1Bn rows
it = 10
w = 256
unique = 0.9
TOTAL_MEM = 240
TOTAL_NODES = 14
max_val = r * unique
rng = default_rng()
frame_data = rng.integers(0, max_val, size=(r, 2))
frame_data1 = rng.integers(0, max_val, size=(r, 2))
print(f"data generated", flush=True)
df_l = pd.DataFrame(frame_data).add_prefix("col")
df_r = pd.DataFrame(frame_data1).add_prefix("col")
print(f"data loaded", flush=True)
procs = int(math.ceil(w / TOTAL_NODES))
mem = int(TOTAL_MEM*0.9)
print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", flush=True)
spark = SparkSession\
.builder\
.appName(f'join {r} {w}')\
.master('spark://node:7077')\
.config('spark.executor.memory', f'{int(mem*0.6)}g')\
.config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\
.config('spark.cores.max', w)\
.config('spark.driver.memory', '100g')\
.config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\
.getOrCreate()
sdf0 = spark.createDataFrame(df_l).repartition(w).cache()
sdf1 = spark.createDataFrame(df_r).repartition(w).cache()
print(f"data loaded to spark", flush=True)
try:
for i in range(it):
t1 = time.time()
out = sdf0.join(sdf1, on='col0', how='inner')
count = out.count()
t2 = time.time()
print(f"timings {r} {w} {i} {(t2 - t1) * 1000:.0f} ms, {count}", flush=True)
del out
del count
gc.collect()
finally:
spark.stop()
Cluster:
I am using standalone spark cluster in a 15 node cluster with 48 cores and 240GB RAM each. I've spawned master and the driver code in node1, while other 14 nodes have spawned workers allocating maximum memory.
In the spark context, I am reserving 90% of total memory to executor, splitting 60% to jvm and 40% to pyspark.
Issue:
When I run the above program, I can see that the executors are being assigned to the app. But it doesn't move forward, even after 60 mins. For smaller row count (10M), this was working without a problem.
Driver output
world sz 256 procs per worker 19 mem 216 iter 8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
/N/u/d/dnperera/.conda/envs/cylonflow/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Negative initial size: -589934400
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warn(msg)
Any help on this is much appreciated.

Related

How to fix "Solution Not Found" Error in Gekko Optimization with rolling principle

My program is optimizing the charging and decharging of a home battery to minimize the cost of electricity at the end of the year. In this case there also is a PV, which means that sometimes you're injecting electricity into the grid and receive money. The net offtake is the result of the usage of the home and the PV installation. So these are the possible situations:
Net offtake > 0 => usage home > PV => discharge from battery or take from grid
Net offtake < 0 => usage home < PV => charge battery or injection into grid
The electricity usage of homes is measured each 15 minutes, so I have 96 measurement point in 1 day. I want to optimilize the charging and decharging of the battery for 2 days, so that day 1 takes the usage of day 2 into account.
I wrote a controller that reads the data and gives each time the input values for 2 days to the optimization. With a rolling principle, it goes to the next 2 days and so on. Below you can see the code from my controller.
from gekko import GEKKO
from simulationModel2_2d_1 import getSimulation2
from exportModel2 import exportToExcelModel2
import numpy as np
#import matplotlib.pyplot as plt
import pandas as pd
import time
import math
# ------------------------ Import and read input data ------------------------
file = r'Data Sim 2.xlsx'
data = pd.read_excel(file, sheet_name='Input', na_values='NaN')
dataRead = pd.DataFrame(data, columns= ['Timestep','Verbruik woning (kWh)','Netto afname (kWh)','Prijs afname (€/kWh)',
'Prijs injectie (€/kWh)','Capaciteit batterij (kW)',
'Capaciteit batterij (kWh)','Rendement (%)',
'Verbruikersprofiel','Capaciteit PV (kWp)','Aantal dagen'])
timestep = dataRead['Timestep'].to_numpy()
usage_home = dataRead['Verbruik woning (kWh)'].to_numpy()
net_offtake = dataRead['Netto afname (kWh)'].to_numpy()
price_offtake = dataRead['Prijs afname (€/kWh)'].to_numpy()
price_injection = dataRead['Prijs injectie (€/kWh)'].to_numpy()
cap_batt_kW = dataRead['Capaciteit batterij (kW)'].iloc[0]
cap_batt_kWh = dataRead['Capaciteit batterij (kWh)'].iloc[0]
efficiency = dataRead['Rendement (%)'].iloc[0]
usersprofile = dataRead['Verbruikersprofiel'].iloc[0]
days = dataRead['Aantal dagen'].iloc[0]
pv = dataRead['Capaciteit PV (kWp)'].iloc[0]
# ------------- Optimization model & Rolling principle (2 days) --------------
# Initialise model
m = GEKKO()
# Output data
ts = []
charging = [] # Amount to charge/decharge batterij
e_batt = [] # Amount of energy in the battery
usage_net = [] # Usage after home, battery and pv
p_paid = [] # Price paid for energy of 15min
# Energy in battery to pass
energy = 0
# Iterate each day for one year
for d in range(int(days)-1):
d1_timestep = []
d1_net_offtake = []
d1_price_offtake = []
d1_price_injection = []
d2_timestep = []
d2_net_offtake = []
d2_price_offtake = []
d2_price_injection = []
# Iterate timesteps
for i in range(96):
d1_timestep.append(timestep[d*96+i])
d2_timestep.append(timestep[d*96+i+96])
d1_net_offtake.append(net_offtake[d*96+i])
d2_net_offtake.append(net_offtake[d*96+i+96])
d1_price_offtake.append(price_offtake[d*96+i])
d2_price_offtake.append(price_offtake[d*96+i+96])
d1_price_injection.append(price_injection[d*96+i])
d2_price_injection.append(price_injection[d*96+i+96])
# Input data simulation of 2 days
ts_temp = np.concatenate((d1_timestep, d2_timestep))
net_offtake_temp = np.concatenate((d1_net_offtake, d2_net_offtake))
price_offtake_temp = np.concatenate((d1_price_offtake, d2_price_offtake))
price_injection_temp = np.concatenate((d1_price_injection, d2_price_injection))
if(d == 7):
print(ts_temp)
print(energy)
# Simulatie uitvoeren
charging_temp, e_batt_temp, usage_net_temp, p_paid_temp, energy_temp = getSimulation2(ts_temp, net_offtake_temp, price_offtake_temp, price_injection_temp, cap_batt_kW, cap_batt_kWh, efficiency, energy, pv)
# Take over output first day, unless last 2 days
energy = energy_temp
if(d == (days-2)):
for t in range(1,len(ts_temp)):
ts.append(ts_temp[t])
charging.append(charging_temp[t])
e_batt.append(e_batt_temp[t])
usage_net.append(usage_net_temp[t])
p_paid.append(p_paid_temp[t])
elif(d == 0):
for t in range(int(len(ts_temp)/2)+1):
ts.append(ts_temp[t])
charging.append(charging_temp[t])
e_batt.append(e_batt_temp[t])
usage_net.append(usage_net_temp[t])
p_paid.append(p_paid_temp[t])
else:
for t in range(1,int(len(ts_temp)/2)+1):
ts.append(ts_temp[t])
charging.append(charging_temp[t])
e_batt.append(e_batt_temp[t])
usage_net.append(usage_net_temp[t])
p_paid.append(p_paid_temp[t])
print('Simulation day '+str(d+1)+' complete.')
# ------------------------ Export output data to Excel -----------------------
a = exportToExcelModel2(ts, usage_home, net_offtake, price_offtake, price_injection, charging, e_batt, usage_net, p_paid, cap_batt_kW, cap_batt_kWh, efficiency, usersprofile, pv)
print(a)
The optimization with Gekko happens in the following code:
from gekko import GEKKO
def getSimulation2(timestep, net_offtake, price_offtake, price_injection,
cap_batt_kW, cap_batt_kWh, efficiency, start_energy, pv):
# ---------------------------- Optimization model ----------------------------
# Initialise model
m = GEKKO(remote = False)
# Global options
m.options.SOLVER = 1
m.options.IMODE = 6
# Constants
speed_charging = cap_batt_kW/4
m.time = timestep
max_cap_batt = m.Const(value = cap_batt_kWh)
min_cap_batt = m.Const(value = 0)
max_charge = m.Const(value = speed_charging) # max battery can charge in 15min
max_decharge = m.Const(value = -speed_charging) # max battery can decharge in 15min
# Parameters
usage_home = m.Param(net_offtake)
price_offtake = m.Param(price_offtake)
price_injection = m.Param(price_injection)
# Variables
e_batt = m.Var(value=start_energy, lb = min_cap_batt, ub = max_cap_batt) # energy in battery
price_paid = m.Var() # price paid each 15min
charging = m.Var(lb = max_decharge, ub = max_charge) # amount charge/decharge each 15min
usage_net = m.Var(lb=min_cap_batt)
# Equations
m.Equation(e_batt==(m.integral(charging)+start_energy)*efficiency)
m.Equation(-charging <= e_batt)
m.Equation(usage_net==usage_home + charging)
price = m.Intermediate(m.if2(usage_net*1e6, price_injection, price_offtake))
price_paid = m.Intermediate(usage_net * price / 100)
# Objective
m.Minimize(price_paid)
# Solve problem
m.options.COLDSTART=2
m.solve()
m.options.TIME_SHIFT=0
m.options.COLDSTART=0
m.solve()
# Energy to pass
energy_left = e_batt[95]
#m.cleanup()
return charging, e_batt, usage_net, price_paid, energy_left
The data you need for input can be found in this Excel document:
https://docs.google.com/spreadsheets/d/1S40Ut9-eN_PrftPCNPoWl8WDDQtu54f0/edit?usp=sharing&ouid=104786612700360067470&rtpof=true&sd=true
With this code, it always ends at day 17 with the "Solution Not Found" Error.
I already tried extending the default iteration limit to 500 but it didn't work.
I also tried with other solvers but also no improvement.
By presolving with COLDSTART it already reached day 17, without this it ends at day 8.
I solved the days were my optimization ends individually and then the solution was always found immediately with the same code.
Someone who can explain this to me and maybe find a solution? Thanks in advance!
This is kind of big to troubleshoot, but here are some general ideas that might help. This assumes, as you said, that the model solves fine for day 1-2, and day 3-4, and day 5-6, etc. And that those results pass inspection (aka the basic model is working as you say).
Then something is (obviously) amiss around day 17. Some things to look at and try:
Start the model at day 16-17, see if it works in isolation
gather your results as you go and do a time series plot of the key variables, maybe one of them is on an obvious bad trend towards a limit causing an infeasibility... Perhaps the e_batt variable is slowly declining because not enough PV energy is available and hits minimum on Day 17
Radically change the upper/lower bounds on your variables to test them to see if they might be involved in the infeasibility (assuming there is one)
Make a different (fake) dataset where the solution is assured and kind of obvious... all constants or a pattern that you know will produce some known result and test the model outputs against it.
It might also be useful to pursue the tips in this excellent post on troubleshooting gekko, and edit your question with the results of that effort.
edit: couple ideas from your comment...
You didn't say what the result was from troubleshooting and whether the error is infeasible or max iterations, or ???. But...
If the model seems to crunch after about 15 days, I'm betting that it is becoming infeasible. Did you plot the battery level over course of days?
Also, I'm suspicious of your equation for the e_batt level... Why are you multiplying the prior battery state by the efficiency factor? That seems incorrect. That is charge that is already in the battery. Odds are you are (incorrectly) hitting the battery charge level every day with the efficiency tax and that the max charge level isn't sufficient to keep up with demand.
In addition to tips above try:
fix your efficiency formula to not multiply the efficiency times the previous state
change the efficiency to 100%
make the upper limit on charge huge
As an aside: I don't really see the connection to PV energy available here. What you are basically modeling is some "mystery battery" that you can charge and discharge anytime you want. I would get this debugged first and then look at the energy available by time of day...you aren't going to be generating charge at midnight. :).

Why do my arrays display missing values when identifying a bandwidth? (geopandas)

I'm trying to identify a suitable bandwidth to use for a geographically weighted regression but every time I search for the bandwidth it displays that there are missing (NaN) values within the arrays of the dataset. Although, each row features all values.
g_y = df_ct2008xy['2008 HP'].values.reshape((-1,1))
g_X = df_ct2008xy[['2008 AF', '2008 MI', '2008 MP', '2008 EB']].values
u = df_ct2008xy['X']
v = df_ct2008xy['Y']
g_coords = list(zip(u,v))
g_X = (g_X - g_X.mean(axis=0)) / g_X.std(axis=0)
g_y = g_y.reshape((-1,1))
g_y = (g_y - g_y.mean(axis=0)) / g_y.std(axis=0)
bw = mgwr.sel_bw.Sel_BW(g_coords,
g_y, # Independent variable
g_X, # Dependent variable
fixed=True, # True for fixed bandwidth and false for adaptive bandwidth
spherical=True) # Spherical coordinates (long-lat) or projected coordinates
I searched using numpy to identify if these were individual values using
np.isnan(g_y).any()
and
np.isnan(g_X)
but apparently every value is 'missing' and returning 'True'

Does Simpy support optimized dynamic resource distributon across multiple nodes?

I have 2 nodes 0 and 1 and in total there are 12 resources which will server in the nodes 0 and 1. Is there a method in Simpy to schedule the 12 resources across nodes 0 and 1 so that the average total processing time of an item through node 0 followed by node 1 is minimized. From time to time resources can move from one node to another for serving. Attached is the code where I have come up with a static distribution of 5 resources in node 0 and 7 resources in node 1. How to make it dynamic with time ?
import numpy as np
import simpy
def interarrival():
return(np.random.exponential(20))
def servicetime():
return(np.random.exponential(60))
def servicing(env, servers_1):
i = 0
while(True):
i = i+1
yield env.timeout(interarrival())
print("Customer "+str(i)+ " arrived in the process at "+str(env.now))
state = 0
env.process(items(env, i, servers_array, state))
def items(env, customer_id, servers_array, state):
with servers_array[state].request() as request:
yield request
t_arrival = env.now
print("Customer "+str(customer_id)+ " arrived in "+str(state)+ " at "+str(t_arrival))
yield env.timeout(servicetime())
t_depart = env.now
print("Customer "+str(customer_id)+ " departed from "+str(state)+ " at "+str(t_depart))
if (state == 1):
print("Customer exits")
else:
state = 1
env.process(items(env, customer_id, servers_array, state))
env = simpy.Environment()
servers_array = []
servers_array.append(simpy.Resource(env, capacity = 5))
servers_array.append(simpy.Resource(env, capacity = 7))
env.process(servicing(env, servers_array))
env.run(until=2880)
If you use the resources, start each node with a capacity of 12 and use the delay from your last question to delay some of the resources from each node so the total active resources is the total you want. Otherwise you may want to start looking at containers and stores which will allow you to move a resource from one node to another.

How to vectorize to speed up Dataframe apply pandas

I have a tXn (5000 X 100) dataframe wts_df,
wts_df.tail().iloc[:, 0:6]
Out[71]:
B C H L R T
2020-09-25 0.038746 0.033689 -0.047835 -0.002641 0.009501 -0.030689
2020-09-28 0.038483 0.033189 -0.061742 0.001199 0.009490 -0.028370
2020-09-29 0.038620 0.034957 -0.031341 0.006179 0.007815 -0.027317
2020-09-30 0.038610 0.034902 -0.014271 0.004512 0.007836 -0.024672
2020-10-01 0.038790 0.029937 -0.044198 -0.008415 0.008347 -0.030980
and two similar txn dataframes, vol_df and rx_df (same index and columns). For now we can use,
rx_df = wts_df.applymap(lambda x: np.random.rand())
vol_df = wts_df.applymap(lambda x: np.random.rand())
I need to do this (simplified):
for date in wts_df.index:
wts = wts_df.loc[date] # is a vector now 1Xn
# mutliply all entries of rx_df and vol_df until this date by these wts, and sum across columns
rx = rx_df.truncate(after=date) # still a dataframe but truncated at a given date, kXn
vol = vol_df_df.truncate(after=date)
wtd_rx = (wts * rx).sum(1) # so a vector kX1
wtd_vol = (wts * vol).sum(1)
# take ratio
rx_vol = rx / vol
rate[date] = rx_vol.tail(20).std()
So rate looks like this
pd.Series(rate).tail()
Out[71]:
rate
2020-09-25 0.0546
2020-09-28 0.0383
2020-09-29 0.0920
2020-09-30 0.0510
2020-10-01 0.0890
The above loop is slow, so i tried this:
def rate_calc(wts, date, rx_df=rx_df, vol_df=vol_df):
wtd_rx = (rx_df * wts).sum(1)
wtd_vol = (vol_df * wts).sum(1)
rx_vol = wtd_rx / wtd_vol
rate = rx_vol.truncate(after=date).tail(20).std()
return rate
rates = wts_df.apply(lambda x: rate_calc(x, x.name), axis=1)
This is still very slow. Moreover I need to do this for multiple wts_df contained in a dict so the total operations takes a lot time.
rates = {key: val.apply(lambda x: rate_calc(x, x.name), axis=1) for key, val in wts_df_dict.iteritems()}
Any ideas how to speed such operations?
Your question falls under the category of 'optimization' so allow me to share with you few pointers to solve your problem.
First, when it comes to speed, always use %timeit to ensure you get better results with a new stratgegy.
Second, there are few ways to iterate a data:
with iterrows() -- use it only when the data sample is small (or better yet, try not to use it as it's too slow).
With apply --better alternative to iterrows and much more efficient but when the data set is large (like in your example) it may present a delay problem.
Vectorizing --simply put, you execute the operation on the entire column/array and its significantly fast. Winner!
So, in order to solve your speed problem your strategy should be in the form of vectorizing. So here's how it should work; (mind the .values):
df['new_column'] = my_function(df['column_1'].values, df['column_2'].values...) and you will note a super fast result.

Apache Pig: OutOfMemoryError with FOREACH aggregation

I'm getting an OutOfMemoryError with running a FOREACH operation in Apache Pig.
16/06/24 15:14:17 INFO util.SpillableMemoryManager: first memory
handler call- Usage threshold init = 164102144(160256K) used =
556137816(543103K) committed = 698875904(682496K) max =
698875904(682496K)
java.lang.OutOfMemoryError: Java heap space
-XX:OnOutOfMemoryError="kill -9 %p"
Executing /bin/sh -c "kill -9 4095"... Killed
My Pig script:
A = LOAD 'PageCountTest/' USING PigStorage(' ') AS (Project:chararray,
Title:chararray, count:int , size:int);
B = GROUP A BY (Project,Title);
C = FOREACH B
generate group, SUM(A.count) AS COUNT; D = ORDER C BY COUNT DESC;
STORE C INTO '/user/hadoop/wikistats';
Sample Data:
aa.b Main_Page 1 14335
aa.d India 1 4075
aa.d Main_Page 1 13190
aa.d Special:RecentChanges 1 200
aa.d Talk:Main_Page/ 1 14147
aa.d w/w/index.php 9 137502
aa Main_Page 6 9872
aa Special:Statistics 1 324
Can anyone please help?
I do suspect that a memory issue arise when you order by as it is the most heavy weighted. The easiest way is to play with the heap size parameters while launching your pig job.
pig -Dmapred.child.java.opts=-Xms2096M yourjob.pig
Also you can declare heap size directly inside the script.
export PIG_HEAPSIZE=2096