I read a lot about memory usage of Spark when doing stuff like collect() or toPandas() (like here). The common wisdom is to use it only on small dataset. The point is how small Spark can handle?
I run locally (for testing) with pyspark, the driver memory set to 20g (I have 32g on my 16 cores mac), but toPandas() crashes even with a dataset as small as 20K rows! That cannot be right, so I suspect I do some (setting) wrong. This is the simplified code to reproduce the error:
import pandas as pd
import numpy as np
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# setting the number of rows for the CSV file
N = 20000
ncols = 7
# creating a pandas dataframe (df)
df = pd.DataFrame(np.random.randint(999,999999,size=(N, ncols)), columns=list(c_name[:ncols]))
file_name = 'random.csv'
# export the dataframe to csv using comma delimiting
df.to_csv(file_name, index=False)
## Load the csv in spark
df = spark.read.format('csv').option('header', 'true').load(file_name)#.limit(5000)#.coalesce(2)
## some checks
n_parts = df.rdd.getNumPartitions()
print('Number of partitions:', n_parts)
print('Number of rows:', df.count())
## conver spark df -> toPandas
df_p = df.toPandas()
print('With pandas:',len(df_p))
I run this within jupyter, and get errors like:
ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
My spark local setting is (everything else default):
('spark.driver.host', '')
('spark.driver.memory', '20g')
('spark.rdd.compress', 'True')
('spark.serializer.objectStreamReset', '100')
('spark.master', 'local[*]')
('spark.executor.id', 'driver')
('spark.submit.deployMode', 'client')
('spark.app.id', 'local-1618499935279')
('spark.driver.port', '55115')
('spark.ui.showConsoleProgress', 'true')
('spark.app.name', 'pyspark-shell')
('spark.driver.maxResultSize', '4g')
Is my setup wrong, or it is expected that even 20g of driver memory can't handle a small dataframe with 20K rows and 7 columns? Will repartitioning help?
I have many CSV files saved in AWS s3 with the same first set of columns and a lot of optional columns. I don't want to download them one by one and than use pd.concat to read it, since this takes a lot of time and has to fit in to the computer memory. Instead, I'm trying to use Dask to load and sum up all of these files, when optional columns should should be treated as zeros.
If all columns where the same I could use:
import dask.dataframe as dd
addr = "s3://SOME_BASE_ADDRESS/*.csv"
df = dd.read_csv(addr)
but it doesn't work with files that don't have same number of columns, since Dask assumes it can use the first columns for all files:
File ".../lib/python3.7/site-packages/pandas/core/internals/managers.py", line 155, in set_axis
'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 64 elements, new values have 62 elements
According to this thread I can either read all headers in advanced (for example by writing them as I produce and save all of the small CSV's) or use something like this:
df = dd.concat([dd.read_csv(f) for f in filelist])
I wonder if this solution is actually faster/better than just directly use pandas? In general, I'd like to know what is the best (mainly fastest) way to tackle this issue?
It might be a good idea to use delayed to standardize dataframes before converting them to a dask dataframe (whether this is optimal for your use case is difficult to judge).
import dask.dataframe as dd
from dask import delayed
list_files = [...] # create a list of files inside s3 bucket
list_cols_to_keep = ['col1', 'col2']
def standard_csv(file_path):
df = pd.read_csv(file_path)
df = df[list_cols_to_keep]
# add any other standardization routines, e.g. dtype conversion
return df
ddf = dd.from_delayed([standard_csv(f) for f in list_files])
I ended up giving up using Dask since it was too slow and used aws s3 sync to download the data and multiprocessing.Pool to read and concat them:
# download:
def sync_outputs(out_path):
local_dir_path = f"/tmp/outputs/"
cmd = f'aws s3 sync {job_output_dir} {local_dir_path} > /tmp/null' # the last part is to avoid prints
return local_dir_path
# concat:
def read_csv(path):
return pd.read_csv(path,index_col=0)
def read_csvs_parallel(local_paths):
from multiprocessing import Pool
import os
with Pool(os.cpu_count()) as p:
csvs = list(tqdm(p.imap(read_csv, local_paths), desc='reading csvs', total=len(paths)))
return csvs
# all togeter:
def concat_csvs_parallel(out_path):
local_paths = sync_outputs(out_path)
csvs = read_csvs_parallel(local_paths)
df = pd.concat(csvs)
return df
aws s3 sync dowloaded about 1000 files (~1KB each) in about 30 second, and reading than with multiproccesing (8 cores) took 3 seconds, this was much faster than also downloading the files using multiprocessing (almost 2 minutes for 1000 files)
In an attempt to get some outlier plots on large datasets I need to convert a spark DataFrame to pandas. Turing to Apache Arrow a simple run is crashing my pyspark console when casting x as string (it works fine without the cast), why?
Using Python version 3.8.9 (default, Apr 10 2021 15:47:22)
Spark context Web UI available at http://6d0b1018a45a:4040
Spark context available as 'sc' (master = local[*], app id = local-1621164597906).
SparkSession available as 'spark'.
>>> import time
>>> from pyspark.sql.functions import rand
>>> from pyspark.sql import functions as F
>>> spark = SparkSession.builder.appName("Console_Test").getOrCreate()
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/05/16 11:31:03 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
>>> a_df = spark.range(1 << 25).toDF("id").withColumn("x", rand())
>>> a_df = a_df.withColumn("id", F.col("id").cast("string"))
>>> start_t = time.time()
>>> a_pd = a_df.toPandas()
Additionally I noticed that options such as spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000")are seemingly without effect as the web ui shows records of significantly more than 5000 being assigned to the tasks.
Any indication on how to resolve the pyspark console crash or more directly render large scatter plots would be highly appreciated - I have (unsuccessfully) tried to find a way to apply Table.to_pandas(split_blocks=True, self_destruct=True)but did not get the able structure from a spark DataFrame.
You try to convert 33.5 mio (2^25) rows into a Pandas dataframe. This will lead to an OutOfMemoryError, as all data will be transfered to the Spark driver.
A way to find outliers would be to calculate the histogram for the column x and then filter down a_df to the relevant bins in Spark before creating the Pandas dataframe:
hist = a_df.select("x").rdd.flatMap(lambda x: x).histogram(10) #create 10 bins
hist is a tuple of two arrays: the first array contains the boundaries of the bins and the second array contains the numbers of elements in each bin:
rand creates uniformly distributed randon numbers, so the histogram in this case is not very interesting. But for real world distributions, the histogram will be useful.
Does any one has idea how to run pandas program on spark standalone cluster machine(windows)? the program developed using pycharm and pandas?
Here the issue is i am able to run from command prompt using spark-submit --master spark://sparkcas1:7077 project.py and getting results. but the activity(status) I am not seeing # workers and also Running Application status and Completed application status from spark web UI: :7077
in the pandas program I just included only one statement " from pyspark import SparkContext
import pandas as pd
from pyspark import SparkContext
# reading csv file from url
workbook_loc = "c:\\2020\Book1.xlsx"
df = pd.read_excel(workbook_loc, sheet_name='Sheet1')
# converting to dict
What could be the issue?
Pandas code runs only on the driver and no workers are involved in this. So there is no point of using pandas code inside spark.
If you are using spark 3.0 you can run your pandas code distributed by converting the spark df as koalas
I have big numpy array. Its shape is (800,224,224,3), which means that there are images (224 * 244) with 3 channels. For distributed deep learning in Spark, I want to change 'numpy array' to 'spark dataframe'.
My method is:
Changed numpy array to csv
Loaded csv and make spark dataframe with 150528 columns (224*224*3)
Use VectorAssembler to create a vector of all columns (features)
Reshape the output of 3 but in the third step, I failed since computation might be too much high
In order to make a vector from this:
|col_1 | col_2|
to this:
| feature |
But the number of columns are really many...
I also tried to convert numpy array to rdd directly but I got 'out of memory' error. In single machine, my job works well with this numpy array.
You should be able to convert the numpy array directly to a Spark dataframe, without going through a csv file. You could try something like the below code:
from pyspark.ml.linalg import Vectors
num_rows = 800
arr = map(lambda x: (Vectors.dense(x), ), numpy_arr.reshape(num_rows, -1))
df = spark.createDataFrame(arr, ["features"])
You can also do this, which I find most convenient:
import numpy as np
import pandas as pd
import pyspark
sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)
array = np.linspace(0, 10)
df_spark = sqlContext.createDataFrame(pd.DataFrame(array))
The only downside is that pandas needs to be installed.
Increase worker memory from the default value of 1 GB using spark.executor.memory flag to resolve out of memory error if you are getting error in worker node otherwise if you are getting this error in driver then try increasing the driver memory as suggested by #pissall. Also, try to identify proper fraction of memory(spark.memory.fraction) to be used for keeping RDD in memory.
I am trying to import ~12 Million records with 8 columns into Python.Because of its huge size my laptop memory would not be sufficient for this. Now I'm trying to import the SQL data into a HDF5 file format. It would be very helpful if someone can share a snippet of code that queries data from SQL and saves it in the HDF5 format in chunks.I am open to use any other file format that would be easier to use.
I plan to do some basic exploratory analysis and later on might create some decision trees/Liner regression models using pandas.
import pyodbc
import numpy as np
import pandas as pd
con = pyodbc.connect('Trusted_Connection=yes',
driver = '{ODBC Driver 13 for SQL Server}',
server = 'SQL_ServerName')
df = pd.read_sql("select * from table_a",con,index_col=['Accountid'],chunksize=1000)
Try this:
sql_reader = pd.read_sql("select * from table_a", con, chunksize=10**5)
hdf_fn = '/path/to/result.h5'
hdf_key = 'my_huge_df'
store = pd.HDFStore(hdf_fn)
for chunk in sql_reader:
store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)
# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')