Apache Spark - sqlContext.sql to pandas - pandas

Hy,
I have a Spark DataFrame and I made some transformation using SQL context, for example, select only two Columns in all data.
df_oraAS = sqlContext.sql("SELECT ENT_EMAIL,MES_ART_ID FROM df_oraAS LIMIT 5 ")
but now I want transform this sqlcontext a pandas dataframe, and I'm using
pddf = df_oraAS.toPandas()
but the output stop here and I need restart the IDE (spyder)
6/01/22 16:04:01 INFO DAGScheduler: Got job 0 (toPandas at <stdin>:1) with 3 output partitions
16/01/22 16:04:01 INFO DAGScheduler: Final stage: ResultStage 0 (toPandas at <stdin>:1)
16/01/22 16:04:01 INFO DAGScheduler: Parents of final stage: List()
16/01/22 16:04:01 INFO DAGScheduler: Missing parents: List()
16/01/22 16:04:01 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[7] at toPandas at <stdin>:1), which has no missing parents
16/01/22 16:04:01 INFO SparkContext: Starting job: toPandas at <stdin>:1
16/01/22 16:04:01 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.4 KB, free 9.4 KB)
16/01/22 16:04:01 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.9 KB, free 14.3 KB)
16/01/22 16:04:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:50877 (size: 4.9 KB, free: 511.1 MB)
16/01/22 16:04:01 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/01/22 16:04:01 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (MapPartitionsRDD[7] at toPandas at <stdin>:1)
16/01/22 16:04:01 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
16/01/22 16:04:02 WARN TaskSetManager: Stage 0 contains a task of very large size (116722 KB). The maximum recommended task size is 100 KB.
16/01/22 16:04:02 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 119523958 bytes)
16/01/22 16:04:03 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 117876401 bytes)
Exception in thread "dispatcher-event-loop-3" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Unknown Source)
at java.io.ByteArrayOutputStream.grow(Unknown Source)
at java.io.ByteArrayOutputStream.ensureCapacity(Unknown Source)
at java.io.ByteArrayOutputStream.write(Unknown Source)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Unknown Source)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(Unknown Source)
at java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.io.ObjectOutputStream.writeObject(Unknown Source)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:200)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:462)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:252)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:247)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$8.apply(TaskSchedulerImpl.scala:317)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$8.apply(TaskSchedulerImpl.scala:315)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:315)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:315)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:315)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalBackend.scala:84)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalBackend.scala:63)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
What I did wrong?
thanks
EDIT: more completed: I load the date from Oracle Database (cx_Oracle) and put the data in a pandas dataframe
df_ora = pd.read_sql('SELECT* FROM DEC_CLIENTES', con=connection)
Next I created a sparkContext to manipulate the dataframe
sqlContext = SQLContext(sc)
df_oraAS = sqlContext.createDataFrame(df_ora)
df_oraAS.registerTempTable("df_oraAS")
df_oraAS = sqlContext.sql("SELECT ENT_EMAIL,MES_ART_ID FROM df_oraAS LIMIT 5 ")
and I want convert again from sqlcontext to a pandas dataframe
pddf = df_oraAS.toPandas()

toPandas is basically collect in disguise. An output is a local Pandas DataFrame. If data doesn't fit into driver memory it will simply fail hence the error you see.

Your pd.read_sql call reads the full database into a pandas dataframe. This is local to the driver. When you call createDataFrame, it then creates a Spark DataFrame from your python pandas dataframe, which results in a really large task size (see the log line below):
16/01/22 16:04:02 WARN TaskSetManager: Stage 0 contains a task of very large size (116722 KB). The maximum recommended task size is 100 KB.
Even though you are selecting only 5 rows, you're actually first loading the full database into memory using that pd.read_sql call. If you're reading from an Oracle SQL database, why don't you use the spark JDBC driver and then perform your select filters and then call toPandas?
What your code is doing is reading the whole DB to pandas, writing to Spark, filtering and reading back to Pandas.

Related

toPandas() fails even on small dataset in local pyspark

I read a lot about memory usage of Spark when doing stuff like collect() or toPandas() (like here). The common wisdom is to use it only on small dataset. The point is how small Spark can handle?
I run locally (for testing) with pyspark, the driver memory set to 20g (I have 32g on my 16 cores mac), but toPandas() crashes even with a dataset as small as 20K rows! That cannot be right, so I suspect I do some (setting) wrong. This is the simplified code to reproduce the error:
import pandas as pd
import numpy as np
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# setting the number of rows for the CSV file
N = 20000
ncols = 7
c_name = 'ABCDEFGHIJKLMNOPQRSTUVXYWZ'
# creating a pandas dataframe (df)
df = pd.DataFrame(np.random.randint(999,999999,size=(N, ncols)), columns=list(c_name[:ncols]))
file_name = 'random.csv'
# export the dataframe to csv using comma delimiting
df.to_csv(file_name, index=False)
## Load the csv in spark
df = spark.read.format('csv').option('header', 'true').load(file_name)#.limit(5000)#.coalesce(2)
## some checks
n_parts = df.rdd.getNumPartitions()
print('Number of partitions:', n_parts)
print('Number of rows:', df.count())
## conver spark df -> toPandas
df_p = df.toPandas()
print('With pandas:',len(df_p))
I run this within jupyter, and get errors like:
ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /192.168.0.104:61536
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
...
My spark local setting is (everything else default):
('spark.driver.host', '192.168.0.104')
('spark.driver.memory', '20g')
('spark.rdd.compress', 'True')
('spark.serializer.objectStreamReset', '100')
('spark.master', 'local[*]')
('spark.executor.id', 'driver')
('spark.submit.deployMode', 'client')
('spark.app.id', 'local-1618499935279')
('spark.driver.port', '55115')
('spark.ui.showConsoleProgress', 'true')
('spark.app.name', 'pyspark-shell')
('spark.driver.maxResultSize', '4g')
Is my setup wrong, or it is expected that even 20g of driver memory can't handle a small dataframe with 20K rows and 7 columns? Will repartitioning help?

Create Dataframe in Pandas - Out of memory error while reading Parquet files

I have a Windows 10 machine with 8 GB RAM and 5 cores.
I have created a parquet file compressed with gzip. The size of the file after compression is 137 MB.
When I am trying to read the parquet file through Pandas, dask and vaex, I am getting memory issues:
Pandas :
df = pd.read_parquet("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed
Dask:
import dask.dataframe as dd
df = dd.read_parquet("C:\\files\\test.parquet").compute()
OSError: Out of memory: realloc of size 3915749376 failed
Vaex:
df = vaex.open("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed
Since Pandas /Python is meant for efficiency and 137 mb file is below par size , are there any recommended ways to create efficient dataframes? Libraries like Vaex, Dask claims to be very efficient.
For single machine, I would recommend Vaex with HDF file format. The data resides on hard disk and thus you can use bigger data sets. There is a built-in function in vaex that will read and convert bigger csv file into hdf file format.
df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)
Dask is optimized for distributed system. You read the big file in chunks and then scatter it among worker machines.
It is totally possible that a 137MB parquet file expands to 4GB in memory, due to efficient compression and encoding in parquet. You may have some options on load, please show your schema. Are you using fastparquet or pyarrow?
Since all of the engines you are trying to use are capable of loading one "row-group" at a time, I suppose you only have one row group, and so splitting won't work. You could load only a selection of columns to save memory, if this can accomplish your task (all the loaders support this).
Check that you are using the latest version of pyarrow. A few times updating has helped me.
pip install -U pyarrow
pip install pyarrow==0.15.0 worked for me.

What is the fastest way to return one row from a big pyspark dataframe or koalas dataframe in databricks?

I have a big dataframe(20 Million rows, 35 columns) in koalas on a databricks notebook. I have performed some transform and join(merge) operations on it using python such as:
mdf.path_info = mdf.path_info.transform(modify_path_info)
x = mdf[['providerid','domain_name']].groupby(['providerid']).apply(domain_features)
mdf = ks.merge( mdf, x[['domain_namex','domain_name_grouped']], left_index=True, right_index=True)
x = mdf.groupby(['providerid','uid']).apply(userspecificdetails)
mmdf = mdf.merge(x[['providerid','uid',"date_last_purch","lifetime_value","age"]], how="left", on=['providerid','uid'])
After these operations, I want to display some rows of the dataframe to verify the resultant dataframe. I am trying to print/display as little as 1-5 rows of this big dataframe but because of spark's nature of lazy evaluation, all the print commands starts 6-12 spark jobs and runs forever after which cluster goes to an unusable state and then nothing happens.
mdf.head()
display(mdf)
mdf.take([1])
mdf.iloc[0]
also tried converting into a spark dataframe and then trying:
df = mdf.to_spark()
df.show(1)
df.rdd.takeSample(False, 1, seed=0)
df.first()
The cluster configuration I am using is 8worker_4core_8gb, meaning each worker and driver node is 8.0 GB Memory, 4 Cores, 0.5 DBU on the Databricks Runtime Version: 7.0 (includes Apache Spark 3.0.0, Scala 2.12)
Can someone please help by suggesting a faster rather fastest way to get/print one row of the big dataframe and which does not wait to process the whole 20Million rows of the dataframe.
As you write because of lazy evaluation, Spark will perform your transformations first and then show the one line. What you can do is reduce the size of your input data, and do the transformations on a much smaller dataset e.g.:
https://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
df.sample(False, 0.1, seed=0)
You could cache the computation result after you convert to spark dataframe and then call the action.
df = mdf.to_spark()
# caches the result so the action called after this will use this cached
# result instead of re-computing the DAG
df.cache()
df.show(1)
You may want to free up the memory used for caching with:
df.unpersist()

Is there a pyarrow equivalent of the chunksize argument in pandas.read_csv?

I am looking to process a large file(5 gb) in RAM but am getting an out of memory error. Is there a way to process the parquet file in chunks like there is in pandas.read_csv?
import pyarrow.parquet as pq
def main():
df = pq.read_table('./data/train.parquet').to_pandas()
main()
There is not yet, but there are issues open about adding this option (see https://issues.apache.org/jira/browse/ARROW-3771, others). Note that memory use will be significantly improved in the upcoming 0.12 release.
In the meantime, you can use pyarrow.parquet.ParquetFile and its read_row_group method to read one row group at a time.

How to collect spark dataframe at each executor node?

My application reads a large parquet file and performs some data extractions to arrive at a smallish spark dataframe object. All the contents of this dataframe must be present at each executor node for the next phase of the computation. I know that I can do this by collect-broadcast, as in this pyspark snippet
sc = pyspark.SparkContext()
sqlc = HiveContext(sc)
# --- register hive tables and generate spark dataframe
spark_df = sqlc.sql('sql statement')
# collect spark dataframe contents into a Pandas dataframe at the driver
global_df = spark_df.toPandas()
# broadcast Pandas dataframe to all the executor nodes
sc.broadcast(global_df)
I was just wondering: is there a more efficient method for doing this? It would seem that this pattern makes the driver node into a bottleneck.
It depends on what you need to do with your small dataframe. If you need to join it with large one, then Spark can optimize such joins broadcasting small dataframe automatically. The max size of dataframe that can be broadcasted is configured by spark.sql.autoBroadcastJoinThreshold option, as described in documentation http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options