Convert pyspark dataframe to pandas dataframe - pandas

I have pyspark dataframe where its dimension is (28002528,21) and tried to convert it to pandas dataframe by using the following code line :
pd_df=spark_df.toPandas()
I got this error:
first Part
Py4JJavaError: An error occurred while calling o170.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 39.0 failed 1 times, most recent failure: Lost task 3.0 in stage 39.0 (TID 89, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:220)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:173)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:552)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:256)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
...
...
Caused by: java.lang.OutOfMemoryError: Java heap space
...
...
Second Part
Exception happened during processing of request from ('127.0.0.1', 56842)
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:56657)
Traceback (most recent call last):
...
...
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
During handling of the above exception, another exception occurred:
...
...
and I tried also to take sample of the original pyspark dataframe
smaple_pd_df=spark_df.sample(0.05).toPandas()
I got an error looks like the first part only of the previous error

You get
java.lang.OutOfMemoryError which probably means that you are trying to load all data into a single node which doesn't have enough RAM to handle the entire DataFrame. If you are using a cloud solution provider such as Databricks, try increasing the size of cluster RAM.

What toPandas() does is collect the whole dataframe into a single node (as explained in #ulmefors's answer).
More specifically, it collects it to the driver. The specific option you should be fine-tuning is spark.driver.memory, increase it accordingly.
Otherwise, if you're planning on doing further transformations on this (rather large) pandas dataframe, you could consider doing them in pyspark first and then collecting the (smaller) result into the driver, hopefully that will fit in memory.
More details are available in the Spark configuration documentation, here.

Related

Structured Streaming in Databricks Azure throwing exception - java.lang.IllegalStateException: Error reading delta file dbfs:/raw_zone/1.delta

We are using Structured Streaming in Databricks environment, Every time while we run this program - kAFKA - Structured Streaming (DBR6.6, Spark 2.4.5) - Writing to CosmosDB, we are getting the same exception as below just before we do the final joins to save the data to Cosmos DB. We haven't modified any spark specific settings and leveraging the default spark /DBR configurations.
Caused by: org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 174 in stage 9353.0 failed 4 times, most recent failure:
Lost task 174.3 in stage 9353.0 (TID 60863, 10.139.64.9, executor 1):
java.lang.IllegalStateException:
Error reading delta file dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta of HDFSStateStoreProvider[id = (op=8,part=174),dir = dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues]:
dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta does not exist
Caused by: java.io.FileNotFoundException:
/6455647419774311/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta

Randomly getting java.lang.ClassCastException in snappy job

Snappy job written in Scala aborts with exception:
java.lang.ClassCastException: com.....$Class1 cannot be cast to com.....$Class1.
Class1 is custom class that is stored in RDD. Interesting thing is this error is thrown while casting same class. So far, no patterns are found.
In the job, we fetch data from hbase, enrich data with analytical metadata using Dataframes and push it to a table in SnappyData. We are using Snappydata 1.2.0.1.
Not sure why is this happening.
Below is Stack Trace:
Job aborted due to stage failure: Task 76 in stage 42.0 failed 4 times, most recent failure: Lost task 76.3 in stage 42.0 (TID 3550, HostName, executor XX.XX.x.xxx(10360):7872): java.lang.ClassCastException: cannot be cast to
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:86)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenRDD$$anon$2.hasNext(WholeStageCodegenExec.scala:571)
at org.apache.spark.sql.execution.WholeStageCodegenRDD$$anon$1.hasNext(WholeStageCodegenExec.scala:514)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:132)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:233)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1006)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:997)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:41)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.sql.execution.WholeStageCodegenRDD.computeInternal(WholeStageCodegenExec.scala:557)
at org.apache.spark.sql.execution.WholeStageCodegenRDD$$anon$1.(WholeStageCodegenExec.scala:504)
at org.apache.spark.sql.execution.WholeStageCodegenRDD.compute(WholeStageCodegenExec.scala:503)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:41)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:103)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:126)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:326)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.spark.executor.SnappyExecutor$$anon$2$$anon$3.run(SnappyExecutor.scala:57)
at java.lang.Thread.run(Thread.java:748)
Classes are not unique by name. They're unique by name + classloader.
ClassCastException of the kind you're seeing happens when you pass data between parts of the app where one or both parts are loaded in a separate classloader.
You might need to clean up your classpath, you might need to resolve the classes from the same classloader, or you might have to serialize the data (especially if you have features that rely on reloading code at runtime).

Issue in saving the content of a dataframe to table

I have a data source (hive external tables) which refresh the data in adhoc manner. To avoid any discrepancies in the execution i'm trying to save the data as a table in my location.
Initially, i have loaded the data from data source to a dataframe
source = hqlContext.table("datasourcedb.table1") // this is working fine
Then, trying to save it the my application location -
source.write.mode('overwrite').saveAsTable("appdb.table1") //No read/write operations on appdb.table1 while doing this action
Above actions throwing exceptions:
java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: BLOCK
at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.abortTask$1(WriterContainer.scala:294)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:271)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/02 04:31:32 ERROR TaskSetManager: Task 9 in stage 1.0 failed 4 times; aborting job
18/03/02 04:31:32 ERROR InsertIntoHadoopFsRelation: Aborting job.
**Note: The size of the source is abot 6GB. Hence, no persist action is planned **

HSQLDB throws Asset failed exception and file io error on db.script.new file during Checkpoint

Our application is a Java based desktop application which will download the binary data from the source, parses it and add it to HSQLDB database. When downloading from the sources individually, application works perfectly. But when doing the same from multiple sources simultaneously with each source in an individual thread, I am getting an error of
java.sql.SQLException: Assert failed: java.lang.ArrayIndexOutOfBoundsException: 23 in statement [CHECKPOINT]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.execute(Unknown Source)
or sometimes,
java.sql.SQLException: Assert failed: java.lang.ArrayIndexOutOfBoundsException: 1016 in statement [CHECKPOINT]
followed by
java.sql.SQLException: File input/output error: C:\ProgramData\test\data\database\db.script.new in statement [CHECKPOINT]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.execute(Unknown Source)
Java: 1.8;
HSQL version: 1.8.10
We are not in the position to migrate the HSQLDB to latest version because of various reasons.
HSQL Properties:
hsqldb.script_format=0
runtime.gc_interval=0
sql.enforce_strict_size=false
hsqldb.cache_size_scale=8
readonly=false
hsqldb.nio_data_file=true
hsqldb.cache_scale=14
version=1.8.0
hsqldb.default_table_type=memory
hsqldb.cache_file_scale=1
hsqldb.log_size=200
modified=yes
hsqldb.cache_version=1.7.0
hsqldb.original_version=1.8.0
hsqldb.compatible_version=1.8.0
Any help or hint will be appreciated.
This is an 7 year old version which is not ideal for multi-threaded usage.
The simple solution is to perform the database updates with a single thread. You can retrofit your multi-threaded application with a synchronized block over a singleton object around the code that performs the database update.

Hive with Tez, No input paths specified in job

I have used hadoop-0.20.x.x, hive-0.11.0. I would talk about hive queries: with the specified configuration every thing is good and working fine.
Now, we have upgraded to hadoop-2.6.x (hadoop2)and hive-0.14.x. Also using Apache Tez.
The problem is, hadoop works as is. But hive sql queries doesn't.
The below query works fine in the older version's. But throw errors in the upgraded version's:
QUERY : SELECT abc.property_name, xyz.date, xyz.time, xyz.value_as_number, xyz.value_units FROM dbname.xyz JOIN dbname.abc ON (xyz.id = abc.src_id) WHERE xyz.person_id=138312;
EXCEPTION:
INFO : Session is already open
INFO : Tez session was closed. Reopening...
INFO : Session re-established.
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1435524970199_0035)
INFO : Map 1: -/- Map 2: -/-
ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1435524970199_0035_1_00, diagnostics=[Vertex vertex_1435524970199_0035_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: concept initializer failed, vertex=vertex_1435524970199_0035_1_00 [Map 1], java.io.IOException: No input paths specified in job
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getInputPaths(HiveInputFormat.java:318)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:328)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:130)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:245)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:239)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:239)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:226)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
]
ERROR : Vertex failed, vertexName=Map 2, vertexId=vertex_1435524970199_0035_1_01, diagnostics=[Vertex vertex_1435524970199_0035_1_01 [Map 2] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: observation initializer failed, vertex=vertex_1435524970199_0035_1_01 [Map 2], java.io.IOException: No input paths specified in job
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getInputPaths(HiveInputFormat.java:318)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:328)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:130)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:245)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:239)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:239)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:226)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
]
ERROR : DAG failed due to vertex failure. failedVertices:2 killedVertices:0
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask (state=08S01,code=2)
Exception says, No input path specified. Well, i understand and know how to do solve in haodop-mapreduce program. But, how do we do it using hive query. Anyway, i don't think this is the same.
To make out, i have used hive shell and beeline shell, hive returned expected output but, beeline returned the same exception as above.
The beauty of the problem is query on individual table works fine. But, when i try to work on the JOIN, it throws the above exception.
But, i have understood that, there's an impact of Apache Tez on my query. Can some one suggest the solution or pin point tez reference, so i could read and rewrite the query accordingly. Thanks
It worked by disabling apache tez.
Look's like apache tez isn't stable yet.