Issue in saving the content of a dataframe to table - apache-spark-sql

I have a data source (hive external tables) which refresh the data in adhoc manner. To avoid any discrepancies in the execution i'm trying to save the data as a table in my location.
Initially, i have loaded the data from data source to a dataframe
source = hqlContext.table("datasourcedb.table1") // this is working fine
Then, trying to save it the my application location -
source.write.mode('overwrite').saveAsTable("appdb.table1") //No read/write operations on appdb.table1 while doing this action
Above actions throwing exceptions:
java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: BLOCK
at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.abortTask$1(WriterContainer.scala:294)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:271)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/02 04:31:32 ERROR TaskSetManager: Task 9 in stage 1.0 failed 4 times; aborting job
18/03/02 04:31:32 ERROR InsertIntoHadoopFsRelation: Aborting job.
**Note: The size of the source is abot 6GB. Hence, no persist action is planned **

Related

How to find batch element in Websphere commerce error

When I am running buildindex in my Websphere application, I have this error in buildindex log:
[2021/05/10 15:41:57:590 GMT] I Data import pre-processing completed in 0.389 seconds for table TI_CAT_EXTENDED_41060.
[2021/05/10 15:41:57:591 GMT] I /opt/IBM/WebSphere/CommerceServer80/instances/auth/search/pre-processConfig/MC_41060/DB2/wc-dataimport-preprocess-catentry-metainf.xml
[2021/05/10 15:41:57:591 GMT] I
Table name: TI_X_CATENT_META_INF_410600
Fetch size: 500
Batch size: 500
[2021/05/10 15:41:58:048 GMT] I Error for batch element #415: DB2 SQL Error: SQLCODE=-302, SQLSTATE=22001, SQLERRMC=null, DRIVER=4.19.77
[2021/05/10 15:41:58:048 GMT] I SQL: SELECT CATENTRY_ID, TITLE, TITLE_KEYWORDS, SHORT_DESC, SHORT_DESC_KEYWORDS, LONG_DESC, LONG_DESC_KEYWORDS, LOCALE FROM X_CATENT_META_INF WHERE STORE_ID = 41006
[2021/05/10 15:41:58:087 GMT] I
The program exiting with exit code: 1.
Data import pre-processing was unsuccessful. An unrecoverable error has occurred.
[2021/05/10 15:41:58:091 GMT] E com.ibm.commerce.foundation.dataimport.preprocess.DataImportPreProcessorMain:handleExecutionException Exception message: CWFDIH0002: An SQL exception was caught. The following error occurred: [jcc][t4][102][10040][4.19.77] Batch failure. The batch was submitted, but at least one exception occurred on an individual member of the batch.
Use getNextException() to retrieve the exceptions for specific batched elements. ERRORCODE=-4229, SQLSTATE=null., stack trace: com.ibm.commerce.foundation.dataimport.exception.DataImportSystemException: CWFDIH0002: An SQL exception was caught. The following error occurred: [jcc][t4][102][10040][4.19.77] Batch failure. The batch was submitted, but at least one exception occurred on an individual member of the batch.
Use getNextException() to retrieve the exceptions for specific batched elements. ERRORCODE=-4229, SQLSTATE=null.
at com.ibm.commerce.foundation.dataimport.preprocess.DataImportPreProcessorMain.processDataConfig(DataImportPreProcessorMain.java:1515)
at com.ibm.commerce.foundation.dataimport.preprocess.DataImportPreProcessorMain.execute(DataImportPreProcessorMain.java:1331)
at com.ibm.commerce.foundation.dataimport.preprocess.DataImportPreProcessorMain.main(DataImportPreProcessorMain.java:534)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:56)
at java.lang.reflect.Method.invoke(Method.java:620)
at com.ibm.ws.bootstrap.WSLauncher.main(WSLauncher.java:280)
Caused by: com.ibm.db2.jcc.am.BatchUpdateException: [jcc][t4][102][10040][4.19.77] Batch failure. The batch was submitted, but at least one exception occurred on an individual member of the batch.
Use getNextException() to retrieve the exceptions for specific batched elements. ERRORCODE=-4229, SQLSTATE=null
at com.ibm.db2.jcc.am.b4.a(b4.java:475)
at com.ibm.db2.jcc.am.Agent.endBatchedReadChain(Agent.java:414)
at com.ibm.db2.jcc.am.ki.a(ki.java:5342)
at com.ibm.db2.jcc.am.ki.c(ki.java:4929)
at com.ibm.db2.jcc.am.ki.executeBatch(ki.java:3045)
at com.ibm.commerce.foundation.dataimport.preprocess.AbstractDataPreProcessor.populateTable(AbstractDataPreProcessor.java:373)
at com.ibm.commerce.foundation.dataimport.preprocess.StaticAttributeDataPreProcessor.process(StaticAttributeDataPreProcessor.java:461)
at com.ibm.commerce.foundation.dataimport.preprocess.DataImportPreProcessorMain.processDataConfig(DataImportPreProcessorMain.java:1482)
... 7 more
The exception seems to be clear, but I can't identify what is the element #415 in batch. Even the log doesn't helps, because it doesn't point to another more detailed log. Do you have any suggestion for find it?
Thanks to the comment of user #mao, I have followed this link
The failing table first must be identified. Enable more detailed tracing for di-preprocess:
Navigate to :
WC_installdir/instances/instance_name/xml/config/dataimport
and open the logging.properties file. Find all instances of INFO and
change it to FINEST. Optionally increase the size of the log file and
the number of historical log files while editing this file.
Thanks to this suggestion, I had re-run the buildindex process, and found that solr was wrongly grouping fields from original table, thus generating a too long field for the destination, and generating the error.

Structured Streaming in Databricks Azure throwing exception - java.lang.IllegalStateException: Error reading delta file dbfs:/raw_zone/1.delta

We are using Structured Streaming in Databricks environment, Every time while we run this program - kAFKA - Structured Streaming (DBR6.6, Spark 2.4.5) - Writing to CosmosDB, we are getting the same exception as below just before we do the final joins to save the data to Cosmos DB. We haven't modified any spark specific settings and leveraging the default spark /DBR configurations.
Caused by: org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 174 in stage 9353.0 failed 4 times, most recent failure:
Lost task 174.3 in stage 9353.0 (TID 60863, 10.139.64.9, executor 1):
java.lang.IllegalStateException:
Error reading delta file dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta of HDFSStateStoreProvider[id = (op=8,part=174),dir = dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues]:
dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta does not exist
Caused by: java.io.FileNotFoundException:
/6455647419774311/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta

Datastax: Block not found error from DSEFS

Spark streaming job running in DSE using DSEFS for check-pointing directory. I see this error in debug log file. How to resolve this error?
ERROR [dsefs-netty-worker-5] 2017-12-01 05:23:02,679 DSE-FS RestServerHandler.scala:126 - [id: 0x9964e082, /<>:58874 :> 0.0.0.0/0.0.0.0:5598] Streaming data to remote end failed.
java.io.IOException: Block not found a3859f30-aa23-11e7-80b9-4b8bdaf197cd
at com.datastax.bdp.fs.server.blocks.BlockService$stateMachine$33$1.apply(BlockService.scala:706) ~[dsefs-server_2.10-5.0.19.jar:5.0.19]
at com.datastax.bdp.fs.server.blocks.BlockService$stateMachine$33$1.apply(BlockService.scala:703) ~[dsefs-server_2.10-5.0.19.jar:5.0.19]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [scala-library-2.10.6.jar:na]
at com.datastax.bdp.fs.exec.SameThreadExecutionContext$class.executeInSameThread(SameThreadExecutionContext.scala:24) ~[dsefs-common_2.10-5.0.19.jar:5.0.19]
at com.datastax.bdp.fs.exec.SameThreadExecutionContext$class.execute(SameThreadExecutionContext.scala:33) ~[dsefs-common_2.10-5.0.19.jar:5.0.19]
at com.datastax.bdp.fs.exec.SerialExecutionContextProvider$$anon$5$$anon$2.execute(SerialExecutionContextProvider.scala:24) ~[dsefs-common_2.10-5.0.19.jar:5.0.19]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) [scala-library-2.10.6.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) ~[scala-library-2.10.6.jar:na]
at scala.concurrent.Promise$class.complete(Promise.scala:55) ~[scala-library-2.10.6.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) ~[scala-library-2.10.6.jar:na]
at com.datastax.bdp.fs.server.blocks.BlockService$stateMachine$1$1.apply(BlockService.scala:60) ~[dsefs-server_2.10-5.0.19.jar:5.0.19]
at com.datastax.bdp.fs.server.blocks.BlockService$stateMachine$1$1.apply(BlockService.scala:60) ~[dsefs-server_2.10-5.0.19.jar:5.0.19]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [scala-library-2.10.6.jar:na]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358) [netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) [netty-all-4.0.34.Final.jar:4.0.34.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) [netty-all-4.0.34.Final.jar:4.0.34.Final]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_112]
This error means DSEFS server failed to find metadata of the data block in the dsefs.blocks Cassandra table. The ids of the file blocks are stored in the dsefs.block_offsets table and they reference blocks stored in dsefs.blocks. If a row exists in dsefs.block_offsets and points to the block id that is absent in dsefs.blocks, you get this error when reading the file.
This error should not happen under normal circumstances and it means the filesystem metadata somehow got into inconsistent state. This may be a bug in the DSEFS implementation, a result of a data loss caused by setting up dsefs keyspace with insufficient replication factor or a result of a write operation that did not finish successfully and was applied only partially.
Please make sure you set dsefs keyspace RF to at least 3 and run nodetool repair to avoid accidental data loss or unavailability of some DSEFS metadata.
If this doesn't help, please contact me directly or through DataStax technical support and provide more details, including logs from the time before the error and more context on what the job was doing when the failure occurred.

BigQuery loads manually but not through the Java SDK

I have a Dataflow pipeline, running locally. The objective is to read a JSON file using TEXTIO, make sessions and load it into BigQuery. Given the structure I have to create a temp directory in GCS and then load it into BigQuery using that. Previously I had a data schema error that prevented me to load the data, see here. That issue is resolved.
So now when I run the pipeline locally it ends with dumping a temporary JSON newline delimited file into GCS. The SDK then gives me the following:
Starting BigQuery load job beam_job_xxxx_00001-1: try 1/3
INFO [main] (BigQueryIO.java:2191) - BigQuery load job failed: beam_job_xxxx_00001-1
...
Exception in thread "main" com.google.cloud.dataflow.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:187)
at pedesys.Dataflow.main(Dataflow.java:148)
Caused by: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.load(BigQueryIO.java:2198)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.processElement(BigQueryIO.java:2146)
The errors are not very descriptive and the data is still not loaded in BigQuery. What is puzzling is that if I go to the BigQuery UI and load the same temporary file from GCS that was dumped by the SDK's Dataflow pipeline manually, in the same table, it works beautifully.
The relevant code parts are as follows:
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BigQueryOptions.class)
.setTempLocation("gs://test/temp");
Pipeline p = Pipeline.create(options)
...
...
session_windowed_items.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write
.named("loadJob")
.to("myproject:db.table")
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
The SDK is swallowing the error/exception and not reporting it to users. It's most likely a schema problem. To get the actual error that is happening you need to fetch the job details by either:
CLI - bq show -j job beam_job_<xxxx>_00001-1
Browser/Web: use "try it" at the bottom of the page here.
#jkff has raised an issue here to improve the error reporting.

HSQLDB throws Asset failed exception and file io error on db.script.new file during Checkpoint

Our application is a Java based desktop application which will download the binary data from the source, parses it and add it to HSQLDB database. When downloading from the sources individually, application works perfectly. But when doing the same from multiple sources simultaneously with each source in an individual thread, I am getting an error of
java.sql.SQLException: Assert failed: java.lang.ArrayIndexOutOfBoundsException: 23 in statement [CHECKPOINT]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.execute(Unknown Source)
or sometimes,
java.sql.SQLException: Assert failed: java.lang.ArrayIndexOutOfBoundsException: 1016 in statement [CHECKPOINT]
followed by
java.sql.SQLException: File input/output error: C:\ProgramData\test\data\database\db.script.new in statement [CHECKPOINT]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.execute(Unknown Source)
Java: 1.8;
HSQL version: 1.8.10
We are not in the position to migrate the HSQLDB to latest version because of various reasons.
HSQL Properties:
hsqldb.script_format=0
runtime.gc_interval=0
sql.enforce_strict_size=false
hsqldb.cache_size_scale=8
readonly=false
hsqldb.nio_data_file=true
hsqldb.cache_scale=14
version=1.8.0
hsqldb.default_table_type=memory
hsqldb.cache_file_scale=1
hsqldb.log_size=200
modified=yes
hsqldb.cache_version=1.7.0
hsqldb.original_version=1.8.0
hsqldb.compatible_version=1.8.0
Any help or hint will be appreciated.
This is an 7 year old version which is not ideal for multi-threaded usage.
The simple solution is to perform the database updates with a single thread. You can retrofit your multi-threaded application with a synchronized block over a singleton object around the code that performs the database update.