Issue with DataStax graph loader edge loading with property key - datastax

DSE VERSION 6.7 and DSE graph loader version 6.7
We have two vertex name "x" and "y" and there is connecting edges named "z" between for x to y.
That edge has property also. So wee needs to load that also. And properties data types are Txt(string) except vertex partitions key wihich is UUID.
As mentioned in documentation from Datastax to load edges with properties we need gzip the CSV .
https://docs.datastax.com/en/dse/6.7/dse-dev/datastax_enterprise/graph/dgl/dglMapScript.html
Now we are loading data using graph loader to load edges between and x and y .
here is our groovy script
// CONFIGURATION
// Configures the data loader to create the schema
//config create_schema: false,preparation: true, load_new: true, load_vertex_threads: 0,dryrun: true,schema_output: '/tmp/loader_output.txt'
config create_schema :false,load_new :true
inputfiledir = '/root/'
waInput = File.csv(inputfiledir + "test.csv.gz").gzip().delimiter('|')
//load edge
load(waInput).asEdges {
label "z"
outV "x", {
label "x"
key "xId"
}
inV "y", {
label "y"
key "yId"
}
}
here is our csv file format
xId|yId|newCase|totalActiveCases|totalRecoveries|totalDeaths|allCasesTotal|totalTestConducted
8c49304d-71e9-4e9d-93b3-47ebfd0ce073|e31f0a23-64c0-44ea-add8-c67eb52c0187|0|0|0|0|0|0
d451b2b2-a4b5-4ed6-bb4e-128945795e57|e31f0a23-64c0-44ea-add8-c67eb52c0187|0|0|0|0|0|0
xId and yId is UUID and except them, all are Text(string) datatype in database
now we run the that groovy script with dse graph loader we get this error
2020-04-21 05:25:14 INFO DataLoaderImpl:213 - Scheduling [test.csv] for reading
2020-04-21 05:25:14 DEBUG Reporter:69 - Input queue [test.csv] throughput is [111.74281809353118] items/s
2020-04-21 05:25:14 DEBUG Reporter:120 - query times 9: p50 4515.0µs, p80 4515.0µs, p90 4515.0µs, p95 4515.0µs, p99 4515.0µs, p99.9 4515.0µs, p99.99 4515.0µs
2020-04-21 05:25:14 ERROR DataLoaderImpl:720 - Failed when finalizing loader, some items that were in the queue may not have been written
com.datastax.dsegraphloader.exception.LoadingException: com.datastax.driver.core.exceptions.InvalidQueryException: Null not allowed
at com.datastax.dsegraphloader.impl.loader.driver.DseGraphDriverImpl.executeGraphQuery(DseGraphDriverImpl.java:149)
at com.datastax.dsegraphloader.impl.loader.driver.DseGraphDriverImpl.getOrCreateVertices(DseGraphDriverImpl.java:366) at com.datastax.dsegraphloader.impl.loader.driver.SafeGraphDriver.lambda$tryGetOrCreateVerticies$28(SafeGraphDriver.java:137)
at com.datastax.dsegraphloader.impl.loader.driver.SafeGraphDriver.tryDriverCall(SafeGraphDriver.java:72)
at com.datastax.dsegraphloader.impl.loader.driver.SafeGraphDriver.tryGetOrCreateVerticies(SafeGraphDriver.java:125)
at com.datastax.dsegraphloader.impl.loader.vertex.VertexLoader.completeLoad(VertexLoader.java:100)
at com.datastax.dsegraphloader.impl.loader.DataLoaderImpl$LoaderCallable.call(DataLoaderImpl.java:645)
at com.datastax.dsegraphloader.impl.loader.DataLoaderImpl$DelegatingLoaderCallable.run(DataLoaderImpl.java:530)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Null not allowed
Any solutions for that ?

Related

Mapping Data Flows - Cannot retrieve value from cached sink

I am trying lookup up a value from a cached sink. The Dataflow looks like the following
I have created a hash value in my cashed sink and want to reference that in my main pipeline.
My key for the cached sink is an array of columns. When I preview the data I get results.
My derived column is then trying to do a lookup against the cached data and running into an error.
When debugging I get the following error. What am I missing or getting wrong in this statement?
Spark job failed: {
"text/plain": "{"runId":"98c9bae9-210e-4791-9b0d-60bc557ff416","sessionId":"02bc59a8-ac6f-4eeb-952c-2e9bdda49691","status":"Failed","payload":{"statusCode":400,"shortMessage":"DF-SYS-01 at Derive 'GenerateHashKey': java.util.NoSuchElementException: key not found: Id","detailedMessage":"Failure 2022-04-26 04:07:47.375 failed DebugManager.processJob, run=98c9bae9-210e-4791-9b0d-60bc557ff416, errorMessage=DF-SYS-01 at Derive 'GenerateHashKey': java.util.NoSuchElementException: key not found: Id"}}\n"
} - RunId: 98c9bae9-210e-4791-9b0d-60bc557ff416
Thanks

Structured Streaming in Databricks Azure throwing exception - java.lang.IllegalStateException: Error reading delta file dbfs:/raw_zone/1.delta

We are using Structured Streaming in Databricks environment, Every time while we run this program - kAFKA - Structured Streaming (DBR6.6, Spark 2.4.5) - Writing to CosmosDB, we are getting the same exception as below just before we do the final joins to save the data to Cosmos DB. We haven't modified any spark specific settings and leveraging the default spark /DBR configurations.
Caused by: org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 174 in stage 9353.0 failed 4 times, most recent failure:
Lost task 174.3 in stage 9353.0 (TID 60863, 10.139.64.9, executor 1):
java.lang.IllegalStateException:
Error reading delta file dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta of HDFSStateStoreProvider[id = (op=8,part=174),dir = dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues]:
dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta does not exist
Caused by: java.io.FileNotFoundException:
/6455647419774311/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta

Convert pyspark dataframe to pandas dataframe

I have pyspark dataframe where its dimension is (28002528,21) and tried to convert it to pandas dataframe by using the following code line :
pd_df=spark_df.toPandas()
I got this error:
first Part
Py4JJavaError: An error occurred while calling o170.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 39.0 failed 1 times, most recent failure: Lost task 3.0 in stage 39.0 (TID 89, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:220)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:173)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:552)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:256)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
...
...
Caused by: java.lang.OutOfMemoryError: Java heap space
...
...
Second Part
Exception happened during processing of request from ('127.0.0.1', 56842)
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:56657)
Traceback (most recent call last):
...
...
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
During handling of the above exception, another exception occurred:
...
...
and I tried also to take sample of the original pyspark dataframe
smaple_pd_df=spark_df.sample(0.05).toPandas()
I got an error looks like the first part only of the previous error
You get
java.lang.OutOfMemoryError which probably means that you are trying to load all data into a single node which doesn't have enough RAM to handle the entire DataFrame. If you are using a cloud solution provider such as Databricks, try increasing the size of cluster RAM.
What toPandas() does is collect the whole dataframe into a single node (as explained in #ulmefors's answer).
More specifically, it collects it to the driver. The specific option you should be fine-tuning is spark.driver.memory, increase it accordingly.
Otherwise, if you're planning on doing further transformations on this (rather large) pandas dataframe, you could consider doing them in pyspark first and then collecting the (smaller) result into the driver, hopefully that will fit in memory.
More details are available in the Spark configuration documentation, here.

Issue in saving the content of a dataframe to table

I have a data source (hive external tables) which refresh the data in adhoc manner. To avoid any discrepancies in the execution i'm trying to save the data as a table in my location.
Initially, i have loaded the data from data source to a dataframe
source = hqlContext.table("datasourcedb.table1") // this is working fine
Then, trying to save it the my application location -
source.write.mode('overwrite').saveAsTable("appdb.table1") //No read/write operations on appdb.table1 while doing this action
Above actions throwing exceptions:
java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: BLOCK
at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.abortTask$1(WriterContainer.scala:294)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:271)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/02 04:31:32 ERROR TaskSetManager: Task 9 in stage 1.0 failed 4 times; aborting job
18/03/02 04:31:32 ERROR InsertIntoHadoopFsRelation: Aborting job.
**Note: The size of the source is abot 6GB. Hence, no persist action is planned **

Configuration values for hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode in HIVE

I am trying to add data to an external table using apache-hive. I am getting the following error in the hive logs
2015-06-15 17:27:44,614 ERROR [LocalJobRunner Map Task Executor #0]: mr.ExecMapper (ExecMapper.java:map(171)) - org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"transactiondate":"05-01-2015 08:26:21","transactiontype":"CASHOUT","transactionid":144590889,"sourcenumber":null,"destnumber":null,"amount":19000,"assumedfield1":880,"customerid":33394093,"transactionstatus":"COMPLETED","assumedfield2":325,"assumedfield3":175870}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 256
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:933)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:709)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)
... 10 more
I googled for this error and came across this link which says that we must change the values of hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode variables to higher values. What are the optimum configurations for these variables on a single node hadoop installation? None of these configuration values are working for me. Please help.
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=250;
Please do not try to increase hive partitions to higher value .
It may cause Namenode crash . If possible try to change the partition column and apply new logic over it