java.util.ConcurrentModificationException when adding some key to metadata in stormcrawler - serialization

I have added a field to metadata for transferring and persisting in the status index. The field is a List of String and its name is input_keywords. After running topology in the Strom cluster, The topology halted with the following logs:
java.lang.RuntimeException: com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException
Serialization trace:
md (com.digitalpebble.stormcrawler.Metadata)
at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:522) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:487) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:74) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.disruptor$consume_loop_STAR_$fn__4132.invoke(disruptor.clj:84) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.util$async_loop$fn__1221.invoke(util.clj:484) [storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
Caused by: com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException
Serialization trace:
md (com.digitalpebble.stormcrawler.Metadata)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) ~[kryo-3.0.3.jar:?]
at org.apache.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:44) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:44) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.daemon.worker$mk_transfer_fn$transfer_fn__10378.invoke(worker.clj:203) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.daemon.executor$start_batch_transfer_GT_worker_handler_BANG$fn__10056.invoke(executor.clj:314) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.disruptor$clojure_handler$reify__4115.onEvent(disruptor.clj:41) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:509) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
... 6 more
Caused by: java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) ~[?:1.8.0_112]
at java.util.HashMap$EntryIterator.next(HashMap.java:1471) ~[?:1.8.0_112]
at java.util.HashMap$EntryIterator.next(HashMap.java:1469) ~[?:1.8.0_112]
at com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:99) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:39) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) ~[kryo-3.0.3.jar:?]
at org.apache.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:44) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:44) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.daemon.worker$mk_transfer_fn$transfer_fn__10378.invoke(worker.clj:203) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.daemon.executor$start_batch_transfer_GT_worker_handler_BANG$fn__10056.invoke(executor.clj:314) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.disruptor$clojure_handler$reify__4115.onEvent(disruptor.clj:41) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:509) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
... 6 more
2021-02-27 08:03:34.276 o.a.s.d.executor Thread-20-disruptor-executor[45 45]-send-queue [ERROR]
java.lang.RuntimeException: com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException
Serialization trace:
md (com.digitalpebble.stormcrawler.Metadata)
at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:522) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:487) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:74) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.disruptor$consume_loop_STAR_$fn__4132.invoke(disruptor.clj:84) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.util$async_loop$fn__1221.invoke(util.clj:484) [storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
Caused by: com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException
Serialization trace:
md (com.digitalpebble.stormcrawler.Metadata)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) ~[kryo-3.0.3.jar:?]
at org.apache.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:44) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:44) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.daemon.worker$mk_transfer_fn$transfer_fn__10378.invoke(worker.clj:203) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.daemon.executor$start_batch_transfer_GT_worker_handler_BANG$fn__10056.invoke(executor.clj:314) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.disruptor$clojure_handler$reify__4115.onEvent(disruptor.clj:41) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:509) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
... 6 more
Caused by: java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) ~[?:1.8.0_112]
at java.util.HashMap$EntryIterator.next(HashMap.java:1471) ~[?:1.8.0_112]
at java.util.HashMap$EntryIterator.next(HashMap.java:1469) ~[?:1.8.0_112]
at com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:99) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:39) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) ~[kryo-3.0.3.jar:?]
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) ~[kryo-3.0.3.jar:?]
at org.apache.storm.serialization.KryoValuesSerializer.serializeInto(KryoValuesSerializer.java:44) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:44) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.daemon.worker$mk_transfer_fn$transfer_fn__10378.invoke(worker.clj:203) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.daemon.executor$start_batch_transfer_GT_worker_handler_BANG$fn__10056.invoke(executor.clj:314) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.disruptor$clojure_handler$reify__4115.onEvent(disruptor.clj:41) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:509) ~[storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
... 6 more
2021-02-27 08:03:34.327 o.a.s.util Thread-20-disruptor-executor[45 45]-send-queue [ERROR] Halting process: ("Worker died")
java.lang.RuntimeException: ("Worker died")
at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341) [storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.7.0.jar:?]
at org.apache.storm.daemon.worker$fn_10827$fn_10828.invoke(worker.clj:781) [storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.daemon.executor$mk_executor_data$fn_10034$fn_10035.invoke(executor.clj:281) [storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at org.apache.storm.util$async_loop$fn__1221.invoke(util.clj:494) [storm-core-1.2.1.3.1.4.0-315.jar:1.2.1.3.1.4.0-315]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
We have different parallelism hints for each components of topology. After adding the input_keywords to metadata we got the error. What is the main reason of the Error?

You are modifying a Metadata instance while it is being serialized. You can't do that, see Storm troubleshooting page.
As explained in the release notes of 1.16, you can lock the metadata. This won't fix the issue but will tell you where in your code you are writing into the metadata.
In our topology, we emitted the same metadata to multiple bolts at the same time.
mystery explained. don't do that.

Related

Unable to label axis in juyter TypeError

This is the code which I have use to plot except i have removed the key.
import datetime
import pandas as pd
import numpy as np
from pandas_datareader import data as web
import matplotlib.pyplot as plt
from pandas import DataFrame
from alpha_vantage.foreignexchange import ForeignExchange
cc = ForeignExchange(key=' ',output_format='pandas')
data, meta_data = cc.get_currency_exchange_daily(from_symbol='USD',to_symbol='EUR', outputsize='full')
print(data)
data['4. close'].plot()
plt.tight_layout()
plt.title('Intraday USD/Eur')
if i insert ylabel('label') i got the following error TypeError: 'str' object is not callable
This problem was outlined here thus if i restart my juyter kernel the ylabel will show up once but if I rerun the same code i will get the same error again. Is there a bug or is there a problem on my end?
I am not sure if it is relevent but the dataframe looks like this
Open High Low Close
date
2019-11-15 0.9072 0.9076 0.9041 0.9043
2019-11-14 0.9081 0.9097 0.9065 0.9070
2019-11-13 0.9079 0.9092 0.9071 0.9082
2019-11-12 0.9062 0.9085 0.9056 0.9079
2019-11-11 0.9071 0.9074 0.9052 0.9062
... ... ... ... ...
2014-11-28 0.8023 0.8044 0.8004 0.8028
2014-11-27 0.7993 0.8024 0.7983 0.8022
2014-11-26 0.8014 0.8034 0.7980 0.7993
2014-11-25 0.8037 0.8059 0.8007 0.8014
2014-11-24 0.8081 0.8085 0.8032 0.8036
1570 rows × 4 columns

Hbase-Spark Connector issue in cloudera: java.lang.AbstractMethodError

I am trying to write Spark dataframe into Hbase, but when I perform any action or write/save method on the same dataframe it gives the following exception:
{
java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:50)
at org.apache.spark.sql.execution.datasources.hbase.HBaseFilter$.log(HBaseFilter.scala:121)
at org.apache.spark.sql.execution.datasources.hbase.HBaseFilter$.buildFilters(HBaseFilter.scala:124)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:60)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
----------
Here is my code:
import org.apache.spark.sql.{SQLContext, _}
import org.apache.spark.sql.execution.datasources.hbase._
import org.apache.spark.{SparkConf, SparkContext}
def catalog = s"""{
| |"table":{"namespace":"default", "name":"Contacts"},
| |"rowkey":"key",
| |"columns":{
| |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
| |"officeAddress":{"cf":"Office", "col":"Address", "type":"string"},
| |"officePhone":{"cf":"Office", "col":"Phone", "type":"string"},
| |"personalName":{"cf":"Personal", "col":"Name", "type":"string"},
| |"personalPhone":{"cf":"Personal", "col":"Phone", "type":"string"}
| |}
| |}""".stripMargin
def withCatalog(cat: String): DataFrame = {
| spark.sqlContext
| .read
| .options(Map(HBaseTableCatalog.tableCatalog->cat))
| .format("org.apache.spark.sql.execution.datasources.hbase")
| .load()
| }
val df = withCatalog(catalog)
i was able to create dataframe, but i perform
df.show()
it gives me error:
java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:50)
at org.apache.spark.sql.execution.datasources.hbase.HBaseFilter$.log(HBaseFilter.scala:121)
at org.apache.spark.sql.execution.datasources.hbase.HBaseFilter$.buildFilters(HBaseFilter.scala:124)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:60)`
please give some suggestions:
i am impoting table form Hbase and creating catlog and based on that creating dataframe on top of that using:-
Spark 1.6
HBase 1.2.0-cdh5.13.3
cloudera
Same problem encountered, I'm using hbase-spark 1.2.0-cdh5.8.4.
I've tried compiling it on version 1.2.0-cdh5.13.0, after that the error no longer exists. You should try to recompile the source code or use higher version.

Error when read data from oracle with spark

There is a field of type NUMBER(2,3) in the table (in oracle database).
When I read this table with spark, the following exception occurs:
org.apache.spark.sql.AnalysisException:
Decimal scale (3) cannot be greater than precision (2).
How can I solve this problem?
The code I am reading the table data is as follows:
val df = spark.read
.format("jdbc")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("url", "jdbc:oracle:thin:#**:1521:**")
.option("user", "**")
.option("password", "**")
.option("dbtable", "table_name")
.load()

Remove null rows in pyspark dataframe [duplicate]

This question already has answers here:
Filter Pyspark dataframe column with None value
(10 answers)
Closed 4 years ago.
When I loaded a fairly large dataset (i.e. Wikipedia's archives) into a spark dataframe, I received the below error:
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
What is the best way to remove Null values within a pyspark dataframe?
you can use na.drop() in order to remove all rows including Null values:
df.na.drop()

Duplicate column Pandas dataframe slice issue

I have a dataframe df with duplicate columns: (I need duplicate columns dataframe, which will be passed as a parameter to matplotlib to plot, so the columns' name and content might be same or different)
>>> df
PE RT Ttl_mkv PE
STK_ID RPT_Date
11_STK79 20130115 41.932 2.744 3629.155 41.932
21_STK58 20130115 14.223 0.048 30302.324 14.223
22_STK229 20130115 22.436 0.350 15968.313 22.436
23_STK34 20130115 -63.252 0.663 4168.189 -63.252
I can get the second column by : df[df.columns[1]] ,
>>> df[df.columns[1]]
STK_ID RPT_Date
11_STK79 20130115 2.744
21_STK58 20130115 0.048
22_STK229 20130115 0.350
23_STK34 20130115 0.663
but if I want to get the first column by df[df.columns[0]] , it will give :
>>> df[df.columns[0]]
PE PE
STK_ID RPT_Date
11_STK79 20130115 41.932 41.932
21_STK58 20130115 14.223 14.223
22_STK229 20130115 22.436 22.436
23_STK34 20130115 -63.252 -63.252
Which one has two columns? That will make my application down for the application just wants the first column but Pandas give 1st & 4th column! Is it a bug or it is designed as this on purpose ? How to bypass this issue ?
My pandas version is 0.8.1 .
I dont really understand why you need to two columns with the same name, avoiding it would probably be the best.
But to answer your question, this would return only 1 of the 'PE' columns:
df.T.drop_duplicates().T.PE
STK_ID RPT_Date
11_STK79 20130115 41.932
21_STK58 20130115 14.223
22_STK229 20130115 22.436
23_STK34 20130115 -63.252
Name: PE
or:
df.T.ix[0].T