I am trying to read a csv file from hdfs using tensorflow. But I am encountering this error.
I had tried this: OOZIE: JA009: RPC response exceeds maximum data length But it's not working for me.
Here's my tensorflow code.
import tensorflow as tf
filename_queue = tf.train.string_input_producer(["hdfs://192.168.60.41:50070/DLTest-3k.csv"])
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
with tf.Session() as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
id, val = sess.run([key, value])
for v in val.splitlines():
print(v.decode())
coord.request_stop()
coord.join(threads)
hadoop version
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
Hadoop 3.1.1
Source code repository Unknown -r Unknown
Compiled by root on 2019-07-19T06:28Z
Compiled with protoc 2.5.0
From source with checksum f76ac55e5b5ff0382a9f7df36a3ca5a0
This command was run using /root/hadoop-3.1.1-src/hadoop-dist/target/hadoop-3.1.1/share/hadoop/common/hadoop-common-3.1.1.jar
Python version: 3.5.2
Tensorflow version: 1.14.0
This is the error message I am receiving.
2019-07-22 09:54:50,474 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-07-22 09:54:51,932 WARN net.NetUtils: Unable to wrap exception of type class org.apache.hadoop.ipc.RpcException: it has no (String) constructor
java.lang.NoSuchMethodException: org.apache.hadoop.ipc.RpcException.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:830)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)
at org.apache.hadoop.ipc.Client.call(Client.java:1443)
at org.apache.hadoop.ipc.Client.call(Client.java:1353)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
hdfsOpenFile(/DLTest-3k.csv): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error:
RpcException: RPC response exceeds maximum data lengthjava.io.IOException: Failed on local exception: org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data length; Host Details : local host is: "f35daeba55f7/172.17.0.3"; destination host is: "bigbraindev.razorthink.net":50070;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:816)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)
at org.apache.hadoop.ipc.Client.call(Client.java:1443)
at org.apache.hadoop.ipc.Client.call(Client.java:1353)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
Caused by: org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data length
at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1816)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1167)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1063)
2019-07-22 09:54:51.969178: W tensorflow/core/kernels/queue_base.cc:277] _0_input_producer: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: hdfs://192.168.60.41:50070/DLTest-3k.csv; Unknown error 255
[[{{node ReaderReadV2}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 14, in <module>
id, val = sess.run([key, value])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: hdfs://192.168.60.41:50070/DLTest-3k.csv; Unknown error 255
[[node ReaderReadV2 (defined at test.py:6) ]]
Errors may have originated from an input operation.
Input Source operations connected to node ReaderReadV2:
WholeFileReaderV2 (defined at test.py:5)
input_producer (defined at test.py:3)
Original stack trace for 'ReaderReadV2':
File "test.py", line 6, in <module>
key, value = reader.read(filename_queue)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/io_ops.py", line 166, in read
return gen_io_ops.reader_read_v2(self._reader_ref, queue_ref, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1105, in reader_read_v2
queue_handle=queue_handle, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
I am not able to convert a rdd to data frame using a custom schema. The details are below with the code:
It works when I use a customSchema as below:
>>> customSchema = StructType([
... StructField("EID",StringType()),\
... StructField("Name",StringType()),\
... StructField("email",StringType()),\
... StructField("Salary",StringType()),\
... StructField("PlaceName",StringType()),\
... StructField("County",StringType()),\
... StructField("City",StringType()),\
... StructField("Gender",StringType())\
... ])
>>>
>>> myDF = spark.createDataFrame(emp1,customSchema)
>>> myDF1 = myDF.withColumn("EID",col("EID").cast("integer")).withColumn("Salary",col("Salary").cast("integer"))
>>> myDF1.show()
+------+--------------------+--------------------+------+--------------+--------------------+--------------+------+
| EID| Name| email|Salary| PlaceName| County| City|Gender|
+------+--------------------+--------------------+------+--------------+--------------------+--------------+------+
|111135| Darell T Grizzle|darell.grizzle#ya...|196416| Tallahassee| Leon| Tallahassee| M|
|111159| Deanna Z Nestor|deanna.nestor#gma...|184760| Collegeport| Matagorda| Collegeport| F|
|111160| Marion G Mcqueary|marion.mcqueary#y...|189506| Flensburg| Morrison| Flensburg| M|
|111175| Monserrate D Bentz|monserrate.bentz#...|184412|South Freeport| Cumberland|South Freeport| F|
|111214| Jamie E Spataro|jamie.spataro#gma...|189926| Gilliam| Saline| Gilliam| M|
|111228| Ernest J Woolbright|ernest.woolbright...|194929| Tacoma| Tacoma| Tacoma| M|
|111243| Ivette F Manzanares|ivette.manzanares...|189834| Lemasters| Franklin| Lemasters| F|
|111274| Erwin F Bouchard|erwin.bouchard#ao...|184390| Bessemer City| Gaston| Bessemer City| M|
|111293| Walton E Garza|walton.garza#comc...|198280| Suncook| Merrimack| Suncook| M|
|111316| Jospeh E Holle|jospeh.holle#gmai...|181878| Wagon Mound| Mora| Wagon Mound| M|
|111327| Angelo S Fizer|angelo.fizer#ibm.com|199654| Zelienople| Butler| Zelienople| M|
|111350| Numbers H Luo| numbers.luo#aol.com|198095| Eva| Benton| Eva| M|
|111359| Jim Z Jewett|jim.jewett#gmail.com|198956| Hatchechubbee| Russell| Hatchechubbee| M|
|111396| Edward M Pentecost|edward.pentecost#...|194979| Dayhoit| Harlan| Dayhoit| M|
|111403| Henry F Lawyer|henry.lawyer#appl...|198515| Washington|District of Columbia| Washington| M|
|111442| Manual X Meany|manual.meany#yaho...|196608| Hunter| Cass| Hunter| M|
|111446| Ethan V Folmar|ethan.folmar#yaho...|188581| Ridgeview| Boone| Ridgeview| M|
|111449| Tanja J Sparrow|tanja.sparrow#yah...|195398| Tower City| Cass| Tower City| F|
|111478|Leigha K Courtema...|leigha.courtemanc...|195306| Sun Valley| Blaine| Sun Valley| F|
|111514| Rob F Struck|rob.struck#gmail.com|198750| Centertown| Cole| Centertown| M|
+------+--------------------+--------------------+------+--------------+--------------------+--------------+------+
only showing top 20 rows
But it fails when I use a Schema (where I directly define EID and Salary as IntegerType) as below:
>>> customSchema = StructType([
... StructField("EID",IntegerType()),\
... StructField("Name",StringType()),\
... StructField("email",StringType()),\
... StructField("Salary",IntegerType()),\
... StructField("PlaceName",StringType()),\
... StructField("County",StringType()),\
... StructField("City",StringType()),\
... StructField("Gender",StringType())\
... ])
Full code below:
>>> rdd = sc.textFile("C:/sparkCourse/filetext/part-00000-646a1d36-8f75-4eee-b937-135e933ede7f-c000.csv").map(lambda row: row.split(','))
>>> rdd.take(1)
[['EID', 'Name', 'email', 'Salary', 'PlaceName', 'County', 'City', 'Gender']]
>>> header = rdd.first()
>>> emp = rdd.filter(lambda row: row != header)
>>> emp.take(1)
[['111135', 'Darell T Grizzle', 'darell.grizzle#yahoo.ca', '196416', 'Tallahassee', 'Leon', 'Tallahassee', 'M']]
>>> emp1 = emp.map(lambda fields:[fields[0],fields[1],fields[2],fields[3],fields[4],fields[5],fields[6],fields[7]])
>>> emp1.take(1)
[['111135', 'Darell T Grizzle', 'darell.grizzle#yahoo.ca', '196416', 'Tallahassee', 'Leon', 'Tallahassee', 'M']]
>>>
>>> customSchema = StructType([
... StructField("EID",IntegerType()),\
... StructField("Name",StringType()),\
... StructField("email",StringType()),\
... StructField("Salary",IntegerType()),\
... StructField("PlaceName",StringType()),\
... StructField("County",StringType()),\
... StructField("City",StringType()),\
... StructField("Gender",StringType())\
... ])
>>> myDF = spark.createDataFrame(emp1,customSchema)
I get the below error:
IntegerType can not accept object '111135' in type <class 'str'>
But, why does it allow the column to be casted as an integer later and not at the time of defining Schema.
Where am I going wrong?
>>> myDF.show()
[Stage 47:> (0 + 1) / 1]19/02/08 19:54:21 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 55)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
File "C:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\spark\python\pyspark\sql\session.py", line 671, in prepare
verify_func(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1402, in verify_struct
verifier(v)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1347, in verify_integer
verify_acceptable_types(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1310, in verify_acceptable_types
% (dataType, obj, type(obj))))
TypeError: field EID: IntegerType can not accept object '111135' in type <class 'str'>
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/02/08 19:54:21 WARN TaskSetManager: Lost task 0.0 in stage 47.0 (TID 55, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
File "C:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\spark\python\pyspark\sql\session.py", line 671, in prepare
verify_func(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1402, in verify_struct
verifier(v)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1347, in verify_integer
verify_acceptable_types(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1310, in verify_acceptable_types
% (dataType, obj, type(obj))))
TypeError: field EID: IntegerType can not accept object '111135' in type <class 'str'>
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/02/08 19:54:21 ERROR TaskSetManager: Task 0 in stage 47.0 failed 1 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\spark\python\pyspark\sql\dataframe.py", line 350, in show
print(self._jdf.showString(n, 20, vertical))
File "C:\spark\python\lib\py4j-0.10.6-src.zip\py4j\java_gateway.py", line 1160, in __call__
File "C:\spark\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\spark\python\lib\py4j-0.10.6-src.zip\py4j\protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1148.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 47.0 failed 1 times, most recent failure: Lost task 0.0 in stage 47.0 (TID 55, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
File "C:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\spark\python\pyspark\sql\session.py", line 671, in prepare
verify_func(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1402, in verify_struct
verifier(v)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1347, in verify_integer
verify_acceptable_types(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1310, in verify_acceptable_types
% (dataType, obj, type(obj))))
TypeError: field EID: IntegerType can not accept object '111135' in type <class 'str'>
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
File "C:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\spark\python\pyspark\sql\session.py", line 671, in prepare
verify_func(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1402, in verify_struct
verifier(v)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1347, in verify_integer
verify_acceptable_types(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1310, in verify_acceptable_types
% (dataType, obj, type(obj))))
TypeError: field EID: IntegerType can not accept object '111135' in type <class 'str'>
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
>>>
It could work if you don't define the schema in the beginning just read the csv with spark.read.csv(....) and then transform the columns with cast.
so if you just want to convert this columns from string to integer, you could use the following code:
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([('111135', 'Darell T Grizzle', 'darell.grizzle#yahoo.ca', '196416', 'Tallahassee', 'Leon', 'Tallahassee', 'M'),\
('111136', 'Darell X Xrizzle', 'darell.Xrizzle#yahoo.ca', '206416', 'Example', 'Leroy', 'Example', 'W')],\
['EID', 'Name', 'email', 'Salary', 'PlaceName', 'County', 'City', 'Gender'])
#above code is only used to create some dataframe with a similar format
#and the functions are used to access the columns with col()
df1 = df1.withColumn("EID", col("EID").cast("int")).withColumn("Salary", col("Salary").cast("int"))
#this line transforms your string columns to integer
df1.printSchema()
df1.show(truncate=False)
Output:
root
|-- EID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- email: string (nullable = true)
|-- Salary: integer (nullable = true)
|-- PlaceName: string (nullable = true)
|-- County: string (nullable = true)
|-- City: string (nullable = true)
|-- Gender: string (nullable = true)
+------+----------------+-----------------------+------+-----------+------+-----------+------+
|EID |Name |email |Salary|PlaceName |County|City |Gender|
+------+----------------+-----------------------+------+-----------+------+-----------+------+
|111135|Darell T Grizzle|darell.grizzle#yahoo.ca|196416|Tallahassee|Leon |Tallahassee|M |
|111136|Darell X Xrizzle|darell.Xrizzle#yahoo.ca|206416|Example |Leroy |Example |W |
+------+----------------+-----------------------+------+-----------+------+-----------+------+
If you want to work with rdd you can use the following code and apply a map function that converts the respective columns:
x = sc.parallelize([['111135', 'Darell T Grizzle', 'darell.grizzle#yahoo.ca', '196416', 'Tallahassee', 'Leon', 'Tallahassee', 'M']])
customSchema = StructType([
StructField("EID",IntegerType()),\
StructField("Name",StringType()),\
StructField("email",StringType()),\
StructField("Salary",IntegerType()),\
StructField("PlaceName",StringType()),\
StructField("County",StringType()),\
StructField("City",StringType()),\
StructField("Gender",StringType())\
])
x = x.map(lambda fields: [int(fields[0]),fields[1],fields[2],int(fields[3]),fields[4],fields[5],fields[6],fields[7]]).collect()
myDF = spark.createDataFrame(x,customSchema)
myDF.show()
Output:
+------+----------------+--------------------+------+-----------+------+-----------+------+
| EID| Name| email|Salary| PlaceName|County| City|Gender|
+------+----------------+--------------------+------+-----------+------+-----------+------+
|111135|Darell T Grizzle|darell.grizzle#ya...|196416|Tallahassee| Leon|Tallahassee| M|
+------+----------------+--------------------+------+-----------+------+-----------+------+
In case any one wants to do the same task using SparkSession, below is the code:
df = spark.read.option("header","true").schema(customSchema).csv("C:/sparkCourse/filetext/part-00000-646a1d36-8f75-4eee-b937-135e933ede7f-c000.csv")
But, any help using the sparkContext would really be appreciated.
I am trying to access a file stored in the S3 bucket from pyspark code. It is giving me the attached error message.
The program works fine when working with files stored locally.
I tried using s3://, s3a:// and s3n:// but none of them seems to work.
code:
ACCESS_KEY = "*********"
SECRET_KEY = "**********"
EncodedSecretKey = SECRET_KEY.replace("/", "%2F")
s3url="s3n://"+ACCESS_KEY+":"+EncodedSecretKey+"#"+bucket_name+"/"+file_name
sqlContext.read.option("delimiter",delimiter).load(s3url,
format='com.databricks.spark.csv',
header='true',
inferSchema='true')
Error Message
Traceback (most recent call last):
File "C:\Users\sachari\AppData\Local\Temp\zeppelin_pyspark-5481670497409059953.py", line 367, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "C:\Users\sachari\AppData\Local\Temp\zeppelin_pyspark-5481670497409059953.py", line 355, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 14, in <module>
File "<stdin>", line 10, in get_df
File "C:\zeppelin\interpreter\spark\pyspark\pyspark.zip\pyspark\sql\readwriter.py", line 149, in load
return self._df(self._jreader.load(path))
File "C:\zeppelin\interpreter\spark\pyspark\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\zeppelin\interpreter\spark\pyspark\pyspark.zip\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\zeppelin\interpreter\spark\pyspark\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o537.load.
: java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Unknown Source)
you have to load aws package,
for pyspark shell you have to load the package as below and it's also work into spark-submit command.
pyspark --packages org.apache.hadoop:hadoop-aws:2.7.1
or
you have to set credentials as shown in below link.
https://hadoop.apache.org/docs/r2.7.2/hadoop-aws/tools/hadoop-aws/index.html
I am trying to print a RDD using Spark in Ipython and when I do that I get this error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-77015cd18335> in <module>()
---> 24 print inputData.collect()
25
26
/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/pyspark/rdd.pyc in collect(self)
771 """
772 with SCCallSiteSync(self.context) as css:
--> 773 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
774 return list(_load_from_socket(port, self._jrdd_deserializer))
775
/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
34 def deco(*a, **kw):
35 try:
---> 36 return f(*a, **kw)
37 except py4j.protocol.Py4JJavaError as e:
38 s = e.java_exception.toString()
/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 56, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 421, in loads
return pickle.loads(obj)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: No module named numpy
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 421, in loads
return pickle.loads(obj)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: No module named numpy
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
My current code is:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
import os.path
import numpy as np
print np.version.version
def extract(line):
return (line[1])
inputPath = os.path.join('file1.csv')
fileName = os.path.join(inputPath)
Data = sc.textFile(fileName).zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)
inputData = (Data
.map(lambda line: line.split(";"))
.filter(lambda line: len(line) >1 )
.map(extract)) # Map to tuples
# error comes a this line
print inputData.collect()
I have numpy already installed (sudo apt-get install python-numpy) and can print numpy version in Ipython using numpy.version.version
Why is this error coming and how to resolve it?
NOTE 1: My current bash_profile:
# Set the Spark Home as an environment variable.
export SPARK_HOME="$HOME/spark-1.5.0-bin-hadoop2.6"
# Define your Spark arguments for when running Spark.
export PYSPARK_SUBMIT_ARGS="--master local[2]"
# IPython alias for the use with SPARK.
alias IPYSPARK='PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --profile=pyspark --ip=0.0.0.0" $SPARK_HOME/bin/pyspark'
I have also added following to my spark-env.sh.template file:
#!/usr/bin/env bash
# This file is sourced when running various Spark programs.
export PYSPARK_PYTHON=/usr/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython
NOTE 2: I am launching the Ipython notebook from inside a virtual environment.
NOTE 3: I have Spark 1.5.0 and numpy 1.8.2
UPDATE: output from sc.parallelize([],1).mapPartitions(lambda _: [(sys.executable, sys.path)]).first()
('/home/vagrant/pyEnv/bin/python2.7', ['', u'/tmp/spark-dbbcfd0b-413e-4406-8bd5-37de29d3fcc5/userFiles-6296ba2d-4ec5-4956-9904-828bda0c6424', '/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip', '/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip', '/home/vagrant/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar', '/home/vagrant/spark-1.5.0-bin-hadoop2.6/python', '/home/vagrant', '/home/vagrant/pyEnv/lib/python2.7', '/home/vagrant/pyEnv/lib/python2.7/plat-x86_64-linux-gnu', '/home/vagrant/pyEnv/lib/python2.7/lib-tk', '/home/vagrant/pyEnv/lib/python2.7/lib-old', '/home/vagrant/pyEnv/lib/python2.7/lib-dynload', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/home/vagrant/pyEnv/local/lib/python2.7/site-packages', '/home/vagrant/pyEnv/lib/python2.7/site-packages'])