Accessing files in S3 bucket from pyspark

Accessing files in S3 bucket from pyspark - amazon-s3

I am trying to access a file stored in the S3 bucket from pyspark code. It is giving me the attached error message.
The program works fine when working with files stored locally.
I tried using s3://, s3a:// and s3n:// but none of them seems to work.
code:
ACCESS_KEY = "*********"
SECRET_KEY = "**********"
EncodedSecretKey = SECRET_KEY.replace("/", "%2F")
s3url="s3n://"+ACCESS_KEY+":"+EncodedSecretKey+"#"+bucket_name+"/"+file_name
sqlContext.read.option("delimiter",delimiter).load(s3url,
format='com.databricks.spark.csv',
header='true',
inferSchema='true')
Error Message
Traceback (most recent call last):
File "C:\Users\sachari\AppData\Local\Temp\zeppelin_pyspark-5481670497409059953.py", line 367, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "C:\Users\sachari\AppData\Local\Temp\zeppelin_pyspark-5481670497409059953.py", line 355, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 14, in <module>
File "<stdin>", line 10, in get_df
File "C:\zeppelin\interpreter\spark\pyspark\pyspark.zip\pyspark\sql\readwriter.py", line 149, in load
return self._df(self._jreader.load(path))
File "C:\zeppelin\interpreter\spark\pyspark\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\zeppelin\interpreter\spark\pyspark\pyspark.zip\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\zeppelin\interpreter\spark\pyspark\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o537.load.
: java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Unknown Source)

you have to load aws package,
for pyspark shell you have to load the package as below and it's also work into spark-submit command.
pyspark --packages org.apache.hadoop:hadoop-aws:2.7.1
or
you have to set credentials as shown in below link.
https://hadoop.apache.org/docs/r2.7.2/hadoop-aws/tools/hadoop-aws/index.html

Related

use AWS_PROFILE in pandas.read_parquet

I'm testing this locally where I have a ~/.aws/config file.
~/.aws/config looks some thing like:
[profile a]
...
[profile b]
...
I also have a AWS_PROFILE environmental variable set as "a".
I would like to read a file in which is accessible with profile b using pandas.
I am able to access it through s3fs by doing:
import s3fs
fs = s3fs.S3FileSystem(profile="b")
fs.get("BUCKET/FILE.parquet", "FILE.parquet")
pd.read_parquet("FILE.parquet")
However, if I try to pass this to pd.read_parquet using storage_options I get a PermissionError: Forbidden.
pd.read_parquet(
"s3://BUCKET/FILE.parquet",
storage_options={"profile": "b"},
)
full Traceback below
Traceback (most recent call last):
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 233, in _call_s3
out = await method(**additional_kwargs)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/aiobotocore/client.py", line 154, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet
return impl.read(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read
return self.api.parquet.read_table(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/parquet.py", line 1672, in read_table
dataset = _ParquetDatasetV2(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/parquet.py", line 1504, in __init__
if filesystem.get_file_info(path_or_paths).is_file:
File "pyarrow/_fs.pyx", line 438, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/_fs.pyx", line 1004, in pyarrow._fs._cb_get_file_info
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/fs.py", line 226, in get_file_info
info = self.fs.info(path)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 72, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in sync
raise result[0]
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 20, in _runner
result[0] = await coro
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 911, in _info
out = await self._call_s3(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 252, in _call_s3
raise translate_boto_error(err)
PermissionError: Forbidden
Note: there is an old question somewhat related to this but it didn't help: How to read parquet file from s3 using dask with specific AWS profile

You just need to add the following argument to the function:
storage_options=dict(profile='your_profile_name')
Hence the read statement is:
pd.read_parquet("s3://your_bucket",storage_options=dict(profile='your_profile_name'))

How to fix "RPC response exceeds maximum data length" error while trying to read from hdfs using tensorflow?

I am trying to read a csv file from hdfs using tensorflow. But I am encountering this error.
I had tried this: OOZIE: JA009: RPC response exceeds maximum data length But it's not working for me.
Here's my tensorflow code.
import tensorflow as tf
filename_queue = tf.train.string_input_producer(["hdfs://192.168.60.41:50070/DLTest-3k.csv"])
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
with tf.Session() as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
id, val = sess.run([key, value])
for v in val.splitlines():
print(v.decode())
coord.request_stop()
coord.join(threads)
hadoop version
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
Hadoop 3.1.1
Source code repository Unknown -r Unknown
Compiled by root on 2019-07-19T06:28Z
Compiled with protoc 2.5.0
From source with checksum f76ac55e5b5ff0382a9f7df36a3ca5a0
This command was run using /root/hadoop-3.1.1-src/hadoop-dist/target/hadoop-3.1.1/share/hadoop/common/hadoop-common-3.1.1.jar
Python version: 3.5.2
Tensorflow version: 1.14.0
This is the error message I am receiving.
2019-07-22 09:54:50,474 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-07-22 09:54:51,932 WARN net.NetUtils: Unable to wrap exception of type class org.apache.hadoop.ipc.RpcException: it has no (String) constructor
java.lang.NoSuchMethodException: org.apache.hadoop.ipc.RpcException.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:830)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)
at org.apache.hadoop.ipc.Client.call(Client.java:1443)
at org.apache.hadoop.ipc.Client.call(Client.java:1353)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
hdfsOpenFile(/DLTest-3k.csv): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error:
RpcException: RPC response exceeds maximum data lengthjava.io.IOException: Failed on local exception: org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data length; Host Details : local host is: "f35daeba55f7/172.17.0.3"; destination host is: "bigbraindev.razorthink.net":50070;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:816)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)
at org.apache.hadoop.ipc.Client.call(Client.java:1443)
at org.apache.hadoop.ipc.Client.call(Client.java:1353)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
Caused by: org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data length
at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1816)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1167)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1063)
2019-07-22 09:54:51.969178: W tensorflow/core/kernels/queue_base.cc:277] _0_input_producer: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: hdfs://192.168.60.41:50070/DLTest-3k.csv; Unknown error 255
[[{{node ReaderReadV2}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 14, in <module>
id, val = sess.run([key, value])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: hdfs://192.168.60.41:50070/DLTest-3k.csv; Unknown error 255
[[node ReaderReadV2 (defined at test.py:6) ]]
Errors may have originated from an input operation.
Input Source operations connected to node ReaderReadV2:
WholeFileReaderV2 (defined at test.py:5)
input_producer (defined at test.py:3)
Original stack trace for 'ReaderReadV2':
File "test.py", line 6, in <module>
key, value = reader.read(filename_queue)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/io_ops.py", line 166, in read
return gen_io_ops.reader_read_v2(self._reader_ref, queue_ref, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1105, in reader_read_v2
queue_handle=queue_handle, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()

SSL: certificate_verify_failed using Google Assistant SDK

I'm trying to get the google assistant sdk sample running on a pi but after receiving the authorization URL by using
google-oauthlib-tool --client-secrets /home/pi/client_secret_<my-id>.json --scope https://www.googleapis.com/auth/assistant-sdk-prototype --save --headless
I enter the code from the provided url and get this dump:
Traceback (most recent call last):
File "/home/pi/env/lib/python3.4/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/pi/env/lib/python3.4/site-packages/urllib3/connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "/home/pi/env/lib/python3.4/site-packages/urllib3/connectionpool.py", line 844, in _validate_conn
conn.connect()
File "/home/pi/env/lib/python3.4/site-packages/urllib3/connection.py", line 326, in connect
ssl_context=context)
File "/home/pi/env/lib/python3.4/site-packages/urllib3/util/ssl_.py", line 325, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "/usr/lib/python3.4/ssl.py", line 364, in wrap_socket
_context=self)
File "/usr/lib/python3.4/ssl.py", line 577, in __init__
self.do_handshake()
File "/usr/lib/python3.4/ssl.py", line 804, in do_handshake
self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:600)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/pi/env/lib/python3.4/site-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/home/pi/env/lib/python3.4/site-packages/urllib3/connectionpool.py", line 630, in urlopen
raise SSLError(e)
urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:600)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/pi/env/bin/google-oauthlib-tool", line 11, in <module>
sys.exit(main())
File "/home/pi/env/lib/python3.4/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/home/pi/env/lib/python3.4/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/pi/env/lib/python3.4/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/pi/env/lib/python3.4/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/pi/env/lib/python3.4/site-packages/google_auth_oauthlib/tool/__main__.py", line 106, in main
creds = flow.run_console()
File "/home/pi/env/lib/python3.4/site-packages/google_auth_oauthlib/flow.py", line 358, in run_console
self.fetch_token(code=code)
File "/home/pi/env/lib/python3.4/site-packages/google_auth_oauthlib/flow.py", line 235, in fetch_token
**kwargs)
File "/home/pi/env/lib/python3.4/site-packages/requests_oauthlib/oauth2_session.py", line 221, in fetch_token
verify=verify, proxies=proxies)
File "/home/pi/env/lib/python3.4/site-packages/requests/sessions.py", line 560, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/home/pi/env/lib/python3.4/site-packages/requests_oauthlib/oauth2_session.py", line 360, in request
headers=headers, data=data, **kwargs)
File "/home/pi/env/lib/python3.4/site-packages/requests/sessions.py", line 513, in request
resp = self.send(prep, **send_kwargs)
File "/home/pi/env/lib/python3.4/site-packages/requests/sessions.py", line 623, in send
r = adapter.send(request, **kwargs)
File "/home/pi/env/lib/python3.4/site-packages/requests/adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:600)
I checked this post which had a similar issue and updated my clock accordingly but that didn't seem to be the problem. I have redownloaded all the libraries and tools and ensured python is up to date but I can't get this to work. Is there any solution to this issue?
It is also worth noting that this same trace occurs regardless of what i put in as the authorization code, even just putting in "x" produces the same output.

I was following the instructions at https://developers.google.com/assistant/sdk/develop/python/run-sample but I think those instructions forgot the step:
sudo apt-get install portaudio19-dev libffi-dev libssl-dev
Doing that then re-running google-oauthlib-tool worked.

Numpy error in printing a RDD in Spark with Ipython

I am trying to print a RDD using Spark in Ipython and when I do that I get this error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-77015cd18335> in <module>()
---> 24 print inputData.collect()
25
26
/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/pyspark/rdd.pyc in collect(self)
771 """
772 with SCCallSiteSync(self.context) as css:
--> 773 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
774 return list(_load_from_socket(port, self._jrdd_deserializer))
775
/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
34 def deco(*a, **kw):
35 try:
---> 36 return f(*a, **kw)
37 except py4j.protocol.Py4JJavaError as e:
38 s = e.java_exception.toString()
/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 56, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 421, in loads
return pickle.loads(obj)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: No module named numpy
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 421, in loads
return pickle.loads(obj)
File "/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: No module named numpy
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
My current code is:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
import os.path
import numpy as np
print np.version.version
def extract(line):
return (line[1])
inputPath = os.path.join('file1.csv')
fileName = os.path.join(inputPath)
Data = sc.textFile(fileName).zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)
inputData = (Data
.map(lambda line: line.split(";"))
.filter(lambda line: len(line) >1 )
.map(extract)) # Map to tuples
# error comes a this line
print inputData.collect()
I have numpy already installed (sudo apt-get install python-numpy) and can print numpy version in Ipython using numpy.version.version
Why is this error coming and how to resolve it?
NOTE 1: My current bash_profile:
# Set the Spark Home as an environment variable.
export SPARK_HOME="$HOME/spark-1.5.0-bin-hadoop2.6"
# Define your Spark arguments for when running Spark.
export PYSPARK_SUBMIT_ARGS="--master local[2]"
# IPython alias for the use with SPARK.
alias IPYSPARK='PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --profile=pyspark --ip=0.0.0.0" $SPARK_HOME/bin/pyspark'
I have also added following to my spark-env.sh.template file:
#!/usr/bin/env bash
# This file is sourced when running various Spark programs.
export PYSPARK_PYTHON=/usr/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython
NOTE 2: I am launching the Ipython notebook from inside a virtual environment.
NOTE 3: I have Spark 1.5.0 and numpy 1.8.2
UPDATE: output from sc.parallelize([],1).mapPartitions(lambda _: [(sys.executable, sys.path)]).first()
('/home/vagrant/pyEnv/bin/python2.7', ['', u'/tmp/spark-dbbcfd0b-413e-4406-8bd5-37de29d3fcc5/userFiles-6296ba2d-4ec5-4956-9904-828bda0c6424', '/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip', '/home/vagrant/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip', '/home/vagrant/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar', '/home/vagrant/spark-1.5.0-bin-hadoop2.6/python', '/home/vagrant', '/home/vagrant/pyEnv/lib/python2.7', '/home/vagrant/pyEnv/lib/python2.7/plat-x86_64-linux-gnu', '/home/vagrant/pyEnv/lib/python2.7/lib-tk', '/home/vagrant/pyEnv/lib/python2.7/lib-old', '/home/vagrant/pyEnv/lib/python2.7/lib-dynload', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/home/vagrant/pyEnv/local/lib/python2.7/site-packages', '/home/vagrant/pyEnv/lib/python2.7/site-packages'])

What does Jython need for SimpleXMLRPCServer to work?

I tried the first example from the python documentation for SimpleXMLRPCServer.
It works perfectly fine in python, but when I try it using jython I get the following
stacktrace:
Traceback (most recent call last):
File "sampleclient.py", line 4, in <module>
print s.pow(2,3) # Returns 2**3 = 8
File "/usr/share/jython/Lib/xmlrpclib.py", line 1147, in __call__
return self.__send(self.__name, args)
File "/usr/share/jython/Lib/xmlrpclib.py", line 1433, in _ServerProxy__request
response = self.__transport.request(
File "/usr/share/jython/Lib/xmlrpclib.py", line 1201, in request
return self._parse_response(h.getfile(), sock)
File "/usr/share/jython/Lib/xmlrpclib.py", line 1324, in _parse_response
p, u = self.getparser()
File "/usr/share/jython/Lib/xmlrpclib.py", line 1210, in getparser
return getparser(use_datetime=self._use_datetime)
File "/usr/share/jython/Lib/xmlrpclib.py", line 1023, in getparser
parser = ExpatParser(target)
File "/usr/share/jython/Lib/xmlrpclib.py", line 536, in __init__
self._parser = parser = expat.ParserCreate(None, None)
File "/usr/share/jython/Lib/xml/parsers/expat.py", line 63, in ParserCreate
return XMLParser(encoding, namespace_separator)
File "/usr/share/jython/Lib/xml/parsers/expat.py", line 91, in __init__
self._reader = XMLReaderFactory.createXMLReader(_xerces_parser)
java.lang.ClassNotFoundException: org.apache.xerces.parsers.SAXParser
at org.xml.sax.helpers.XMLReaderFactory.loadClass(XMLReaderFactory.java:229)
at org.xml.sax.helpers.XMLReaderFactory.createXMLReader(XMLReaderFactory.java:220)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
org.xml.sax.SAXException: org.xml.sax.SAXException: SAX2 driver class org.apache.xerces.parsers.SAXParser not found
java.lang.ClassNotFoundException: org.apache.xerces.parsers.SAXParser
I'm much more experienced with python than java, so I'm not sure how to fix this. Is a dependency missing from my system or do I need to fiddle with classpaths?
EDIT: I re-tried this with the stand-alone versions from Jython's website, and they worked, so It seems there might be something broken with Ubuntu's installed version?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Accessing files in S3 bucket from pyspark - amazon-s3

Related

use AWS_PROFILE in pandas.read_parquet

How to fix "RPC response exceeds maximum data length" error while trying to read from hdfs using tensorflow?

SSL: certificate_verify_failed using Google Assistant SDK

Numpy error in printing a RDD in Spark with Ipython

What does Jython need for SimpleXMLRPCServer to work?

Categories

Resources