error after applying a udf on a dataframe in pyspark - dataframe

PYSPARK VERSION 2.3.2
I have dataframe (df) in pyspark with the following schema:
>>> df.printSchema
<bound method DataFrame.printSchema of
DataFrame[id: string,
F: string,
D: string,
T: string,
S: string,
P: string]>
I have the following simplified UDF:
rep = UserDefinedFunction(lambda x: x.replace(":",";"))
the I do:
df1 = df.withColumn("occ", rep(col("D")))
but after df1.show() there is error:
df1.show()
[Stage 9:>
(0 + 1) / 1]19/08/23 23:59:15 WARN
org.apache.spark.scheduler.TaskSetManager:
Lost task 0.0 in stage 9.0 (TID 30, cluster, executor 1):
java.io.IOException:
Cannot run program "/opt/conda/bin/python":
error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at .....
Caused by: java.io.IOException: error=2, No such file or directory
19/08/23 23:59:16 ERROR
org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 9.0 failed 4 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 350, in show
print(self._jdf.showString(n, 20, vertical))
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o339.showString.
.......

Looks like you have something wrong with your installation.
Cannot run program "/opt/conda/bin/python":
error=2, No such file or directory

The issue was that an installed conda package has overwritten the default python, which caused the error.

Maybe the problem is not in your code.
Check the version from Java JDK that you are using. What I know is that the .show() method is not compatible with Java JDK 11. If you are using this version, just make a downgrade to version 8. And don´t forget to configure correctly the environments variable for JDK 8.

Related

py4j.protocol.Py4JJavaError: An error occurred while calling o27.partitions in Cloudera CDH 5.5.0 VM, Spark 2.4.7, JDK1.8.0_181

I am learning to use Spark on a personal computer with hardware capable of running Hadoop. Here's the config:
Cloudera CDH 5.5.0 w/ Cloudera Quickstart, Spark 2.4.7, JDK1.8.0_181, Hadoop 2.6.0, Python 3.6.9.
When running a Python script (copied from a Udemy video on YouTube), I ran into and fixed several errors, but I could not find any solution for the following one:
java.io.IOException: Incomplete HDFS URI, no host: hdfs: /user/cloudera / Spark / ml - 100 k / u.data
Traceback (most recent call last):
File "/home/cloudera/Spark/LowestRatedMovieDataFrame.py", line 75, in < module >
movieDataset = spark.createDataFrame(movies)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 746, in createDataFrame
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 390, in _createFromRDD
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 361, in _inferSchema
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1378, in first
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1327, in take
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2517, in getNumPartitions
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred
while calling o27.partitions.: java.io.IOException: Incomplete HDFS URI, no host: hdfs: /user/cloudera / Spark / ml - 100 k / u.data

Error loading library gpuarray with Theano

I am trying to run this script to test Theano's use of my GPU and get the following error:
ERROR (theano.gpuarray): Could not initialize pygpu, support disabled
Traceback (most recent call last):
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-
packages/theano/gpuarray/__init__.py", line 164, in <module>
use(config.device)
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-
packages/theano/gpuarray/__init__.py", line 151, in use
init_dev(device)
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-
packages/theano/gpuarray/__init__.py", line 60, in init_dev
sched=config.gpuarray.sched)
File "pygpu/gpuarray.pyx", line 614, in pygpu.gpuarray.init
(pygpu/gpuarray.c:9419)
File "pygpu/gpuarray.pyx", line 566, in pygpu.gpuarray.pygpu_init
(pygpu/gpuarray.c:9110)
File "pygpu/gpuarray.pyx", line 1021, in
pygpu.gpuarray.GpuContext.__cinit__ (pygpu/gpuarray.c:13472)
pygpu.gpuarray.GpuArrayException: Error loading library: -1
I need to use the nvidia-381 driver since my GPU is a 1080 ti and is not compatible with nvidia-375. I'm not sure if that matters but installing nvcc overwrites 381 and causes some errors if I reinstall 381 after setting up nvcc so I can't use nvcc.
I can import pygpu without errors but if I run pygpu.test() I get the following error and I don't know how to specify the DEVICE variable without nvcc.
======================================================================
ERROR: Failure: RuntimeError (No test device specified. Specify one using the DEVICE or GPUARRAY_TEST_DEVICE environment variables.)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-packages/nose/failure.py", line 39, in runTest
raise self.exc_val.with_traceback(self.tb)
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-packages/nose/loader.py", line 418, in loadTestsFromName
addr.filename, addr.module)
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-packages/nose/importer.py", line 47, in importFromPath
return self.importFromDir(dir_path, fqname)
File "/home/me/anaconda3/envs/py35/lib/python3.5/site-packages/nose/importer.py", line 94, in importFromDir
mod = load_module(part_fqname, fh, filename, desc)
File "/home/me/anaconda3/envs/py35/lib/python3.5/imp.py", line 234, in load_module
return load_source(name, filename, file)
File "/home/me/anaconda3/envs/py35/lib/python3.5/imp.py", line 172, in load_source
module = _load(spec)
File "<frozen importlib._bootstrap>", line 693, in _load
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 665, in exec_module
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
File "/home/me/.local/lib/python3.5/site-packages/pygpu-0.6.2-py3.5-linux-x86_64.egg/pygpu/tests/test_tools.py", line 5, in <module>
from .support import (guard_devsup, rand, check_flags, check_meta, check_all,
File "/home/me/.local/lib/python3.5/site-packages/pygpu-0.6.2-py3.5-linux-x86_64.egg/pygpu/tests/support.py", line 32, in <module>
context = gpuarray.init(get_env_dev())
File "/home/me/.local/lib/python3.5/site-packages/pygpu-0.6.2-py3.5-linux-x86_64.egg/pygpu/tests/support.py", line 29, in get_env_dev
raise RuntimeError("No test device specified. Specify one using the DEVICE or GPUARRAY_TEST_DEVICE environment variables.")
RuntimeError: No test device specified. Specify one using the DEVICE or GPUARRAY_TEST_DEVICE environment variables.
----------------------------------------------------------------------
Ran 7 tests in 0.003s
FAILED (errors=7)
<nose.result.TextTestResult run=7 errors=7 failures=0>
Warning: its entirely possible that this is all wrong and the actual reason for your problem is in fact - as you suspect - your gpu driver.
I had the same issue with gpuarray on Windows 10.
In the end I solved it by:
completely uninstall python
install cuda 8.0 (with cudnn 5.1)
install anaconda
install theano through anaconda:
conda install theano pygpu
As you are using linux: This error message basically means It didn't work, don't ask me why And is mostly shown if something with your setup is wrong (e.g. different compilers used for compiling python and theano, or incompatible cuda version)
I would recommend to update to cuda 8.0 and to reinstall your python environment over anaconda (just in case)
On a side note: I tested your example script from the docu and at least that is working....
Note for windows users: Never try to install Anaconda in a location where you have spaces in the path... Everything looks fine ... until theano starts having trouble finding and compiling things.
Note regarding the pygpu.test():
Normally you just set the environment variable:
windows: set DEVICE=cuda
linux: export DEVICE=cuda
BUT The test has the habit of saying you didn't specify a device if the library couldn't be loaded...

Pandas not installing in 64 bit windows

I recently installed a new verion of Anaconda 3 4.2 on a windows laptop.
All the packages work fine , but pandas never worked for me from day 1.
So i thought of uninstalling and installing a new version of pandas 0.19
While using pip:
C:\Users\>python -m pip install --user pandas
Collecting pandas
Using cached pandas-0.19.2-cp35-cp35m-win_amd64.whl
Exception:
Traceback (most recent call last):
File "D:\Anaconda3-4.2\lib\site-packages\pip\basecommand.py", line 215, in main
status = self.run(options, args)
File "D:\Anaconda3-4.2\lib\site-packages\pip\commands\install.py", line 335, in run
wb.build(autobuilding=True)
File "D:\Anaconda3-4.2\lib\site-packages\pip\wheel.py", line 749, in build
self.requirement_set.prepare_files(self.finder)
File "D:\Anaconda3-4.2\lib\site-packages\pip\req\req_set.py", line 380, in prepare_files
ignore_dependencies=self.ignore_dependencies))
File "D:\Anaconda3-4.2\lib\site-packages\pip\req\req_set.py", line 620, in _prepare_file
session=self.session, hashes=hashes)
File "D:\Anaconda3-4.2\lib\site-packages\pip\download.py", line 821, in unpack_url
hashes=hashes
File "D:\Anaconda3-4.2\lib\site-packages\pip\download.py", line 663, in unpack_http_url
unpack_file(from_path, location, content_type, link)
File "D:\Anaconda3-4.2\lib\site-packages\pip\utils\__init__.py", line 599, in unpack_file
flatten=not filename.endswith('.whl')
File "D:\Anaconda3-4.2\lib\site-packages\pip\utils\__init__.py", line 499, in unzip_file
fp = open(fn, 'wb')
PermissionError: [Errno 13] Permission denied: C:\\Users\\AppData\\Local\\Temp\\pip-build-h5ip5q8f\\pandas\\pandas/io/tests/data/blank.xls
While using conda :
An unexpected error has occurred.
Please consider posting the following information to the
conda GitHub issue tracker at:
https://github.com/conda/conda/issues
Current conda install:
platform : win-64
conda version : 4.3.8
conda is private : False
conda-env version : 4.3.8
conda-build version : 2.0.2
python version : 3.5.2.final.0
requests version : 2.12.4
root environment : D:\Anaconda3-4.2 (writable)
default environment : D:\Anaconda3-4.2
envs directories : D:\Anaconda3-4.2\envs
package cache : D:\Anaconda3-4.2\pkgs
channel URLs : https://repo.continuum.io/pkgs/free/win-64
https://repo.continuum.io/pkgs/free/noarch
https://repo.continuum.io/pkgs/r/win-64
https://repo.continuum.io/pkgs/r/noarch
https://repo.continuum.io/pkgs/pro/win-64
https://repo.continuum.io/pkgs/pro/noarch
https://repo.continuum.io/pkgs/msys2/win-64
https://repo.continuum.io/pkgs/msys2/noarch
config file : None
offline mode : False
user-agent : conda/4.3.8 requests/2.12.4 CPython/3.5.2 Windows/7
Windows/6.1.7601
D:\Anaconda3-4.2\Scripts\conda-script.py install pandas
Traceback (most recent call last):
File "D:\Anaconda3-4.2\lib\site-packages\conda\exceptions.py", line 617, in conda_exception_handler
return_value = func(*args, **kwargs)
File "D:\Anaconda3-4.2\lib\site-packages\conda\cli\main.py", line 137, in _main
exit_code = args.func(args, p)
File "D:\Anaconda3-4.2\lib\site-packages\conda\cli\main_install.py", line 80, in execute
install(args, parser, 'install')
File "D:\Anaconda3-4.2\lib\site-packages\conda\cli\install.py", line 347, in install
execute_actions(actions, index, verbose=not context.quiet)
File "D:\Anaconda3-4.2\lib\site-packages\conda\plan.py", line 837, in execute_actions
execute_instructions(plan, index, verbose)
File "D:\Anaconda3-4.2\lib\site-packages\conda\instructions.py", line 258, in execute_instructions
cmd(state, arg)
File "D:\Anaconda3-4.2\lib\site-packages\conda\instructions.py", line 118, in UNLINKLINKTRANSACTION_CMD
txn = UnlinkLinkTransaction.create_from_dists(index, prefix, unlink_dists, link_dists)
File "D:\Anaconda3-4.2\lib\site-packages\conda\core\link.py", line 121, in create_from_dists
for dist, pkg_dir in zip(link_dists, pkg_dirs_to_link))
File "D:\Anaconda3-4.2\lib\site-packages\conda\core\link.py", line 121, in <genexpr>
for dist, pkg_dir in zip(link_dists, pkg_dirs_to_link))
File "D:\Anaconda3-4.2\lib\site-packages\conda\gateways\disk\read.py", line 71, in read_package_info
index_json_record = read_index_json(extracted_package_directory)
File "D:\Anaconda3-4.2\lib\site-packages\conda\gateways\disk\read.py", line 94, in read_index_json
with open(join(extracted_package_directory, 'info', 'index.json')) as fi:
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Anaconda3-4.2\\pkgs\\pandas-0.19.2-np111py35_1\\info\\index.json
The problem is that you are trying to install Pandas as a regular user, and therefore cannot change the contents of Anaconda installation folder (where Python installed packages live). You need to run CMD or PowerShell (whichever you are using) as an administrator. Right-click on its shortcut in the start menu and click "Run as administrator", then run again the same command.
Try to use Anaconda package manager - conda instead of pip:
C:\> conda install pandas

I get "SyntaxError" for rst2pdf command after installing RST2PDF in Windows 7

After I install RST2PDF in Windows 7, and run the command rst2pdf or rst2pdf -h, it says
SyntxError: Invalid syntax
I installed rst2pdf V.93 using PIP and have set the path to scripts in python directory.
Here is the error:
c:>rst2pdf Traceback (most recent call last): File
"C:\Users\IBM_ADMIN\AppData\Local\Programs\Python\Python35\Scripts\rst2pd
f-script.py", line 9, in
load_entry_point('rst2pdf===0.93.dev-r0', 'console_scripts', 'rst2pdf')() File
"C:\Users\IBM_ADMIN\AppData\Local\Programs\Python\Python35\lib\site-packa
ges\pkg_resources__init__.py", line 558, in load_entry_point
return get_distribution(dist).load_entry_point(group, name) File "C:\Users\IBM_ADMIN\AppData\Local\Programs\Python\Python35\lib\site-packa
ges\pkg_resources__init__.py", line 2682, in load_entry_point
return ep.load() File "C:\Users\IBM_ADMIN\AppData\Local\Programs\Python\Python35\lib\site-packa
ges\pkg_resources__init__.py", line 2355, in load
return self.resolve() File "C:\Users\IBM_ADMIN\AppData\Local\Programs\Python\Python35\lib\site-packa
ges\pkg_resources__init__.py", line 2361, in resolve
module = import(self.module_name, fromlist=['name'], level=0) File
"C:\Users\IBM_ADMIN\AppData\Local\Programs\Python\Python35\lib\site-packa
ges\rst2pdf\createpdf.py", line 695
except ValueError, v:
^ SyntaxError: invalid syntax
The syntax for exception handling changed with python 3: instead of except ValueError, v one must use except ValueError as v.
Apparently, rst2pdf has no support for python3. To use it, you must have python2.7 installed. Some attempts have been made to port rst2pdf, but the efforts seem to have stalled.
Alternatively, you may try to use pandoc, as suggested here

How to resolve a 'The report "report.custom" already exists!' error while launching OpenERP 7.0

Running Ubuntu 12.04, I try to run OpenERP 7.0, using the latest tarball archive.
wget "http://nightly.openerp.com/7.0/nightly/src/openerp-7.0-latest.tar.gz"
tar -xzf openerp-7.0-latest.tar.gz
# cd extracted directory
./openerp-server —xmlrpc-port=40069 —netrpc-port=40070 —addons-path=openerp/addons,openerp/web/addons
Here is what I get:
Traceback (most recent call last):
File "./openerp-server", line 5, in <module>
openerp.cli.main()
File "/path/to/decrompressed/directory/openerp-7.0-20140113-001013/openerp/cli/__init__.py", line 51, in main
__import__(m)
File "/path/to/decrompressed/directory/openerp-7.0-20140113-001013/openerp/modules/module.py", line 133, in load_module
mod = imp.load_module('openerp.addons.' + module_part, f, path, descr)
File "/path/to/decrompressed/directory/openerp-7.0-20140113-001013/openerp/addons/account_test/__init__.py", line 2, in <module>
import report
File "/path/to/decrompressed/directory/openerp-7.0-20140113-001013/openerp/addons/account_test/report/__init__.py", line 1, in <module>
import account_test_report
File "/path/to/decrompressed/directory/openerp-7.0-20140113-001013/openerp/addons/account_test/report/account_test_report.py", line 25, in <module>
from report import report_sxw
File "/path/to/decrompressed/directory/openerp-7.0-20140113-001013/openerp/report/__init__.py", line 25, in <module>
import custom
File "/path/to/decrompressed/directory/openerp-7.0-20140113-001013/openerp/report/custom.py", line 623, in <module>
report_custom('report.custom')
File "/path/to/decrompressed/directory/openerp-7.0-20140113-001013/openerp/report/custom.py", line 58, in __init__
report_int.__init__(self, name)
File "/path/to/decrompressed/directory/openerp-7.0-20140113-001013/openerp/report/interface.py", line 45, in __init__
assert not self.exists(name), 'The report "%s" already exists!' % name
AssertionError: The report "report.custom" already exists!
I already successfully runned a 6.0 instance on this same machine, this server is not running while I try to launch the 7.0 version though. I don't know if that may be related to the fact I already installed the 6.0 before. Could it be a PostgreSQL problem? I didn't drop the existing 6.0 database.
Running the server without more parameters seems to work:
./openerp-server
2014-01-13 14:26:35,270 1014 INFO ? openerp: OpenERP version 7.0-20140113-001013
…