Related
I'm trying to install numpy on a Virtual Environment using pip, there is a failure when building wheel for it, however.
The problem is only present when trying to install it in the virtualenv, I can install and update it on my system just fine.
It is running on Windows 10, Python 3.10.5, pip 22.3.1.
Running from numpy source directory.
setup.py:67: DeprecationWarning:
`numpy.distutils` is deprecated since NumPy 1.23.0, as a result
of the deprecation of `distutils` itself. It will be removed for
Python >= 3.12. For older Python versions it will remain present.
It is recommended to use `setuptools < 60.0` for those Python versions.
For more details, see:
https://numpy.org/devdocs/reference/distutils_status_migration.html
import numpy.distutils.command.sdist
Processing numpy/random\_bounded_integers.pxd.in
Processing numpy/random\bit_generator.pyx
Processing numpy/random\mtrand.pyx
Processing numpy/random\_bounded_integers.pyx.in
Processing numpy/random\_common.pyx
Processing numpy/random\_generator.pyx
Processing numpy/random\_mt19937.pyx
Processing numpy/random\_pcg64.pyx
Processing numpy/random\_philox.pyx
Processing numpy/random\_sfc64.pyx
Cythonizing sources
INFO: blas_opt_info:
INFO: blas_armpl_info:
Looking for python310.dll
Traceback (most recent call last):
File "C:\stable_diffusion\diffusers_venv\lib\python3.10\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 363, in <module>
main()
File "C:\stable_diffusion\diffusers_venv\lib\python3.10\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 345, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "C:\stable_diffusion\diffusers_venv\lib\python3.10\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 261, in build_wheel
return _build_backend().build_wheel(wheel_directory, config_settings,
File "C:\Users\Vitor\AppData\Local\Temp\pip-build-env-amr3d935\overlay\lib\python3.10\site-packages\setuptools\build_meta.py", line 230, in build_wheel
return self._build_with_temp_dir(['bdist_wheel'], '.whl',
File "C:\Users\Vitor\AppData\Local\Temp\pip-build-env-amr3d935\overlay\lib\python3.10\site-packages\setuptools\build_meta.py", line 215, in _build_with_temp_dir
self.run_setup()
File "C:\Users\Vitor\AppData\Local\Temp\pip-build-env-amr3d935\overlay\lib\python3.10\site-packages\setuptools\build_meta.py", line 267, in run_setup
super(_BuildMetaLegacyBackend,
File "C:\Users\Vitor\AppData\Local\Temp\pip-build-env-amr3d935\overlay\lib\python3.10\site-packages\setuptools\build_meta.py", line 158, in run_setup
exec(compile(code, __file__, 'exec'), locals())
File "setup.py", line 479, in <module>
setup_package()
File "setup.py", line 471, in setup_package
setup(**metadata)
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\core.py", line 135, in setup
config = configuration()
File "setup.py", line 118, in configuration
config.add_subpackage('numpy')
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\misc_util.py", line 1050, in add_subpackage
config_list = self.get_subpackage(subpackage_name, subpackage_path,
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\misc_util.py", line 1016, in get_subpackage
config = self._get_configuration_from_setup_py(
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\misc_util.py", line 958, in _get_configuration_from_setup_py
config = setup_module.configuration(*args)
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\setup.py", line 9, in configuration
config.add_subpackage('core')
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\misc_util.py", line 1050, in add_subpackage
config_list = self.get_subpackage(subpackage_name, subpackage_path,
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\misc_util.py", line 1016, in get_subpackage
config = self._get_configuration_from_setup_py(
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\misc_util.py", line 958, in _get_configuration_from_setup_py
config = setup_module.configuration(*args)
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\core\setup.py", line 853, in configuration
blas_info = get_info('blas_opt', 0)
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 585, in get_info
return cl().get_info(notfound_action)
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 845, in get_info
self.calc_info()
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 2077, in calc_info
if self._calc_info(blas):
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 2063, in _calc_info
return getattr(self, '_calc_info_{}'.format(name))()
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 1979, in _calc_info_armpl
info = get_info('blas_armpl')
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 585, in get_info
return cl().get_info(notfound_action)
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 845, in get_info
self.calc_info()
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 1337, in calc_info
info = self.check_libs2(lib_dirs, armpl_libs)
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 1001, in check_libs2
exts = self.library_extensions()
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 960, in library_extensions
c = customized_ccompiler()
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\system_info.py", line 216, in customized_ccompiler
global_compiler = _customized_ccompiler()
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\__init__.py", line 62, in customized_ccompiler
c = ccompiler.new_compiler(plat=plat, compiler=compiler, verbose=verbose)
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\ccompiler.py", line 780, in new_compiler
compiler = klass(None, dry_run, force)
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\mingw32ccompiler.py", line 64, in __init__
build_import_library()
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\mingw32ccompiler.py", line 348, in build_import_library
return _build_import_library_amd64()
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\mingw32ccompiler.py", line 397, in _build_import_library_amd64
dll_file = find_python_dll()
File "C:\Users\Vitor\AppData\Local\Temp\pip-install-1sr3obr4\numpy_fa19a5e39e9448e7b6553e0f5a275159\numpy\distutils\mingw32ccompiler.py", line 219, in find_python_dll
raise ValueError("%s not found in %s" % (dllname, lib_dirs))
ValueError: python310.dll not found in ['C:\\stable_diffusion\\diffusers_venv\\', 'C:\\stable_diffusion\\diffusers_venv\\lib', 'C:\\stable_diffusion\\diffusers_venv\\bin', 'C:\\Program Files\\Inkscape\\', 'C:\\Program Files\\Inkscape\\lib', 'C:\\Program Files\\Inkscape\\bin', 'C:\\WINDOWS\\System32']
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for numpy
Failed to build numpy
ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects
As I faced a similar problem before, my instinct was to update pip, change the running python version or try an older, specific, version of numpy, but nothing seems to work and the problem persist.
I am trying to create dataframe with proper schema after fetching data from text file. in RDD, all data types are strings however one of the field data type is interger, which i want to ensure that created as integer. So i created Structtype and created dataframe. but it throws an error as below.
Error Message:
--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call
last) in ()
----> 1 df.show()
/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc
in show(self, n, truncate, vertical)
376 """
377 if isinstance(truncate, bool) and truncate:
--> 378 print(self._jdf.showString(n, 20, vertical))
379 else:
380 print(self._jdf.showString(n, int(truncate), vertical))
/Applications/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.pyc
in call(self, *args) 1284 answer =
self.gateway_client.send_command(command) 1285 return_value
= get_return_value(
-> 1286 answer, self.gateway_client, self.target_id, self.name) 1287 1288 for temp_arg in temp_args:
/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/utils.pyc
in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/Applications/anaconda2/lib/python2.7/site-packages/py4j/protocol.pyc
in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o64.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0
in stage 3.0 (TID 5, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last): File
"/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",
line 377, in main
process() File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",
line 372, in process
serializer.dump_stream(func(split_index, iterator), outfile) File
"/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
line 393, in dump_stream
vs = list(itertools.islice(iterator, batch)) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py",
line 99, in wrapper
return f(*args, **kwargs) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/session.py",
line 730, in prepare
verify_func(obj) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/types.py",
line 1389, in verify
verify_value(obj) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/types.py",
line 1370, in verify_struct
verifier(v) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/types.py",
line 1389, in verify
verify_value(obj) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/types.py",
line 1315, in verify_integer
verify_acceptable_types(obj) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/types.py",
line 1278, in verify_acceptable_types
% (dataType, obj, type(obj)))) TypeError: field id: IntegerType can not accept object u'1' in type
at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at
org.apache.spark.scheduler.Task.run(Task.scala:121) at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
at
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
at
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363) at
org.apache.spark.sql.Dataset.head(Dataset.scala:2544) at
org.apache.spark.sql.Dataset.take(Dataset.scala:2758) at
org.apache.spark.sql.Dataset.getRows(Dataset.scala:254) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:291) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:748) Caused by:
org.apache.spark.api.python.PythonException: Traceback (most recent
call last): File
"/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",
line 377, in main
process() File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",
line 372, in process
serializer.dump_stream(func(split_index, iterator), outfile) File
"/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",
line 393, in dump_stream
vs = list(itertools.islice(iterator, batch)) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py",
line 99, in wrapper
return f(*args, **kwargs) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/session.py",
line 730, in prepare
verify_func(obj) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/types.py",
line 1389, in verify
verify_value(obj) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/types.py",
line 1370, in verify_struct
verifier(v) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/types.py",
line 1389, in verify
verify_value(obj) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/types.py",
line 1315, in verify_integer
verify_acceptable_types(obj) File "/Users/nagaraju.n/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/types.py",
line 1278, in verify_acceptable_types
% (dataType, obj, type(obj)))) TypeError: field id: IntegerType can not accept object u'1' in type
at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at
org.apache.spark.scheduler.Task.run(Task.scala:121) at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
#!/usr/bin/env python
coding: utf-8
In[11]:
import os import sys from pyspark import SparkContext from pyspark.sql
import SparkSession from pyspark.sql.types import *
spark=SparkSession.builder.getOrCreate() sc =
SparkContext.getOrCreate()
In[12]:
Reads data from file and creates rdd rdd=sc.textFile('/Users/nagaraju.n/Downloads/sample_data.txt')
In[13]:
type(rdd)
In[14]:
rdd_data=rdd.map(lambda p: p.split(","))
In[15]:
rdd_data.collect()
In[16]:
print(rdd_data)
In[17]:
orig_header=rdd_data.first()
In[18]:
type(orig_header)
In[19]:
rdd_withoutheader=rdd_data.filter(lambda p:p != orig_header)
In[20]:
rdd_withoutheader.collect()
In[21]:
Create Schema header = StructType([StructField("id", IntegerType(), True),StructField("first_name", StringType(),
True),StructField("last_name", StringType(),
True),StructField("email", StringType(), True),StructField("phone",
StringType(), True),StructField("city", StringType(),
True),StructField("country", StringType(), True)])
In[22]:
header
In[23]:
df=spark.createDataFrame(rdd_withoutheader,header)
In[24]:
df.show()
/// Part of your code:
header = StructType([StructField("stockticker", StringType(), True),StructField("tradedate", IntegerType(), True),StructField("openprice", FloatType(), True),StructField("highprice", FloatType(), True),StructField("lowprice", FloatType(), True),StructField("closeprice", FloatType(), True),StructField("volume", IntegerType(), True)])
df=spark.createDataFrame(rdd_data,header)
///
My answer:
Schema is used most to avoid a full table scan to infer types and doesn't perform any type casting. Hence above method best works for Json/avro/parquet input files not for text files. For textfiles following are the best methods:
Method 1 based on your code, convert rdd to dataframe and define schema as below:
rdd=sc.textFile('/Users/nagaraju.n/Downloads/sample_data.txt')
df_noType=data.map(lambda p: p.split(",")).toDF(["id", "first_name", "last_name", "email", "phone", "city", "country"])
Now you can type cast either of these ways:
Way1:
df_typecast=df_noType.select(df_noType.id.cast('int'), df_noType.first_name, df_noType.last_name, df_noType.email, df_noType.phone, df_noType.city, df_noType.country)
Note: in above line no need to type cast other fields to string as they are bydefault string
Note: if decimals are there then you can use df_noType.id.cast('float')
(or)
way2:
from pyspark.sql.types import *
df_typecast=df_noType.select(df_noType.id.cast(IntegerType()), df_noType.first_name.cast(StringType()), df_noType.last_name.cast(StringType()), df_noType.email.cast(StringType()), df_noType.phone.cast(StringType()), df_noType.city.cast(StringType()), df_noType.country.cast(StringType()))
Method 2: I usually use this always which I feel best and easy
rdd=sc.textFile('/Users/nagaraju.n/Downloads/sample_data.txt')
from pyspark.sql import Row
df=rdd.map(lambda p: Row(id= int(p.split(",")[0]), first_name= p.split(",")[1], last_name= p.split(",")[2], email= p.split(",")[3], phone= p.split(",")[4], city= p.split(",")[5], country=p.split(",")[6])).toDF()
df.printSchema()
Note: if decimals are there then you can use float(p.split(",")[0])
I am not able to convert a rdd to data frame using a custom schema. The details are below with the code:
It works when I use a customSchema as below:
>>> customSchema = StructType([
... StructField("EID",StringType()),\
... StructField("Name",StringType()),\
... StructField("email",StringType()),\
... StructField("Salary",StringType()),\
... StructField("PlaceName",StringType()),\
... StructField("County",StringType()),\
... StructField("City",StringType()),\
... StructField("Gender",StringType())\
... ])
>>>
>>> myDF = spark.createDataFrame(emp1,customSchema)
>>> myDF1 = myDF.withColumn("EID",col("EID").cast("integer")).withColumn("Salary",col("Salary").cast("integer"))
>>> myDF1.show()
+------+--------------------+--------------------+------+--------------+--------------------+--------------+------+
| EID| Name| email|Salary| PlaceName| County| City|Gender|
+------+--------------------+--------------------+------+--------------+--------------------+--------------+------+
|111135| Darell T Grizzle|darell.grizzle#ya...|196416| Tallahassee| Leon| Tallahassee| M|
|111159| Deanna Z Nestor|deanna.nestor#gma...|184760| Collegeport| Matagorda| Collegeport| F|
|111160| Marion G Mcqueary|marion.mcqueary#y...|189506| Flensburg| Morrison| Flensburg| M|
|111175| Monserrate D Bentz|monserrate.bentz#...|184412|South Freeport| Cumberland|South Freeport| F|
|111214| Jamie E Spataro|jamie.spataro#gma...|189926| Gilliam| Saline| Gilliam| M|
|111228| Ernest J Woolbright|ernest.woolbright...|194929| Tacoma| Tacoma| Tacoma| M|
|111243| Ivette F Manzanares|ivette.manzanares...|189834| Lemasters| Franklin| Lemasters| F|
|111274| Erwin F Bouchard|erwin.bouchard#ao...|184390| Bessemer City| Gaston| Bessemer City| M|
|111293| Walton E Garza|walton.garza#comc...|198280| Suncook| Merrimack| Suncook| M|
|111316| Jospeh E Holle|jospeh.holle#gmai...|181878| Wagon Mound| Mora| Wagon Mound| M|
|111327| Angelo S Fizer|angelo.fizer#ibm.com|199654| Zelienople| Butler| Zelienople| M|
|111350| Numbers H Luo| numbers.luo#aol.com|198095| Eva| Benton| Eva| M|
|111359| Jim Z Jewett|jim.jewett#gmail.com|198956| Hatchechubbee| Russell| Hatchechubbee| M|
|111396| Edward M Pentecost|edward.pentecost#...|194979| Dayhoit| Harlan| Dayhoit| M|
|111403| Henry F Lawyer|henry.lawyer#appl...|198515| Washington|District of Columbia| Washington| M|
|111442| Manual X Meany|manual.meany#yaho...|196608| Hunter| Cass| Hunter| M|
|111446| Ethan V Folmar|ethan.folmar#yaho...|188581| Ridgeview| Boone| Ridgeview| M|
|111449| Tanja J Sparrow|tanja.sparrow#yah...|195398| Tower City| Cass| Tower City| F|
|111478|Leigha K Courtema...|leigha.courtemanc...|195306| Sun Valley| Blaine| Sun Valley| F|
|111514| Rob F Struck|rob.struck#gmail.com|198750| Centertown| Cole| Centertown| M|
+------+--------------------+--------------------+------+--------------+--------------------+--------------+------+
only showing top 20 rows
But it fails when I use a Schema (where I directly define EID and Salary as IntegerType) as below:
>>> customSchema = StructType([
... StructField("EID",IntegerType()),\
... StructField("Name",StringType()),\
... StructField("email",StringType()),\
... StructField("Salary",IntegerType()),\
... StructField("PlaceName",StringType()),\
... StructField("County",StringType()),\
... StructField("City",StringType()),\
... StructField("Gender",StringType())\
... ])
Full code below:
>>> rdd = sc.textFile("C:/sparkCourse/filetext/part-00000-646a1d36-8f75-4eee-b937-135e933ede7f-c000.csv").map(lambda row: row.split(','))
>>> rdd.take(1)
[['EID', 'Name', 'email', 'Salary', 'PlaceName', 'County', 'City', 'Gender']]
>>> header = rdd.first()
>>> emp = rdd.filter(lambda row: row != header)
>>> emp.take(1)
[['111135', 'Darell T Grizzle', 'darell.grizzle#yahoo.ca', '196416', 'Tallahassee', 'Leon', 'Tallahassee', 'M']]
>>> emp1 = emp.map(lambda fields:[fields[0],fields[1],fields[2],fields[3],fields[4],fields[5],fields[6],fields[7]])
>>> emp1.take(1)
[['111135', 'Darell T Grizzle', 'darell.grizzle#yahoo.ca', '196416', 'Tallahassee', 'Leon', 'Tallahassee', 'M']]
>>>
>>> customSchema = StructType([
... StructField("EID",IntegerType()),\
... StructField("Name",StringType()),\
... StructField("email",StringType()),\
... StructField("Salary",IntegerType()),\
... StructField("PlaceName",StringType()),\
... StructField("County",StringType()),\
... StructField("City",StringType()),\
... StructField("Gender",StringType())\
... ])
>>> myDF = spark.createDataFrame(emp1,customSchema)
I get the below error:
IntegerType can not accept object '111135' in type <class 'str'>
But, why does it allow the column to be casted as an integer later and not at the time of defining Schema.
Where am I going wrong?
>>> myDF.show()
[Stage 47:> (0 + 1) / 1]19/02/08 19:54:21 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 55)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
File "C:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\spark\python\pyspark\sql\session.py", line 671, in prepare
verify_func(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1402, in verify_struct
verifier(v)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1347, in verify_integer
verify_acceptable_types(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1310, in verify_acceptable_types
% (dataType, obj, type(obj))))
TypeError: field EID: IntegerType can not accept object '111135' in type <class 'str'>
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/02/08 19:54:21 WARN TaskSetManager: Lost task 0.0 in stage 47.0 (TID 55, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
File "C:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\spark\python\pyspark\sql\session.py", line 671, in prepare
verify_func(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1402, in verify_struct
verifier(v)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1347, in verify_integer
verify_acceptable_types(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1310, in verify_acceptable_types
% (dataType, obj, type(obj))))
TypeError: field EID: IntegerType can not accept object '111135' in type <class 'str'>
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/02/08 19:54:21 ERROR TaskSetManager: Task 0 in stage 47.0 failed 1 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\spark\python\pyspark\sql\dataframe.py", line 350, in show
print(self._jdf.showString(n, 20, vertical))
File "C:\spark\python\lib\py4j-0.10.6-src.zip\py4j\java_gateway.py", line 1160, in __call__
File "C:\spark\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\spark\python\lib\py4j-0.10.6-src.zip\py4j\protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1148.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 47.0 failed 1 times, most recent failure: Lost task 0.0 in stage 47.0 (TID 55, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
File "C:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\spark\python\pyspark\sql\session.py", line 671, in prepare
verify_func(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1402, in verify_struct
verifier(v)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1347, in verify_integer
verify_acceptable_types(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1310, in verify_acceptable_types
% (dataType, obj, type(obj))))
TypeError: field EID: IntegerType can not accept object '111135' in type <class 'str'>
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
File "C:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
File "C:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\spark\python\pyspark\sql\session.py", line 671, in prepare
verify_func(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1402, in verify_struct
verifier(v)
File "C:\spark\python\pyspark\sql\types.py", line 1421, in verify
verify_value(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1347, in verify_integer
verify_acceptable_types(obj)
File "C:\spark\python\pyspark\sql\types.py", line 1310, in verify_acceptable_types
% (dataType, obj, type(obj))))
TypeError: field EID: IntegerType can not accept object '111135' in type <class 'str'>
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
>>>
It could work if you don't define the schema in the beginning just read the csv with spark.read.csv(....) and then transform the columns with cast.
so if you just want to convert this columns from string to integer, you could use the following code:
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([('111135', 'Darell T Grizzle', 'darell.grizzle#yahoo.ca', '196416', 'Tallahassee', 'Leon', 'Tallahassee', 'M'),\
('111136', 'Darell X Xrizzle', 'darell.Xrizzle#yahoo.ca', '206416', 'Example', 'Leroy', 'Example', 'W')],\
['EID', 'Name', 'email', 'Salary', 'PlaceName', 'County', 'City', 'Gender'])
#above code is only used to create some dataframe with a similar format
#and the functions are used to access the columns with col()
df1 = df1.withColumn("EID", col("EID").cast("int")).withColumn("Salary", col("Salary").cast("int"))
#this line transforms your string columns to integer
df1.printSchema()
df1.show(truncate=False)
Output:
root
|-- EID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- email: string (nullable = true)
|-- Salary: integer (nullable = true)
|-- PlaceName: string (nullable = true)
|-- County: string (nullable = true)
|-- City: string (nullable = true)
|-- Gender: string (nullable = true)
+------+----------------+-----------------------+------+-----------+------+-----------+------+
|EID |Name |email |Salary|PlaceName |County|City |Gender|
+------+----------------+-----------------------+------+-----------+------+-----------+------+
|111135|Darell T Grizzle|darell.grizzle#yahoo.ca|196416|Tallahassee|Leon |Tallahassee|M |
|111136|Darell X Xrizzle|darell.Xrizzle#yahoo.ca|206416|Example |Leroy |Example |W |
+------+----------------+-----------------------+------+-----------+------+-----------+------+
If you want to work with rdd you can use the following code and apply a map function that converts the respective columns:
x = sc.parallelize([['111135', 'Darell T Grizzle', 'darell.grizzle#yahoo.ca', '196416', 'Tallahassee', 'Leon', 'Tallahassee', 'M']])
customSchema = StructType([
StructField("EID",IntegerType()),\
StructField("Name",StringType()),\
StructField("email",StringType()),\
StructField("Salary",IntegerType()),\
StructField("PlaceName",StringType()),\
StructField("County",StringType()),\
StructField("City",StringType()),\
StructField("Gender",StringType())\
])
x = x.map(lambda fields: [int(fields[0]),fields[1],fields[2],int(fields[3]),fields[4],fields[5],fields[6],fields[7]]).collect()
myDF = spark.createDataFrame(x,customSchema)
myDF.show()
Output:
+------+----------------+--------------------+------+-----------+------+-----------+------+
| EID| Name| email|Salary| PlaceName|County| City|Gender|
+------+----------------+--------------------+------+-----------+------+-----------+------+
|111135|Darell T Grizzle|darell.grizzle#ya...|196416|Tallahassee| Leon|Tallahassee| M|
+------+----------------+--------------------+------+-----------+------+-----------+------+
In case any one wants to do the same task using SparkSession, below is the code:
df = spark.read.option("header","true").schema(customSchema).csv("C:/sparkCourse/filetext/part-00000-646a1d36-8f75-4eee-b937-135e933ede7f-c000.csv")
But, any help using the sparkContext would really be appreciated.
I've been getting IndexError: too many indices for array at the first line of the squares.append and the other posts on IndexError just seemed a little too confusing, so it would be great if there can be a simple explanation on why I'm getting this!
def check_squares(grid):
squares=[]
count_dic={}
count_dic[0]=0
for num in range(1,10):
count_dic[num]=0
squares.append(list(((np.array(grid)[:3,:3]).reshape(-1))))
squares.append(list(((np.array(grid)[:3,3:6]).reshape(-1))))
squares.append(list(((np.array(grid)[:3,6:9]).reshape(-1))))
squares.append(list(((np.array(grid)[3:6,:3]).reshape(-1))))
squares.append(list(((np.array(grid)[3:6,3:6]).reshape(-1))))
squares.append(list(((np.array(grid)[3:6,6:9]).reshape(-1))))
squares.append(list(((np.array(grid)[6:9,:3]).reshape(-1))))
squares.append(list(((np.array(grid)[6:9,3:6]).reshape(-1))))
squares.append(list(((np.array(grid)[6:9,6:9]).reshape(-1))))
for lst in squares:
for i in lst:
count_dic[i]+=1
count_dic[0]=0
if (all(value <=1 for value in count_dic.values()))==True:
for num in range(1,10):
count_dic[num]=0
else:
return False
return True
ill_formed = [[5,3,4,6,7,8,9,1,2],
[6,7,2,1,9,5,3,4,8],
[1,9,8,3,4,2,5,6,7],
[8,5,9,7,6,1,4,2,3],
[4,2,6,8,5,3,7,9], # <---
[7,1,3,9,2,4,8,5,6],
[9,6,1,5,3,7,2,8,4],
[2,8,7,4,1,9,6,3,5],
[3,4,5,2,8,6,1,7,9]]
valid = [[5,3,4,6,7,8,9,1,2],
[6,7,2,1,9,5,3,4,8],
[1,9,8,3,4,2,5,6,7],
[8,5,9,7,6,1,4,2,3],
[4,2,6,8,5,3,7,9,1],
[7,1,3,9,2,4,8,5,6],
[9,6,1,5,3,7,2,8,4],
[2,8,7,4,1,9,6,3,5],
[3,4,5,2,8,6,1,7,9]]
print(check_squares(valid))
print (check_squares(ill-formed))
Error message: in check_squares
squares.append(list(((np.array(grid)[:3,:3]).reshape(-1))))
IndexError: too many indices for array
traceback.print_stack()
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py", line 245, in <module>
main()
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py", line 241, in main
kernel.start()
File "C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
File "C:\ProgramData\Anaconda3\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
super(ZMQIOLoop, self).start()
File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\ioloop.py", line 888, in start
handler_func(fd_obj, events)
File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
self._handle_recv()
File "C:\ProgramData\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
self._run_callback(callback, msg)
File "C:\ProgramData\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
callback(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
File "C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2698, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2808, in run_ast_nodes
if self.run_code(code, result):
File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-9-e4a4d877ed6a>", line 1, in <module>
traceback.print_stack()
Really appreciate any help, thanks!
Here's my solution - you can try improving upon it.
def check_squares(grid):
grid = np.array(grid)
for i in grid:
unique , counts = (np.unique(i, return_counts=True))
#print(unique, counts)
if len(counts)<9:
return(False)
for i in np.split(grid,3):
for t in (np.hsplit(i,3)):
unique1 , counts1 = np.unique(t, return_counts=True)
#print(unique1,counts1)
if len(counts1)<9:
return(False)
return (True)
Learn about numpy array splits and slicing. Do not use array reshape half a dozen times to obtain your 3*3 sub-array. I have used np split and np.hsplit
Perform all your matrix operations in the numpy space.
you can use np.unique with return counts set as True to obtain how many unique numbers are present in your input and how many times this has occurred. You can use this obtain to identify if your sub-array has repeating characters.
Suggestions - currently you are not checking if your input is valid -eg: if your input has 12 in the sudoku grid , above code does not handle it.
you were also not checking if your rows and columns are valid sudoku columns. I have handled the row situation - i'll leave the job of checking if each column is a valid sudoku column upto you
Check it out. Happy learning!
I just downloaded Odoo, installed dependencies and configured my IDE. After starting Odoo and being prompted to create my database and choose password it fails with the following error:
ParseError: "MissingError
One of the documents you are trying to access has been deleted, please try again after refreshing." while parsing /path/to/odoo/openerp/addons/base/base_data.xml:4, near
<record id="view_menu" model="ir.ui.view">
<field name="name">ir.ui.menu.tree</field>
<field name="model">ir.ui.menu</field>
<field name="arch" type="xml">
<tree string="Menu" toolbar="1">
<field icon="icon" name="name"/>
</tree>
</field>
<field name="field_parent">child_id</field>
</record>
I've tried dropping the database several times to no avail.
The current stacktrace:
2015-06-21 04:55:00,293 16487 INFO odoo-test werkzeug: 127.0.0.1 - - [21/Jun/2015 04:55:00] "GET /web/login?redirect=http%3A%2F%2F0.0.0.0%3A8069%2Fweb%3Fdb%3Dodoo-test HTTP/1.1" 500 -
2015-06-21 04:55:00,303 16487 ERROR odoo-test werkzeug: Error on request:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/werkzeug/serving.py", line 177, in run_wsgi
execute(self.server.app)
File "/usr/local/lib/python2.7/dist-packages/werkzeug/serving.py", line 165, in execute
application_iter = app(environ, start_response)
File "/path/to/proj/odoo/openerp/service/server.py", line 285, in app
return self.app(e, s)
File "/path/to/proj/odoo/openerp/service/wsgi_server.py", line 216, in application
return application_unproxied(environ, start_response)
File "/path/to/proj/odoo/openerp/service/wsgi_server.py", line 202, in application_unproxied
result = handler(environ, start_response)
File "/path/to/proj/odoo/openerp/http.py", line 1281, in __call__
return self.dispatch(environ, start_response)
File "/path/to/proj/odoo/openerp/http.py", line 1255, in __call__
return self.app(environ, start_wrapped)
File "/usr/local/lib/python2.7/dist-packages/werkzeug/wsgi.py", line 588, in __call__
return self.app(environ, start_response)
File "/path/to/proj/odoo/openerp/http.py", line 1427, in dispatch
response = self.get_response(httprequest, result, explicit_session)
File "/path/to/proj/odoo/openerp/http.py", line 1362, in get_response
result = request.registry['ir.http']._handle_exception(e)
File "/path/to/proj/odoo/openerp/addons/base/ir/ir_http.py", line 146, in _handle_exception
return request._handle_exception(exception)
File "/path/to/proj/odoo/openerp/http.py", line 659, in _handle_exception
return super(HttpRequest, self)._handle_exception(exception)
File "/path/to/proj/odoo/openerp/http.py", line 1359, in get_response
result.flatten()
File "/path/to/proj/odoo/openerp/http.py", line 1232, in flatten
self.response.append(self.render())
File "/path/to/proj/odoo/openerp/http.py", line 1226, in render
context=request.context)
File "/path/to/proj/odoo/openerp/api.py", line 241, in wrapper
return old_api(self, *args, **kwargs)
File "/path/to/proj/odoo/openerp/addons/base/ir/ir_ui_view.py", line 1029, in render
return self.pool[engine].render(cr, uid, id_or_xml_id, qcontext, loader=loader, context=context)
File "/path/to/proj/odoo/openerp/api.py", line 241, in wrapper
return old_api(self, *args, **kwargs)
File "/path/to/proj/odoo/openerp/addons/base/ir/ir_qweb.py", line 261, in render
return self.render_node(self.get_template(id_or_xml_id, qwebcontext), qwebcontext)
File "/path/to/proj/odoo/openerp/addons/base/ir/ir_qweb.py", line 192, in get_template
raise_qweb_exception(QWebTemplateNotFound, message="Loader could not find template %r" % name, template=origin_template)
File "/path/to/proj/odoo/openerp/addons/base/ir/ir_qweb.py", line 190, in get_template
xml_doc = qwebcontext.loader(name)
File "/path/to/proj/odoo/openerp/addons/base/ir/ir_ui_view.py", line 1027, in loader
return self.read_template(cr, uid, name, context=context)
File "/path/to/proj/odoo/openerp/api.py", line 241, in wrapper
return old_api(self, *args, **kwargs)
File "<string>", line 2, in read_template
File "/path/to/proj/odoo/openerp/tools/cache.py", line 122, in lookup
value = d[key] = self.method(*args, **kwargs)
File "/path/to/proj/odoo/openerp/addons/base/ir/ir_ui_view.py", line 857, in read_template
view_id = self.pool['ir.model.data'].xmlid_to_res_id(cr, uid, xml_id, raise_if_not_found=True)
File "/path/to/proj/odoo/openerp/api.py", line 241, in wrapper
return old_api(self, *args, **kwargs)
File "/path/to/proj/odoo/openerp/addons/base/ir/ir_model.py", line 944, in xmlid_to_res_id
return self.xmlid_to_res_model_res_id(cr, uid, xmlid, raise_if_not_found)[1]
File "/path/to/proj/odoo/openerp/api.py", line 241, in wrapper
return old_api(self, *args, **kwargs)
File "/path/to/proj/odoo/openerp/addons/base/ir/ir_model.py", line 936, in xmlid_to_res_model_res_id
return self.xmlid_lookup(cr, uid, xmlid)[1:3]
File "/path/to/proj/odoo/openerp/api.py", line 241, in wrapper
return old_api(self, *args, **kwargs)
File "<string>", line 2, in xmlid_lookup
File "/path/to/proj/odoo/openerp/tools/cache.py", line 74, in lookup
value = d[key] = self.method(*args, **kwargs)
File "/path/to/proj/odoo/openerp/addons/base/ir/ir_model.py", line 926, in xmlid_lookup
raise ValueError('External ID not found in the system: %s' % (xmlid))
QWebTemplateNotFound: External ID not found in the system: web.login
In my case error was resolved with following steps.
First install series of missing dependencies
From my log, the following are installed for odoo. You can probably reduce the list by trial and error.
comerr-dev
gyp
javascript-common
krb5-multidev
libc-ares2
libc-ares-dev
libexpat1-dev
libffi-dev
libfreetype6-dev
libgssrpc4
libjpeg8-dev
libjpeg-dev
libjpeg-turbo8-dev
libjs-node-uuid
libkadm5clnt-mit9
libkadm5srv-mit9
libkdb5-7
libldap2-dev
libpng12-dev
libpq-dev
libpython2.7-dev
libpython-dev
libsasl2-dev
libssl-dev
libssl-doc
libv8-3.14.5
libv8-3.14-dev
libxml2-dev
libxslt1-dev
node-abbrev
node-ansi
node-archy
node-async
node-block-stream
node-combined-stream
node-cookie-jar
node-delayed-stream
node-forever-agent
node-form-data
node-fstream
node-fstream-ignore
node-github-url-from-git
node-glob
node-graceful-fs
node-gyp
node-inherits
node-ini
nodejs
nodejs-dev
node-json-stringify-safe
node-lockfile
node-lru-cache
node-mime
node-minimatch
node-mkdirp
node-mute-stream
node-node-uuid
node-nopt
node-normalize-package-data
node-npmlog
node-once
node-osenv
node-qs
node-read
node-read-package-json
node-request
node-retry
node-rimraf
node-semver
node-sha
node-sigmund
node-slide
node-tar
node-tunnel-agent
node-which
npm
postgresql-server-dev-9.3
python2.7-dev
python-dev
zlib1g-dev
Next was error coming from PIL.
I removed it and installed again : See Installing PIL with pip
First Install
libjpeg-dev
libfreetype6-dev
zlib1g-dev
Uninstalling python-pil
sudo apt-get purge python-pil
Installing pillow
pip install pillow
these steps should resolved the problem. I hope help you, ^^
for more information see this link:
[BUG] DB Creation :: ?Missing Dependency?
Odoo new dependency to Pillow