Python pandas_udf spark error - pandas

I started playing around with spark locally and finding this weird issue
1) pip install pyspark==2.3.1
2) pyspark>
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType, udf
df = pd.DataFrame({'x': [1,2,3], 'y':[1.0,2.0,3.0]})
sp_df = spark.createDataFrame(df)
#pandas_udf('long', PandasUDFType.SCALAR)
def pandas_plus_one(v):
return v + 1
sp_df.withColumn('v2', pandas_plus_one(sp_df.x)).show()
Taking this example from here https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
Any idea why I keep getting this error?
py4j.protocol.Py4JJavaError: An error occurred while calling o108.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 8, localhost, executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:333)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:322)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:177)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:90)
at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88)
at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131)
at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:158)
... 27 more

I had the same problem. I found it to be a version problem between pandas and numpy.
For me the following works:
numpy==1.14.5
pandas==0.23.4
pyarrow==0.10.0
before I had the following non working combination:
numpy==1.15.1
pandas==0.23.4
pyarrow==0.10.0

I found the issue to be only an incompatible version of pyarrow. Spark 2.4.0 was built with pyarrow 0.10.0 (https://issues.apache.org/jira/browse/SPARK-23874).
I reverted my pyarrow package to 0.10.0 (current version was 0.15.x) and it worked perfectly.
Config that works for me is..
numpy==1.14.3
pandas==0.23.0
pyarrow==0.10.0

Related

Geoserver 2.19 ImagePyramid processing error

I have more than 500 Gb orthophoto GTiff images with LZW compression, the task is to operate them using the geoserver's power.
The main idea is to use pyramids for much better data mobility in the future. For my tests, I use 137 Gb GTiff images with LZW compression.
Firstly, I compressed my files via GDAL util gdal_translate, which helps me to get 25 Gb GTiff images:
gdal_translate -of GTiff -co COMPRESS=JPEG input_file output_file
Secondly, I used GDAL util gdalbuildvrt to build a Virtual Dataset (VRT) via GDAL util gdal_retile:
gdalbuildvrt -te xmin_vrt ymin_vrt xmax_vrt ymax_vrt -srcnodata "0 0 0" output_file.vrt input_gtiff_file.tif
Thirdly, I used GDAL util gdal_retile for external pyramids building:
gdal_retile -of GTiff -v -r bilinear -levels 4 -ps 2048 2048 -co "TILED=YES" -co "COMPRESS=JPEG" -targetDir C:\...\out input_file.vrt
All 1-4 levels have been built into 1-4 subdirectories and cuted GTiff files.
The next step was to use ImagePyramid Geoserver's extesion for 25 Gb GTiff pyramids. For correct usage, I have created a Geoserver's new data ImagePyramid Storage (ImagePyramid pyramidal plugin). Zero subdirectory has been created correctly with a ShapeFile into it.
The last step is to publish the new generated storage as a layer, but it leads the error "An error occurred while loading the page" with " Failed to load granule file" and "java.lang.NullPointerException".
org.apache.wicket.WicketRuntimeException: Method onRequest of interface org.apache.wicket.behavior.IBehaviorListener targeted at org.apache.wicket.ajax.markup.html.AjaxLink$1#3a4a35fe on component [AjaxLink [Component id = link]] threw an exception
at org.apache.wicket.RequestListenerInterface.internalInvoke(RequestListenerInterface.java:268)
at org.apache.wicket.RequestListenerInterface.invoke(RequestListenerInterface.java:241)
at org.apache.wicket.core.request.handler.ListenerInterfaceRequestHandler.invokeListener(ListenerInterfaceRequestHandler.java:248)
at org.apache.wicket.core.request.handler.ListenerInterfaceRequestHandler.respond(ListenerInterfaceRequestHandler.java:234)
at org.apache.wicket.request.cycle.RequestCycle$HandlerExecutor.respond(RequestCycle.java:895)
at org.apache.wicket.request.RequestHandlerStack.execute(RequestHandlerStack.java:64)
at org.apache.wicket.request.cycle.RequestCycle.execute(RequestCycle.java:265)
at org.apache.wicket.request.cycle.RequestCycle.processRequest(RequestCycle.java:222)
at org.apache.wicket.request.cycle.RequestCycle.processRequestAndDetach(RequestCycle.java:293)
at org.apache.wicket.protocol.http.WicketFilter.processRequestCycle(WicketFilter.java:261)
at org.apache.wicket.protocol.http.WicketFilter.processRequest(WicketFilter.java:203)
at org.apache.wicket.protocol.http.WicketServlet.doGet(WicketServlet.java:137)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.springframework.web.servlet.mvc.ServletWrappingController.handleRequestInternal(ServletWrappingController.java:166)
at org.springframework.web.servlet.mvc.AbstractController.handleRequest(AbstractController.java:177)
at org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter.handle(SimpleControllerHandlerAdapter.java:52)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1040)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006)
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.eclipse.jetty.servlet.ServletHolder$NotAsync.service(ServletHolder.java:1452)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:791)
at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626)
at org.geoserver.filters.ThreadLocalsCleanupFilter.doFilter(ThreadLocalsCleanupFilter.java:26)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.geoserver.filters.SpringDelegatingFilter$Chain.doFilter(SpringDelegatingFilter.java:69)
at org.geoserver.wms.animate.AnimatorFilter.doFilter(AnimatorFilter.java:70)
at org.geoserver.filters.SpringDelegatingFilter$Chain.doFilter(SpringDelegatingFilter.java:66)
at org.geoserver.filters.SpringDelegatingFilter.doFilter(SpringDelegatingFilter.java:41)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.geoserver.platform.AdvancedDispatchFilter.doFilter(AdvancedDispatchFilter.java:37)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:320)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:70)
at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:127)
at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:91)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:74)
at org.geoserver.security.filter.GeoServerCompositeFilter.doFilter(GeoServerCompositeFilter.java:91)
at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:70)
at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:119)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:74)
at org.geoserver.security.filter.GeoServerCompositeFilter.doFilter(GeoServerCompositeFilter.java:91)
at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
at org.geoserver.security.filter.GeoServerAnonymousAuthenticationFilter.doFilter(GeoServerAnonymousAuthenticationFilter.java:51)
at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:70)
at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:200)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:74)
at org.geoserver.security.filter.GeoServerCompositeFilter.doFilter(GeoServerCompositeFilter.java:91)
at org.geoserver.security.filter.GeoServerUserNamePasswordAuthenticationFilter.doFilter(GeoServerUserNamePasswordAuthenticationFilter.java:122)
at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:70)
at org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:158)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:74)
at org.geoserver.security.filter.GeoServerCompositeFilter.doFilter(GeoServerCompositeFilter.java:91)
at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:70)
at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:105)
at org.geoserver.security.filter.GeoServerSecurityContextPersistenceFilter$1.doFilter(GeoServerSecurityContextPersistenceFilter.java:52)
at org.geoserver.security.filter.GeoServerCompositeFilter$NestedFilterChain.doFilter(GeoServerCompositeFilter.java:74)
at org.geoserver.security.filter.GeoServerCompositeFilter.doFilter(GeoServerCompositeFilter.java:91)
at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
at org.springframework.security.web.FilterChainProxy.doFilterInternal(FilterChainProxy.java:215)
at org.springframework.security.web.FilterChainProxy.doFilter(FilterChainProxy.java:178)
at org.geoserver.security.GeoServerSecurityFilterChainProxy.doFilter(GeoServerSecurityFilterChainProxy.java:142)
at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:358)
at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:271)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.geoserver.filters.LoggingFilter.doFilter(LoggingFilter.java:101)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.geoserver.filters.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:77)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.geoserver.filters.GZIPFilter.doFilter(GZIPFilter.java:47)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.geoserver.filters.SessionDebugFilter.doFilter(SessionDebugFilter.java:46)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.geoserver.filters.FlushSafeFilter.doFilter(FlushSafeFilter.java:42)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:516)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.reflect.InvocationTargetException
at jdk.internal.reflect.GeneratedMethodAccessor302.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.wicket.RequestListenerInterface.internalInvoke(RequestListenerInterface.java:258)
... 128 more
Caused by: java.lang.RuntimeException: Error occurred while building the resources for the configuration page
at org.geoserver.web.data.layer.NewLayerPage.buildLayerInfo(NewLayerPage.java:431)
at org.geoserver.web.data.layer.NewLayerPage$9.onClick(NewLayerPage.java:324)
at org.geoserver.web.wicket.SimpleAjaxLink$1.onClick(SimpleAjaxLink.java:47)
at org.apache.wicket.ajax.markup.html.AjaxLink$1.onEvent(AjaxLink.java:85)
at org.apache.wicket.ajax.AjaxEventBehavior.respond(AjaxEventBehavior.java:155)
at org.apache.wicket.ajax.AbstractDefaultAjaxBehavior.onRequest(AbstractDefaultAjaxBehavior.java:601)
... 132 more
Caused by: org.geotools.data.DataSourceException: Unable to create this mosaic
at org.geotools.gce.imagemosaic.RasterLayerResponse.prepareResponse(RasterLayerResponse.java:757)
at org.geotools.gce.imagemosaic.RasterLayerResponse.processRequest(RasterLayerResponse.java:605)
at org.geotools.gce.imagemosaic.RasterLayerResponse.createResponse(RasterLayerResponse.java:573)
at org.geotools.gce.imagemosaic.RasterManager.read(RasterManager.java:1321)
at org.geotools.gce.imagemosaic.ImageMosaicReader.read(ImageMosaicReader.java:652)
at org.geotools.gce.imagemosaic.ImageMosaicReader.read(ImageMosaicReader.java:633)
at org.geotools.gce.imagepyramid.ImagePyramidReader.loadRequestedTiles(ImagePyramidReader.java:402)
at org.geotools.gce.imagepyramid.ImagePyramidReader.read(ImagePyramidReader.java:360)
at org.geoserver.catalog.CoverageDimensionCustomizerReader.read(CoverageDimensionCustomizerReader.java:234)
at org.geoserver.catalog.SingleGridCoverage2DReader.read(SingleGridCoverage2DReader.java:126)
at org.geoserver.catalog.CatalogBuilder.getCoverageSampleDimensions(CatalogBuilder.java:1188)
at org.geoserver.catalog.CatalogBuilder.buildCoverageInternal(CatalogBuilder.java:1064)
at org.geoserver.catalog.CatalogBuilder.buildCoverage(CatalogBuilder.java:985)
at org.geoserver.catalog.CatalogBuilder.buildCoverage(CatalogBuilder.java:939)
at org.geoserver.web.data.layer.NewLayerPage.buildLayerInfo(NewLayerPage.java:418)
... 137 more
Caused by: java.io.IOException: java.util.concurrent.ExecutionException: org.geotools.gce.imagemosaic.GranuleLoadingException: Failed to load granule file:/C:/AlidadA/3_software/geoserver_2_19_0/data_dir/data/1_drn_data/pyramids/out/0/8-50-0-5_epsg_4326_01_01.tif
at org.geotools.gce.imagemosaic.granulecollector.BaseSubmosaicProducer.collectGranules(BaseSubmosaicProducer.java:225)
at org.geotools.gce.imagemosaic.granulecollector.BaseSubmosaicProducer.createMosaic(BaseSubmosaicProducer.java:398)
at org.geotools.gce.imagemosaic.RasterLayerResponse$MosaicProducer.produce(RasterLayerResponse.java:420)
at org.geotools.gce.imagemosaic.RasterLayerResponse$MosaicProducer.access$600(RasterLayerResponse.java:276)
at org.geotools.gce.imagemosaic.RasterLayerResponse.prepareResponse(RasterLayerResponse.java:676)
... 151 more
Caused by: java.util.concurrent.ExecutionException: org.geotools.gce.imagemosaic.GranuleLoadingException: Failed to load granule file:/C:/AlidadA/3_software/geoserver_2_19_0/data_dir/data/1_drn_data/pyramids/out/0/8-50-0-5_epsg_4326_01_01.tif
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.geotools.gce.imagemosaic.granulecollector.BaseSubmosaicProducer.collectGranules(BaseSubmosaicProducer.java:121)
... 155 more
Caused by: org.geotools.gce.imagemosaic.GranuleLoadingException: Failed to load granule file:/C:/AlidadA/3_software/geoserver_2_19_0/data_dir/data/1_drn_data/pyramids/out/0/8-50-0-5_epsg_4326_01_01.tif
at org.geotools.gce.imagemosaic.GranuleLoader.call(GranuleLoader.java:112)
at org.geotools.gce.imagemosaic.GranuleLoader.call(GranuleLoader.java:38)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at org.geotools.gce.imagemosaic.granulecollector.BaseSubmosaicProducer.acceptGranule(BaseSubmosaicProducer.java:445)
at org.geotools.gce.imagemosaic.granulecollector.DefaultSubmosaicProducer.accept(DefaultSubmosaicProducer.java:70)
at org.geotools.gce.imagemosaic.RasterLayerResponse$MosaicProducer.visit(RasterLayerResponse.java:360)
at org.geotools.gce.imagemosaic.catalog.CachingDataStoreGranuleCatalog.getGranuleDescriptors(CachingDataStoreGranuleCatalog.java:180)
at org.geotools.gce.imagemosaic.catalog.LockingGranuleCatalog.lambda$getGranuleDescriptors$7(LockingGranuleCatalog.java:195)
at org.geotools.gce.imagemosaic.catalog.LockingGranuleCatalog.guardIO(LockingGranuleCatalog.java:93)
at org.geotools.gce.imagemosaic.catalog.LockingGranuleCatalog.getGranuleDescriptors(LockingGranuleCatalog.java:195)
at org.geotools.gce.imagemosaic.RasterManager.getGranuleDescriptors(RasterManager.java:1330)
at org.geotools.gce.imagemosaic.RasterLayerResponse.prepareResponse(RasterLayerResponse.java:672)
... 151 more
Caused by: java.lang.NullPointerException
at org.geotools.gce.imagemosaic.GranuleDescriptor.loadRaster(GranuleDescriptor.java:1318)
at org.geotools.gce.imagemosaic.GranuleLoader.call(GranuleLoader.java:108)
... 162 more
I found myself in a similar situation, and I succeeded while using a GeoTIFF without any compression (especially JPEG compression). It means that I only executed your third command line :
gdal_retile -of GTiff -v -r bilinear -levels 4 -ps 2048 2048 -co "TILED=YES" \
-co "COMPRESS=JPEG" -targetDir C:\...\out input_file.vrt
But without the -co COMPRESS=JPEG. And it worked ! I think it's a problem with JPEG compression, but I didn't test with others so I can't be sure.
I was having the same issue and it worked by omitting the COMPRESS option. I then tried using
-co "COMPRESS=LZW"
and it worked. Helped me almost halving the space used by the uncompressed tiles.

Plotly Apache Spark MapBox

Is there a way to plot Apache Spark Dataframes into MapBox?
I've tried Plotly but it only takes Pandas Dataframes for plotting on maps.
UPDATE:
When I convert Spark to Pandas dataframe, I get the following error:
Py4JJavaError: An error occurred while calling o58.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 11 tasks (1097.7 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2141)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:752)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2093)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2133)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3448)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3445)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Error writing XGBoost Classifier to pmml with sklearn2pmml

I want to save my XGBoost model as pmml using sklearn2pmml. I'm using Python V3.7.3 with Sklearn 0.20.3 & sklearn2pmml V0.53.0. My data is mainly binary, with just 3 columns of continuous data, I'm running my notebook in Databricks and convert my Spark dataframe to a pandas dataframe. Code snippet below
import xgboost as xgb
from sklearn_pandas import DataFrameMapper
from sklearn.compose import ColumnTransformer
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.decoration import ContinuousDomain
from sklearn.preprocessing import StandardScaler
X = pdf[continuous_features + numericCols]
y = pdf["Label"]
mapper = DataFrameMapper(
[([cont_column], [ContinuousDomain(), StandardScaler()]) for cont_column in continuous_features] +
[([c for c in numericCols], None)] # no transformation
)
clf = xgb.XGBClassifier(objective='multi:softprob',eval_metric='auc',num_class = 2,
n_jobs =6,max_delta_step=1, min_child_weight=14, gamma=1.5, subsample = 0.8,
colsample_bytree = 0.5, max_depth=10, learning_rate = 0.1)
pipeline = PMMLPipeline([
("mapper", mapper),
("estimator", clf)
])
pipeline.fit(X,y.values.reshape(-1,))
sklearn2pmml(pipeline, "xgb_V1.pmml", with_repr = True)
The pipeline fits to the data, generates a score and prediction with pipeline.score(X,y) and pipeline.predict(X), but when I try to write it to pmml, I get the following error:
Standard output is empty
Standard error:
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 47 ms.
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Converting..
Feb 21, 2020 1:53:30 PM sklearn2pmml.pipeline.PMMLPipeline initTargetFields
WARNING: Attribute 'sklearn2pmml.pipeline.PMMLPipeline.target_fields' is not set. Assuming y as the name of the target field
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Attribute 'xgboost.sklearn.XGBClassifier._le' has an unsupported value (Python class xgboost.compat.XGBoostLabelEncoder)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:45)
at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:82)
at sklearn.LabelEncoderClassifier.getLabelEncoder(LabelEncoderClassifier.java:40)
at sklearn.LabelEncoderClassifier.getClasses(LabelEncoderClassifier.java:34)
at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.preprocessing.LabelEncoder
at java.lang.Class.cast(Class.java:3369)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
... 7 more
Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'xgboost.sklearn.XGBClassifier._le' has an unsupported value (Python class xgboost.compat.XGBoostLabelEncoder)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:45)
at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:82)
at sklearn.LabelEncoderClassifier.getLabelEncoder(LabelEncoderClassifier.java:40)
at sklearn.LabelEncoderClassifier.getClasses(LabelEncoderClassifier.java:34)
at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.preprocessing.LabelEncoder
at java.lang.Class.cast(Class.java:3369)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
I thought it might be a version incompatibility issue between Sklearn and sklearn2pmml as per this post https://github.com/jpmml/sklearn2pmml/issues/197, but I think the versions I have installed should be ok. Any ideas on what's going on with this? Thanks in advance
It is probably a XGBoost package version issue. The SkLearn2PMML package expects the label encoder (XGBClassifier._le attribute) to be a "normal" Scikit-Learn label encoder class (sklearn.preprocessing.(label|_label).LabelEncoder), but in your case it's something different (xgboost.compat.XGBoostLabelEncoder).
In which XGBOost package version was this xgboost.compat.XGBoostLabelEncoder introduced? It's either some very old, or very new thing.
In any case, please open a feature request with the JPMML-SkLearn project here to have this issue sorted out.

PySpark pandas_udfs java.lang.IllegalArgumentException error

Does anyone have experience using pandas UDFs on a local pyspark session running on Windows? I've used them on linux with good results, but I've been unsuccessful on my Windows machine.
Environment:
python==3.7
pyarrow==0.15
pyspark==2.3.4
pandas==0.24
java version "1.8.0_74"
Sample script:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.fallback.enabled", "false")
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
#pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
v = pdf.v
return pdf.assign(v=v - v.mean())
out_df = df.groupby("id").apply(subtract_mean).toPandas()
print(out_df.head())
# +---+----+
# | id| v|
# +---+----+
# | 1|-0.5|
# | 1| 0.5|
# | 2|-3.0|
# | 2|-1.0|
# | 2| 4.0|
# +---+----+
After running for a loooong time (splits the toPandas stage into 200 tasks each taking over a second) it returns an error like this:
Traceback (most recent call last):
File "C:\miniconda3\envs\pandas_udf\lib\site-packages\pyspark\sql\dataframe.py", line 1953, in toPandas
tables = self._collectAsArrow()
File "C:\miniconda3\envs\pandas_udf\lib\site-packages\pyspark\sql\dataframe.py", line 2004, in _collectAsArrow
sock_info = self._jdf.collectAsArrowToPython()
File "C:\miniconda3\envs\pandas_udf\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\miniconda3\envs\pandas_udf\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\miniconda3\envs\pandas_udf\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o62.collectAsArrowToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 69 in stage 3.0 failed 1 times, most recent failure: Lost task 69.0 in stage 3.0 (TID 201, localhost, executor driver): java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(Unknown Source)
at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64)
at org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeSchema(MessageSerializer.java:104)
at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:128)
at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:161)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:290)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.hasNext(ArrowConverters.scala:96)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.foreach(ArrowConverters.scala:94)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.to(ArrowConverters.scala:94)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.toBuffer(ArrowConverters.scala:94)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.toArray(ArrowConverters.scala:94)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Your java.lang.IllegalArgumentException in pandas_udf has to do with pyarrow version, not with OS environment. See this issue for details.
You have two routs of action:
Downgrade pyarrow to v.0.14, or
Add environment variable ARROW_PRE_0_15_IPC_FORMAT=1 to SPARK_HOME/conf/spark-env.sh
On Windows, you'll need to have a spark-env.cmd file in the conf directory: set ARROW_PRE_0_15_IPC_FORMAT=1, as suggested by Jonathan Taws
Addendum to the answer of Sergey:
if you prefer to build your own sparkSession in python and not change your config files, you'll need to set both spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT and the environment variable of the local executor spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT
spark_session = SparkSession.builder \
.master("yarn") \
.config('spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT',1)\
.config('spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT',1)
spark = spark_session.getOrCreate()
Hope this helps!

Apache Spark - sqlContext.sql to pandas

Hy,
I have a Spark DataFrame and I made some transformation using SQL context, for example, select only two Columns in all data.
df_oraAS = sqlContext.sql("SELECT ENT_EMAIL,MES_ART_ID FROM df_oraAS LIMIT 5 ")
but now I want transform this sqlcontext a pandas dataframe, and I'm using
pddf = df_oraAS.toPandas()
but the output stop here and I need restart the IDE (spyder)
6/01/22 16:04:01 INFO DAGScheduler: Got job 0 (toPandas at <stdin>:1) with 3 output partitions
16/01/22 16:04:01 INFO DAGScheduler: Final stage: ResultStage 0 (toPandas at <stdin>:1)
16/01/22 16:04:01 INFO DAGScheduler: Parents of final stage: List()
16/01/22 16:04:01 INFO DAGScheduler: Missing parents: List()
16/01/22 16:04:01 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[7] at toPandas at <stdin>:1), which has no missing parents
16/01/22 16:04:01 INFO SparkContext: Starting job: toPandas at <stdin>:1
16/01/22 16:04:01 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.4 KB, free 9.4 KB)
16/01/22 16:04:01 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.9 KB, free 14.3 KB)
16/01/22 16:04:01 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:50877 (size: 4.9 KB, free: 511.1 MB)
16/01/22 16:04:01 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/01/22 16:04:01 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (MapPartitionsRDD[7] at toPandas at <stdin>:1)
16/01/22 16:04:01 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
16/01/22 16:04:02 WARN TaskSetManager: Stage 0 contains a task of very large size (116722 KB). The maximum recommended task size is 100 KB.
16/01/22 16:04:02 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 119523958 bytes)
16/01/22 16:04:03 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 117876401 bytes)
Exception in thread "dispatcher-event-loop-3" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Unknown Source)
at java.io.ByteArrayOutputStream.grow(Unknown Source)
at java.io.ByteArrayOutputStream.ensureCapacity(Unknown Source)
at java.io.ByteArrayOutputStream.write(Unknown Source)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Unknown Source)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(Unknown Source)
at java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.io.ObjectOutputStream.writeObject(Unknown Source)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:200)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:462)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:252)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:247)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$8.apply(TaskSchedulerImpl.scala:317)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$8.apply(TaskSchedulerImpl.scala:315)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:315)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:315)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:315)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalBackend.scala:84)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalBackend.scala:63)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
What I did wrong?
thanks
EDIT: more completed: I load the date from Oracle Database (cx_Oracle) and put the data in a pandas dataframe
df_ora = pd.read_sql('SELECT* FROM DEC_CLIENTES', con=connection)
Next I created a sparkContext to manipulate the dataframe
sqlContext = SQLContext(sc)
df_oraAS = sqlContext.createDataFrame(df_ora)
df_oraAS.registerTempTable("df_oraAS")
df_oraAS = sqlContext.sql("SELECT ENT_EMAIL,MES_ART_ID FROM df_oraAS LIMIT 5 ")
and I want convert again from sqlcontext to a pandas dataframe
pddf = df_oraAS.toPandas()
toPandas is basically collect in disguise. An output is a local Pandas DataFrame. If data doesn't fit into driver memory it will simply fail hence the error you see.
Your pd.read_sql call reads the full database into a pandas dataframe. This is local to the driver. When you call createDataFrame, it then creates a Spark DataFrame from your python pandas dataframe, which results in a really large task size (see the log line below):
16/01/22 16:04:02 WARN TaskSetManager: Stage 0 contains a task of very large size (116722 KB). The maximum recommended task size is 100 KB.
Even though you are selecting only 5 rows, you're actually first loading the full database into memory using that pd.read_sql call. If you're reading from an Oracle SQL database, why don't you use the spark JDBC driver and then perform your select filters and then call toPandas?
What your code is doing is reading the whole DB to pandas, writing to Spark, filtering and reading back to Pandas.