Why won't my application start with pandas_udf and PySpark+Flask? - pandas

When my Flask+PySpark application has a function with #udf or #pandas_udf annotation, it will not start. If I simply remove the annotation, it does start.
If I try to start my application with Flask, the first pass of lexical interpretation of the script is executed. For example, the debugger stops at import lines such as
from pyspark.sql.functions import pandas_udf, udf, PandasUDFType
. However no statement is executed at all, including the initial app = Flask(name) statement. (Could it be some kind of hidden exception? )
If I start my application without Flask, with the same exact function and with the same imports, it does work.
These are the imports:
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, udf, PandasUDFType
import pandas as pd
This is the function:
#pandas_udf('string', PandasUDFType.SCALAR)
def pandas_not_null(s):
return s.fillna("_NO_NA_").replace('', '_NO_E_')
This is the statement that is not executed iff #pandas_udf is there:
app = Flask(__name__)
This is how IntelliJ starts Flask:
FLASK_APP = app
FLASK_ENV = development
FLASK_DEBUG = 1
In folder /Users/vivaomengao/projects/dive-platform/cat-intel/divecatintel
/Users/vivaomengao/anaconda/bin/python /Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py --module --multiproc --qt-support=auto --client 127.0.0.1 --port 56486 --file flask run
I'm running MacOS in my own computer.

I found the problem. The problem was that the #pandas_udf annotation required a Spark session at the time that the module is loaded (some kind of "first pass parsing" in Python). To solve the problem, I first called my code that creates a Spark session. Then I imported the module that has the function with the #pandas_udf annotation after. I imported it right inside the caller function and not at the header.
To troubleshoot, I set a breakpoint over the #pandas_udf function (in PyCharm) and stepped into the functions. With that I could inspect the local variables. One of the variables referred to something like "sc" or "_jvm". I knew from a past problem that that happened if the Spark session was not initialized.

Related

BigQueryCheckAsyncOperator in airflow does not exist

I am trying to use async operators for bigquery; however,
from airflow.providers.google.cloud.operators.bigquery import BigQueryCheckAsyncOperator
gives the error:
ImportError: cannot import name 'BigQueryCheckOperatorAsync' from 'airflow.providers.google.cloud.operators.bigquery'
The documentation in https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/bigquery.html mentions that BigQueryCheckAsyncOperator exists.
I am using airflow 2.4.
How to import it?
The operator you are trying to import was never released.
It was added in PR and removed in PR both were part of Google provider 8.4.0 release thus overall the BigQueryCheckAsyncOperator class was never part of the release.
You can use defer mode in the existed class BigQueryCheckOperator by setting the deferrable parameter to True.

Importing functions in .py files and using them by calling function_name

I have a folder called Script_py, and containing a lot of .py files, for example tri_insertion.py and tri_rapide.py.
Each name.py contains just one function called also name. My aim is to :
import all the functions (and if I have to add an other .py file, it will be imported automatically),
execute one function with the command 'name(parameters)'.
I tried the solutions of How to load all modules in a folder? with a dynamic ___all___ in ___init___.py, and from Script_py import all, but a calling to a function is name.name(parameters) instead of name(parameters)
Finally the following solution works fine. Fristly, I store in a list the complete list of modules :
import os, pkgutil
test = list(module for _, module, _ in
pkgutil.iter_modules([os.path.dirname('absolute_path')]))
Secondly, I import all the modules with a for loop.
for script in test:
exec("from {module} import *".format(module=script))

Flask Blueprints: RuntimeError Application not registered on db

Ive been going at this for several hours but im afraid I still don't gronk flask app context and how my app should be implemented with Blueprints.
Ive taken a look at the this and this and have tried a few different recommendations but there must be something wrong with my basic approach.
I have one 'main' blueprint setup under the following PJ structure:
project/
app/
main/
__init__.py
routes.py
forms.py
helper.py
admin/
static/
templates/
__init__.py
models.py
app/init.py:
from flask import Flask
from config import config
from flask.ext.sqlalchemy import SQLAlchemy
from flask.ext.bootstrap import Bootstrap
db = SQLAlchemy()
bootstrap = Bootstrap()
def create_app(config_version='default'):
app = Flask(__name__)
app.config.from_object(config[config_version])
bootstrap.init_app(app)
from .main import main as main_blueprint
app.register_blueprint(main_blueprint)
db.init_app(app)
return app
app/main/init.py
from flask import Blueprint
main = Blueprint('main',__name__)
from . import routes, helper
app/main/helper.py
#!/usr/bin/env python
from . import main
from ..models import SKU, Category, db
from flask import current_app
def get_categories():
cat_list = []
for cat in db.session.query(Category).all():
cat_list.append((cat.id,cat.category))
return cat_list
Everything worked fine until I created the get_categoriesfunction in helpers.py to pull a dynamic list for a select form in app/main/forms.py. When I fireup WSGI, however, I get this error:
RuntimeError: application not registered on db instance and no application bound to current context
It would appear the db referenced in helper is not associated with an app context but when I try to create one within the function, it has not worked.
What am I doing wrong and is there a better way to organize helper functions when using Blueprints?
Documentation on database contexts here here.
My first thought was that you weren't calling db.init_app(app), but was corrected in the comments.
In app/main/helper.py, you're importing the database via from ..models import SKU, Category, db. This database object will not have been initialized with the app that you've created.
The way that I've gotten around this is by having another file, a shared.py in the root directory. In that file, create the database object,
from flask.ext.sqlalchemy import SQLAlchemy
db = SQLAlchemy()
In your app/init.py, don't create a new db object. Instead, do
from shared import db
db.init_app(app)
In any place that you want to use the db object, import it from shared.py. This way, the object in the shared file will have been initialized with the app context, and there's no chance of circular imports (which is one of the problems that you can run into with having the db object outside of the app-creating file).

Why can't I access Jython access stdlib modules when called from tomcat?

I can take an application with the following code in it:
PythonInterpreter interpreter = new PythonInterpreter();
interpreter.exec("import os");
interpreter.exec("import mylib");
Where the following is resources/Lib/mylib/__init__.py:
from __future__ import print_function
from . import myfriend as thing
import os
print("Yep, everything works")
and compile it using maven, producing a my-app-with-dependencies.jar
I can easily run it with java -jar my-app-with-depenendencies.jar and it works just fine, hooray!
Here's where the sad part comes in. I can put this exact same code inside a Spring handler:
#RequestMapping("/doesnotwork")
public #ResponseBody String sadness() {
PythonInterpreter interpreter = new PythonInterpreter();
interpreter.exec("import os");
interpreter.exec("import mylib");
return "Quoth the Java, nevermore";
}
and magically this no longer works. Not one little bit.
I can however move my file from resources/Lib/ to webapp/WEB-INF/lib/Lib/ and import mylib works. But within mylib I can no longer import from __future__ or os. I can import sys, and my sys.path looks like this:
['/path/to/my/webapp/WEB-INF/lib/Lib', '__classpath__', '__pyclasspath__/']
My sys.path_importer_cache looks like this:
{'__classpath__': <type 'org.python.core.JavaImporter'>,
'/path/to/my/webapp/WEB-INF/lib/Lib': None,
'/path/to/my/webapp/WEB-INF/lib/Lib/mylib': None,
'__pyclasspath__/': <ClasspathPyImporter object at 0x2>}
What am I doing wrong that I can't import the stdlib? /path/to/my/webapp/WEB-INF/lib contains both jython-2.7-b1.jar and jython-standalone-2.7-b1.jar. I've even tried inserting those jar files into my path and still no dice.
I can import java classes from .jar files present in the folder, except for ones found in the jython .jars. For instance, inside jython-2.7-b1.jar are org/python/apache/xml/serialize/Serializer.class. I can import org.python but there only exists org.python.__name__.

Cherrypy web server hangs forever -- Matplotlib error

I'm creating a web-based interface for a number of different command line executables, and am using cherrypy behind apache (using mod_rewrite). I'm very new to this, and am having difficulty getting things configured properly. On my development machine, everything works reasonable well, but when I installed the code on a second machine I can't get anything to work properly.
The basic workflow for the applications is: 1. upload a dataset, 2. process the data (using python with some calls to executables using subprocess.call), 3. display the results on the web page.
After uploading and processing one dataset, everytime I attempt to process a second dataset the system stops responding. I'm not seeing any output in the terminal from the cherrypy process, or in the site log that shows any errors have occurred.
I'm starting cherrypy with the following conf file:
[global]
environment: 'production'
log.error_file: 'logs/site.log'
log.screen: True
tools.sessions.on: True
tools.session.storage_type: "file"
tools.session.storage_path: "sessions/"
tools.sessions.timeout: 60
tools.auth.on: True
tools.caching.on: False
server.socket_host: '0.0.0.0'
server.max_request_body_size: 0
server.socket_timeout: 60
server.thread_pool: 20
server.socket_queue_size: 10
engine.autoreload.on:True
My init.py file:
import cherrypy
import os
import string
from os.path import exists, join
from os import pathsep
from string import split
from mako.template import Template
from mako.lookup import TemplateLookup
from auth import AuthController, require, member_of, name_is
from twopoint import TwoPoint
current_dir = os.path.dirname(os.path.abspath(__file__))
lookup = TemplateLookup(directories=[current_dir + '/templates'])
def findInSubdirectory(filename, subdirectory=''):
if subdirectory:
path = subdirectory
else:
path = os.getcwd()
for root, dirs, names in os.walk(path):
if filename in names:
return os.path.join(root, filename)
return None
class Root:
#cherrypy.expose
#require()
def index(self):
tmpl = lookup.get_template("main.html")
return tmpl.render(usr=WebUtils.getUserName(),source="")
if __name__=='__main__':
conf_path = os.path.dirname(os.path.abspath(__file__))
conf_path = os.path.join(conf_path, "prod.conf")
cherrypy.config.update(conf_path)
cherrypy.config.update({'server.socket_host': '127.0.0.1',
'server.socket_port': 8080});
def nocache():
cherrypy.response.headers['Cache-Control']='no-cache,no-store,must-revalidate'
cherrypy.response.headers['Pragma']='no-cache'
cherrypy.response.headers['Expires']='0'
cherrypy.tools.nocache = cherrypy.Tool('before_finalize',nocache)
cherrypy.config.update({'tools.nocache.on':'True'})
cherrypy.tree.mount(Root(), '/')
cherrypy.tree.mount(TwoPoint(), '/twopoint')
cherrypy.engine.start()
cherrypy.engine.block()
For one example where this occurs, I've got the following javascript function that calls my python code:
function compTwoPoint(dataset,orig){
// call python code to generate images
$.post("/twopoint/compTwoPoint/"+dataset,
function(result){
res=jQuery.parseJSON(result);
if(res.success==true){
showTwoPoint(res.path,orig);
}
else{
alert(res.exception);
$('#display_loading').html("");
}
});
}
This calls the python code:
def twopoint(in_matrix):
"""proprietary code, can't share"""
def twopoint_file(in_file_name,out_file_name):
k = imread(in_file_name);
figure()
imshow(twopoint(k))
colorbar()
savefig(out_file_name,bbox_inches="tight")
close()
class TwoPoint:
#cherrypy.expose
def compTwoPoint(self,dataset):
try:
fnames=WebUtils.dataFileNames(dataset)
twopoint_file(fnames['filepath'],os.path.join(fnames['savebase'],"twopt.png"))
return encoder.iterencode({"success": True})
These functions work together to give the expected result. The problem is that after processing one input file, I am unable to process a second file. I don't seem to get a response from the server.
On the machine where things are working, I'm running python 2.7.6 and cherrypy 3.2.3. On the second machine, I have python 2.7.7 and cherrypy 3.3.0. While this may explain the difference in behavior, I'd like to find a way to make my code portable enough to overcome the difference in version (going from older to newer)
I'm not sure what the problem is, or even what to search for. I would appreciate any guidance or help you can offer.
(edit: Digging a bit more, I discovered something is happening with matplotlib. if I put print statments before and after the figure() command in twopoint_file, only the first one prints. Calling this function directly from a python interpreter (removing cherrypy from the equation) I get the following error:
can't invoke "event" command: application has been destroyed while executing "event generate $w{{ThemeChanged}}"
procedure "ttk::ThemeChanged" line 6 invoked from within "ttk::ThemeChanged"
end edit)
I don't understand what this error means, and haven't had much luck searching.
Old question, but I got the same problem which I fixed by changing backend in Matplotlib:
import matplotlib
matplotlib.use("qt4agg")