Instantiate Spacy object only once in apache mod_wsgi environment - apache

myapp.py
---- import statements ---
parser = None
app = Flask(__name__)
#app.route('/xxxxxx/yyy')
def markDealStatud():
text = 'matthew honnibal created spacy library'
parsedData = parser(text.decode("utf-8"))
xxxxxxxxxxxxx
xxxxxxxxxxxx
def initSpacy():
global parser
parser = English()
if __name__ == '__main__':
initSpacy()
app.run()​
if __name__ == 'myapp':
initSpacy()
when i run this app in development mode, __main__ will execute and it will instantiate spacy only once and i will use that.
For production we are using apache server mod_wsgi configuration. similarly i want to instantiate it once(myapp) and use the same obj.
In my configuration it is instantiating for each request. Suggest some solution plz.
Environment
Operating System: Linux
Python Version Used: 2.7
Environment Information: apache mod_wsgi deployment

Related

tensorflow data validation tfdv fails on google cloud dataflow with "Can't get attribute 'NumExamplesStatsGenerator' "

I am following this "get started" tensorflow tutorial on how to run tfdv on apache beam on google cloud dataflow. My code is very similar to the one in the tutorial:
import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions, WorkerOptions
PROJECT_ID = 'my-project-id'
JOB_NAME = 'my-job-name'
REGION = "europe-west3"
NETWORK = "regions/europe-west3/subnetworks/mysubnet"
GCS_STAGING_LOCATION = 'gs://my-bucket/staging'
GCS_TMP_LOCATION = 'gs://my-bucket/tmp'
GCS_DATA_LOCATION = 'gs://another-bucket/my-data.CSV'
# GCS_STATS_OUTPUT_PATH is the file path to which to output the data statistics
# result.
GCS_STATS_OUTPUT_PATH = 'gs://my-bucket/stats'
# downloaded locally with: pip download tensorflow_data_validation --no-deps --platform manylinux2010_x86_64 --only-binary=:all:
#(would be great to use it have it on cloud storage) PATH_TO_WHL_FILE = 'gs://my-bucket/wheels/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'
PATH_TO_WHL_FILE = '/Users/myuser/some-folder/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'
# Create and set your PipelineOptions.
options = PipelineOptions()
# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
google_cloud_options.region = REGION
options.view_as(StandardOptions).runner = 'DataflowRunner'
setup_options = options.view_as(SetupOptions)
# PATH_TO_WHL_FILE should point to the downloaded tfdv wheel file.
setup_options.extra_packages = [PATH_TO_WHL_FILE]
# Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 2
print("Generating stats...")
tfdv.generate_statistics_from_tfrecord(GCS_DATA_LOCATION, output_path=GCS_STATS_OUTPUT_PATH, pipeline_options=options)
print("Stats generated!")
The code above starts a dataflow job but unfortunately it fails with the following error:
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
return dill.loads(s)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 462, in find_class
return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute 'NumExamplesStatsGenerator' on <module 'tensorflow_data_validation.statistics.stats_impl' from '/usr/local/lib/python3.8/site-packages/tensorflow_data_validation/statistics/stats_impl.py'>
I couldn't find on the internet anything similar.
If it can help, on my local machine (MACOS) I have the following versions:
Apache Beam version: 2.34.0
Tensorflow version: 2.6.2
TensorFlow Transform version: 1.4.0
TFDV version: 1.4.0
Apache beam on cloud runs with Apache Beam Python 3.8 SDK 2.34.0
BONUS QUESTION: Another question I have is around the PATH_TO_WHL_FILE. I tried to put it on a storage bucket but Beam doesn't seem to be able to pick it up. Only locally, which is actually a problem, because it would make it more difficult to distribute this code. What would be a good practice to distribute this wheel file?
Based on the name of the attribute NumExamplesStatsGenerator, it's a generator that is not pickle-able.
But I couldn't find the attribute from the module now.
A search indicates that in 1.4.0 this module contains this attribute.
So you may want to try a newer versioned TFDV.
PATH_TO_WHL_FILE indicates a file to stage/distribute to Dataflow for execution, so you can use a file on GCS.

TensorFlow Extended Kubeflow Multiple Workers

I have an issue with TFX on Kubeflow DAG Runner. Issues is that I managed to start only one pod per run. I don't see any configuration for "workers" except on the Apache Beam arguments, which doesn't help.
Running CSV load on one pod results in OOMKilled error because file has more than 5GB. I tried splitting the file in parts of 100MB but that did not help also.
So my question is: How to run a TFX job/stage on Kubeflow on multiple "worker" pods, or is that even possible?
Here is the code I've been using:
examples = external_input(data_root)
example_gen = CsvExampleGen(input=examples)
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
dsl_pipeline = pipeline.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
components=[
example_gen, statistics_gen
],
enable_cache=True,
beam_pipeline_args=['--num_workers=%d' % 5]
)
if __name__ == '__main__':
tfx_image = 'custom-aws-imgage:tfx-0.26.0'
config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
kubeflow_metadata_config=kubeflow_dag_runner.get_default_kubeflow_metadata_config(),
tfx_image=tfx_image)
kfp_runner = kubeflow_dag_runner.KubeflowDagRunner(config=config)
# KubeflowDagRunner compiles the DSL pipeline object into KFP pipeline package.
# By default it is named <pipeline_name>.tar.gz
kfp_runner.run(dsl_pipeline)
Environment:
Docker image: tensorflow/tfx:0.26.0 with boto3 installed (aws related issue)
Kubernetes: AWS EKS latest
Kubeflow: 1.0.4
It seams that this is not possible at the time.
See:
https://github.com/kubeflow/kubeflow/issues/1583

Flask app runs in Flask but not from Python

I'm trying to get a web page to execute a Python script on my Raspberry Pi.
I've installed Flask as described here, Serving Raspberry Pi with Flask, and entered the sample program:
from flask import Flask
app = Flask(__name__)
#app.route("/")
def hello():
return "Hello World!"
if __name__ == "__main__":
app.run(host='0.0.0.0', port=80, debug=True)
but when I run it (sudo python hello-flask.py) I get and error about it know having a application or something.
I tried the same from The flask website
from flask import Flask
app = Flask(__name__)
#app.route("/")
def hello():
return "Hello World!"
and that runs ok, but using a different run method
$ FLASK_APP=hello.py flask run
which doesn't work for the first app.
what am I doing wrong ?
NOTE : I tried to get Apache to execute my Python scripts, but couldn't get that working either. I got it working to where I could use Python to generate HTML, but not execute a Python script to do anything with the GPIO.

Can I switch between wkhtmltopdf libraries programmatically?

I have several versions of wkhtmltopdf libraries installed on my server. I want to be able to switch between them programmatically when I'm about to render them because we have several development teams and they use different versions of wkhtmltopdf. Different wkhtmltopdf version are giving totally different rendered results, which is weird. Is it possible to switch between them programmatically?
This is not a full code but i try to this type of code may it's work for you:
import os
from openerp import tools # this odoo config file master/openerp/tools/which.py
import subprocess
import logging
_logger = logging.getLogger(__name__)
def find_in_path(name):
path = os.environ.get('PATH', os.defpath).split(os.pathsep)
if tools.config.get('bin_path') and tools.config['bin_path'] != 'None':
path.append(tools.config['bin_path'])
return tools.which(name, path=os.pathsep.join(path))
def _get_wkhtmltopdf_bin():
return find_in_path('wkhtmltopdf')
wkhtmltopdf_state = 'install'
try:
process = subprocess.Popen(
[_get_wkhtmltopdf_bin(), '--version'], stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
print "prrrrrrrrrrrrrrrr", process.communicate()[0]
# here write your logic
#
#
#
#
except (OSError, IOError):
_logger.info('You need Wkhtmltopdf to print a pdf version of the reports.')

import json in STAF/STAX

I have been working on STAF & STAX. My objective is to read a JSON file using STAF & STAX Return Testcase PASS or FAIL. I tried updating my staf to latest version with latest python version.
Python Version Detail
20130408-15:38:19
Python Version : 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06)
[Java HotSpot(TM) Client VM (Sun Microsystems Inc.)]
Here is my Code:
try:
import simplejson as json
except ImportError:
import json
title = []
album = []
slist = []
json_data=open('d:\Json_File.txt')
data = json.load(json_data)
for i in range(data["result"].__len__()):
title = data["result"][i]["Title"]
album = data["result"][i]["Album"]
slist = data["result"][i]["Title"] + ' [' + data["result"][i]["Album"] + '] \n'
It is giving error given below
20130408-11:32:26 STAXPythonEvaluationError signal raised. Terminating job.
===== XML Information =====
File: new13.xml, Machine: local://local
Line 15: Error in element type "script".
===== Python Error Information =====
com.ibm.staf.service.stax.STAXPythonEvaluationException:
Traceback (most recent call last):
File "<pyExec string>", line 1, in <module>
ImportError: No module named simplejson
===== Call Stack for STAX Thread 1 =====[
function: main (Line: 7, File: C:\STAF\services\stax\samples\new13.xml, Machine: local://local)
sequence: 1/2 (Line: 14, File: C:\STAF\services\stax\samples\new13.xml, Machine: local://local)
]
What's the process to include JSON in STAF Module.
STAX uses Jython (a version of Python written in Java), not Python, to execute code within a element in a STAX job. As i said i was using the latest version of STAX, v3.5.4, then it provides an embedded Jython 2.5.2 (which implements the same set of language features as Python 2.5) to execute code within a element.
Note: Jython 2.5.2 does not include simplejson since simplejson is included in Python 2.6 or later.
Appendix F: "Jython and CPython Differences" in the STAX User's Guide at talks about some differences between Jython and Python (aka CPython). Installing Python 2.7 or later in system will have no effect on the fact that STAX uses Jython 2.5.2 to execute code within a element in a STAX job. However, "simplejson can be run via Jython." I added the directory containing the simplejson module to the sys.path within my STAX job and then import simplejson. For example:
<script>
myPythonDir = 'C:/simplejson'
import sys
pythonpath = sys.path
# Append myPythonDir to sys.path if not already present
if myPythonDir not in pythonpath:
sys.path.append(myPythonDir)
import simplejson as json
</script>
Or, if you want to use Python 2.7 or later that you installed on your system (that includes simplejson), you can run a Python script (that uses json) via your STAX job using a** element.
For example, to use Python 2.7 (if installed in C:\Python2.7) to run a Python script named YourPythonScript.py in C:\tests.
<process>
<location>'local'</location>
<command mode="'shell'">'C:/Python2.7/bin/python.exe YourPythonScript.py'</command>
<workdir>'C:/tests'</workdir>
</process>
I have little idea about STAF/STAX. But going by what the error says, it seems simplejson module is not available. Rewrite the import line as following:
try:
import simplejson as json
except ImportError:
import json
You can fallback to json module in case import fails (Python 2.6+).