Is it possible to specify log level with a string, not an int? - python-logging

I know I can specify a log level like this:
>>> import logging
>>> logging.log(level=50, msg='critical log')
CRITICAL:root:critical log
Is there a built in way for me to accomplish the same thing with the string 'CRITICAL', instead of the number 50? Something like
>>> import logging
>>> logging.log(level_string='CRITICAL', msg='critical log')
CRITICAL:root:critical log

You can either use the predefined logging methods or use the built-in conversion function:
import logging
# this works:
logging.critical("critical log")
# the logging module also provides predefined constants:
logging.log(level=logging.CRITICAL, msg="critical log")
# you can also use `getattr` if you have the level as string:
logging.log(getattr(logging, "CRITICAL"), "critical log")
# or if you really, really want to use the string (not recommended):
logging.log(level=logging._checkLevel("CRITICAL"), msg="critical log")

getLevelName works for me:
>>> logging.getLevelName("CRITICAL")
50
which results in
>>> logging.log(level=logging.getLevelName("CRITICAL"), msg='critical log')

Related

What is the best way to send Arrow data to the browser?

I have Apache Arrow data on the server (Python) and need to use it in the browser. It appears that Arrow Flight isn't implemented in JS. What are the best options for sending the data to the browser and using it there?
I don't even need it necessarily in Arrow format in the browser. This question hasn't received any responses, so I'm adding some additional criteria for what I'm looking for:
Self-describing: don't want to maintain separate schema definitions
Minimal overhead: For example, an array of float32s should transfer as something compact like a data type indicator, length value and sequence of 4-byte float values
Cross-platform: Able to be easily sent from Python and received and used in the browser in a straightforward way
Surely this is a solved problem? If it is I've been unable to find a solution. Please help!
Building off of the comments on your original post by David Li, you can implement a non-streaming version what you want without too much code using PyArrow on the server side and the Apache Arrow JS bindings on the client. The Arrow IPC format satisfies your requirements because it ships the schema with the data, is space-efficient and zero-copy, and is cross-platform.
Here's a toy example showing generating a record batch on server and receiving it on the client:
Server:
from io import BytesIO
from flask import Flask, send_file
from flask_cors import CORS
import pyarrow as pa
app = Flask(__name__)
CORS(app)
#app.get("/data")
def data():
data = [
pa.array([1, 2, 3, 4]),
pa.array(['foo', 'bar', 'baz', None]),
pa.array([True, None, False, True])
]
batch = pa.record_batch(data, names=['f0', 'f1', 'f2'])
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
writer.write_batch(batch)
return send_file(BytesIO(sink.getvalue().to_pybytes()), "data.arrow")
Client
const table = await tableFromIPC(fetch(URL));
// Do what you like with your data
Edit: I added a runnable example at https://github.com/amoeba/arrow-python-js-ipc-example.

reading from hive table and updating same table in pyspark - using checkpoint

I am using spark version 2.3 and trying to read hive table in spark as:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.table("emp.emptable")
here I am adding a new column with current date from system to the existing dataframe
import pyspark.sql.functions as F
newdf = df.withColumn('LOAD_DATE', F.current_date())
and now facing an issue,when I am trying to write this dataframe as hive table
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
pyspark.sql.utils.AnalysisException: u'Cannot overwrite table emp.emptable that is also being read from;'
so I am checkpointing the dataframe to break the lineage since I am reading and writing from same dataframe
checkpointDir = "/hdfs location/temp/tables/"
spark.sparkContext.setCheckpointDir(checkpointDir)
df = spark.table("emp.emptable").coalesce(1).checkpoint()
newdf = df.withColumn('LOAD_DATE', F.current_date())
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
This way it's working fine and new column has been added to the hive table. but I have to delete the checkpoint files every time it's get created. Is there any best way to break the lineage and write the same dataframe with updated column details and save it to hdfs location or as a hive table.
or is there any way to specify a temp location for checkpoint directory, which will get deleted post the spark session completes.
As we discussed in this post, setting below property is way to go.
spark.conf.set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
That question had different context. we wanted to retain the checkpointed dataset so did not care to add on cleanup solution.
Setting above property is working sometime(tested scala, java and python) but its hard to rely on it. Official document says that by setting this property it Controls whether to clean checkpoint files if the reference is out of scope. I don't know what exactly it means because my understanding is that once spark session/context is stopped it should clean it. Would be great if someone can shad light on it.
Regarding
Is there any best way to break the lineage
Check this question, #BiS found some way to cut the lineage using createDataFrame(RDD, Schema) method. I haven't tested it by myself though.
Just FYI, I don't rely on above property usually and delete the checkpointed directory in code itself to be on safe side.
We can get the checkpointed directory like below:
Scala :
//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")
scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3
//It gives String so we can use org.apache.hadoop.fs to delete path
PySpark:
// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'
# notice 'u' at the start which means It returns unicode object use str(t)
# Below are the steps to get hadoop file system object and delete
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
False

Why won't my application start with pandas_udf and PySpark+Flask?

When my Flask+PySpark application has a function with #udf or #pandas_udf annotation, it will not start. If I simply remove the annotation, it does start.
If I try to start my application with Flask, the first pass of lexical interpretation of the script is executed. For example, the debugger stops at import lines such as
from pyspark.sql.functions import pandas_udf, udf, PandasUDFType
. However no statement is executed at all, including the initial app = Flask(name) statement. (Could it be some kind of hidden exception? )
If I start my application without Flask, with the same exact function and with the same imports, it does work.
These are the imports:
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, udf, PandasUDFType
import pandas as pd
This is the function:
#pandas_udf('string', PandasUDFType.SCALAR)
def pandas_not_null(s):
return s.fillna("_NO_NA_").replace('', '_NO_E_')
This is the statement that is not executed iff #pandas_udf is there:
app = Flask(__name__)
This is how IntelliJ starts Flask:
FLASK_APP = app
FLASK_ENV = development
FLASK_DEBUG = 1
In folder /Users/vivaomengao/projects/dive-platform/cat-intel/divecatintel
/Users/vivaomengao/anaconda/bin/python /Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py --module --multiproc --qt-support=auto --client 127.0.0.1 --port 56486 --file flask run
I'm running MacOS in my own computer.
I found the problem. The problem was that the #pandas_udf annotation required a Spark session at the time that the module is loaded (some kind of "first pass parsing" in Python). To solve the problem, I first called my code that creates a Spark session. Then I imported the module that has the function with the #pandas_udf annotation after. I imported it right inside the caller function and not at the header.
To troubleshoot, I set a breakpoint over the #pandas_udf function (in PyCharm) and stepped into the functions. With that I could inspect the local variables. One of the variables referred to something like "sc" or "_jvm". I knew from a past problem that that happened if the Spark session was not initialized.

pygtk3 checking does signal exist

how can I learn this before connecting anything.
example
does_signal_exist(widget, signal):
pass
button = Gtk.Button()
does_signal_exist(button, "clicked") # returns True
does_signal_exist(button, "hello") # returns False
does_signal_exist(button, "unboxed") # returns False
does_signal_exist(button, "button-press-event") # returns True
There's a wondrous tool called 'manual' or 'reference'. Look for the Gtk-3.0 manual, and click on Classes, and look for Gtk.Button. Look for Signals. If not there you can do the same for the signals inherited from previous Objects.
Using help(button or even help(Gtk.Button) (inside Python3, either in interactive or in a program), you have access to all methods, and lots of other information about the class and the instance.
Using the manual mentioned above, check out the GiRepository module - these are functions which you can use to look inside Gtk, its classes and properties.
And you can just check if it works: Use try:/except: to check if you can actually connect to the signal/event you are interested in.
def does_signal_exist(gtkWidgetObj, sigTag):
try:
gtkWidgetObj.emit(sigTag)
return True
except:
return "unknown signal name" not in format_exc():

Nonetype Error on Python Google Search Script - Is this a spam prevention tactic?

Fairly new to Python so apologies if this is a simple ask. I have browsed other answered questions but can't seem to get it functioning consistently.
I found the below script which prints the top result from google for a set of defined terms. It will work the first few times that I run it but will display the following error when I have searched 20 or so terms:
Traceback (most recent call last):
File "term2url.py", line 28, in <module>
results = json['responseData']['results']
TypeError: 'NoneType' object has no attribute '__getitem__'
From what I can gather, this indicates that one of the attributes does not have a defined value (potentially a result of google blocking me?). I attempted to solve the issue by adding in the else clause though I still run into the same problem.
Any help would be greatly appreciated; I have pasted the full code below.
Thanks!
#
# This is a quick and dirty script to pull the most likely url and description
# for a list of terms. Here's how you use it:
#
# python term2url.py < {a txt file with a list of terms} > {a tab delimited file of results}
#
# You'll must install the simpljson module to use it
#
import urllib
import urllib2
import simplejson
import sys
# Read the terms we want to convert into URL from info redirected from the command line
terms = sys.stdin.readlines()
for term in terms:
# Define the query to pass to Google Search API
query = urllib.urlencode({'q' : term.rstrip("\n")})
url = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s" % (query)
# Fetch the results and convert to JSON format
search_results = urllib2.urlopen(url)
json = simplejson.loads(search_results.read())
# Process the results by pulling the first record, which has the best match
results = json['responseData']['results']
for r in results[:1]:
if results is not None:
url = r['url']
desc = r['content'].encode('ascii', 'replace')
else:
url = "none"
desc = "none"
# Print the results to stdout. Use redirect to capture the output
print "%s\t%s" % (term.rstrip("\n"), url)
import time
time.sleep(1)
Here are some Python details for you first:
None is a valid object in Python, of the type NoneType:
print(type(None))
Produces:
< class 'NoneType' >
And the no attribute error you got is normal when you try to access some method or attribute of an object that doesn't have that attribute. In this case, you were attempting to use the __getitem__ syntax (object[item_index]), which NoneType objects don't support because it doesn't have the __getitem__ method.
The point of the previous explanation is that your assumption about what your error means is correct: your results object is essentially empty.
As for why you're hitting this in the first place, I believe you are running up against Google's API limits. It looks like you're using the old API that is now deprecated. The number of search results (not queries) used to be limited to around 64 per query, and there used to be no rate or per-day limit. However, since it's been deprecated for over 5 years now, there may be new undocumented limits.
I don't think it necessarily has anything to do with SPAM, but I do believe it is an undocumented limit.