After discovering forbidden fruit, a CPython package that made me stare in shear horror as I observed unholy desecration of Python's sacred built-in classes, I was wondering if such a thing could be/has been done in Jython too?
For instance, do something along the lines of:
>>> from evil import desecrate
>>> unleash_hell = lambda x: "Madness"
>>> descrate(int, "__str__", unleash_hell)
>>> print(int(10))
Madness
I just tried this with success:
>>> def sayHello(self):
... print 'hello'
...
>>> import java.lang.String as String
>>> String.sayHello = sayHello
>>> String().sayHello()
hello
Related
I'm writing a teaching document that uses lots of examples of Python code and includes the resulting numeric output. I'm working from inside IPython and a lot of the examples use NumPy.
I want to avoid print statements, explicit formatting or type conversions. They clutter the examples and detract from the principles I'm trying to explain.
What I know:
From IPython I can use %precision to control the displayed precision of any float results.
I can use np.set_printoptions() to control the displayed precision of elements within a NumPy array.
What I'm looking for is a way to control the displayed precision of a NumPy float64 scalar which doesn't respond to either of the above. These get returned by a lot of NumPy functions.
>>> x = some_function()
Out[2]: 0.123456789
>>> type(x)
Out[3]: numpy.float64
>>> %precision 2
Out[4]: '%.2f'
>>> x
Out[5]: 0.123456789
>>> float(x) # that precision works for regular floats
Out[6]: 0.12
>>> np.set_printoptions(precision=2)
>>> x # but doesn't work for the float64
Out[8]: 0.123456789
>>> np.r_[x] # does work if it's in an array
Out[9]: array([0.12])
What I want is
>>> # some formatting command
>>> x = some_function() # that returns a float64 = 0.123456789
Out[2]: 0.12
but I'd settle for:
a way of telling NumPy to give me float scalars by default, rather than float64.
a way of telling IPython how to handling a float64, kind of like what I can do with a repr_pretty for my own classes.
IPython has formatters (core/formatters.py) which contain a dict that maps a type to a format method. There seems to be some knowledge of NumPy in the formatters but not for the np.float64 type.
There are a bunch of formatters, for HTML, LaTeX etc. but text/plain is the one for consoles.
We first get the IPython formatter for console text output
plain = get_ipython().display_formatter.formatters['text/plain']
and then set a formatter for the float64 type, we use the same formatter as already exists for float since it already knows about %precision
plain.for_type(np.float64, plain.lookup_by_type(float))
Now
In [26]: a = float(1.23456789)
In [28]: b = np.float64(1.23456789)
In [29]: %precision 3
Out[29]: '%.3f'
In [30]: a
Out[30]: 1.235
In [31]: b
Out[31]: 1.235
In the implementation I also found that %precision calls np.set_printoptions() with a suitable format string. I didn't know it did this, and potentially problematic if the user has already set this. Following the example above
In [32]: c = np.r_[a, a, a]
In [33]: c
Out[33]: array([1.235, 1.235, 1.235])
we see it is doing the right thing for array elements.
I can do this formatter initialisation explicitly in my own code, but a better fix might to modify IPython code/formatters.py line 677
#default('type_printers')
def _type_printers_default(self):
d = pretty._type_pprinters.copy()
d[float] = lambda obj,p,cycle: p.text(self.float_format%obj)
# suggested "fix"
if 'numpy' in sys.modules:
d[numpy.float64] = lambda obj,p,cycle: p.text(self.float_format%obj)
# end suggested fix
return d
to also handle np.float64 here if NumPy is included. Happy for feedback on this, if I feel brave I might submit a PR.
Keras fit_generator() has a parameter pickle_safe which defaults to False.
Training can run faster if it is pickle_safe, and accordingly set the flag to True ?
According to Kera's docs:
pickle_safe: If True, use process based threading. Note that because this implementation relies on multiprocessing, you should not pass non picklable arguments to the generator as they can't be passed easily to children processes.
I don't understand exactly what this is saying.
How can I determine if my arguments are pickle_safe or not ??
If it's relevant:
- I'm passing in a custom generator
- the generator function takes arguments: X_train, y_train, batch_size, p_keep;
they are of type np.array, int, float)
- I'm not using a GPU
- Also, I'm using Keras 1.2.1, though I believe this argument behaves the same as in keras 2
I have no familiarity with keras, but from a glance at the documentation, pickle_safe just means that the tuples produced by your generator must be "picklable".
pickle is a standard python module that is used to serialize and unserialize objects. The standard multiprocessing implementation uses the pickle mechanism to share objects between different processes -- since the two processes don't share the same address space, they cannot directly see the same python objects. So, to send objects from process A to process B, they're pickled in A (which produces a sequence of bytes in a specific well-known format), the pickled format is then sent via an interprocess-communication mechanism to B, and unpickled in B, producing a copy of A's original object in B's address space.
So, to discover if your objects are picklable, just invoke, say, pickle.dumps on them.
>>> import pickle
>>> class MyObject:
... def __init__(self, a, b, c):
... self.a = a
... self.b = b
... self.c = c
...
>>> foo = MyObject(1, 2, 3)
>>> pickle.dumps(foo)
b'\x80\x03c__main__\nMyObject\nq\x00)\x81q\x01}q\x02(X\x01\x00\x00\x00cq\x03K\x03X\x01\x00\x00\x00aq\x04K\x01X\x01\x00\x00\x00bq\x05K\x02ub.'
>>>
dumps produces a byte string. We can now reconstitute the foo object from the byte string as bar using loads:
>>> foo_pick = pickle.dumps(foo)
>>> bar = pickle.loads(foo_pick)
>>> bar
<__main__.MyObject object at 0x7f5e262ece48>
>>> bar.a, bar.b, bar.c
(1, 2, 3)
If something is not picklable, you'll get an exception. For example, lambdas can't be pickled:
>>> class MyOther:
... def __init__(self, a, b, c):
... self.a = a
... self.b = b
... self.c = c
... self.printer = lambda: print(self.a, self.b, self.c)
...
>>> other = MyOther(1, 2, 3)
>>> other_pick = pickle.dumps(other)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: Can't pickle local object 'MyOther.__init__.<locals>.<lambda>'
See the documentation for more info:
https://docs.python.org/3.5/library/pickle.html?highlight=pickle#what-can-be-pickled-and-unpickled
I have produced some software that is processing data for analysis and plotting. For each type of data the data frames are produced in a module dedicated for the type.
Depending on the structure of the data the data frame columns could be normal or multindex.
I will pass the data frames to a procedure function that will produce plots of columns that are numeric.
I would like to be able to "attach" a string to each of the "printable" column with a string that will be used as plot labels. This string will not be the same as the name of the column.
I don't seem to be able to figure out a good way to do this purely with pandas DataFrame, so far I don't have any other solution either.
I have seen posts about metadata but I don't completely understand if this functionality is supported or not? At least I don't get this to work, especially it seems like using frames with MultiIndex columns complicates things.
If it is not supported is it still on the todo list?
From my reading I get the impression it have worked differently in different versions of pandas and even depend on if python 2 or 3 is used.
I would like to know if there is a convenient way to accomplish what I require with Pandas data frames? Is using _metadata for this advisable? If so how?
I have looked around quite a bit but especially the MultiIndex concern seems to not be addressed anywhere.
This one seem to indicate that metadata should be supported but is it for data frames? I need Series in a DataFrame.
Adding meta-information/metadata to pandas DataFrame
This one seem to be a similar question but I have tried the solution and it did not help, I tried the solution but it seems not to help me.
Propagate pandas series metadata through joins
Here is some experimentation I have done based on my understanding of the use of _metadata functionality. It seems to indicate that the _metadata did not make any difference and that the attribute did not persist a copy. Also it shows that using MultiIndex is an even more "unsupported" case.
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from numpy.random import randn # To get values for the test frames
>>> import platform # To print python version
>>> # A function to set labels of the columns
>>> def labelSetter(aDF) :
... DFtmp = aDF.copy() # Just to ensure it is a different dataframe
... for column in DFtmp.columns :
... DFtmp[column].myLab='This is '+column.__str__()
... DFtmp[column].notMyLab='This should not persist'
... return DFtmp
...
>>>
>>> print 'Pandas version: {}'.format(pd.version.version)
Pandas version: 0.15.2
>>>
>>> pd.Series._metadata.append('myLab');print pd.Series._metadata # now _metadata contains 'myLab'
['name', 'myLab']
>>>
>>> # Make dataframes normal columns and MultiIndex
>>> dfS=pd.DataFrame(randn(2, 6),columns=['a1','a2','a3','b1','b2','c1']);print dfS
a1 a2 a3 b1 b2 c1
0 -0.934869 -0.310979 0.362635 -0.994605 -0.880114 -1.663265
1 0.205341 -1.642080 -0.732969 -0.080109 -0.082483 -0.208360
>>>
>>> dfMI=pd.DataFrame(randn(2, 6),columns=[['a','a','a','b','b','c'],['a1','a2','a3','b1','b2','c1']]);print dfMI
a b c
a1 a2 a3 b1 b2 c1
0 -0.578399 0.478925 1.047342 -0.087225 1.905074 0.146105
1 0.640575 0.153328 -1.117847 1.043026 0.671220 -0.218550
>>>
>>> # Run the labelSetter function on the data frames
>>> dfSWlab=labelSetter(dfS)
>>> dfMIWlab=labelSetter(dfMI)
>>>
>>> print dfSWlab['a2'].myLab
This is a2
>>> # This worked
>>>
>>> print dfSWlab['a2'].notMyLab
This should not persist
>>> # 'notMyLab' has not been appended to _metadata but the label still persists.
>>>
>>> dfSWlabCopy=dfSWlab.copy() # make a copy to see if myLab persists.
>>>
>>> dfSWlabCopy['a2'].myLab
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 1942, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'myLab'
>>> # 'myLab' was appended to _metadata but still did not persist the copy
>>>
>>> print dfMIWlab['a']['a2'].myLab
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 1942, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'myLab'
>>> # For the MultiIndex data frame the 'myLab' is not accessible
I wrote a C extension (mycext.c) for Python 3.2. The extension relies on constant data stored in a C header (myconst.h). The header file is generated by a Python script. In the same script, I make use of the recently compiled module. The workflow in the Python3 myscript (not shown completely) is as follows:
configure_C_header_constants()
write_constants_to_C_header() # write myconst.h
os.system('python3 setup.py install --user') # compile mycext
import mycext
mycext.do_stuff()
This works perfectly fine the in a Python session for the first time. If I repeat the procedure in the same session (for example, in two different testcases of a unittest), the first compiled version of mycext is always (re)loaded.
How do I effectively reload a extension module with the latest compiled version?
You can reload modules in Python 3.x by using the imp.reload() function. (This function used to be a built-in in Python 2.x. Be sure to read the documentation -- there are a few caveats!)
Python's import mechanism will never dlclose() a shared library. Once loaded, the library will stay until the process terminates.
Your options (sorted by decreasing usefulness):
Move the module import to a subprocess, and call the subprocess again after recompiling, i.e. you have a Python script do_stuff.py that simply does
import mycext
mycext.do_stuff()
and you call this script using
subprocess.call([sys.executable, "do_stuff.py"])
Turn the compile-time constants in your header into variables that can be changed from Python, eliminating the need to reload the module.
Manually dlclose() the library after deleting all references to the module (a bit fragile since you don't hold all the references yourself).
Roll your own import mechanism.
Here is an example how this can be done. I wrote a minimal Python C extension mini.so, only exporting an integer called version.
>>> import ctypes
>>> libdl = ctypes.CDLL("libdl.so")
>>> libdl.dlclose.argtypes = [ctypes.c_void_p]
>>> so = ctypes.PyDLL("./mini.so")
>>> so.PyInit_mini.argtypes = []
>>> so.PyInit_mini.restype = ctypes.py_object
>>> mini = so.PyInit_mini()
>>> mini.version
1
>>> del mini
>>> libdl.dlclose(so._handle)
0
>>> del so
At this point, I incremented the version number in mini.c and recompiled.
>>> so = ctypes.PyDLL("./mini.so")
>>> so.PyInit_mini.argtypes = []
>>> so.PyInit_mini.restype = ctypes.py_object
>>> mini = so.PyInit_mini()
>>> mini.version
2
You can see that the new version of the module is used.
For reference and experimenting, here's mini.c:
#include <Python.h>
static struct PyModuleDef minimodule = {
PyModuleDef_HEAD_INIT, "mini", NULL, -1, NULL
};
PyMODINIT_FUNC
PyInit_mini()
{
PyObject *m = PyModule_Create(&minimodule);
PyModule_AddObject(m, "version", PyLong_FromLong(1));
return m;
}
there is another way, set a new module name, import it, and change reference to it.
Update: I have now created a Python library around this approach:
https://github.com/bergkvist/creload
https://pypi.org/project/creload/
Rather than using the subprocess module in Python, you can use multiprocessing. This allows the child process to inherit all of the memory from the parent (on UNIX-systems).
For this reason, you also need to be careful not to import the C extension module into the parent.
If you return a value that depends on the C extension, it might also force the C extension to become imported in the parent as it receives the return-value of the function.
import multiprocessing as mp
import sys
def subprocess_call(fn, *args, **kwargs):
"""Executes a function in a forked subprocess"""
ctx = mp.get_context('fork')
q = ctx.Queue(1)
is_error = ctx.Value('b', False)
def target():
try:
q.put(fn(*args, **kwargs))
except BaseException as e:
is_error.value = True
q.put(e)
ctx.Process(target=target).start()
result = q.get()
if is_error.value:
raise result
return result
def my_c_extension_add(x, y):
assert 'my_c_extension' not in sys.modules.keys()
# ^ Sanity check, to make sure you didn't import it in the parent process
import my_c_extension
return my_c_extension.add(x, y)
print(subprocess_call(my_c_extension_add, 3, 4))
If you want to extract this into a decorator - for a more natural feel, you can do:
class subprocess:
"""Decorate a function to hint that it should be run in a forked subprocess"""
def __init__(self, fn):
self.fn = fn
def __call__(self, *args, **kwargs):
return subprocess_call(self.fn, *args, **kwargs)
#subprocess
def my_c_extension_add(x, y):
assert 'my_c_extension' not in sys.modules.keys()
# ^ Sanity check, to make sure you didn't import it in the parent process
import my_c_extension
return my_c_extension.add(x, y)
print(my_c_extension_add(3, 4))
This can be useful if you are working in a Jupyter notebook, and you want to rerun some function without rerunning all your existing cells.
Notes
This answer might only be relevant on Linux/macOS where you have a fork() system call:
Python multiprocessing linux windows difference
https://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/
I am trying to migrate some code from using ElementTree to using lxml.etree and have encountered an error early on:
>>> import lxml.etree as ET
>>> main = ET.Element("main")
>>> another = ET.Element("another", foo="bar")
>>> main.attrib.update(another.attrib)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
main.attrib.update(another.attrib)
File "lxml.etree.pyx", line 2153, in lxml.etree._Attrib.update
(src/lxml/lxml.etree.c:46972)
ValueError: too many values to unpack (expected 2)
But I am able to update using the following:
>>> main.attrib.update({'foo': 'bar'})
Is this a bug in lxml (version 2.3) or am I just missing something obvious?
I'm getting the same error, don't think that it's only 2.3 issue.
Workaround:
main.attrib.update(dict(another.attrib))
# or more efficient if it has many attributes:
main.attrib.update(another.attrib.iteritems())
UPDATE
lxml.etree._Attrib.update accepts dict or iterable (source). Although _Attrib has dict interface, it is not dict instance.
In [3]: type(another.attrib)
Out[3]: lxml.etree._Attrib
In [4]: isinstance(another.attrib, dict)
Out[4]: False
So update tries to iterate items as key, value. Maybe it's done for perfomance. Only lxml author knows.
Ways to change it in lxml:
Subclass dict.
Check for hasattr(sequence_or_dict, 'items').
I'm not familiar with Cython and don't know what is better.