IPython loading variables to workspace: can you think of a better solution than this? - variables

I'm migrating from MATLAB to ipython and before taking the leap I'm going through my minimal workflow to make sure every operation I perform daily on MATLAB for data crunching is available on ipython.
I'm currently stuck on the very basic task of saving and loading numpy arrays via a one-line command, such as MATLAB's:
>>> save('myresults.mat','a','b','c')
>>> load('myresults.mat')
In particular, what I like about MATLAB's load command is that not only it reads
the data file but it loads the variables into the workspace, nothing else is needed to start working with them. Note that this is not the case with, for instance, numpy.load(), which requires another line to be able to assign the loaded values to the workspace variables. [ See: IPython: how to automagically load npz file and assign values to variables? ]
Based on the answers and comments to that question, I came up with this dirty-bad-engineering-ugly-coding-but-working solution. I know it's not pretty, and I would like to know if you can come up with the correct version of this [1].
I put this into iocustom.py:
def load(filename):
ip = get_ipython()
ip.ex("import numpy as np")
ip.ex("locals().update(np.load('" + filename + "'))")
so that I can run, from the ipython session:
import iocustom
load('myresults.npz')
and the variables are dumped to the workspace.
I find it hard to believe there's nothing built-in equivalent to this, and it's even harder to think that that 3-line function is the optimal solution. I would be very grateful if you could please suggest a more correct way of doing this.
Please keep in mind that:
I'm looking for a solution which would also work inside a script and a function.
I know there's "pickle" but I refuse to use more than one line of code for something as mundane as a simple 'save' and/or 'load' command.
I know there's "savemat" and "loadmat" available from scipy, but I would like to migrate completely, i.e., do not work with mat files but with numpy arrays.
Thanks in advance for all your help.
[1] BTW: how do people working with ipython save and load a set of numpy arrays easily? After hours of googling I cannot seem to find a simple and straightforward solution for this daily task.

If I save this as load_on_run.py:
import argparse
import numpy as np
if __name__=='__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-l','--list', help='list variables', action='store_true')
parser.add_argument('filename')
__args = parser.parse_args()
data = np.load(__args.filename)
locals().update(data)
del parser, data, argparse, np
if __args.list:
print([k for k in locals() if not k.startswith('__')])
del __args
And then in ipython I can invoke it with %run:
In [384]: %run load_on_run testarrays.npz -l
['array2', 'array3', 'array4', 'array1']
In [385]: array3
Out[385]: array([-10, -9, -8, -7, -6, -5, -4, -3, -2, -1])
It neatly loads the arrays from the file into the ipython workspace.
I'm taking advantage of the fact that magic %run runs a script, leaving all functions and variables defined by it in the main namespace. I haven't looked into how it does this.
The script just takes a few arguments, loads the file (so far only .npz), and uses the locals().update trick to put its variables into the local namespace. Then I clear out the unnecessary variables and modules, leaving only the newly loaded ones.
I could probably define an alias for %run load_on_run.
I can also imagine a script along these lines that lets you load variables with an import: from <script> import *.

You could assign the values in the npz file to global variables:
import numpy as np
def spill(filename):
f = np.load(filename)
for key, val in f.iteritems():
globals()[key] = val
f.close()
This solution works in Python2 and Python3, and any flavor of interative shell,
not just IPython. Using spill is fine for interactive use, but not for scripts
since
It gives the file the ability to rebind arbitrary names to
arbitrary values. That can lead to surprising, hard to debug behavior, or even be a security risk.
Dynamically created variable names are hard to program with. As
the Zen of Python (import this) says, "Namespaces are one honking
great idea -- let's do more of those!" For a script it is better to
keep the values in the NpzFile, f, and access them by indexing,
such as f['x'].

Related

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

What is the difference between `matplotlib.rc` and `matplotlib.rcParams`? And which one to use?

I have been using matplotlib.rc in my scripts to preprocess my plots. But recently I have realized that using matplotlib.rcParams is much easier before doing a quick plot interactively (e.g. via IPython). This got me into thinking what difference between the two is.
I searched the matplotlib documentation wherein no clear answer was provided in this regard. Moreover, when I issue type(matplotlib.rc), the interpreter says that it is a function. On the other hand, when I issue type(matplotlib.rcParams), I am told that it is a class object. These two answers are not at all helpful and hence I would appreciate some help differentiating the two.
Additionally, I would like to know which one to prefer over the other.
Thanks in advance.
P.S. I went through this question: What's the difference between matplotlib.rc and matplotlib.pyplot.rc? but the answers are specific to the difference between the matplotlib instance and the pyplot instance of the two types I am enquiring about and, hence, is also not that helpful.
matplotlib.rc is a function that updates matplotlib.rcParams.
matplotlib.rcParams is a dict-subclass that provides a validate key-value map for Matplotlib configuration.
The docs for mpl.rc are at https://matplotlib.org/stable/api/matplotlib_configuration_api.html?highlight=rc#matplotlib.rc and the code is here.
The class definition of RcParams is here and it the instance is created here.
If we look at the guts of matplotlib.rc we see:
for g in group:
for k, v in kwargs.items():
name = aliases.get(k) or k
key = '%s.%s' % (g, name)
try:
rcParams[key] = v
except KeyError as err:
raise KeyError(('Unrecognized key "%s" for group "%s" and '
'name "%s"') % (key, g, name)) from err
where we see that matplotlib.rc does indeed update matplotlib.rcParams (after doing some string formatting).
You should use which ever one is more convenient for you. If you know exactly what key you want to update, then interacting with the dict-like is better, if you want to set a whole bunch of values in a group then mpl.rc is likely better!

VS Code - Completion is terrible, is it my setup?

Code completion and intellisense in VS Code is absolutely god-awful for me. In every language. I have extensions installed and updated but its always absolute trash.
import pandas as pd
data_all = pd.read_csv(DATA_FILE, header=None)
data_all. (press tab)
No suggestions.
Do you really not know its a Pandas DataFrame object, its literally the line above?
I have this issue in python, in ruby/rails, pretty much every langauge i try to use the completion is absolute garbage. Do i have an extension that is breaking other extensions? is code jsut this bad? Why is it so inexplicably useless?
Installed Currently:
abusaidm.html-s
nippets#0.2.1
alefragnani.numbered-bookmarks#8.0.2
bmewburn.vscode-intelephense-client#1.6.3
bung87.rails#0.16.11
bung87.vscode-gemfile#0.4.0
castwide.solargraph#0.21.1
CoenraadS.bracket-pair-colorizer#1.0.61
donjayamanne.python-extension-pack#1.6.0
ecmel.vscode-html-css#1.10.2
felixfbecker.php-debug#1.14.9
felixfbecker.php-intellisense#2.3.14
felixfbecker.php-pack#1.0.2
formulahendry.auto-close-tag#0.5.10
golang.go#0.23.2
groksrc.ruby#0.1.0
k--kato.intellij-idea-keybindings#1.4.0
KevinRose.vsc-python-indent#1.12.0
Leopotam.csharpfixformat#0.0.84
magicstack.MagicPython#1.1.0
miguel-savignano.ruby-symbols#0.1.8
ms-dotnettools.csharp#1.23.9
ms-mssql.mssql#1.10.1
ms-python.python#2021.2.636928669
ms-python.vscode-pylance#2021.3.1
ms-toolsai.jupyter#2021.3.619093157
ms-vscode.cpptools#1.2.2
rebornix.ruby#0.28.1
sianglim.slim#0.1.2
VisualStudioExptTeam.vscodeintellicode#1.2.11
wingrunr21.vscode-ruby#0.28.0
Zignd.html-css-class-completion
#1.20.0
If you check the IntelliSense of the read_csv() method (By hovering your mouse over it), you will see that it returns a DataFrame object
(function)
read_csv(reader: IO, sep: str = ...,
#Okay... very long definition but scroll to the end...
float_precision: str | None = ...) -> DataFrame
But if you use IntelliSense check the variable data_all
import pandas as pd
data_all = pd.read_csv(DATA_FILE, header=None)
It is listed as the default data type in python: Any. That's why your compiler isn't generating the autocomplete.
So, you simply need to explicitly tell your compiler that it is, in fact, a DataFrame object as shown.
import pandas as pd
from pandas.core.frame import DataFrame
DATA_FILE = "myfile"
data_all:DataFrame = pd.read_csv(DATA_FILE, header=None)
# Now all autocomplete options on data_all are available!
It might seem strange why the compiler cannot guess the data type in this example until you realize that the read_csv() method is overloaded with many definitions, and some of them return objects as Any type. So the compiler assumes the worst-case scenario and treats it as an Any type object unless specified otherwise.

Accessing a .fits file and plotting its columns

I'm trying to access a .fits file and plotting two columns (out of many!).
I used pyfits to access the file, and
plt.plotfile('3XMM_DR5.fits', delimiter=' ', cols=(0, 1), names=('x-axis','y-axis'))
but that's not working. Are there any alternatives? And is there any way to open the file using python? In order to access the data table
According to the docs from matplotlib for plotfile:
Note: plotfile is intended as a convenience for quickly plotting data from flat files; it is not intended as an alternative interface to general plotting with pyplot or matplotlib.
This isn't very clear. I think by "flat files" it just means CSV data or something--this function isn't used very much in my experience, and it certainly does't know anything about FITS files, which are seldom used outside astronomy. You mentioned in your post that you did something with PyFITS, but that isn't demonstrated anywhere in your question.
PyFITS, incidentally, has been deprecated for several years now, and its functionality is integrated into Astropy.
You can open a table from a FITS file with astropy.Table.read:
from astropy.table import Table
table = Table.read('3XMM_DR5.fits')
then access the columns with square bracket notation like:
plt.plot(table['whatever the x axis column is named'], table['y axis column name'])

Is binary identical output possible with XlsxWriter?

With the same input is it possible to make the output binary identical using XlsxWriter?
I tried changing the created property to the same date and that helped a little. Still get a lot of differences in sharedStrings.xml.
Thanks
Yes for identical input, if you set the created date in the worksheet properties:
import xlsxwriter
import datetime
for filename in ('hello1.xlsx', 'hello2.xlsx'):
workbook = xlsxwriter.Workbook(filename)
workbook.set_properties({'created': datetime.date(2016, 4, 25)})
worksheet = workbook.add_worksheet()
worksheet.write('A1', 'Hello world')
workbook.close()
Then:
$ cmp hello1.xlsx hello2.xlsx
# No output. Files are the same.
The order in which strings are added to the file will change the layout of the sharedStrings table and thus lead to non-identical files. That is generally the case with Excel as well.
Note: This requires XlsxWriter version 1.0.4 or later to work.
Even though the author of the previous answer appears to have repudiated it, it appears to be correct, but not the whole story. I did my own tests on Python 3.7 and XlsxWriter 1.1.2. You won't notice the creation time issue if your files are small because they'll be written so fast their default creation times of "now()" will be the same.
What's missing from the first answer is that you need to make the same number of calls to the write_* methods. For example, if you call write followed by merge_range on the same cell for one of the workbooks, you need to have the same sequence of calls for the other. You can't skip the write call and just do merge_range, for instance. If you do this, the sharedStrings.xml files will have different values of count even if the value of uniqueCount is the same.
If you can arrange for these things to be true, then your two workbooks should come out as equal at the binary level.