TypeError on read_csv, working in pandas .7, error in .8.0rc2, possible dependency error? - pandas

I am attempting to execute the following within python:
from pandas import *
tickdata = read_csv('/home/user/sourcefile.csv',index_col=0,parse_dates='TRUE')
The csv files has rows that look like:
2011/11/23 23:56:00.554389,1165.2500
2011/11/23 23:56:02.310943,1165.5000
2011/11/23 23:56:05.564009,1165.2500
On pandas .7, this executes fine. On pandas .8.0rc2, I get the error below. Because I have .7 and .8 installed on two different systems, I have not ruled out a dependency or python version difference. Any ideas on how to get this to execute under .8 are appreciated.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 225, in read_csv
return _read(TextParser, filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 192, in _read
return parser.get_chunk()
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 728, in get_chunk
index = self._agg_index(index)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 846, in _agg_index
if try_parse_dates and self._should_parse_dates(self.index_col):
File "/usr/local/lib/python2.7/dist-packages/pandas-0.8.0rc2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 874, in _should_parse_dates
return i in to_parse or name in to_parse
TypeError: 'in <string>' requires string as left operand, not int

I fixed the parser bug shown in the stack trace that you pasted. However, I'm wondering whether your date column is named "TRUE" or did you mean to just pass a boolean? I haven't dug through pandas history but I know that in 0.8 we are supporting much more complex date parsing behavior as part of the time series API so here we're interpreting the string value as a column name.

I've reported the bug on GitHub (best place for bug reports):
https://github.com/pydata/pandas/issues/1544
Should have a resolution today or tomorrow.

Related

Python matplotlib/pandas fails with "OverflowError: value too large to convert to npy_uint32"

I am trying to run this variant call pipeline with 144 samples so the resulting files are quite big. I managed to get it almost to the end, but the last rule (plots_stats) fails with OverflowError: value too large to convert to npy_uint32. This is a Python script, that plots from a gzipped tsv file. I guess, I just have to many rows in my calls.tsv.gzto be handled. The complete error log is:
Traceback (most recent call last):
File "/[PATH]/workflow_var_calling/.snakemake/scripts/tmp10j_ba31.plot-depths.py", line 16, in <module>
sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/series.py", line 2899, in unstack
return unstack(self, level, fill_value)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 501, in unstack
constructor=obj._constructor_expanddim)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 116, in __init__
self.index = index.remove_unused_levels()
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1494, in remove_unused_levels
uniques = algos.unique(lab)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/algorithms.py", line 367, in unique
table = htable(len(values))
File "pandas/_libs/hashtable_class_helper.pxi", line 937, in pandas._libs.hashtable.Int64HashTable.__cinit__
OverflowError: value too large to convert to npy_uint32
any ideas?

running temporal fusion transformer default dataset shape error

I ran default code of Temporal fusion transformer in google colab which downloaded at github.
After clone, when I ran the step 2, there's no way to test training.
python3 -m script_train_fixed_params volatility outputs yes
The problem is shape error in the below.
Computing best validation loss
Computing test loss
/usr/local/lib/python3.7/dist-packages/keras/engine/training_v1.py:2079: UserWarning: `Model.state_updates` will be removed in a future version. This property should not be used in TensorFlow 2.0, as `updates` are applied automatically.
updates=self.state_updates,
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/drive/MyDrive/tft_tf2/script_train_fixed_params.py", line 239, in <module>
use_testing_mode=True) # Change to false to use original default params
File "/content/drive/MyDrive/tft_tf2/script_train_fixed_params.py", line 156, in main
targets = data_formatter.format_predictions(output_map["targets"])
File "/content/drive/MyDrive/tft_tf2/data_formatters/volatility.py", line 183, in format_predictions
output[col] = self._target_scaler.inverse_transform(predictions[col])
File "/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_data.py", line 1022, in inverse_transform
force_all_finite="allow-nan",
File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 773, in check_array
"if it contains a single sample.".format(array)
ValueError: Expected 2D array, got 1D array instead:
array=[-1.43120418 1.58885804 0.28558148 ... -1.50945972 -0.16713021
-0.57365613].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I've tried to modify code which is predict dataframe shpae of 'data_formatters/volatility.py", line 183, in format_predictions' because I guessed that's where the problem arises.), but I can't handle that.
You have to change line
183 in volatitlity.py
output[col] = self._target_scaler.inverse_transform(predictions[col].values.reshape(-1, 1))
and line 216 in electricity.py
sliced_copy[col] = target_scaler.inverse_transform(sliced_copy[col].values.reshape(-1, 1))
Afterwards the example electricity works fine. And I guess this should be the same with volatility.

Pandas to_hdf fails on dataframes containing nullable int dtypes (e.g. Int8Dtype)

I'm trying to reduce the memory consumption of some large data that we work with, so that more data can be appended to it without throwing memory errors. Downcasting floats where possible helps a little, but the major savings I#ve found have been from casting float64s the Int8 and Int16 where possible. This data contains NaNs. This is unavoidable, and in context there is no value I can replace NaNs with that doesn't change the meaning of the data. The new nullable dtypes are great for this, but I get ValueError: cannot convert float NaN to integer when trying to save the resulting frames to hdf.
I've tried using to_hdf with and without specifiying table format, and get different errors (without specifying table format the error is AttributeError: 'NoneType' object has no attribute 'names')
´´´
df=pd.DataFrame([1,2,3,np.nan,5], columns=['A'])
df.to_hdf('Z:/test.hd5', 'data')
#This works
df['A']=df.A.astype(pd.Int8Dtype())
df.to_hdf('Z:/test.hd5', 'data')
Traceback (most recent call last):
File "<ipython-input-51-6b0f3ad26286>", line 1, in <module>
df.to_hdf('Z:/test.hd5', 'data', complevel=9, complib='blosc:zlib')
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3 \lib\site-packages\pandas\core\generic.py", line 2377, in to_hdf
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 274, in to_hdf
f(store)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 268, in <lambda>
f = lambda store: store.put(key, value, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3 \lib\site-packages\pandas\io\pytables.py", line 889, in put
self._write_to_group(key, value, append=append, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 1415, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3 \lib\site-packages\pandas\io\pytables.py", line 3022, in write
blk.values, items=blk_items)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\pytables.py", line 2750, in write_array
atom = _tables().Atom.from_dtype(value.dtype)
File "C:\Users\marnoch.hamilton-jon\AppData\Local\Continuum\anaconda3\lib\site-packages\tables\atom.py", line 381, in from_dtype
if basedtype.names:
AttributeError: 'NoneType' object has no attribute 'names'
´´´
Is this a bug? An intentional limitation? Or have I done something dumb?
This is a bug. See GitHub Issue #26144 for the status.

Pandas 0.24.0 breaks my pandas dataframe with special column identifiers

I had code that worked fine until I tried to run it on a coworker's machine, whereupon I discovered that while it worked using pandas 0.22.0, it broke on pandas 0.24.0. For the moment, we've solved this problem by downgrading their copy of pandas, but I would like to find a better solution if one exists.
The problem seems to be that I am creating a user-defined class to use as identifiers for my columns in the dataframe. When trying to compare two dataframes it for some reason tries to call my column labels as functions, and then throws an exception because they aren't callable
Here's some example code:
import pandas as pd
import numpy as np
class label(object):
def __init__(self, var):
self.var = var
def __eq__(self,other):
return self.var == other.var
df = pd.DataFrame(np.eye(5),columns=[label(ii) for ii in range(5)])
df == df
This produces the following stack trace:
Traceback (most recent call last):
File "<ipython-input-4-496e4ab3f9d9>", line 1, in <module>
df==df1
File "C:\...\site-packages\pandas\core\ops.py", line 2098, in f
return dispatch_to_series(self, other, func, str_rep)
File "C:\...\site-packages\pandas\core\ops.py", line 1157, in dispatch_to_series
new_data = expressions.evaluate(column_op, str_rep, left, right)
File "C:\...\site-packages\pandas\core\computation\expressions.py", line 208, in evaluate
return _evaluate(op, op_str, a, b, **eval_kwargs)
File "C:\...\site-packages\pandas\core\computation\expressions.py", line 68, in _evaluate_standard
return op(a, b)
File "C:\...\site-packages\pandas\core\ops.py", line 1135, in column_op
for i in range(len(a.columns))}
File "C:\...\site-packages\pandas\core\ops.py", line 1135, in <dictcomp>
for i in range(len(a.columns))}
File "C:\...\site-packages\pandas\core\ops.py", line 1739, in wrapper
name=res_name).rename(res_name)
File "C:\...\site-packages\pandas\core\series.py", line 3733, in rename
return super(Series, self).rename(index=index, **kwargs)
File "C:\...\site-packages\pandas\core\generic.py", line 1091, in rename
level=level)
File "C:\...\site-packages\pandas\core\internals\managers.py", line 171, in rename_axis
obj.set_axis(axis, _transform_index(self.axes[axis], mapper, level))
File "C:\...\site-packages\pandas\core\internals\managers.py", line 2004, in _transform_index
items = [func(x) for x in index]
TypeError: 'label' object is not callable
I've found I can fix the problem by making my class callable with a single argument and returning that argument, but that breaks .loc indexing, which will default to treating my objects as callables.
This problem only occurs when the custom objects are in the columns - the index can handle them just fine.
Is this a bug or a change in usage, and is there any way I can work around it without giving up my custom labels?

Python Matplotlib errors with savefig (newbie).

All parts of Python on my computer were recently installed from the Enthought academic package, but use Pyscripter for editing and running code. I'm very early in my learning curve, and so could very well be overlooking some obvious things here.
When I try to create a plot and save it like so:
import matplotlib.pylab as pl
pl.hist(myEst, bins=20, range=(.1,.60))
pl.ylabel("Freq")
pl.xlabel("Success Probability")
pl.title('Histogram of Binomial Estimator')
pl.axis([0, 1, 0, 500])
pl.vlines (.34,0,500)
pl.savefig('TestHist.png')
pl.show()
I get these errors:
Traceback (most recent call last):
File "<editor selection>", line 9, in <module>
File "C:\Python27\lib\site-packages\matplotlib\figure.py", line 1172, in savefig
self.canvas.print_figure(*args, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\backends\backend_wxagg.py", line 100, in print_figure
FigureCanvasAgg.print_figure(self, filename, *args, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\backend_bases.py", line 2017, in print_figure
**kwargs)
File "C:\Python27\lib\site-packages\matplotlib\backends\backend_agg.py", line 450, in print_png
filename_or_obj = file(filename_or_obj, 'wb')
IOError: [Errno 13] Permission denied: 'TestHist.png'
If I take out the pl.savefig('TestHist') line everything works fine, and I can see the plot I want, but when that line is in there I get the errors.
I've checked my backend version using pl.get_backend(), it returns 'WXAgg', which according to documentation should be able to use .png format.
I've also tried including an explicit format='png' and format=png within the savefig command, but still get errors.
Can anyone give me advice on how to proceed, or another approach for saving a plot?
There's nothing wrong with your code. I just tested it locally on my machine. The issue is this error:
IOError: [Errno 13] Permission denied: 'TestHist.png'
You are most likely trying to save the file somewhere that the Python process doesn't have permission to access. What OS are you on? Where are you trying to save the file?
If it helps others, I made the silly mistake of not actually designating a file name and as a result had returned the same error message that lead me to this question for review.
Here is the code that was generating the error:
plt.savefig('C:\\Users\\bwarn\\Canopy', format='png')
Here is my correction that resolved (you'll see I designated the actual file and name)
plt.savefig('C:\\Users\\bwarn\\Canopy\\myplot.png', format='png')
The following worked for me when I was running a neural network on my windows machine:
image_path = 'A:/DeepLearning/Padhai/MLFlow/images/%s.png' % (expt_id)
plt.savefig(image_path)
Or otherwise refer:
Using 'r' in front of the path