Improve error message for ill-formed input format? - cntk

I have a map file containing data like this:
|labels 0 0 1 0 0 0 |features 0
|labels 1 0 0 0 0 0 |features 2
|labels 0 0 0 1 0 0 |features 3
|labels 0 0 0 0 0 1 |features 7
Data is read into a minibatch with the following code:
from cntk import Trainer, StreamConfiguration, text_format_minibatch_source, learning_rate_schedule, UnitType
mb_source = text_format_minibatch_source('test_map2.txt', [
StreamConfiguration('features', 1),
StreamConfiguration('labels', num_classes)])
test_minibatch = mb_source.next_minibatch(2)
If the input file is ill-formed, you will sometimes get a quite cryptic error message. For example a missing line break at the end of the last row in the input file will result in an error like this:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-35-2f1481ccfced> in <module>()
----> 1 test_minibatch = mb_source.next_minibatch(2)
C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py34\lib\site-packages\cntk\utils\swig_helper.py in wrapper(*args, **kwds)
56 #wraps(f)
57 def wrapper(*args, **kwds):
---> 58 result = f(*args, **kwds)
59 map_if_possible(result)
60 return result
C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py34\lib\site-packages\cntk\io\__init__.py in next_minibatch(self, minibatch_size_in_samples, input_map, device)
159
160 mb = super(MinibatchSource, self).get_next_minibatch(
--> 161 minibatch_size_in_samples, device)
162
163 if input_map:
C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py34\lib\site-packages\cntk\cntk_py.py in get_next_minibatch(self, *args)
1914
1915 def get_next_minibatch(self, *args):
-> 1916 return _cntk_py.MinibatchSource_get_next_minibatch(self, *args)
1917 MinibatchSource_swigregister = _cntk_py.MinibatchSource_swigregister
1918 MinibatchSource_swigregister(MinibatchSource)
RuntimeError: Invalid chunk requested.
Sometimes it could be hard to figure out where in the file there would be a problem. Would it be possible to emit a more specific error message. Line number in the input file would be useful.

Thanks for reporting the issue. We have created a bug and will be working on fixing the reader behavior with ill-formed input.

Related

How to understand an Index Error Message?

I am attempting to use the pretty-print confusion-matrix library to create a confusion matrix.
When I run pp_matrix(df, cmap=cmap), I get the following error message:
Traceback (most recent call last):
File "/Users/name/folder/subfolder/subsubfolder/prepositions.py", line 27, in <module>
pp_matrix(df, cmap=cmap)
File "/opt/anaconda3/lib/python3.8/site-packages/pretty_confusion_matrix/pretty_confusion_matrix.py", line 222, in pp_matrix
txt_res = configcell_text_and_colors(
File "/opt/anaconda3/lib/python3.8/site-packages/pretty_confusion_matrix/pretty_confusion_matrix.py", line 59, in configcell_text_and_colors
tot_rig = array_df[col][col]
IndexError: index 37 is out of bounds for axis 0 with size 37
The first few lines of my DateFrame look like this:
in auf mit zu ... an-entlang auf-entlang ohne außerhalb
into 318 8 10 9 ... 0 0 0 0
in 4325 727 681 62 ... 0 0 0 0
on 253 3197 215 46 ... 0 0 0 0
at 206 280 54 9 ... 0 0 0 0
with 384 397 2097 24 ... 0 0 0 0
by 31 12 23 0 ... 0 0 0 0
in-front-of 15 15 25 0 ... 0 0 0 0
The total size is 49 rows x 36 columns.
I think the issue has to do with zero-indexing in Python, but to be honest, I'm not even sure how to go about debugging this.
Any suggestions would be greatly appreciated. Thank you in advance.

Is there another way to solve about pandas set option?

I'm analyzing a data-frame and want to check more detailed lists
but even though I searched some solutions from google,
I don't understand why the result is not changed.
what is the problem??
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Import data
df = df = pd.read_csv(r"C:\Users\Administrator\Desktop\medical.txt")
pd.set_option("display.max_rows", 50)
pd.set_option('display.max_columns', 15)
print(df)
id age gender height weight ap_hi ap_lo cholesterol gluc
0 0 18393 2 168 62.0 110 80 1 1
1 1 20228 1 156 85.0 140 90 3 1
2 2 18857 1 165 64.0 130 70 3 1
3 3 17623 2 169 82.0 150 100 1 1
4 4 17474 1 156 56.0 100 60 1 1
... ... ... ... ... ... ... ... ...
69995 99993 19240 2 168 76.0 120 80 1 1
69996 99995 22601 1 158 126.0 140 90 2 2
69997 99996 19066 2 183 105.0 180 90 3 1
69998 99998 22431 1 163 72.0 135 80 1 2
69999 99999 20540 1 170 72.0 120 80 2 1
Look at https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html at "Frequently used options" chapter.
You can see that if the "max_rows" is lower than the total number of rows in your dataframe then it is displayed like your results.
Below a copy past of the interesting part in the link that I gave you:
if there are a way to display enough columns
pd.set_option('display.width',1000)
or
pd.set_option('display.width',None)
but to rows may be you only use
df.head(50)
or
df.tail(50)
or follows to DisplayAll
pd.set_option("display.max_rows", None)
Why set that is useless:
The second parameter is not the maximum number of rows that can be viewed, but an internal template parameter
code as follows:
set_option = CallableDynamicDoc(_set_option, _set_option_tmpl)
CallableDynamicDoc:
class CallableDynamicDoc:
def __init__(self, func, doc_tmpl):
self.__doc_tmpl__ = doc_tmpl
self.__func__ = func
def __call__(self, *args, **kwds):
return self.__func__(*args, **kwds)
#property
def __doc__(self):
opts_desc = _describe_option("all", _print_desc=False)
opts_list = pp_options_list(list(_registered_options.keys()))
return self.__doc_tmpl__.format(opts_desc=opts_desc, opts_list=opts_list)

How to use a list of categories that example belongs to as a feature solving classification problem?

One of features looks like this:
1 170,169,205,174,173,246,247,249,380,377,383,38...
2 448,104,239,277,276,99,154,155,76,412,139,333,...
3 268,422,419,124,1,17,431,343,341,435,130,331,5...
4 50,53,449,106,279,420,161,74,123,364,231,18,23...
5 170,169,205,174,173,246,247,249,380,377,383,38...
It tells us what categories the example belongs to.
How should I use it while solving classification problem?
I've tried to use dummy variables,
df=df.join(features['cat'].str.get_dummies(',').add_prefix('contains_'))
but we don't know where there are some other categories that were not mentioned in the training set, so, I do not know how to preprocess all the objects.
That's interesting. I didn't know str.get_dummies, but maybe I can help you with the rest.
You basically have two problems:
The set of categories you get later contains categories that were unknown while training the model. You have to get rid of these later.
The set of categories you get later does not contain all categories. You have to make sure, you generate dummies for them as well.
Problem 1: filtering out unknown/unwanted categories
The first problem is easy to solve:
# create a set of all categories, you want to allow
# either definie it as a fixed set, or extract it from your
# column like this (the output of the map is actually irrelevant)
# the result will be in valid_categories
valid_categories= set()
df['categories'].str.split(',').map(valid_categories.update)
# now if you want to normalize your data before you do the
# dummy encoding, you can cleanse the data by
# splitting it, creating an intersection and then joining
# it back again to get a string on which you can work with
# str.get_dummies
df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',')
Problem 2: generating dummies for all known categories
The second problem can be solved by just adding a dummy row, that
contains all categories e.g. with df.append just before you
call get_dummies and removing it right after get_dummies.
# e.g. you can do it like this
# get a new index value to
# be able to remove the row later
# (this only works if you have
# a numeric index)
dummy_index= df.index.max()+1
# assign the categories
#
df.loc[dummy_index]= {'id':999, 'categories': ','.join(valid_categories)}
# now do the processing steps
# mentioned in the section above
# then create the dummies
# after that remove the dummy line
# again
df.drop(labels=[dummy_index], inplace=True)
Example:
import io
raw= """id categories
1 170,169,205,174,173,246,247
2 448,104,239,277,276,99,154
3 268,422,419,124,1,17,431,343
4 50,53,449,106,279,420,161,74
5 170,169,205,174,173,246,247"""
df= pd.read_fwf(io.StringIO(raw))
valid_categories= set()
df['categories'].str.split(',').map(valid_categories.update)
# remove 154 and 170 for demonstration purposes
valid_categories.remove('170')
valid_categories.remove('154')
df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',').str.get_dummies(',')
Out[622]:
1 104 106 124 161 169 17 173 174 205 239 246 247 268 276 277 279 343 419 420 422 431 448 449 50 53 74 99
0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1
2 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0
3 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0
4 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
You can see, that there are not columns for 154 and 170.

Input contains infinity or a value too large for dtype('float64')

I've seen many similar questions here, but none of the answers solved my problem.
I am trying to do a Power Transform in my dataset, but I am still obtaining such error.
The dataset does not contain inf or nan values, and I make sure that they are not greater than float64.max. I also tried to reindex the dataframe before.
features_training = features_training.astype(np.float64)
target_training = target_training.astype(np.float64)
features_test = features_test.astype(np.float64)
target_test = target_test.astype(np.float64)
print(np.where(features_training.values >= np.finfo(np.float64).max))
print(np.where(features_test.values >= np.finfo(np.float64).max))
print(np.where(target_training.values >= np.finfo(np.float64).max))
print(np.where(target_test.values >= np.finfo(np.float64).max))
print(np.isnan(features_training.values).any())
print(np.isnan(features_test.values).any())
print(np.isnan(target_training.values).any())
print(np.isnan(target_test.values).any())
print(np.isinf(features_training.values).any())
print(np.isinf(features_test.values).any())
print(np.isinf(target_training.values).any())
print(np.isinf(target_test.values).any())
pt_X = PowerTransformer().fit(features_training)
pt_Y = PowerTransformer().fit(np.asarray(target_training).reshape(-1,1))
features_training = pt_X.transform(features_training)
target_training = pt_Y.transform(np.asarray(target_training).reshape(-1,1))
features_test = pt_X.transform(features_test)
target_test = pt_Y.transform(np.asarray(target_test).reshape(-1,1))
Using dataframe.info()
features training
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Columns: 138 entries
dtypes: float64(138)
memory usage: 545.6 KB
None
target training
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 1 columns):
506 non-null float64
dtypes: float64(1)
memory usage: 4.0 KB
None
features test
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482 entries, 0 to 481
Columns: 138 entries
dtypes: float64(138)
memory usage: 519.7 KB
None
target test
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482 entries, 0 to 481
Data columns (total 1 columns):
482 non-null float64
dtypes: float64(1)
memory usage: 3.8 KB
None
Error traceback
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-100-6ca93dd1855a> in <module>
21 # features already normalized. Target remains the same
22 features_training, features_test, target_training, target_test, ptX_, pt_Y = normalization(features_training, features_test,
---> 23 target_training, target_test)
24
25 model.fit(features_training, target_training)
<ipython-input-99-9199a48b9d30> in normalization(features_training, features_test, target_training, target_test)
47 target_training = pt_Y.transform(np.asarray(target_training).reshape(-1,1))
48
---> 49 features_test = pt_X.transform(features_test)
50 target_test = pt_Y.transform(np.asarray(target_test).reshape(-1,1))
51
~\AppData\Local\Continuum\anaconda2\envs\env36\lib\site-packages\sklearn\preprocessing\data.py in transform(self, X)
2731
2732 if self.standardize:
-> 2733 X = self._scaler.transform(X)
2734
2735 return X
~\AppData\Local\Continuum\anaconda2\envs\env36\lib\site-packages\sklearn\preprocessing\data.py in transform(self, X, copy)
756 X = check_array(X, accept_sparse='csr', copy=copy,
757 estimator=self, dtype=FLOAT_DTYPES,
--> 758 force_all_finite='allow-nan')
759
760 if sparse.issparse(X):
~\AppData\Local\Continuum\anaconda2\envs\env36\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
540 if force_all_finite:
541 _assert_all_finite(array,
--> 542 allow_nan=force_all_finite == 'allow-nan')
543
544 if ensure_min_samples > 0:
~\AppData\Local\Continuum\anaconda2\envs\env36\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
54 not allow_nan and not np.isfinite(X).all()):
55 type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56 raise ValueError(msg_err.format(type_err, X.dtype))
57 # for object dtype data, we only check for NaNs (GH-13254)
58 elif X.dtype == np.dtype('object') and not allow_nan:
ValueError: Input contains infinity or a value too large for dtype('float64').

How do I read tabulator separated CSV in blaze?

I have a "CSV" data file with the following format (well, it's rather a TSV):
event pdg x y z t px py pz ekin
3383 11 -161.515 5.01938e-05 -0.000187112 0.195413 0.664065 0.126078 -0.736968 0.00723234
1694 11 -161.515 -0.000355633 0.000263174 0.195413 0.511853 -0.523429 0.681196 0.00472714
4228 11 -161.535 6.59631e-06 -3.32796e-05 0.194947 -0.713983 -0.0265468 -0.69966 0.0108681
4233 11 -161.515 -0.000524488 6.5069e-05 0.195413 0.942642 0.331324 0.0406377 0.017594
This file is interpretable as-is in pandas:
from pandas import read_csv, read_table
data = read_csv("test.csv", sep="\t", index_col=False) # Works
data = read_table("test.csv", index_col=False) # Works
However, when I try to read it in blaze (that declares to use pandas keyword arguments), an exception is thrown:
from blaze import Data
Data("test.csv") # Attempt 1
Data("test.csv", sep="\t") # Attempt 2
Data("test.csv", sep="\t", index_col=False) # Attempt 3
None of these works and pandas is not used at all. The "sniffer" that tries to deduce column names and types just calls csv.Sniffer.sniff() from standard library (which fails).
Is there a way how to properly read this file in blaze (given that its "little brother" has few hundred MBs, I want to use blaze's sequential processing capabilities)?
Thanks for any ideas.
Edit: I think it might be a problem of odo/csv and filed an issue: https://github.com/blaze/odo/issues/327
Edit2:
Complete error:
Error Traceback (most recent call last) in () ----> 1 bz.Data("test.csv", sep="\t", index_col=False)
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/blaze/interactive.py in Data(data, dshape, name, fields, columns, schema, **kwargs)
54 if isinstance(data, _strtypes):
55 data = resource(data, schema=schema, dshape=dshape, columns=columns,
---> 56 **kwargs)
57 if (isinstance(data, Iterator) and
58 not isinstance(data, tuple(not_an_iterator))):
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs)
62
63 def __call__(self, s, *args, **kwargs):
---> 64 return self.dispatch(s)(s, *args, **kwargs)
65
66 #property
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in resource_csv(uri, **kwargs)
276 #resource.register('.+\.(csv|tsv|ssv|data|dat)(\.gz|\.bz2?)?')
277 def resource_csv(uri, **kwargs):
--> 278 return CSV(uri, **kwargs)
279
280
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in __init__(self, path, has_header, encoding, sniff_nbytes, **kwargs)
102 if has_header is None:
103 self.has_header = (not os.path.exists(path) or
--> 104 infer_header(path, sniff_nbytes))
105 else:
106 self.has_header = has_header
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in infer_header(path, nbytes, encoding, **kwargs)
58 with open_file(path, 'rb') as f:
59 raw = f.read(nbytes)
---> 60 return csv.Sniffer().has_header(raw if PY2 else raw.decode(encoding))
61
62
/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in has_header(self, sample)
392 # subtracting from the likelihood of the first row being a header.
393
--> 394 rdr = reader(StringIO(sample), self.sniff(sample))
395
396 header = next(rdr) # assume first row is header
/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in sniff(self, sample, delimiters)
187
188 if not delimiter:
--> 189 raise Error("Could not determine delimiter")
190
191 class dialect(Dialect):
Error: Could not determine delimiter
I am working with Python 2.7.10, dask v0.7.1, blaze v0.8.2 and conda v3.17.0.
conda install dask
conda install blaze
Here is a way you can import the data for use with blaze. Parse the data first with pandas and then convert it into blaze. Perhaps this defeats the purpose, but there are no troubles this way.
As a side note in order to parse the data file correctly your line in pandas parse statment should be:
from blaze import Data
from pandas import DataFrame, read_csv
data = read_csv("csvdata.dat", sep="\s*", index_col=False)
bdata = Data(data)
Now the data is formatted correctly with no errors, bdata:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
Here is an alternative, use dask, it probably can do the same chunking, or large scale processing you are looking for. Dask certainly makes it immediately easy to correctly load a tsv format.
In [17]: import dask.dataframe as dd
In [18]: df = dd.read_csv('tsvdata.txt', sep='\t', index_col=False)
In [19]: df.head()
Out[19]:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
4 854 11 -161.515 0.000032 0.000418 0.195414 0.675752 0.315671
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
3 0.040638 0.017594
4 -0.666116 0.012641
In [20]:
See also: http://dask.pydata.org/en/latest/array-blaze.html#how-to-use-blaze-with-dask