Appending pandas data to hdf store, getting 'TypeError: object of type 'int' has no len()' error - pandas

Motivation:
I have about 30 million rows of data, one column being an index value, the other being a list of 512 int32 numbers. I wish to only retrieve maybe a thousand or so at a time, so I want to create some sort of datastore that can look up the data by index, while leaving the rest on the disk.
Right now the data is split up into 184 files, which can be opened by pandas.
This is what my dataframe looks like
df.head()
IndexID NumpyIds
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131 [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716 [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098 [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...
There is the index, and then the column 'NumpyIds' which are numpy arrays of size 512, containing int32 ints.
I then tried this:
store = pd.HDFStore('/data2.h5')
store.put('index', df, format='table', append=True)
And got this
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-05b956667991> in <module>()
----> 1 store.put('index', df, format='table', append=True, data_columns=True)
2 store.close
4 frames
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors)
1040 data_columns=data_columns,
1041 encoding=encoding,
-> 1042 errors=errors,
1043 )
1044
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors)
1707 dropna=dropna,
1708 nan_rep=nan_rep,
-> 1709 data_columns=data_columns,
1710 )
1711
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns)
4141 min_itemsize=min_itemsize,
4142 nan_rep=nan_rep,
-> 4143 data_columns=data_columns,
4144 )
4145
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize)
3811 nan_rep=nan_rep,
3812 encoding=self.encoding,
-> 3813 errors=self.errors,
3814 )
3815 adj_name = _maybe_adjust_name(new_name, self.version)
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors)
4798 # we cannot serialize this data, so report an exception on a column
4799 # by column basis
-> 4800 for i in range(len(block.shape[0])):
4801
4802 col = block.iget(i)
TypeError: object of type 'int' has no len()
What am I trying to do?
I have 184 pandas files which I am trying to concatenate into 1 hdf file for fast look up using the index.
For example
store['index'][21]
Would give me that 512 dimension vector for the index of 21.
Edit:
I tried creating a column for every number, so
df[[str(i) for i in range(512)]] = pd.DataFrame(df.NumpyIds.to_numpy(), index=df.index)
df.drop(columns='NumpyIds', inplace=True)
store.put('index', df, format='table', append=True)
store.close
Which works, although I feel this may be a hack rather than an ideal workaround. But now the issue is I can't seem to get those values from the index
store.select(key='index', start=2163410)
returns
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511
IndexID
0 rows × 512 columns
Which are the column names, but not the data in that column. Also this method takes a lot of RAM. I am wondering if it loads all the data at once, rather than just the index specified.
Another workaround I'm trying is opening the data directly in h5py
df = pd.read_hdf(hdf_files[0])
df.set_index('IndexID', inplace=True)
df.to_hdf('testhdf.h5', key='df')
h = h5py.File('testhdf.h5')
But I can't seem to figure out how to retrieve data by index from this store
h['df'][2163410]
/usr/local/lib/python3.6/dist-packages/h5py/_hl/base.py in _e(self, name, lcpl)
135 else:
136 try:
--> 137 name = name.encode('ascii')
138 coding = h5t.CSET_ASCII
139 except UnicodeEncodeError:
AttributeError: 'int' object has no attribute 'encode'

as far as I know, this is a BUG.
See #34274.
I've fixed it in #38919. Now it shows appropriate error message.

Related

NumPy stack, vstack and sdtack usage

I am trying to better understand hstack, vstack, and dstack in NumPy.
a = np.arange(96).reshape(2,4,4,3)
print(a)
print(f"dimensions of a:", np.ndim(a))
print(f"Shape of a:", a.shape)
b = np.arange(201,225).reshape(2,4,3)
print(f"Shape of b:", b)
c = np.arange(101,133).reshape(2,4,4)
print(c)
print(f"dimensions of c:", np.ndim(c))
print(f"Shape of c:", c.shape)
a is:
[[[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
[[12 13 14]
[15 16 17]
[18 19 20]
[21 22 23]]
[[24 25 26]
[27 28 29]
[30 31 32]
[33 34 35]]
[[36 37 38]
[39 40 41]
[42 43 44]
[45 46 47]]]
[[[48 49 50]
[51 52 53]
[54 55 56]
[57 58 59]]
[[60 61 62]
[63 64 65]
[66 67 68]
[69 70 71]]
[[72 73 74]
[75 76 77]
[78 79 80]
[81 82 83]]
[[84 85 86]
[87 88 89]
[90 91 92]
[93 94 95]]]]
and c is:
[[[101 102 103 104]
[105 106 107 108]
[109 110 111 112]
[113 114 115 116]]
[[117 118 119 120]
[121 122 123 124]
[125 126 127 128]
[129 130 131 132]]]
and b is:
[[[201 202 203]
[204 205 206]
[207 208 209]
[210 211 212]]
[[213 214 215]
[216 217 218]
[219 220 221]
[222 223 224]]]
How do I reshape c so that I can use hstack correctly: I wish to add one column for each row in each of the dimensions.
How do I reshape b so that I can use vstack correctly: I wish one row for each column in each of the dimensions.
I would like to come up with a general rule on the dimensions to check for the array that needs to be added to an existing array.
You can concatenate to a (2,4,4,3) a
(1,4,4,3) axis 0
(2,1,4,3) with axis=1
(2,4,1,3) axis 2
(2,4,4,1) axis 3
Read and reread as needed, the np.concatenate docs.
edit
In previous post(s) I've summarized the code of hstack and vstack, though you easily read that via the [source] link in the official docs.
When should I use hstack/vstack vs append vs concatenate vs column_stack?
hstack makes sure all arguments are atleast_1d and does a concatenate on axis 0 or 1. vstack makes sure all are atleast_2d, and does a concatenate on axis 0.
Maybe I should have insisted on seeing your attempts and any errors (and attempts to understand the errors).
For adding c to a:
In [58]: np.hstack((a,c))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [58], in <cell line: 1>()
----> 1 np.hstack((a,c))
File <__array_function__ internals>:5, in hstack(*args, **kwargs)
File ~\anaconda3\lib\site-packages\numpy\core\shape_base.py:345, in hstack(tup)
343 return _nx.concatenate(arrs, 0)
344 else:
--> 345 return _nx.concatenate(arrs, 1)
File <__array_function__ internals>:5, in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimension(s) and the array at index 1 has 3 dimension(s)
Notice, the error was raised by concatenate, and focuses on the number of dimensions - 4d and 3d. The hstack wrapper did not change inputs at all.
If I add a trailing dimension to c, I get:
In [62]: c[...,None].shape
Out[62]: (2, 4, 4, 1)
In [63]: np.concatenate((a, c[...,None]),axis=3).shape
Out[63]: (2, 4, 4, 4)
Similarly for b:
In [64]: np.concatenate((a, b[...,None,:]),axis=2).shape
Out[64]: (2, 4, 5, 3)
The hstack/vstack docs specify 2nd and 1st axis concatenate. But you want to use axis 2 or 3. So those 'stack' functions don't apply, do they?

'Invalid format specifier' error when using pandas describe() or head() in google colab

When I use pandas describe() or head() method on a dataframe on Google colab,
I get 'Invalid format specifier' error.
However,when I run the same code on Jupiter notebook, it works fine.
What could be the problem?
All the other methods (such as info() ) work fine on both jupiter notebook and colab:
train = pd.read_csv('./input/train_V2.csv')
train.describe()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/IPython/core/formatters.py in __call__(self, obj)
697 type_pprinters=self.type_printers,
698 deferred_pprinters=self.deferred_printers)
--> 699 printer.pretty(obj)
700 printer.flush()
701 return stream.getvalue()
17 frames
/usr/local/lib/python3.7/dist-packages/pandas/io/formats/format.py in <listcomp>(.0)
1427 [
1428 formatter(val) if not m else na_rep
-> 1429 for val, m in zip(values.ravel(), mask.ravel())
1430 ]
1431 ).reshape(values.shape)
ValueError: Invalid format specifier
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/IPython/core/formatters.py in __call__(self, obj)
332 pass
333 else:
--> 334 return printer(obj)
335 # Finally look for special method names
336 method = get_real_method(obj, self.print_method)
17 frames
/usr/local/lib/python3.7/dist-packages/pandas/io/formats/format.py in <listcomp>(.0)
1427 [
1428 formatter(val) if not m else na_rep
-> 1429 for val, m in zip(values.ravel(), mask.ravel())
1430 ]
1431 ).reshape(values.shape)
ValueError: Invalid format specifier
All the other methods works fine on both jupiter notebook and colab:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4446966 entries, 0 to 4446965
Data columns (total 29 columns):
# Column Dtype
--- ------ -----
0 Id object
1 groupId object
2 matchId object
3 assists int8
4 boosts int8
5 damageDealt float32
6 DBNOs int8
7 headshotKills int8
8 heals int8
9 killPlace int8
10 killPoints int16
11 kills int8
12 killStreaks int8
13 longestKill float32
14 matchDuration int16
15 matchType object
16 maxPlace int8
17 numGroups int8
18 rankPoints int16
19 revives int8
20 rideDistance float32
21 roadKills int8
22 swimDistance float32
23 teamKills int8
24 vehicleDestroys int8
25 walkDistance float32
26 weaponsAcquired int16
27 winPoints int16
28 winPlacePerc float32
dtypes: float32(6), int16(5), int8(14), object(4)
memory usage: 339.3+ MB

Pandas: How to Sort a Range of Columns in a Dataframe?

I have a Pandas dataframe that I need to sort by the data columns' maximum values. I am having trouble performing the sort because all of the sorting examples that I have found operate on all of the columns in the dataframe when performing the sort. In this case I need to sort only a subset of the columns. The first column contains a date, and the remaining 90 columns contain data. The 90 data columns are currently sorted alphabetically by their column name. I would like to sort them in decreasing order of their maximum value, which happens to be in the last row.
In the bigger scheme of things, this question is about how to perform sorting on a range of columns within a dataframe, rather than sorting all of the columns in the dataframe. There may be cases, for example, where I need to sort only columns 2 through 12 of a dataframe, while leaving the remaining columns in their existing order.
Here is a sample of the unsorted dataframe:
df.tail()
Date ADAMS ALLEN BARTHOLOMEW BENTON BLACKFORD BOONE BROWN ... WABASH WARREN WARRICK WASHINGTON WAYNE WELLS WHITE WHITLEY
65 2020-05-10 8 828 356 13 14 227 28 ... 64 12 123 48 53 11 149 22
66 2020-05-11 8 860 367 16 14 235 28 ... 67 12 126 48 56 12 161 23
67 2020-05-12 8 872 371 17 14 235 28 ... 67 12 131 49 56 12 162 23
68 2020-05-13 9 897 382 17 14 249 29 ... 68 12 140 50 58 13 164 27
69 2020-05-14 9 955 394 21 14 252 29 ... 69 12 145 50 60 15 164 28
I would like to perform the sort so that the column with the largest value in row 69 is placed after df['Date'], with the columns ordered so that the values in row 69 decrease from left to right. Once that is done, I'd like to create a series containing the column headers, to generate rank list. Using the visible columns as an example, the desired list would be:
rank_list=[ "ALLEN", "BARTHOLOMEW", "BOONE", "WHITE", "WARRICK", ... "BLACKFORD", "WARREN", "ADAMS" ]
My biggest hurdle at present is that when I perform the sort I'm not able to exclude the Date column, and I'm receiving a type error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
I am new to Pandas so I apologize if there is a solution to this problem that should be obvious. thanks.
You can do it this way using sort_values once selected the right row and the range of column
#data sample
np.random.seed(86)
df = pd.DataFrame({'date':pd.date_range('2020-05-15', periods=5),
'a': np.random.randint(0,50, 5),
'b': np.random.randint(0,50, 5),
'c': np.random.randint(0,50, 5),
'd': np.random.randint(0,50, 5)})
# parameters
start_idx = 1 #note: the indexing start at 0, so 1 is the second column
end_idx = df.shape[1] #for the last column
row_position = df.shape[0]-1 #for the last one
# create the new order
new_col_roder = df.columns.tolist()
new_col_roder[start_idx:end_idx] = df.iloc[row_position, start_idx:end_idx]\
.sort_values(ascending=False).index
#reirder
df = df[new_col_roder]
print(df)
date c a d b
0 2020-05-15 30 20 44 40
1 2020-05-16 45 32 29 9
2 2020-05-17 17 44 14 27
3 2020-05-18 13 28 4 41
4 2020-05-19 41 35 14 12 #as you can see, the columns are now c, a, d, b
I suggest the following:
# initialize the provided sample data frame
df = pd.DataFrame([['65 2020-05-10', 8, 828, 356, 13, 14, 227, 28, 64, 12, 123, 48, 53, 11, 149, 22],
['66 2020-05-11', 8, 860, 367, 16, 14, 235, 28, 67, 12, 126, 48, 56, 12, 161, 23],
['67 2020-05-12', 8, 872, 371, 17, 14, 235, 28, 67, 12, 131, 49, 56, 12, 162, 23],
['68 2020-05-13', 9, 897, 382, 17, 14, 249, 29, 68, 12, 140, 50, 58, 13, 164, 27],
['69 2020-05-14', 9, 955, 394, 21, 14, 252, 29, 69, 12, 145, 50, 60, 15, 164, 28]],
columns = ['Date', 'ADAMS', 'ALLEN', 'BARTHOLOMEW', 'BENTON', 'BLACKFORD', 'BOONE', 'BROWN', 'WABASH', 'WARREN', 'WARRICK', 'WASHINGTON', 'WAYNE', 'WELLS', 'WHITE', 'WHITLEY']
)
# a list of tuples in the form (column_name, max_value)
column_max_list = [(column, df[column].max()) for column in df.columns.values[1:]]
# sort the list descending by the max value
column_max_list_sorted = sorted(column_max_list, key = lambda tup: tup[1], reverse = True)
# extract only the column names
rank_list = [tup[0] for tup in column_max_list_sorted]
for i in range(len(rank_list)):
# get the column to insert next
col = df[rank_list[i]]
# drop the column to be inserted back
df.drop(columns = [rank_list[i]], inplace = True)
# insert the column at the correct index
df.insert(loc = i + 1, column = rank_list[i], value = col)
This yields the desired rank_list
['ALLEN', 'BARTHOLOMEW', 'BOONE', 'WHITE', 'WARRICK', 'WABASH', 'WAYNE', 'WASHINGTON', 'BROWN', 'WHITLEY', 'BENTON', 'WELLS', 'BLACKFORD', 'WARREN', 'ADAMS']
as well as the desired df:
Date ALLEN BARTHOLOMEW BOONE WHITE ...
0 65 2020-05-10 828 356 227 149 ...
1 66 2020-05-11 860 367 235 161 ...
2 67 2020-05-12 872 371 235 162 ...
3 68 2020-05-13 897 382 249 164 ...
4 69 2020-05-14 955 394 252 164 ...

Input contains infinity or a value too large for dtype('float64')

I've seen many similar questions here, but none of the answers solved my problem.
I am trying to do a Power Transform in my dataset, but I am still obtaining such error.
The dataset does not contain inf or nan values, and I make sure that they are not greater than float64.max. I also tried to reindex the dataframe before.
features_training = features_training.astype(np.float64)
target_training = target_training.astype(np.float64)
features_test = features_test.astype(np.float64)
target_test = target_test.astype(np.float64)
print(np.where(features_training.values >= np.finfo(np.float64).max))
print(np.where(features_test.values >= np.finfo(np.float64).max))
print(np.where(target_training.values >= np.finfo(np.float64).max))
print(np.where(target_test.values >= np.finfo(np.float64).max))
print(np.isnan(features_training.values).any())
print(np.isnan(features_test.values).any())
print(np.isnan(target_training.values).any())
print(np.isnan(target_test.values).any())
print(np.isinf(features_training.values).any())
print(np.isinf(features_test.values).any())
print(np.isinf(target_training.values).any())
print(np.isinf(target_test.values).any())
pt_X = PowerTransformer().fit(features_training)
pt_Y = PowerTransformer().fit(np.asarray(target_training).reshape(-1,1))
features_training = pt_X.transform(features_training)
target_training = pt_Y.transform(np.asarray(target_training).reshape(-1,1))
features_test = pt_X.transform(features_test)
target_test = pt_Y.transform(np.asarray(target_test).reshape(-1,1))
Using dataframe.info()
features training
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Columns: 138 entries
dtypes: float64(138)
memory usage: 545.6 KB
None
target training
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 1 columns):
506 non-null float64
dtypes: float64(1)
memory usage: 4.0 KB
None
features test
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482 entries, 0 to 481
Columns: 138 entries
dtypes: float64(138)
memory usage: 519.7 KB
None
target test
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482 entries, 0 to 481
Data columns (total 1 columns):
482 non-null float64
dtypes: float64(1)
memory usage: 3.8 KB
None
Error traceback
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-100-6ca93dd1855a> in <module>
21 # features already normalized. Target remains the same
22 features_training, features_test, target_training, target_test, ptX_, pt_Y = normalization(features_training, features_test,
---> 23 target_training, target_test)
24
25 model.fit(features_training, target_training)
<ipython-input-99-9199a48b9d30> in normalization(features_training, features_test, target_training, target_test)
47 target_training = pt_Y.transform(np.asarray(target_training).reshape(-1,1))
48
---> 49 features_test = pt_X.transform(features_test)
50 target_test = pt_Y.transform(np.asarray(target_test).reshape(-1,1))
51
~\AppData\Local\Continuum\anaconda2\envs\env36\lib\site-packages\sklearn\preprocessing\data.py in transform(self, X)
2731
2732 if self.standardize:
-> 2733 X = self._scaler.transform(X)
2734
2735 return X
~\AppData\Local\Continuum\anaconda2\envs\env36\lib\site-packages\sklearn\preprocessing\data.py in transform(self, X, copy)
756 X = check_array(X, accept_sparse='csr', copy=copy,
757 estimator=self, dtype=FLOAT_DTYPES,
--> 758 force_all_finite='allow-nan')
759
760 if sparse.issparse(X):
~\AppData\Local\Continuum\anaconda2\envs\env36\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
540 if force_all_finite:
541 _assert_all_finite(array,
--> 542 allow_nan=force_all_finite == 'allow-nan')
543
544 if ensure_min_samples > 0:
~\AppData\Local\Continuum\anaconda2\envs\env36\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
54 not allow_nan and not np.isfinite(X).all()):
55 type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56 raise ValueError(msg_err.format(type_err, X.dtype))
57 # for object dtype data, we only check for NaNs (GH-13254)
58 elif X.dtype == np.dtype('object') and not allow_nan:
ValueError: Input contains infinity or a value too large for dtype('float64').

How do I read tabulator separated CSV in blaze?

I have a "CSV" data file with the following format (well, it's rather a TSV):
event pdg x y z t px py pz ekin
3383 11 -161.515 5.01938e-05 -0.000187112 0.195413 0.664065 0.126078 -0.736968 0.00723234
1694 11 -161.515 -0.000355633 0.000263174 0.195413 0.511853 -0.523429 0.681196 0.00472714
4228 11 -161.535 6.59631e-06 -3.32796e-05 0.194947 -0.713983 -0.0265468 -0.69966 0.0108681
4233 11 -161.515 -0.000524488 6.5069e-05 0.195413 0.942642 0.331324 0.0406377 0.017594
This file is interpretable as-is in pandas:
from pandas import read_csv, read_table
data = read_csv("test.csv", sep="\t", index_col=False) # Works
data = read_table("test.csv", index_col=False) # Works
However, when I try to read it in blaze (that declares to use pandas keyword arguments), an exception is thrown:
from blaze import Data
Data("test.csv") # Attempt 1
Data("test.csv", sep="\t") # Attempt 2
Data("test.csv", sep="\t", index_col=False) # Attempt 3
None of these works and pandas is not used at all. The "sniffer" that tries to deduce column names and types just calls csv.Sniffer.sniff() from standard library (which fails).
Is there a way how to properly read this file in blaze (given that its "little brother" has few hundred MBs, I want to use blaze's sequential processing capabilities)?
Thanks for any ideas.
Edit: I think it might be a problem of odo/csv and filed an issue: https://github.com/blaze/odo/issues/327
Edit2:
Complete error:
Error Traceback (most recent call last) in () ----> 1 bz.Data("test.csv", sep="\t", index_col=False)
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/blaze/interactive.py in Data(data, dshape, name, fields, columns, schema, **kwargs)
54 if isinstance(data, _strtypes):
55 data = resource(data, schema=schema, dshape=dshape, columns=columns,
---> 56 **kwargs)
57 if (isinstance(data, Iterator) and
58 not isinstance(data, tuple(not_an_iterator))):
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs)
62
63 def __call__(self, s, *args, **kwargs):
---> 64 return self.dispatch(s)(s, *args, **kwargs)
65
66 #property
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in resource_csv(uri, **kwargs)
276 #resource.register('.+\.(csv|tsv|ssv|data|dat)(\.gz|\.bz2?)?')
277 def resource_csv(uri, **kwargs):
--> 278 return CSV(uri, **kwargs)
279
280
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in __init__(self, path, has_header, encoding, sniff_nbytes, **kwargs)
102 if has_header is None:
103 self.has_header = (not os.path.exists(path) or
--> 104 infer_header(path, sniff_nbytes))
105 else:
106 self.has_header = has_header
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in infer_header(path, nbytes, encoding, **kwargs)
58 with open_file(path, 'rb') as f:
59 raw = f.read(nbytes)
---> 60 return csv.Sniffer().has_header(raw if PY2 else raw.decode(encoding))
61
62
/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in has_header(self, sample)
392 # subtracting from the likelihood of the first row being a header.
393
--> 394 rdr = reader(StringIO(sample), self.sniff(sample))
395
396 header = next(rdr) # assume first row is header
/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in sniff(self, sample, delimiters)
187
188 if not delimiter:
--> 189 raise Error("Could not determine delimiter")
190
191 class dialect(Dialect):
Error: Could not determine delimiter
I am working with Python 2.7.10, dask v0.7.1, blaze v0.8.2 and conda v3.17.0.
conda install dask
conda install blaze
Here is a way you can import the data for use with blaze. Parse the data first with pandas and then convert it into blaze. Perhaps this defeats the purpose, but there are no troubles this way.
As a side note in order to parse the data file correctly your line in pandas parse statment should be:
from blaze import Data
from pandas import DataFrame, read_csv
data = read_csv("csvdata.dat", sep="\s*", index_col=False)
bdata = Data(data)
Now the data is formatted correctly with no errors, bdata:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
Here is an alternative, use dask, it probably can do the same chunking, or large scale processing you are looking for. Dask certainly makes it immediately easy to correctly load a tsv format.
In [17]: import dask.dataframe as dd
In [18]: df = dd.read_csv('tsvdata.txt', sep='\t', index_col=False)
In [19]: df.head()
Out[19]:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
4 854 11 -161.515 0.000032 0.000418 0.195414 0.675752 0.315671
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
3 0.040638 0.017594
4 -0.666116 0.012641
In [20]:
See also: http://dask.pydata.org/en/latest/array-blaze.html#how-to-use-blaze-with-dask