WARNING: Exhausted all input expected for the current sequence while reading a floating point value at offset - cntk

I'm trying this example of CNTK but it fails when training the data...
It seems that it exhausted all input but that is somehow handled as an error.
I have no idea what is going wrong. Anyone can help?
Error
WARNING: Exhausted all input expected for the current sequence while reading a floating point value at offset 790 in the input file (TrainData.txt)
[CALL STACK] > Microsoft::MSR::CNTK::IDataReader:: SupportsDistributedMBRead
- Microsoft::MSR::CNTK::IDataReader:: SupportsDistributedMBRead - Microsoft::MSR::CNTK::IDataReader:: SupportsDistributedMBRead
- Microsoft::MSR::CNTK::IDataReader:: SupportsDistributedMBRead - Microsoft::MSR::CNTK::IDataReader:: SupportsDistributedMBRead
- CreateDeserializer - CreateDeserializer
- CreateDeserializer - CreateDeserializer
- CreateDeserializer - CreateDeserializer
- CreateDeserializer - CreateDeserializer
- CreateDeserializer - CreateDeserializer
- CreateDeserializer
EXCEPTION occurred: Reached the maximum number of allowed errors while reading the input file (TrainData.txt).
TestData.txt
|features 1.0 1.0 |labels 1 0 0
|features 3.0 9.0 |labels 1 0 0
|features 8.0 8.0 |labels 1 0 0
|features 3.0 4.0 |labels 0 1 0
|features 5.0 6.0 |labels 0 1 0
|features 3.0 6.0 |labels 0 1 0
|features 8.0 3.0 |labels 0 0 1
|features 8.0 1.0 |labels 0 0 1
|features 9.0 2.0 |labels 0 0 1
TrainData.txt
|features 1.0 5.0 |labels 1 0 0
|features 1.0 2.0 |labels 1 0 0
|features 3.0 8.0 |labels 1 0 0
|features 4.0 1.0 |labels 1 0 0
|features 5.0 8.0 |labels 1 0 0
|features 6.0 3.0 |labels 1 0 0
|features 7.0 5.0 |labels 1 0 0
|features 7.0 6.0 |labels 1 0 0
|features 1.0 4.0 |labels 1 0 0
|features 2.0 7.0 |labels 1 0 0
|features 2.0 1.0 |labels 1 0 0
|features 3.0 1.0 |labels 1 0 0
|features 5.0 2.0 |labels 1 0 0
|features 6.0 7.0 |labels 1 0 0
|features 7.0 4.0 |labels 1 0 0
|features 3.0 5.0 |labels 0 1 0
|features 4.0 4.0 |labels 0 1 0
|features 5.0 5.0 |labels 0 1 0
|features 4.0 6.0 |labels 0 1 0
|features 4.0 5.0 |labels 0 1 0
|features 6.0 1.0 |labels 0 0 1
|features 7.0 1.0 |labels 0 0 1
|features 8.0 2.0 |labels 0 0 1
|features 7.0 2.0 |labels 0 0 1

Got an answer from someone offline.
Each well-formed line must end with either a "Line Feed" \n or "Carriage Return, Line Feed" \r\n symbols (including the last line of the file).
CNTK TextFormat-Reader.

Related

Based on some rules, how to expand data in Pandas?

Please forgive my English. I hope I can say clearly.
Assume we have this data:
>>> data = {'Span':[3,3.5], 'Low':[6.2,5.16], 'Medium':[4.93,4.1], 'High':[3.68,3.07], 'VeryHigh':[2.94,2.45], 'ExtraHigh':[2.48,2.06], '0.9':[4.9,3.61], '1.5':[3.23,2.38], '2':[2.51,1.85]}
>>> df = pd.DataFrame(data)
>>> df
Span Low Medium High VeryHigh ExtraHigh 0.9 1.5 2
0 3.0 6.20 4.93 3.68 2.94 2.48 4.90 3.23 2.51
1 3.5 5.16 4.10 3.07 2.45 2.06 3.61 2.38 1.85
I want to get this data:
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85
The principles apply to df:
Span expands by the combination of Wind and Snow to get the MaxSpacing
Wind and Snow is mutually exclusive. When Wind is one of 'Low', 'Medium', 'High', 'VeryHigh', 'ExtraHigh', Snow is zero; when Snow is one of 0.9, 1.5, 2, Wind is zero.
Please help. Thank you.
Use DataFrame.melt for unpivot and then sorting by indices, create Snow column by to_numeric and Series.fillna in DataFrame.insert and last set 0 for Wind column:
df = (df.melt('Span', ignore_index=False, var_name='Wind', value_name='MaxSpacing')
.sort_index(ignore_index=True))
s = pd.to_numeric(df['Wind'], errors='coerce')
df.insert(2, 'Snow', s.fillna(0))
df.loc[s.notna(), 'Wind'] = 0
print (df)
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85
Alternative solution with DataFrame.set_index and DataFrame.stack:
df = df.set_index('Span').rename_axis('Wind', axis=1).stack().reset_index(name='MaxSpacing')
s = pd.to_numeric(df['Wind'], errors='coerce')
df.insert(2, 'Snow', s.fillna(0))
df.loc[s.notna(), 'Wind'] = 0
print (df)
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85

Pandas - create new column with the sum of last N values of another column

I have this df:
round_id team opponent home_dummy GC GP P
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3
11 14.0 Flamengo Sport 1.0 0.0 3.0 3
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0
Now I'd like to add a column 'last_5', which consists of the sum of the last 5 'P' values, ending up with:
rodada_id clube opponent home_dummy GC GP P last_5
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1 0
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3 2
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3 5
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3 8
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3 11
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0 13
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3 12
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3 12
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3 12
11 14.0 Flamengo Sport 1.0 0.0 3.0 3 12
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3 12
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1 15
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3 13
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1 11
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0 8
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1 8
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0 6
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0 5
Please note that up to index 4 (n=5), the sum will have to be of the last 1, 2, 3, 4 rows.
I have tried:
N = 5
df = df.groupby(df.P // N).sum()
But this does not work.
Let us try
df['Last_5'] = df.P.rolling(5,min_periods=1).sum().shift().fillna(0)
Out[9]:
0 0.0
1 0.0
2 1.0
3 2.0
4 5.0
5 8.0
6 11.0
7 13.0
8 12.0
9 12.0
10 12.0
11 12.0
12 12.0
13 15.0
14 13.0
15 13.0
16 11.0
17 8.0
18 6.0
19 5.0

Pandas - create new column with the sum of last N values of another column with groupby

I have this df:
round_id team opponent home_dummy GC GP P
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3
11 14.0 Flamengo Sport 1.0 0.0 3.0 3
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0
and I've aded a new column to it, which creates a new column with the sum of last N values of another, like so:
df['Last_5'] = df.P.rolling(5,min_periods=1).sum().shift().fillna(0)
which gives me:
round_id team opponent home_dummy GC GP P last_5
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0 0 0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0 1 0
2 5.0 Flamengo Botafogo 1.0 1.0 1.0 1 1
3 6.0 Flamengo Santos 0.0 0.0 1.0 3 2
4 7.0 Flamengo Bahia 0.0 3.0 5.0 3 5
5 8.0 Flamengo Fortaleza 1.0 1.0 2.0 3 8
6 9.0 Flamengo Fluminense 0.0 1.0 2.0 3 11
7 10.0 Flamengo Ceará 0.0 2.0 0.0 0 13
8 3.0 Flamengo Coritiba 0.0 0.0 1.0 3 12
9 11.0 Flamengo Goiás 1.0 1.0 2.0 3 12
10 13.0 Flamengo Athlético-PR 1.0 1.0 3.0 3 12
11 14.0 Flamengo Sport 1.0 0.0 3.0 3 12
12 15.0 Flamengo Vasco 0.0 1.0 2.0 3 12
13 16.0 Flamengo Bragantino 1.0 1.0 1.0 1 15
14 17.0 Flamengo Corinthians 0.0 1.0 5.0 3 13
15 18.0 Flamengo Internacional 0.0 2.0 2.0 1 11
16 19.0 Flamengo São Paulo 1.0 4.0 1.0 0 8
17 12.0 Flamengo Palmeiras 0.0 1.0 1.0 1 8
18 2.0 Flamengo Atlético-GO 0.0 3.0 0.0 0 6
19 20.0 Flamengo Atlético-MG 0.0 4.0 0.0 0 5
But lets say I have many teams in the same dataframe:
round_id team opponent home_dummy GC GP /
0 1.0 Flamengo Atlético-MG 1.0 1.0 0.0
1 4.0 Flamengo Grêmio 1.0 1.0 1.0
2 5.0 Flamengo Botafogo 1.0 1.0 1.0
3 6.0 Flamengo Santos 0.0 0.0 1.0
4 7.0 Flamengo Bahia 0.0 3.0 5.0
.. ... ... ... ... ... ...
395 15.0 Atlético-GO Bragantino 1.0 1.0 2.0
396 16.0 Atlético-GO Santos 0.0 0.0 1.0
397 17.0 Atlético-GO Athlético-PR 1.0 1.0 1.0
398 9.0 Atlético-GO Vasco 0.0 1.0 2.0
399 20.0 Atlético-GO Corinthians 1.0 1.0 1.0
How do I apply the same calculation and achieve the same result per Team, without overlapping last N rows between teams?
Adding the groupby
df['Last_5'] = df.groupby('team').P.apply(lambda x : x.rolling(5,min_periods=1).sum().shift().fillna(0))

Pandas calculate difference on two DataFrames with column and multi-indices

I have 2 DataFrames, df1 is:
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 10.0 1.7 3.7 0.0
1 Jack 10.0 0.0 2.8 3.5
2 Fox 10.0 1.7 0.0 0.0
3 Rex 1.0 0.0 3.0 4.2
the second DataFrame - df2 is:
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 5.0 1.7 2.0 0.0
1 Jack 6.0 0.0 0.8 3.5
2 Fox 8.0 5.0 0.0 0.0
3 Rex 1.0 0.0 1.0 4.2
4 Snack 3.1 9.0 2.8 4.4
5 Yosee 4.3 0.0 0.0 4.3
6 Petty 0.5 1.3 2.8 3.5
7 Lind 3.6 7.5 2.8 4.3
8 Korr 0.6 1.5 1.8 2.3
Result is df3:
ID Name Jan17 Jun18 Dec18 Apr19
0 Nick 5.0 0 1.7 0
1 Jack 4.0 0 2.0 0
2 Fox 2.0 -3.3 0 0
3 Rex 0 0 2.0 0
How to calculate differences between columns in df1 and df2 based on multi-indices: [ID, Name] of df1 and save result to the df3?
I'd appreciate for any idea. Thanks!
Just subtract, subtraction is aligned on the index. You can reindex df2 before subtracting to avoid NaNs:
# df1 - df2.reindex(df1.index)
df1.sub(df2.reindex(df1.index))
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 5.0 0.0 1.7 0.0
1 Jack 4.0 0.0 2.0 0.0
2 Fox 2.0 -3.3 0.0 0.0
3 Rex 0.0 0.0 2.0 0.0
Note that the reason I went for reindex over loc was to avoid KeyErrors if there are missing index values.
In the above instance, the first solution will produce NaNs, so you can specify fill_values to reindex to ensure df1's value is returned (rather than NaN):
df2.reindex(df1.index, fill_value=0)
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 5.0 1.7 2.0 0.0
1 Jack 6.0 0.0 0.8 3.5
2 Fox 8.0 5.0 0.0 0.0
3 Rex 1.0 0.0 1.0 4.2
You can simply do
df1-df2.loc[df1.index]
Output:
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 5.0 0.0 1.7 0.0
1 Jack 4.0 0.0 2.0 0.0
2 Fox 2.0 -3.3 0.0 0.0
3 Rex 0.0 0.0 2.0 0.0
Try something new
sum(df1.align(0-df2,join='left'))
Out[282]:
Jan17 Jun18 Dec18 Apr19
ID Name
0 Nick 5.0 0.0 1.7 0.0
1 Jack 4.0 0.0 2.0 0.0
2 Fox 2.0 -3.3 0.0 0.0
3 Rex 0.0 0.0 2.0 0.0

Logistic regression with pandas and sklearn: Input contains NaN, infinity or a value too large for dtype('float64')

I want to run the following model (logistic regression) for the pandas data frame I read.
However, when the predict method comes, it says: "Input contains NaN, infinity or a value too large for dtype('float64')"
My code is: (Note that there must exist 10 numerical and 4 categorial variables)
import pandas as pd
import io
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
x = pd.to_numeric(heart['diagnosis'])
heart['diagnosis'] = (x > 1).astype(int)
heart_train, heart_test, goal_train, goal_test = train_test_split(heart.loc[:,'age':'thal'], heart.loc[:,'diagnosis'], test_size=0.3, random_state=0)
clf = LogisticRegression()
clf.fit(heart_train, goal_train)
heart_test_results = clf.predict(heart_test) #From here it is broken
print(clf.get_params(clf))
print(clf.score(heart_train,goal_train))
The data frame info is as follows (print(heart.info()):
RangeIndex: 271 entries, 0 to 270
Data columns (total 14 columns):
age 270 non-null object
sex 270 non-null object
chestpain 270 non-null category
restBP 270 non-null object
chol 270 non-null object
sugar 270 non-null object
ecg 270 non-null category
maxhr 270 non-null object
angina 270 non-null object
dep 270 non-null object
exercise 270 non-null category
fluor 270 non-null object
thal 270 non-null category
diagnosis 271 non-null int32
dtypes: category(4), int32(1), object(9)
memory usage: 21.4+ KB
None
Do anyone know what I am missing here?
Thanks in advance!!
I gues the reason for this error is how you parse this data:
In [116]: %paste
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
## -- End pasted text --
In [117]: heart
Out[117]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
270 None None NaN None None None NaN None None None NaN None NaN None
[271 rows x 14 columns]
NOTE: pay attention at the very last row with NaN's
try to do it this simplified way instead:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
In [118]: df = pd.read_csv(url, sep='\s+', header=None, names=header_row)
In [119]: df
Out[119]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260 58.0 0.0 3.0 120.0 340.0 0.0 0.0 172.0 0.0 0.0 1.0 0.0 3.0 1
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
[270 rows x 14 columns]
also pay attention at automatically parsed (guessed) dtypes - pd.read_csv() will do all necesarry convertions for you:
In [120]: df.dtypes
Out[120]:
age float64
sex float64
chestpain float64
restBP float64
chol float64
sugar float64
ecg float64
maxhr float64
angina float64
dep float64
exercise float64
fluor float64
thal float64
diagnosis int64
dtype: object
I would suspect it was the train_test_split thing.
I would suggest turning your X, and y into numpy arrays to avoid this problem. This usually solves for this.
X = heart.loc[:,'age':'thal'].as_matrix()
y = heart.loc[:,'diagnosis'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)
and then fit for
clf.fit(X_train, y_train)