How to understand an Index Error Message? - dataframe

I am attempting to use the pretty-print confusion-matrix library to create a confusion matrix.
When I run pp_matrix(df, cmap=cmap), I get the following error message:
Traceback (most recent call last):
File "/Users/name/folder/subfolder/subsubfolder/prepositions.py", line 27, in <module>
pp_matrix(df, cmap=cmap)
File "/opt/anaconda3/lib/python3.8/site-packages/pretty_confusion_matrix/pretty_confusion_matrix.py", line 222, in pp_matrix
txt_res = configcell_text_and_colors(
File "/opt/anaconda3/lib/python3.8/site-packages/pretty_confusion_matrix/pretty_confusion_matrix.py", line 59, in configcell_text_and_colors
tot_rig = array_df[col][col]
IndexError: index 37 is out of bounds for axis 0 with size 37
The first few lines of my DateFrame look like this:
in auf mit zu ... an-entlang auf-entlang ohne außerhalb
into 318 8 10 9 ... 0 0 0 0
in 4325 727 681 62 ... 0 0 0 0
on 253 3197 215 46 ... 0 0 0 0
at 206 280 54 9 ... 0 0 0 0
with 384 397 2097 24 ... 0 0 0 0
by 31 12 23 0 ... 0 0 0 0
in-front-of 15 15 25 0 ... 0 0 0 0
The total size is 49 rows x 36 columns.
I think the issue has to do with zero-indexing in Python, but to be honest, I'm not even sure how to go about debugging this.
Any suggestions would be greatly appreciated. Thank you in advance.

Related

How to use a list of categories that example belongs to as a feature solving classification problem?

One of features looks like this:
1 170,169,205,174,173,246,247,249,380,377,383,38...
2 448,104,239,277,276,99,154,155,76,412,139,333,...
3 268,422,419,124,1,17,431,343,341,435,130,331,5...
4 50,53,449,106,279,420,161,74,123,364,231,18,23...
5 170,169,205,174,173,246,247,249,380,377,383,38...
It tells us what categories the example belongs to.
How should I use it while solving classification problem?
I've tried to use dummy variables,
df=df.join(features['cat'].str.get_dummies(',').add_prefix('contains_'))
but we don't know where there are some other categories that were not mentioned in the training set, so, I do not know how to preprocess all the objects.
That's interesting. I didn't know str.get_dummies, but maybe I can help you with the rest.
You basically have two problems:
The set of categories you get later contains categories that were unknown while training the model. You have to get rid of these later.
The set of categories you get later does not contain all categories. You have to make sure, you generate dummies for them as well.
Problem 1: filtering out unknown/unwanted categories
The first problem is easy to solve:
# create a set of all categories, you want to allow
# either definie it as a fixed set, or extract it from your
# column like this (the output of the map is actually irrelevant)
# the result will be in valid_categories
valid_categories= set()
df['categories'].str.split(',').map(valid_categories.update)
# now if you want to normalize your data before you do the
# dummy encoding, you can cleanse the data by
# splitting it, creating an intersection and then joining
# it back again to get a string on which you can work with
# str.get_dummies
df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',')
Problem 2: generating dummies for all known categories
The second problem can be solved by just adding a dummy row, that
contains all categories e.g. with df.append just before you
call get_dummies and removing it right after get_dummies.
# e.g. you can do it like this
# get a new index value to
# be able to remove the row later
# (this only works if you have
# a numeric index)
dummy_index= df.index.max()+1
# assign the categories
#
df.loc[dummy_index]= {'id':999, 'categories': ','.join(valid_categories)}
# now do the processing steps
# mentioned in the section above
# then create the dummies
# after that remove the dummy line
# again
df.drop(labels=[dummy_index], inplace=True)
Example:
import io
raw= """id categories
1 170,169,205,174,173,246,247
2 448,104,239,277,276,99,154
3 268,422,419,124,1,17,431,343
4 50,53,449,106,279,420,161,74
5 170,169,205,174,173,246,247"""
df= pd.read_fwf(io.StringIO(raw))
valid_categories= set()
df['categories'].str.split(',').map(valid_categories.update)
# remove 154 and 170 for demonstration purposes
valid_categories.remove('170')
valid_categories.remove('154')
df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',').str.get_dummies(',')
Out[622]:
1 104 106 124 161 169 17 173 174 205 239 246 247 268 276 277 279 343 419 420 422 431 448 449 50 53 74 99
0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1
2 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0
3 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0
4 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
You can see, that there are not columns for 154 and 170.

Select nearest values around a certain value in column

I have the following df and I need to find all values which are equal or nearest to 660 in column value.
In detail this means I must iterate somehow through the column value to find all these 660 or nearest values. The values in column value are in a range from 1 to (the end varies) and when the end is reached the values start again from 1. Finally, I must select the entire row of all other columns where value == 660 (or neareest). I have more or less a 'helper' column helper which has the same value during a value range of 1 to end. It could be helpfull to get the result (column help is always 0 or 1). Here is the df example:
helper value
0 1
.
.
.
0 647
0 649
0 652
0 654
0 656
0 659
0 661
0 663
0 665
0 667
0 669
0 672
0 674
0 676
0 678
0 681
.
.
.
0 1000
1 1
.
.
.
1 647
1 649
1 652
1 654
1 656
1 659
1 661
1 663
1 665
1 667
1 669
1 672
1 674
1 676
1 678
1 681
.
.
1 1500
0 1
.
.
.
0 645
0 647
0 650
0 652
0 654
0 656
0 658
0 661
0 663
0 665
0 667
0 669
0 672
0 674
0 676
0 679
.
.
.
0 980
Thanks for any help or hints!
There is not enough info for a correct answer. What is the size of the dataframe? Also what do you mean by nearest? The solution to "where value == 660" can be done by applying a mask to the pandas series.
Something like df = df[(df['value'] == 660)] should do.
While if you have a range of numbers you can use boolean operators for the mask:
mask = df.value > 650 & df.value < 670
df = df[mask]
Let me know if this is what you are looking for

Improve error message for ill-formed input format?

I have a map file containing data like this:
|labels 0 0 1 0 0 0 |features 0
|labels 1 0 0 0 0 0 |features 2
|labels 0 0 0 1 0 0 |features 3
|labels 0 0 0 0 0 1 |features 7
Data is read into a minibatch with the following code:
from cntk import Trainer, StreamConfiguration, text_format_minibatch_source, learning_rate_schedule, UnitType
mb_source = text_format_minibatch_source('test_map2.txt', [
StreamConfiguration('features', 1),
StreamConfiguration('labels', num_classes)])
test_minibatch = mb_source.next_minibatch(2)
If the input file is ill-formed, you will sometimes get a quite cryptic error message. For example a missing line break at the end of the last row in the input file will result in an error like this:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-35-2f1481ccfced> in <module>()
----> 1 test_minibatch = mb_source.next_minibatch(2)
C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py34\lib\site-packages\cntk\utils\swig_helper.py in wrapper(*args, **kwds)
56 #wraps(f)
57 def wrapper(*args, **kwds):
---> 58 result = f(*args, **kwds)
59 map_if_possible(result)
60 return result
C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py34\lib\site-packages\cntk\io\__init__.py in next_minibatch(self, minibatch_size_in_samples, input_map, device)
159
160 mb = super(MinibatchSource, self).get_next_minibatch(
--> 161 minibatch_size_in_samples, device)
162
163 if input_map:
C:\local\Anaconda3-4.1.1-Windows-x86_64\envs\cntk-py34\lib\site-packages\cntk\cntk_py.py in get_next_minibatch(self, *args)
1914
1915 def get_next_minibatch(self, *args):
-> 1916 return _cntk_py.MinibatchSource_get_next_minibatch(self, *args)
1917 MinibatchSource_swigregister = _cntk_py.MinibatchSource_swigregister
1918 MinibatchSource_swigregister(MinibatchSource)
RuntimeError: Invalid chunk requested.
Sometimes it could be hard to figure out where in the file there would be a problem. Would it be possible to emit a more specific error message. Line number in the input file would be useful.
Thanks for reporting the issue. We have created a bug and will be working on fixing the reader behavior with ill-formed input.

Creating DXF Spline programmatically

I need to create spline programmatically. I've made something like:
0
SECTION
2
HEADER
9
$ACADVER
1
AC1006
0
ENDSEC
0
SECTION
2
TABLES
0
TABLE
2
LAYER
0
LAYER
2
shape
70
64
62
250
6
CONTINUOUS
0
LAYER
2
holes
70
64
62
250
6
CONTINUOUS
0
ENDTAB
0
ENDSEC
0
SECTION
2
ENTITIES
0
SPLINE
8
shape
100
AcDbSpline
210
0
220
0
230
1
70
4
71
3
72
11
73
4
74
4
42
0.0000001
43
0.0000001
44
0.0000000001
40
0
40
0
40
0
40
0
40
1
40
1
40
1
40
2
40
2
40
2
40
2
10
0
20
0
30
0
10
100
20
50
30
0
10
40
20
40
30
0
10
15
20
23
30
0
11
0
21
0
31
0
11
200
21
200
31
0
11
80
21
80
31
0
11
432
21
234
31
0
0
ENDSEC
0
EOF
When I'm trying to open it in Autodesk TrueView, I'm getting an error:
Undefined group code 210 for object on line 54.
Invalid or incomplete DXF input -- drawing discarded.
Where is the error? When I'm copying just SPLINE section to the DXF generated by AI everything works fine. So I think I need to add something in the header section or something.
This file is DXF version AC1006 which is older than DXF R12. The SPLINE entity
requires at least DXF version AC1012 DXF R13/R14. But with DXF version AC1012
the tag structure of DXF files is changed (OBJECTS and CLASSES sections, SubClassMarkers ...), so just editing the DXF version
does not work.
See also: http://ezdxf.readthedocs.io/en/latest/dxfinternals/filestructure.html#minimal-dxf-content
Also the SPLINE entity seems to be invalid, it has no handle (5) and no owner
tag (330), and the whole AcDbEntity subclass is missing.
Your spline is of degree 3 with 11 knots (0, 0,0,0,1,1,1,2,2,2,2) and 4 control points ( (0,0), (100,50),(40,40),(15,23) ). This might be the problem culprit. You should either have 4 control points and 8 knots or 7 control points and 11 knots.
You may need to assign a handle to the SPLINE, since you're specifying $ACADVER = AC1018 = AutoCAD 2004 where item handles are required.
Try adding a 5-code pair right before the layer designation, like so, where AAAA is a unique hex-encoded handle:
  0
SPLINE
5 <-- add these two lines
AAAA <--
8
shape
100
AcDbSpline

SPIN output results

I have written Promela code to verify the Needham-Schroeder protocol using SPIN.
After running a random simulation of the code I receive this output:
0: proc - (:root:) creates proc 0 (:init:)
Starting PIni with pid 1
1: proc 0 (:init:) creates proc 1 (PIni)
0 :init ini run PIni(A,I,N
Starting PRes with pid 2
3: proc 0 (:init:) creates proc 2 (PRes)
0 :init ini run PRes(B,Nb)
Starting PI with pid 3
4: proc 0 (:init:) creates proc 3 (PI)
0 :init ini run PI()
1 PIni 62 else
1 PIni 63 1
1 PIni 64 ca!self,nonce,
3 PI 128 ca?,x1,x2,x3
1 PIni 64 values: 1!A,Na
3 PI 128 values: 1?0,Na
Process Statement PI(3):kNa PI(3):x1 PI(3):x2 PI(3):x3
3 PI 135 x3 = 0 1 0 0 I
3 PI 101 ca!B,gD,A,B 1 0 0 0
2 PRes 79 ca?eval(self), 1 0 0 0
3 PI 101 values: 1!B,gD 1 0 0 0
2 PRes 79 values: 1?B,gD 1 0 0 0
Process Statement PI(3):kNa PI(3):x1 PI(3):x2 PI(3):x3 PRes(2):g2 PRes(2):g3
2 PRes 80 g3==A)&&(self= 1 0 0 0 gD A
2 PRes 80 ResRunningAB = 1 0 0 0 gD A
Process Statement PI(3):kNa PI(3):x1 PI(3):x2 PI(3):x3 PRes(2):g2 PRes(2):g3 ResRunning
2 PRes 82 ca!self,g2,non 1 0 0 0 gD A 1
3 PI 128 ca?,x1,x2,x3 1 0 0 0 gD A 1
2 PRes 82 values: 1!B,gD 1 0 0 0 gD A 1
3 PI 128 values: 1?0,gD 1 0 0 0 gD A 1
3 PI 135 x3 = 0 1 0 0 A gD A 1
3 PI 113 ca!( (kNa) -> 1 0 0 0 gD A 1
1 PIni 68 ca?eval(self), 1 0 0 0 gD A 1
3 PI 113 values: 1!A,Na 1 0 0 0 gD A 1
1 PIni 68 values: 1?A,Na 1 0 0 0 gD A 1
Process Statement PI(3):kNa PI(3):x1 PI(3):x2 PI(3):x3 PIni(1):g1 PRes(2):g2 PRes(2):g3 ResRunning
1 PIni 69 else 1 0 0 0 Na gD A 1
1 PIni 69 1 1 0 0 0 Na gD A 1
1 PIni 71 cb!self,g1,par 1 0 0 0 Na gD A 1
3 PI 139 cb?,x1,x2 1 0 0 0 Na gD A 1
1 PIni 71 values: 2!A,Na 1 0 0 0 Na gD A 1
3 PI 139 values: 2?0,Na 1 0 0 0 Na gD A 1
3 PI 145 x2 = 0 1 0 I 0 Na gD A 1
timeout
#processes: 4
34: proc 3 (PI) needhamNew.pml:100 (state 81)
34: proc 2 (PRes) needhamNew.pml:86 (state 10)
34: proc 1 (PIni) needhamNew.pml:73 (state 18)
34: proc 0 (:init:) needhamNew.pml:58 (state 8)
4 processes created
I can see that the processes that are created which are for the Initiator, Responder and Intruder. I'm finding it difficult to see exactly how this proves that the Needham-Schroeder protocol can be broken even though I understand the theory behind it.
Can anyone make sense of this output and maybe direct me to where I should be looking?
If you would like to view my Promela code please let me know!
Any feedback is appreciated!
Near the end of the output you see timeout - this means that no further progress can be made which usually indicates a deadlock of some sort. In the process list (at the end) you see the line numbers and none of them are marked as a valid end state. Therefore either you have a true deadlock or your model is wrong because it fails to identify its valid end state(s).