How can I change my index vector into sparse feature vector that can be used in sklearn? - pandas

I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this :
001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]
the first column is userID, the second column is the newsID.newsID is a index column, for example, after transformation, [12,456,157] in the first row means that this user has read the 12th, 456th and 157th news (in sparse vector, the 12th column, 456th column and 157th column are 1, while other columns have value 0). And I want to change these data into a sparse vector format that can be used as input vector in Kmeans or DBscan algorithm of sklearn.
How can I do that?

One option is to construct the sparse matrix explicitly. I often find it easier to build the matrix in COO matrix format and then cast to CSR format.
from scipy.sparse import coo_matrix
input_data = [
("001436800277225", [12,456,157]),
("009092130698762", [248]),
("010003000431538", [361,521,83]),
("010156461231357", [173,67,244])
]
NUMBER_MOVIES = 1000 # maximum index of the movies in the data
NUMBER_USERS = len(input_data) # number of users in the model
# you'll probably want to have a way to lookup the index for a given user id.
user_row_map = {}
user_row_index = 0
# structures for coo format
I,J,data = [],[],[]
for user, movies in input_data:
if user not in user_row_map:
user_row_map[user] = user_row_index
user_row_index+=1
for movie in movies:
I.append(user_row_map[user])
J.append(movie)
data.append(1) # number of times users watched the movie
# create the matrix in COO format; then cast it to CSR which is much easier to use
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()

Use MultiLabelBinarizer from sklearn.preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_)
12 41 45 62 67 83 86 97 123 142 ... 244 248 286 341 361 411 435 456 521 621
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
7 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0

Related

How to fix "tf.math.confusion_matrix()" error

I'm trying to find the confusion matrix of a multiclass classification problem. I'm using tf.math.confusion_matrix() to do that. The code snippet is as follows,
y_pred = model.predict(x_test)
y_pred = tf.argmax(y_pred, axis=1)
Y_test = tf.argmax(y_test, axis=1)
matrix = tf.math.confusion_matrix(Y_test, y_pred)
The output of Y_test is,
tf.Tensor(
[[0 2 0 ... 0 0 0]
[0 2 0 ... 0 0 0]
[0 2 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 3 ... 0 0 0]
[0 0 2 ... 0 0 0]], shape=(2124, 279), dtype=int64)
The output of y_pred is,
tf.Tensor(
[[1 2 2 ... 0 0 0]
[0 2 3 ... 0 0 0]
[3 2 0 ... 3 1 3]
...
[3 1 0 ... 2 3 2]
[1 0 3 ... 1 1 2]
[1 0 2 ... 1 1 2]], shape=(2124, 279), dtype=int64)
Y_test[1] looks like the following,
tf.Tensor(
[0 2 0 1 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(279,), dtype=int64)
y_pred[1] looks like the following,
tf.Tensor(
[0 2 3 1 3 3 2 2 2 3 2 3 2 3 3 2 1 0 0 0 0 3 1 0 2 3 1 2 0 1 0 0 1 0 0 0 0
2 0 2 1 0 0 0 0 1 0 0 0 3 2 0 0 3 2 0 0 3 3 0 3 0 0 0 0 1 0 2 1 0 2 3 0 3
3 0 2 3 1 3 2 0 3 0 0 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 3 3 0 0 0 0 0 3 0 0 1
0 3 0 3 3 0 1 0 3 0 0 0 0 0 0 3 0 1 0 0 0 0 0 0 0 0 0 3 0 0 3 0 0 0 0 0 0
3 0 3 3 0 0 0 3 0 0 0 0 0 0 0 0 0 0 3 0 3 0 0 0 0 3 0 3 0 0 0 0 0 0 0 2 0
0 1 0 0 0 0 2 0 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 3 0 3 3 0 2 3 0 3 3
3 3 3 0 3 0 3 0 0 3 0 0 0 0 3 3 3 2 0 0 0 0 0 0 0 0 2 3 0 0 3 0 0 0 3 0 2
0 0 3 0 0 0 0 0 3 1 2 0 3 2 3 0 3 0 0 0], shape=(279,), dtype=int64)
And the error I'm getting is,
InvalidArgumentError: Dimensions [0,2) of indices[shape=[2124,2,279]] must match dimensions [0,2) of updates[shape=[2124,279]] [Op:ScatterNd]
How this can be solved?

How to speed up Pandas' "iterrows"

I have a Pandas dataframe which I want to transform in the following way: I have some sensor data from an intelligent floor which is in column "CAPACITANCE" (split by ",") and that data comes from the device indicated in column "DEVICE". Now I want to have one row with a column per sensor - each device has 8 sensors, so I want to have devices x 8 columns and in that column I want the sensor data from exactly that sensor.
But my code seems to be super slow since I have about 90.000 rows in that dataframe! Does anyone have a suggestion how to speed it up?
BEFORE:
CAPACITANCE DEVICE TIMESTAMP \
0 0.00,-1.00,0.00,1.00,1.00,-2.00,13.00,1.00 01,07 2017/11/15 12:24:42
1 0.00,0.00,-1.00,-1.00,-1.00,0.00,-1.00,0.00 01,07 2017/11/15 12:24:42
2 0.00,-1.00,-2.00,0.00,0.00,1.00,0.00,-2.00 01,07 2017/11/15 12:24:43
3 2.00,0.00,-2.00,-1.00,0.00,0.00,1.00,-2.00 01,07 2017/11/15 12:24:43
4 1.00,0.00,-2.00,1.00,1.00,-3.00,5.00,1.00 01,07 2017/11/15 12:24:44
AFTER:
01,01-0 01,01-1 01,01-2 01,01-3 01,01-4 01,01-5 01,01-6 01,01-7 \
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
01,02-0 01,02-1 ... 05,07-1 05,07-2 05,07-3 05,07-4 05,07-5 \
0 0 0 ... 0 0 0 0 0
1 0 0 ... 0 0 0 0 0
2 0 0 ... 0 0 0 0 0
3 0 0 ... 0 0 0 0 0
4 0 0 ... 0 0 0 0 0
05,07-6 05,07-7 TIMESTAMP 01,07-8
0 0 0 2017-11-15 12:24:42 1.00
1 0 0 2017-11-15 12:24:42 0.00
2 0 0 2017-11-15 12:24:43 -2.00
3 0 0 2017-11-15 12:24:43 -2.00
4 0 0 2017-11-15 12:24:44 1.00
# creating new dataframe based on the old one
floor_df_resampled = floor_df.copy()
floor_device = ["01,01", "01,02", "01,03", "01,04", "01,05", "01,06", "01,07", "01,08", "01,09", "01,10",
"02,01", "02,02", "02,03", "02,04", "02,05", "02,06", "02,07", "02,08", "02,09", "02,10",
"03,01", "03,02", "03,03", "03,04", "03,05", "03,06", "03,07", "03,08", "03,09",
"04,01", "04,02", "04,03", "04,04", "04,05", "04,06", "04,07", "04,08", "04,09",
"05,06", "05,07"]
# creating new columns
floor_objects = []
for device in floor_device:
for sensor in range(8):
floor_objects.append(device + "-" + str(sensor))
# merging new columns
floor_df_resampled = pd.concat([floor_df_resampled, pd.DataFrame(columns=floor_objects)], ignore_index=True, sort=True)
# part that takes loads of time
for index, row in floor_df_resampled.iterrows():
obj = row["DEVICE"]
sensor_data = row["CAPACITANCE"].split(',')
for idx, val in enumerate(sensor_data):
col = obj + "-" + str(idx + 1)
floor_df_resampled.loc[index, col] = val
floor_df_resampled.drop(["DEVICE"], axis=1, inplace=True)
floor_df_resampled.drop(["CAPACITANCE"], axis=1, inplace=True)
Like commented, I'm not sure why you want that many columns, but the new columns can be created as follows:
def explode(x):
dev_name = x.DEVICE.iloc[0]
ret_df = x.CAPACITANCE.str.split(',', expand=True).astype(float)
ret_df.columns = [f'{dev_name}-{col}' for col in ret_df.columns]
return ret_df
new_df = df.groupby('DEVICE').apply(explode).fillna(0)
and then you can merge this with the old data frame:
df = df.join(new_df)

Pharo FileSystem: How do I write a binary file?

TabularResources testExcelSheet
from this project gives me a binary representation in a literal array of an Excel file.
````
testExcelSheet
^ #[80 75 3 4 20 0 6 0 8 0 0 0 33 0 199 122 151 144 120 1 0 0 32 6 0 0 19 0 8 2 91 67 111 110 116 101 110 116 95 84 121 112 101 115 93 46 120 109 108 32 162 4 2 40 160 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .....
....0 109 108 80 75 1 2 45 0 20 0 6 0 8 0 0 0 33 0 126 148 213 45 209 1 0 0 250 10 0 0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 233 36 0 0 120 108 47 99 97 108 99 67 104 97 105 110 46 120 109 108 80 75 5 6 0 0 0 0 13 0 13 0 74 3 0 0 232 38 0 0 0 0]
````
Question
How do I write this to the disk to see which kind of file it is?
Answer
(by Esteban, edited)
./TabularATest1.xlsx' asFileReference writeStreamDo: [ :stream |
stream
binary;
nextPutAll: self testExcelSheet ]
Easiest way to do that is something like this:
'./file.bin' asFileReference writeStreamDo: [ :stream |
stream
binary;
nextPutAll: #[1 2 3 4 5 6 7 8 9 0] ]
So the trick is just telling to the stream "be a binary file" :)

in a word macro delete everything that does not start with one of two strings

I have a data file that contains a lot of extra data. I want to run a word macro that only keeps 5 lines (I could live with 6 if it makes it easier)
I found how to delete a row if it contains a string.
I want to keep the paragraphs that start with:
Record write time
Headband impedance
Headband Packets
Headband RSSI
Headband Status
I could live with keeping
Headband ID
I tried the following macro, based on a sample I saw here. But, I am getting an error.
Sub test()
'
' test Macro
Dim search1 As String
search1 = "record"
Dim search2 As String
search2 = "headb"
Dim para As Paragraph
For Each para In ActiveDocument.Paragraphs
Dim txt As String
txt = para.Range.Text
If Not InStr(LCase(txt), search1) Then
If Not InStr(LCase(txt), search2) Then
para.Range.Delete
End If
Next
End Sub
The error is: next without For.
I know that there may be a better way, and an open to any fix.
Sample data:
The data is:
ZEO Start data record
----------------
Record write time: 10/14/2014 20:32
Factory reset date: 10/14/2014 20:23
Headband ID: 01/01/1970 18:32
Headband impedance: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 241 247 190 165 154 150 156 162 177 223 202
Headband Packets: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 21 4 30 3 3 3 9 4 46 46 1
Headband RSSI: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 6 254 254 250 5 255 4 3 249
Headband Status: 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 169 170 170
Hardware ID: 2
Software ID: 43
Sensor Life Reset date: Not recorded
sleep Stat Reset date: 10/14/2014 20:18
Awakenings: 0
Awakenings Average: 0
Start of night: 10/14/2014 20:28
End of night: 10/14/2014 20:32
Awakenings: 0
Awakenings Average: 0
Time in deep: 0
Time in deep average: 0
There is an End If missing. Add this immediately after the first End If - do you get the same error?
Update:
There is also an error in the If conditions. Check the InStr reference for return values.
You need to use something like If Not InStr(...) = 1 Then on both if statements.

SKlearn metrics fails with expected y object and predicted y object

In Sci-kit learn have created a few models with train and test data.
The models work fine, but when I try to compute any accuracy metrics, it fails. I assume something is wrong with either my prediction object (pred y) or expected object (true y).
For this test, I have looked at the pred y. It is an object and have 119 0/1 values.
The true y is also an object and has 119 0/1 values.
My code and the error is below, as well as an object comparison. It is the error I do not understand.
"expected" is my true y and "target_predicted" is the predicted y.
I have tried other metrics and other models- it always fails when I am at this stage.
Any assistance?
#Basic Decsion Tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(bank_train, bank_train_target)
print clf
DecisionTreeClassifier(compute_importances=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_density=None, min_samples_leaf=1, min_samples_split=2,
random_state=None, splitter='best')
#test model using test data
target_predicted = clf.predict(bank_test)
accuracy_score(expected,target_predicted)
#error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-23d1a990a192> in <module>()
1 #test model using test data
2 target_predicted = clf.predict(bank_test)
----> 3 accuracy_score(expected,target_predicted)
/Users/mpgartland1/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in accuracy_score(y_true, y_pred, normalize, sample_weight)
1295
1296 # Compute accuracy for each possible representation
-> 1297 y_type, y_true, y_pred = _check_clf_targets(y_true, y_pred)
1298 if y_type == 'multilabel-indicator':
1299 score = (y_pred != y_true).sum(axis=1) == 0
/Users/mpgartland1/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in _check_clf_targets(y_true, y_pred)
125 if (y_type not in ["binary", "multiclass", "multilabel-indicator",
126 "multilabel-sequences"]):
--> 127 raise ValueError("{0} is not supported".format(y_type))
128
129 if y_type in ["binary", "multiclass"]:
ValueError: unknown is not supported
Here is a comparison of the two objects.
print target_predicted.size
print expected.size
print target_predicted.dtype
print expected.dtype
print target_predicted
print expected
119
119
object
object
[1 0 0 1 0 0 1 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1
0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0
0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1
0 1 0 1 0 0 0 1]
[1 0 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 1 0 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 0 0 0 0
0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1
0 1 0 0 0 1 0 1]
If also fails when I try a confusion matrix or other metric- using very cookie cutter code. So, my guess is in the object(s).
Thanks