TypeError: Expected binary or unicode string - tensorflow

I want to use tf.data to input my image data. And I have read all image in a fold into a np.array, then I used to np.array to create a tf.data.Dataset object. However, I had a TypeError. My code is shown as follows.
import os
from scipy.misc import imread
import numpy as np
import glob
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
image = []
img_dir = 'data/ILSVRC2012_test/*'
images = np.array([np.array(imread(data)) for data in glob.glob(img_dir)])
image_data = tf.data.Dataset.from_tensor_slices(images)
And the following block is error information.
TypeError: Expected binary or unicode string, got array([[[184, 210, 225],
[184, 210, 225],
[184, 210, 225],
...,
[160, 185, 205],
[159, 184, 204],
[159, 184, 204]],
[[183, 209, 224],
[184, 210, 225],
[184, 210, 225],
...,
[159, 186, 205],
[159, 186, 205],
[159, 186, 205]],
[[184, 210, 225],
[184, 210, 225],
[185, 211, 226],
...,
[160, 187, 206],
[160, 187, 206],
[160, 187, 206]],
...,
[[ 65, 65, 15],
[ 71, 71, 17],
[ 75, 76, 19],
...,
[ 83, 83, 19],
[ 82, 87, 21],
[ 85, 85, 21]],
[[ 70, 70, 18],
[ 74, 75, 18],
[ 74, 78, 19],
...,
[ 77, 81, 20],
[ 78, 87, 24],
[ 77, 81, 20]],
[[ 71, 71, 17],
[ 73, 74, 17],
[ 77, 78, 20],
...,
[ 85, 86, 20],
[ 85, 85, 21],
[ 75, 74, 20]]], dtype=uint8)
Any help would be appreciated!

Related

how to plot two cluster using dictionary file

I have a dictionary file saved in .npy file that contain two cluster ids that i want to plot in a scatter plot(for the id values saved under key '0' one cluster and the id values saved under key '1' is another cluster)
My script:
import numpy as np
import matplotlib.pyplot as plt
data=np.load("dict.npy",allow_pickle=True)
print(data)
array({0: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 125,
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
191, 192, 193, 194, 195, 196, 197, 198, 199]), 1: array([ 89, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115,
116, 117, 118, 119, 120, 121, 122, 123, 124])}, dtype=object)
An example as you have request:
#you will need these libraries:
import numpy as np
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
Then generate some random 2D data, just for this example:
#the data you want to cluster
X = np.random.multivariate_normal(mean=[1,2], cov=[[.5, .25], [.25,.75]], size=1800)
plt.scatter(*X.T, alpha=.25, color="k")
Finally run the clustering and see the result:
X_cluster = KMeans(n_clusters=2).fit_predict(X)
for c in set(X_cluster):
plt.scatter(*X[X_cluster==c].T, alpha=.25)
plt.figure(figsize=(7,7))
for cluster in data:
plt.scatter(X[data[cluster],0], X[data[cluster],1])
plt.show()
where X is the dateset that you have used for the clustering and has shape (N,2) (N is the number of samples)

My mac doesn't show seaborn plot without error message?

In [14]: import seaborn as sns
...: import matplotlib.pyplot as plt
...:
...: l = [41, 44, 46, 46, 47, 47, 48, 48, 49, 51, 52, 53, 53, 53, 53, 55, 55, 55,
...: 55, 56, 56, 56, 56, 56, 56, 57, 57, 57, 57, 57, 57, 57, 57, 58, 58, 58,
...: 58, 59, 59, 59, 59, 59, 59, 59, 59, 60, 60, 60, 60, 60, 60, 60, 60, 61,
...: 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 62, 62, 62, 62, 62, 62, 62, 62,
...: 62, 63, 63, 63, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 64, 64, 64, 65,
...: 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 66,
...: 67, 67, 67, 67, 67, 67, 67, 67, 68, 68, 68, 68, 68, 69, 69, 69, 70, 70,
...: 70, 70, 71, 71, 71, 71, 71, 72, 72, 72, 72, 73, 73, 73, 73, 73, 73, 73,
...: 74, 74, 74, 74, 74, 75, 75, 75, 76, 77, 77, 78, 78, 79, 79, 79, 79, 80,
...: 80, 80, 80, 81, 81, 81, 81, 83, 84, 84, 85, 86, 86, 86, 86, 87, 87, 87,
...: 87, 87, 88, 90, 90, 90, 90, 90, 90, 91, 91, 91, 91, 91, 91, 91, 91, 92,
...: 92, 93, 93, 93, 94, 95, 95, 96, 98, 98, 99, 100, 102, 104, 105, 107, 108,
...: 109, 110, 110, 113, 113, 115, 116, 118, 119, 121]
...:
...: sns.distplot(l, kde=True, rug=False)
...:
/Users/congminmin/.venv/wbkg/lib/python3.7/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[14]: <AxesSubplot:ylabel='Density'>
<Figure size 1008x720 with 1 Axes>
In [15]: plt.show()
In [16]:
It doesn't give any error, as above in my iPython. UMAP is the same. If I put the code into a python file and run it, it doesn't show any visualization either, no error as well.
My OS:
macOS Big Sur, 11.6
This is the first time to try seaborn and UMAP libraries, no success.

Most efficient method to loop through a dataframe and return a filtered array of values based on multiple criteria

I have a dataset which has data of events including various elements with positional data of these elements included at various points in time. The total dataset is very large covering many of these events.
For each element at each point in time, I want to find the closest other element. To start this I was going to return an array of the positional data of all other elements at a specific time period and include this in the same row of the original dataframe (to perform further calculations on later).
I had two attempts at coding this, which I have included below. Both take too long on such a large dataset. Any ways that I can make it more efficient would be greatly appreciated.
import pandas as pd
import numpy as np
def func1(db, val, frame):
return db.loc[(db['val'].isin([val])) & (db['frameId'].isin([frame])) & ['displayName', 'x', 'y']]
.reset_index(drop=True).values.tolist()
d = pd.DataFrame({'displayName': ['Bob', 'Jane', 'Alice',
'Bob', 'Jane', 'Alice'],
'x': [90, 88, 86, 94, 91, 92],
'y': [24, 13, 18, 20, 15, 16],
'val': [201801, 201801, 201801, 201801, 201801, 201801],
'frameId': [1, 1, 1, 2, 2, 2]})
res = d.apply(lambda row: func1(d, row['val'], row['frameId']), axis=1)
Approach 2:
def func2(db, val, frame):
return [l[[0, 1, 2]] for l in db if l[3] == val if l[4] == frame]
res = d.apply(lambda row: func2(np.array(d), row['val'], row['frameId']), axis=1)
The result (res) will thus be an array that looks like this:
[[['Bob', 90, 24], ['Jane', 88, 13], ['Alice', 86, 18]],
[['Bob', 90, 24], ['Jane', 88, 13], ['Alice', 86, 18]],
[['Bob', 90, 24], ['Jane', 88, 13], ['Alice', 86, 18]],
[['Bob', 94, 20], ['Jane', 91, 15], ['Alice', 92, 16]],
[['Bob', 94, 20], ['Jane', 91, 15], ['Alice', 92, 16]],
[['Bob', 94, 20], ['Jane', 91, 15], ['Alice', 92, 16]]]
However over the large dataset this is very time consuming to produce under both methods so any way to reduce time complexity would be welcomed.
If the order of the first dimension of the 3D-Array does not matter, then please use (if it does matter, then you will have to create a series that groups by displayName or index and takes the cumcount. Sort by that and then drop. Let me know.:
import pandas as pd
import numpy as np
d = pd.DataFrame({'displayName': ['Bob', 'Jane', 'Alice',
'Bob', 'Jane', 'Alice'],
'x': [90, 88, 86, 94, 91, 92],
'y': [24, 13, 18, 20, 15, 16],
'val': [201801, 201801, 201801, 201801, 201801, 201801],
'frameId': [1, 1, 1, 2, 2, 2]})
n = d['frameId'].max() + 1
x = d['displayName'].nunique()
pd.concat([d.iloc[:,0:3]]*n).to_numpy().reshape(df.shape[0],x,x)
Out[1]:
array([[['Bob', 90, 24],
['Jane', 88, 13],
['Alice', 86, 18]],
[['Bob', 94, 20],
['Jane', 91, 15],
['Alice', 92, 16]],
[['Bob', 90, 24],
['Jane', 88, 13],
['Alice', 86, 18]],
[['Bob', 94, 20],
['Jane', 91, 15],
['Alice', 92, 16]],
[['Bob', 90, 24],
['Jane', 88, 13],
['Alice', 86, 18]],
[['Bob', 94, 20],
['Jane', 91, 15],
['Alice', 92, 16]]], dtype=object)

Validation accuracy reaches 1 with very high loss

I have 3 very specific questions:
I am training a regression model-the dateset is small (300)-, and getting a validation accuracy of 1.00 in the 4th epoch till the last ! and training accuracy 0.9957 at the last epoch, while the loss is actually big it is 33 , so I don't know how both the accuracy and loss are very high!
using optimizer ADAM and loss (mean_absolute_error)
When scaling the inputs, some values turn to negative despite I don't have any negative value, is that reasonable?And I noticed some similar numbers are not the same after scaling.
when I predict, I should scale the data I am going to predict at the same manner I scaled the inputs,no ? but how can I make sure of that as the inputs are all scaled related to each other in all the rows -as I understand-.
array=SData.to_numpy()
array
array([[ 6.25 , 6.25 , 6.25 , ..., 8.11521 ,
13.525349,
744.421033],
[ 6.25 , 6.25 , 6.25 , ..., 8.989118, 14.981864,
744.484697],
[ 6.25 , 6.25 , 6.25 , ..., 8.931293, 14.885489,
744.484629],
...,
[ 6.160831, 8.157965, 9.184461, ..., 6.170488, 10.284147,
938.598232],
[ 6.160831, 8.157965, 9.184461, ..., 12.417958, 20.696597,
938.291951],
[ 6.160831, 8.157965, 9.184461, ..., 6.007829, 10.013048,
938.103987]])
unscaled_inputs=array[:,:9]
unscaled_inputs
targets=array[:,9:]
unscaled_inputs array([[ 6.25 , 6.25 , 6.25 , ..., 6.25 , 6.25 ,
0. ],
[ 6.25 , 6.25 , 6.25 , ..., 6.25 , 6.25 ,
15. ],
[ 6.25 , 6.25 , 6.25 , ..., 6.25 , 6.25 ,
30. ],
...,
[ 6.160831, 8.157965, 9.184461, ..., 8.640023, 8.996907,
45. ],
[ 6.160831, 8.157965, 9.184461, ..., 8.640023, 8.996907,
60. ],
[ 6.160831, 8.157965, 9.184461, ..., 8.640023, 8.996907,
75. ]])
scaled_inputs=preprocessing.scale(unscaled_inputs)
scaled_inputs
array([[ 0.64061068, -1.55811375, -1.96681483, ..., -0.96073795,
-1.709721 , -1.46385011],
[ 0.64061068, -1.55811375, -1.96681483, ..., -0.96073795,
-1.709721 , -0.87831007],
[ 0.64061068, -1.55811375, -1.96681483, ..., -0.96073795,
-1.709721 , -0.29277002],
...,
[ 0.35930701, 1.56499191, 1.66411229, ..., 0.76559569,
0.84111767, 0.29277002],
[ 0.35930701, 1.56499191, 1.66411229, ..., 0.76559569,
0.84111767, 0.87831007],
[ 0.35930701, 1.56499191, 1.66411229, ..., 0.76559569,
0.84111767, 1.46385011]])
shuffled_indicies=np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indicies)
shuffled_indicies
array([257, 191, 37, 128, 72, 247, 161, 252, 140, 264, 258, 255,
278,
148, 231, 186, 31, 83, 230, 175, 121, 156, 151, 256, 192, 200,
66, 59, 199, 9, 223, 157, 214, 73, 92, 61, 60, 139, 47,
280, 202, 104, 110, 22, 39, 197, 81, 225, 69, 94, 284, 18,
113, 187, 267, 173, 91, 90, 111, 180, 144, 20, 287, 153, 131,
103, 268, 172, 260, 193, 141, 224, 179, 87, 106, 96, 274, 85,
89, 105, 84, 75, 15, 160, 52, 24, 126, 16, 235, 124, 44,
40, 249, 34, 63, 219, 11, 198, 149, 118, 277, 222, 238, 209,
127, 272, 184, 107, 5, 146, 169, 57, 116, 170, 82, 23, 207,
174, 188, 88, 206, 7, 36, 226, 86, 150, 276, 163, 62, 12,
253, 204, 45, 74, 210, 14, 108, 195, 196, 4, 109, 263, 241,
147, 78, 176, 33, 10, 232, 248, 42, 43, 50, 97, 270, 117,
254, 181, 201, 266, 182, 38, 211, 218, 212, 26, 239, 41, 55,
275, 77, 189, 30, 122, 80, 58, 271, 19, 119, 158, 154, 177,
53, 70, 265, 99, 205, 165, 250, 178, 49, 213, 136, 240, 6,
208, 25, 32, 217, 246, 285, 237, 3, 227, 155, 190, 259, 159,
269, 138, 167, 216, 234, 64, 281, 133, 137, 166, 2, 54, 112,
13, 65, 279, 114, 95, 100, 1, 125, 282, 185, 145, 102, 29,
135, 0, 101, 71, 164, 17, 28, 130, 68, 262, 56, 245, 129,
244, 236, 283, 67, 8, 79, 134, 35, 51, 120, 168, 194, 21,
27, 98, 251, 115, 273, 123, 233, 76, 286, 228, 243, 220, 162,
142, 229, 203, 152, 143, 221, 242, 171, 48, 93, 132, 183, 215,
261, 46])
shuffled_inputs=scaled_inputs[shuffled_indicies]
shuffled_targets=targets[shuffled_indicies]
#define the numcer of observations
observations_count=shuffled_inputs.shape[0]
# 80 10 10 Rule
train_count=int(0.8 * observations_count)
validation_count=int(0.1 * observations_count )
test_count=observations_count-train_count-validation_count
train_inputs=shuffled_inputs[:train_count]
train_targets=shuffled_targets[:train_count]
validation_inputs=shuffled_inputs[train_count:train_count+validation_count]
validation_targets=shuffled_targets[train_count:train_count+validation_count]
test_inputs=shuffled_inputs[train_count+validation_count:]
test_targets=shuffled_targets[train_count+validation_count:]
np.savez('Sample_Data_Train',inputs=train_inputs,targets=train_targets)
np.savez('Sample_Data_Validation',inputs=validation_inputs,targets=validation_targets)
np.savez('Sample_Data_Test',inputs=test_inputs,targets=test_targets)
npz=np.load(r"C:\Users\dai_k\OneDrive\Desktop\GRASSHOPPERS\Second semester\Thesis\samplenpz\Sample_Data_Train.npz")
Processed_train_inputs=npz['inputs'].astype(np.float)
processed_train_targets=npz['targets'].astype(np.float)
npz1=np.load(r"C:\Users\dai_k\OneDrive\Desktop\GRASSHOPPERS\Second semester\Thesis\samplenpz\Sample_Data_Validation.npz")
processed_validation_inputs=npz1['inputs'].astype(np.float)
processed_validation_targets=npz1['targets'].astype(np.float)
npz2=np.load(r"C:\Users\dai_k\OneDrive\Desktop\GRASSHOPPERS\Second semester\Thesis\samplenpz\Sample_Data_Test.npz")
processed_test_inputs=npz2['inputs'].astype(np.float)
processed_test_targets=npz2['targets'].astype(np.float)
output_size=8
hidden_layer_size=100 # START WITH ANY WIDTH - This is a hyperbarameter
model=tf.keras.Sequential([
tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
tf.keras.layers.Dense(output_size, activation='relu')
])
model.compile(optimizer='adam', loss='mean_absolute_error', metrics=['accuracy'])
batch_size=8
max_epochs=30
early_stopping=tf.keras.callbacks.EarlyStopping()
model.fit(Processed_train_inputs,
processed_train_targets,
batch_size=batch_size,
epochs=max_epochs,
callbacks=[early_stopping],
validation_data=(processed_validation_inputs, processed_validation_targets),
verbose=2 )
It is a regression problem, and you used MAE loss function. So, it is not possible the value of loss very small ( fraction, like in classification loss) and I think it is logical for regression type analysis.
You are standardizing your dataset not normalization. Normalization usually means to scale a variable to have a set of values between 0 and 1, while standardization transforms data to have a mean of zero and a standard deviation of 1. So, using scale you are converting your data zero mean, so that happens. You can use normalizer or using raw coding for data normalization.
Yes, you should standardize your data while predicting. Otherwise, due to training and testing data variation (as data distribution is different in train and test) performance will poor.

How to get initialisation point after sklearn.cluster.KMeans

How can I know the initialisation points that were used for the Means after performing Means from sklearn.cluster?
For each of my clusters, I need to return each feature of the initialisation points used (original input was in a Pandas datafraame)
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(0)
# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# KMeans with 3 clusters
clf = KMeans(n_clusters=3)
clf.fit(X,y)
#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_
# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}
# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
temp = [key,value]
dictlist.append(temp)
print(dictlist)
[[0, array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
[1, array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
[2, array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]