Problem with manual data for PyTorch's DataLoader - pandas

I have a dataset which I have to process in such a way that it works with a convolutional neural network of PyTorch (I'm completely new to PyTorch). The data is stored in a dataframe with a column for pictures (28 x 28 ndarrays with int32 entries) and a column with its class labels. The pixels of the images merely adopt values +1 and -1 (since it is simulation data of a classical 2d Ising Model). The dataframe looks like this.
I imported the following (a lot of this is not relevant for now, but I included everything for completeness. "data_loader" is a custom py file.):
import numpy as np
import matplotlib.pyplot as plt
import data_loader
import pandas as pd
import torch
import torchvision.transforms as T
from torchvision.utils import make_grid
from torch.nn import Module
from torch.nn import Conv2d
from torch.nn import Linear
from torch.nn import MaxPool2d
from torch.nn import ReLU
from torch.nn import LogSoftmax
from torch import flatten
from sklearn.metrics import classification_report
import time as time
from torch.utils.data import DataLoader, Dataset
Then, I want to get this in the correct shape in order to make it useful for PyTorch. I do this by defining the following class
class MetropolisDataset(Dataset):
def __init__(self, data_frame, transform=None):
self.data_frame = data_frame
self.transform = transform
def __len__(self):
return len(self.data_frame)
def __getitem__(self,idx):
if torch.is_tensor(idx):
idx = idx.tolist()
label = self.data_frame['label'].iloc[idx]
image = self.data_frame['image'].iloc[idx]
image = np.array(image)
if self.transform:
image = self.transform(image)
return (image, label)
I call instances of this class as:
train_set = MetropolisDataset(data_frame = df_train,
transform = T.Compose([
T.ToPILImage(),
T.ToTensor()]))
validation_set = MetropolisDataset(data_frame = df_validation,
transform = T.Compose([
T.ToPILImage(),
T.ToTensor()]))
test_set = MetropolisDataset(data_frame = df_test,
transform = T.Compose([
T.ToPILImage(),
T.ToTensor()]))
The problem does not yet arise here, because I am able to read out and show images from these instances of the above defined class.
Then, as far as I found out, it is necessary to let this go through the DataLoader of PyTorch, which I do as follows:
batch_size = 64
train_dl = DataLoader(train_set, batch_size, shuffle=True, num_workers=3, pin_memory=True)
validation_dl = DataLoader(validation_set, batch_size, shuffle=True, num_workers=3, pin_memory=True)
test_dl = DataLoader(test_set, batch_size, shuffle=True, num_workers=3, pin_memory=True)
However, if I want to use these instances of the DataLoader, simply nothing happens. I neither get an error, nor the computation seems to get anywhere. I tried to run a CNN but it does not seem to compute anything. Something else I tried was to show some sample images with the code provided by this article, but the same issue occurs. The sample code is:
def show_images(images, nmax=10):
fig, ax = plt.subplots(figsize=(8, 8))
ax.set_xticks([]); ax.set_yticks([])
ax.imshow(make_grid((images.detach()[:nmax]), nrow=8).permute(1, 2, 0))
def show_batch(dl, nmax=64):
for images in dl:
show_images(images, nmax)
break
show_batch(test_dl)
It seems that there is some error in the implementation of my MetropolisDataset class or with the DataLoader itself. How could this problem be solved?

As mentioned in the comments, the problem was partly solved by setting num_workers to zero since I was working in a Jupyter notebook, as answered here. However, this left open one further problem that I got errors when I wanted to apply the DataLoader to run a CNN. The issue was then that my data did consist of int32 numbers instead of float32. I do not include further codes, because this was related directly to my data - however, the issue was (as very often) merely a wrong datatype.

Related

numpy ndarray error in lmfit when mdel is passed using sympy

I got the following error:
<lambdifygenerated-1>:2: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.return numpy.array((A1exp(-1/2(x - xc1)**2/sigma1**2), 0, 0))
Here I have just one model but this code is written for model combination in fitting by the lmfit Please kindly let me know about it.
import matplotlib.pyplot as plt
import numpy as np
import sympy
from sympy.parsing import sympy_parser
import lmfit
gauss_peak1 = sympy_parser.parse_expr('A1*exp(-(x-xc1)**2/(2*sigma1**2))')
gauss_peak2 = 0
exp_back = 0
model_list = sympy.Array((gauss_peak1, gauss_peak2, exp_back))
model = sum(model_list)
print(model)
model_list_func = sympy.lambdify(list(model_list.free_symbols), model_list)
model_func = sympy.lambdify(list(model.free_symbols), model)
np.random.seed(1)
x = np.linspace(0, 10, 40)
param_values = dict(x=x, A1=2, sigma1=1, xc1=2)
y = model_func(**param_values)
yi = model_list_func(**param_values)
yn = y + np.random.randn(y.size)*0.4
plt.plot(x, yn, 'o')
plt.plot(x, y)
lm_mod = lmfit.Model(model_func, independent_vars=('x'))
res = lm_mod.fit(data=yn, **param_values)
res.plot_fit()
plt.plot(x, y, label='true')
plt.legend()
plt.show()
lmfit.Model takes a model function that is a Python function. It parses the function arguments and expects those to be the Parameters for the model.
I don't think using sympy-created functions will do that. Do you need to use sympy here? I don't see why. The usage here seems designed to make the code more complex than it needs to be. It seems you want to make a model with a Gaussian-like peak, and a constant(?) background. If so, why not do
from lmfit.Models import GaussianModel, ConstantModel
model = GaussianModel(prefix='p1_') + ConstantModel()
params = model.make_params(p1_amplitude=2, p1_center=2, p1_sigma=1, c=0)
That just seems way easier to me, and it is very easy to add a second Gaussian peak to that model.
But even if you have your own preferred mathematical expression, don't use that as a sympy string, use it as Python code:
def myfunction(x, A1, xc1, sigma1):
return A1*exp(-(x-xc1)**2/(2*sigma1**2))
and then
from lmfit import Model
mymodel = Model(myfunction)
params = mymodel.guess(A1=2, xc1=2, sigma1=1)
In short: sympy is an amazing tool, but lmfit does not use it.

What is the correct way to implement a basic GLCM-Layer in Tensorflow/Keras?

I am trying to get a GLCM implementation running in a custom Keras Layer in a reasonable fast time. So far I took the _glcm_loop from skimage-implementation, reduced it to what I needed and put it into a basic layer, like this:
import numpy as np
import tensorflow as tf
from time import time
from tensorflow import keras
from tensorflow.keras.preprocessing import image
from tensorflow.keras import layers
from skimage.feature import *
from numpy import array
from math import sin, cos
from time import time
import matplotlib.pyplot as plt
class GLCMLayer(keras.layers.Layer):
def __init__(self, greylevels=32, angles=[0], distances=[1], name=None, **kwargs):
self.greylevels = greylevels
self.angles = angles
self.distances = distances
super(GLCMLayer, self).__init__(name=name, **kwargs)
def _glcm_loop(self, image, distances, angles, levels, out):
rows = image.shape[0]
cols = image.shape[1]
for a_idx in range(len(angles)):
angle = angles[a_idx]
for d_idx in range(len(distances)):
distance = distances[d_idx]
offset_row = round(sin(angle) * distance)
offset_col = round(cos(angle) * distance)
start_row = max(0, -offset_row)
end_row = min(rows, rows - offset_row)
start_col = max(0, -offset_col)
end_col = min(cols, cols - offset_col)
for r in range(start_row, end_row):
for c in range(start_col, end_col):
i = image[r, c]
row = r + offset_row
col = c + offset_col
j = image[row, col]
out[i, j, d_idx, a_idx] += 1
def call(self, inputs):
P = np.zeros((self.greylevels, self.greylevels, len(self.distances), len(self.angles)), dtype=np.uint32, order='C')
self._glcm_loop(inputs, self.distances, self.angles, self.greylevels, P)
return P
def get_config(self):
config = {
'angle': self.angle,
'distance': self.distance,
'greylevel': self.greylevel,
}
base_config = super(GLCMLayer, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
My execution code looks like this:
def quant(img, greylevels):
return array(img)//(256//greylevels)
if __name__ == "__main__":
source_file = "<some sour file>"
img_raw = image.load_img(source_file, target_size=(150,150), color_mode="grayscale")
img = quant(img_raw, 32)
layer = GLCMLayer()
start = time()
aug = layer(img)
tf.print(time()-start)
This is my first step to create it as a preprocessing layer. The second step then will be to modify it to run it also as hidden layer inside a model. That is why I didn't put it to a complete model yet, but I feel like there will be additional changes required when doing so.
For some reason the execution time is about 15-20 seconds long. Executing the code on the CPU without the layer takes about 0.0009 seconds. Obviously, something is going wrong here.
I am fairly new to tf and keras, so I fear I am missing something in the way on how to use the framework. In order to resolve it, I read about (which doesn't mean I understood):
do not use np-functions inside tensorflow, but tf-functions instead,
use tf.Variable,
use tf.Data,
unfolding is not possible in some way (whatever that means)
I tried a little here and there, but couldn't get them running, instead finding various different exceptions. So my questions are:
What is the correct way to use tf-functions in a GLCM to get the best performance on the GPU?
What do I need to take care of when using the layer in a complete model?
From that point on, I should hopefully be able to then implement the GLCM properties.
Any help is greatly appreciated.
(Disclaimer: I assume that there is a lot of other stuff not optimal yet, if anything comes to your mind just add it.)

Using Sklearn with NumPy and Images and get this error 'setting an array element with a sequence'

I am trying to create a simple image classification tool.
I would like the code below to work with classifying images. It works fine when it is a non image NumPy array.
#https://e2eml.school/images_to_numbers.html
import numpy as np
from sklearn.utils import Bunch
from PIL import Image
monkey = [1]
dog = [2]
example_animals = Bunch(data = np.array([monkey,dog]),target = np.array(['monkey','dog']))
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2) #with KMeans you get to pre specify the number of Clusters
KModel = kmeans.fit(example_animals.data) #fit a model using the training data , in this case original example animal data passed through
import pandas as pd
crosstab = pd.crosstab(example_animals.target,KModel.labels_)
print(crosstab)
I have looked into how to make an image into a NumPy array at https://e2eml.school/images_to_numbers.html
The code below where I have converted images to NumPy array doesn't work.
When run it gets the following error
** 'setting an array element with a sequence'**
#https://e2eml.school/images_to_numbers.html
import numpy as np
from sklearn.utils import Bunch
from PIL import Image
monkey = np.asarray(Image.open("monkey.jpg"))
dog = np.asarray(Image.open("dog.jpeg"))
example_animals = Bunch(data = np.array([monkey,dog]),target = np.array(['monkey','dog']))
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2) #with KMeans you get to pre specify the number of Clusters
KModel = kmeans.fit(example_animals.data) #fit a model using the training data , in this case original example animal data passed through
import pandas as pd
crosstab = pd.crosstab(example_animals.target,KModel.labels_)
print(crosstab)
I would appreciate any insight how I fix the error 'setting an array element with a sequence' so that the images will be compatible with the sklearn processing.
You need to be sure that your images "monkey.jpg" and "dog.jpeg" have the same number of pixels. Otherwise, you will have to resize the images to have the same size. Moreover, the data of your Bunch object need to be of shape (n_samples, n_features) (you can check the documentation https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit)
You need to be aware that you use an unserpervised learning model (Kmeans). So the output of the model is not directly "monkey" or "dog".
I found the solution to error setting an array element with a sequence
Kmeans requires the data arrays for comparison need to be the same size.
This means if importing pictures, the pictures need to be resized, converted into a numpy array (a format that is compatible with Kmeans) and finally made into a 1 dimensional array.
#https://e2eml.school/images_to_numbers.html
#https://machinelearningmastery.com/how-to-load-and-manipulate-images-for-deep-learning-in-python-with-pil-pillow/
import numpy as np
from matplotlib import pyplot as plt
from sklearn.utils import Bunch
from PIL import Image
from sklearn.cluster import KMeans
import pandas as pd
monkey = Image.open("monkey.jpg")
dog = Image.open("dog.jpeg")
#resize pictures
monkey1 = monkey.resize((180,220))
dog1 = dog.resize((180,220))
#make pictures into numpy array
monkey2 = np.asarray(monkey1)
dog2 = np.asarray(dog1)
#https://www.quora.com/How-do-I-convert-image-data-from-2D-array-to-1D-using-python
#make numpy array into 1 dimensional array
monkey3 = monkey2.reshape(-1)
dog3 = dog2.reshape(-1)
example_animals = Bunch(data = np.array([monkey3,dog3]),target = np.array(['monkey','dog']))
kmeans = KMeans(n_clusters=2) #with KMeans you get to pre specify the number of Clusters
KModel = kmeans.fit(example_animals.data) #fit a model using the training data , in this case original example food data passed through
crosstab = pd.crosstab(example_animals.target,KModel.labels_)
print(crosstab)

Reshaping image and Plotting in Python

I am working on mnist_fashion data. The images in mnist_data are 28x28 pixel. For the purpose of feeding it to a neural network(multi-layer perceptron), I transformed the data into (784,) shape.
Further, I need to again reshape it back to the original size.
For this, I used below given code:-
from keras.datasets import fashion_mnist
import numpy as np
import matplotlib.pyplot as plt
(train_imgs,train_lbls), (test_imgs, test_lbls) = fashion_mnist.load_data()
plt.imshow(test_imgs[0].reshape(28,28))
no_of_test_imgs = test_imgs.shape[0]
test_imgs_trans = test_imgs.reshape(test_imgs.shape[1]*test_imgs.shape[2], no_of_test_imgs).T
plt.imshow(test_imgs_trans[0].reshape(28,28))
Unfortunately, I am not getting the similar image. I am not able to understand why this is happening.
expected image:
recieved image:
Kindly help me to resolve the problem.
pay attention when you flatten the images in test_imgs_trans
(train_imgs,train_lbls), (test_imgs, test_lbls) = tf.keras.datasets.fashion_mnist.load_data()
plt.imshow(test_imgs[0].reshape(28,28))
no_of_test_imgs = test_imgs.shape[0]
test_imgs_trans = test_imgs.reshape(no_of_test_imgs, test_imgs.shape[1]*test_imgs.shape[2])
plt.imshow(test_imgs_trans[0].reshape(28,28))

small test_set xgb predict

i would like to ask a question about a problem that i have for the last couple days.
First of all i am a beginner in machine learning and this is my first time using the XGBoost algorithm so excuse me for any mistakes I have done.
I trained my model to predict whether a log file is malicious or not. After i save and reload my model on a different session i use the predict function which seems to be working normally ( with a few deviations in probabilities but that is another topic, I know I, have seen it in another topic )
The problem is this: Sometimes when i try to predict a "small" csv file after load it seems to be broken predicting only the Zero label, even for indexes that are categorized correct previously.
For example, i load a dataset containing 20.000 values , the predict() is working. I keep only the first 5 of these values using pandas drop, again its working. If i save the 5 values on a different csv and reload it its not working. The same error happens if i just remove by hand all indexes (19.995) and save file only with 5 remaining.
I would bet it is a size of file problem but when i drop the indexes on the dataframe through pandas it seems to be working
Also the number 5 ( of indexes ) is for example purpose the same happens if I delete a large portion of the dataset.
I first came up with this problem after trying to verify by hand some completely new logs, which seem to be classified correctly if thrown into the big csv file but not in a new file on their own.
Here is my load and predict code
##IMPORTS
import os
import pandas as pd
from pandas.compat import StringIO
from datetime import datetime
from langid.langid import LanguageIdentifier, model
import langid
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.externals import joblib
from ggplot import ggplot, aes, geom_line
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.metrics import average_precision_score
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from collections import defaultdict
import pickle
df = pd.read_csv('big_test.csv')
df3 = pd.read_csv('small_test.csv')
#This one is necessary for the loaded_model
class ColumnSelector(BaseEstimator, TransformerMixin):
def init(self, column_list):
self.column_list = column_list
def fit(self, x, y=None):
return self
def transform(self, x):
if len(self.column_list) == 1:
return x[self.column_list[0]].values
else:
return x[self.column_list].to_dict(orient='records')
loaded_model = joblib.load('finalized_model.sav')
result = loaded_model.predict(df)
print(result)
df2=df[:5]
result2 = loaded_model.predict(df2)
print(result2)
result3 = loaded_model.predict(df3)
print(result3)
The results i get are these:
[1 0 1 ... 0 0 0]
[1 0 1 0 1]
[0 0 0 0 0]
I can provide any code even from training or my dataset if necessary.
*EDIT: I use a pipeline for my data. I tried to reproduce the error after using xgb to fit the iris data and i could not. Maybe there is something wrong with my pipeline? the code is below :
df = pd.read_csv('big_test.csv')
# df.info()
# Split Dataset
attributes = ['uri','code','r_size','DT_sec','Method','http_version','PenTool','has_referer', 'Lang','LangProb','GibberFlag' ]
x_train, x_test, y_train, y_test = train_test_split(df[attributes], df['Scan'], test_size=0.2,
stratify=df['Scan'], random_state=0)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train, test_size=0.2,
stratify=y_train, random_state=0)
# print('Train:', len(y_train), 'Dev:', len(y_dev), 'Test:', len(y_test))
# set up graph function
def plot_precision_recall_curve(y_true, y_pred_scores):
precision, recall, thresholds = precision_recall_curve(y_true, y_pred_scores)
return ggplot(aes(x='recall', y='precision'),
data=pd.DataFrame({"precision": precision, "recall": recall})) + geom_line()
# XGBClassifier
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, column_list):
self.column_list = column_list
def fit(self, x, y=None):
return self
def transform(self, x):
if len(self.column_list) == 1:
return x[self.column_list[0]].values
else:
return x[self.column_list].to_dict(orient='records')
count_vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 2), min_df=10)
dict_vectorizer = DictVectorizer()
xgb = XGBClassifier(seed=0)
pipeline = Pipeline([
("feature_union", FeatureUnion([
('text_features', Pipeline([
('selector', ColumnSelector(['uri'])),
('count_vectorizer', count_vectorizer)
])),
('categorical_features', Pipeline([
('selector', ColumnSelector(['code','r_size','DT_sec','Method','http_version','PenTool','has_referer', 'Lang','LangProb','GibberFlag' ])),
('dict_vectorizer', dict_vectorizer)
]))
])),
('xgb', xgb)
])
pipeline.fit(x_train, y_train)
filename = 'finalized_model.sav'
joblib.dump(pipeline, filename)
Thats due to different dtypes in big and small file.
When you do:
df = pd.read_csv('big_test.csv')
The dtypes are these:
print(df.dtypes)
# Output
uri object
code object # <== Observe this
r_size object # <== Observe this
Scan int64
...
...
...
Now when you do:
df3 = pd.read_csv('small_test.csv')
the dtypes are changed:
print(df3.dtypes)
# Output
uri object
code int64 # <== Now this has changed
r_size int64 # <== Now this has changed
Scan int64
...
...
You see, pandas will try to determine the dtypes of the columns by itself. When you load the big_test.csv, there are some values in code and r_size column which are of string types, due to this whole column dtype is changed to string, which is not done in small_test.csv.
Now due to this change, the dictVectorizer encodes the data in a different way than before and the features are changed, and hence the results are also changed.
If you do this:
df3[['code', 'r_size']] = df3[['code', 'r_size']].astype(str)
and then call the predict(), the results are same again.