Multilabel classification - pandas

I working with a multilabel classification problem, using Keras, scikit-learn, etc...
My dataframe contain 4000 microscopic oil samples, with images and 13 different labels for which problem find in those samples.
Actually i convert all images and labels to numpy array.
Example of one labeled image:
[ 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0 ]
In this label, if position is equal to 1, that means the current sample have a specific problem, like some particles in oil, and as you can see, it's possible the sample have more than one output.
Well, the problem is, my dataframe are imbalanced and i need to apply Class Weight method, but before, looking at the labels, i think i need to use like: [ 0, 1, 0, 0, ... ], not like the example i gave above.
Detail, i can run my neural network code without class weight, works well, but i can't train all the model with that imbalanced data.
Already tryed working using lists, unsuccessfully!
Of course, i have problems with shape, images have in example: (1000, 100, 200, 3) and labels (1000, 13); Thats why i can't apply Class Weight too...
There is a few problems i trying to fix.
I will post my code, because i stuck and i don't know what to do.
class_weight_list = compute_class_weight('balanced',np.unique(Y_train), Y_train)
class_weight = dict(zip(np.unique(Y_train), class_weight_list))
Y_train = to_categorical(Y_train,num_classes=len(np.unique(Y_train)))
main.py
dataset.py
models.py
What is the best strategy to work with labels in this case?
I appreciate if someone can help me.
Thanks in advance!!

Related

Using GridSearchCV of xgboost with DMatrix

I have some problems when I was practicing how to use xgboost.
As I know, the "DMatrix" is a special internal structure that makes the model run faster.
Here's the problem:
To tune the model, (I guess) GridSearchCV or RandomizedSearchCV are considerable.
With the code below:
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(X,Y), verbose=3, random_state=1001 )
I can also do the cross validation by passing cv. That was great.
However, it really takes time (almost 40 mins with big data and colab gpu) and I really want to improve it.
After I transform my train data to DMatrix:
xgbtrain = xgb.DMatrix(train_x, train_y)
I'm not knowing what to do next because the .fit requires X and y..
How to do that? Or any way to make it faster?
Thanks
This question is pretty old, so I suspect you may have already found an answer. XGBoost can be tricky to navigate the different options when incorporating CV or parameter tuning.
Instead of using xgb.fit() you can use xgb.train() to utilize the DMatrix object. Additionally, XGB has xgb.cv() for performing a cross validation. I myself am hoping to find an alternative to GridSearchCV, but I don't think there is one. The best method may be to create a loop of xgb.cv() to compare evaluation results and identify the best performing parameters.
XGB has really helpful documentation, you may want to check outXGB Python Intro: Training and Cross Validation Demo
Try Optuna for hyperparameter tuning of XGBoost, much much faster, and use gpu (tree_method = gpu_hist). Kaggle has free GPU every week.

Add internal boundary or crack in PyGmsh / Gmsh

I am trying to generate a finite element mesh using PyGmsh, using the following code:
import pygmsh
geom = pygmsh.opencascade.Geometry(
characteristic_length_min=0.1,
characteristic_length_max=0.1,
)
rectangle = geom.add_rectangle([-1.0, -1.0, 0.0], 2.0, 2.0)
disk1 = geom.add_disk([-1.2, 0.0, 0.0], 0.5)
disk2 = geom.add_disk([+1.2, 0.0, 0.0], 0.5)
disk3 = geom.add_disk([0.0, -0.9, 0.0], 0.5)
disk4 = geom.add_disk([0.0, +0.9, 0.0], 0.5)
union = geom.boolean_union([rectangle, disk1, disk2])
diff = geom.boolean_difference([union], [disk3, disk4])
mesh = pygmsh.generate_mesh(geom, dim=2)
I can generate the following mesh:
However, I would like to add a crack to the mesh, something like:
The crack here is just an example, it would need to be defined before the meshing process.
I've tried creating 2 points (geom.add_point()) and a line (geom.add_line()), and then do a
geom.boolean_difference() between the final geometry and the line/crack, but this just does not work.
Any help would be greatly appreciated.
EDIT
The purpose of this type of mesh generation is to simulate a physical crack in a body. In the meshing process, the crack can be modeled by the elemental connectivity of the mesh (i.e. the elements must have different nodes to create a crack face). Example, before applying any load, the crack is closed:
After applying the load, the crack opens since the element connectivity allows this:
You can achieve this by modeling a very narrow rectangle at that region. You can give dimensions like 1e-10 easily. I modelled also the crack tip to collapse the nodes in one point by modeling a very small circle. It works quite fine.
Also there is a plugin for this now. It automatically separates the nodes at the specified crack line/surface.
This can be achieved using the "embed" functionality. Minimal working example bellow (in Python).
import gmsh
gmsh.initialize()
gmsh.model.add("TestModel")
ms = 1 # mesh size at point
# square (plate) points
gmsh.model.geo.addPoint(0, 0, 0, ms, 1)
gmsh.model.geo.addPoint(8, 0, 0, ms, 2)
gmsh.model.geo.addPoint(8, 8, 0, ms, 3)
gmsh.model.geo.addPoint(0, 8, 0, ms, 4)
# square (plate) lines
gmsh.model.geo.addLine(1, 2, 1)
gmsh.model.geo.addLine(2, 3, 2)
gmsh.model.geo.addLine(3, 4, 3)
gmsh.model.geo.addLine(4, 1, 4)
# square (plate) curve loop
gmsh.model.geo.addCurveLoop([1, 2, 3, 4], 1)
# square (plate) surface
s = gmsh.model.geo.addPlaneSurface([1])
# "crack" geometry
a = gmsh.model.geo.addPoint(2, 2, 0, ms)
b = gmsh.model.geo.addPoint(6, 4, 0, ms/100)
l = gmsh.model.geo.addLine(a, b)
# synchronize
gmsh.model.geo.synchronize()
# embed "crack" on plate
gmsh.model.mesh.embed(1, [l], 2, s)
# generate mesh
gmsh.model.mesh.generate(2)
gmsh.fltk.run()
gmsh.finalize()
Output:

How to generate a mesh file with extracting nodes and elements

I need to generate a mesh file, where I need to extract the following information :
X Y and Z coordinates of each node + the nodetags
list of all the elements + elementtags
I would like to give each edge(the elements and the nodes of the edges) of my domain an index, in order to use it in my code for the management of BC, IC and parameters...)
Is there any preexisting code that would help me to do that ?
I tried gmsh, but I can't really understand the syntax of the .msh file, which is different from the explanation they propose in : 9.1 MSH file format
I've created meshio for this purpose. Here's how to write a file:
points = numpy.array([
[0.0, 0.0, 0.0],
[0.0, 1.0, 0.0],
[0.0, 0.0, 1.0],
])
cells = {
"triangle": numpy.array([
[0, 1, 2]
])
}
meshio.write_points_cells(
"foo.vtk",
points,
cells,
# Optionally provide extra data on points, cells, etc.
# point_data=point_data,
# cell_data=cell_data,
# field_data=field_data
)
Many different formats are supported.

Matplotlib graphic's line smoothing

bullets' trajectory comparison
I'm a new python user. I'm using this powerful code to do scientific research and data analysis.
I'm writing my thesis in physics, I'm trying to describe and analyze the external ballistics behind the bullet flight.
I'm using matplotlib to draw graphics representing the bullet's parabolic path and the related cross points; given that I'd like to know if there is a special code to smooth up the graphic lines drawn following the real experimental data avoiding to have a graphic made by a lot of linear segments.
Thanks a lot to all of you!
Francesco
All right Francesco, thanks for uploading the image. Now, let's have some fun with coding.
As first I suggest to use the numpy function to fit a polynomial curve of a certain degree to a set of value: np.polyfit(). Be aware of the degree you set as the results can widely change. For more information, please take a look at this documentation: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.polyfit.html
Then, in order to smooth your curve down, you need to increase the number of point to draw the function with np.linspace() and use this new set to apply the
function np.poly1d() (it calculates the y coordinates based on the fitting you did with polyfit).
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = [0, 50, 100, 150, 200, 250]
y = [-1, 0.8, 1.9, 1.6, 0, -3]
z = np.polyfit(x, y, 2)
p = np.poly1d(z)
xp = np.linspace(-2, 255)
plt.plot(x, y, '.', xp, p(xp), '-')
plt.show()

Why does MinMaxScaler add lines to image?

I want to normalize the pixel values of an image to the range [0, 1] for each channel (R, G, B).
Minimal Example
#!/usr/bin/env python
import numpy as np
import scipy
from sklearn import preprocessing
original = scipy.misc.imread('Crocodylus-johnsoni-3.jpg')
scipy.misc.imshow(original)
transformed = np.zeros(original.shape, dtype=np.float64)
scaler = preprocessing.MinMaxScaler()
for channel in range(3):
transformed[:, :, channel] = scaler.fit_transform(original[:, :, channel])
scipy.misc.imsave("transformed.jpg", transformed)
What happens
Taking https://commons.wikimedia.org/wiki/File:Crocodylus-johnsoni-3.jpg,
I get the following "normalized" result:
As you can see there are lines from top to bottom at the right side. What happened there? It seems to me that the normalization went wrong. If so: How do I fix it?
In scikit-learn, a two-dimensional array with shape (m, n) is usually interpreted as a collection of m samples, with each sample having n features.
MinMaxScaler.fit_transform() transforms each feature, so each column of your array is transformed independently of the others. That results in the vertical "stripes" in the image.
It looks like you intended to scale each color channel independently. To do that using MinMaxScaler, reshape the input so that each channel becomes one column. That is, if the original image has shape (m, n, 3), reshape it to (m*n, 3) before passing it to the fit_transform() method, and then restore the shape of the result to create the transformed array.
For example,
ascolumns = original.reshape(-1, 3)
t = scaler.fit_transform(ascolumns)
transformed = t.reshape(original.shape)
With this, transformed looks like this:
The image looks exactly like the original, because it turns out that in the array original, the minimum and maximum are 0 and 255, respectively, in each channel:
In [41]: original.min(axis=(0, 1))
Out[41]: array([0, 0, 0], dtype=uint8)
In [42]: original.max(axis=(0, 1))
Out[42]: array([255, 255, 255], dtype=uint8)
So all fit_transform does in this case is transform all the input values to the floating point range [0.0, 1.0] uniformly. If the minimum or maximum was different in one of the channels, the transformed image would look different.
By the way, it is not difficult to perform the transform using pure numpy. (I'm using Python 3, so in the following, the division automatically casts the result to floating point. If you are using Python 2, you'll need to convert one of the argument to floating point, or use from __future__ import division.)
In [58]: omin = original.min(axis=(0, 1), keepdims=True)
In [59]: omax = original.max(axis=(0, 1), keepdims=True)
In [60]: xformed = (original - omin)/(omax - omin)
In [61]: np.allclose(xformed, transformed)
Out[61]: True
(One potential problem with that method is that it will generate an error if one of the channels is constant, because then one of the values in omax - omin will be 0.)