Reorder dataset to find Sum of LSE between two data sets - optimization

Assume I have two datasets, each one containing 5000 samples, and each sample has three dimensions. I am looking for a way to "reorder" the samples in one (or probably both) dataset such that the sum of least square error (error: Euclidean distance between points) between these two data sets be minimum.
Please see the attached photo for clarification.
Thanks.
See this image for more clarification:
https://www.mathworks.com/matlabcentral/answers/uploaded_files/1017640/image.png

Related

How to plot timeseries with many NaNs?

Originally I had a dataframe containing power consumption of some devices like this:
and I wanted to plot power consumption vs time for different devices, one plot per one of 6 possible dates. After grouping by date I got plots like this one (for each group = date):
Then I tried to create similar plot, but switch date and device roles so that it is grouped by device and colored by date. In order to do it I prepared this dataframe:
It is similar to the previous one, but has many NaN values due to differing measurement times. I thought it won't be a problem, but then after grouping by device, subplots look like this one (ex is just a name of sub-dataframe extracted from loop going through groups = devices):
This is the ex dataframe (mean lag between observations is around 20 seconds)
Question: What should I do to make plot grouped by device look like ones grouped by date? (I'd like to use ex dataframe but handle NaNs somehow.)
I found solution in answer to similar question: ex.interpolate(method='linear').plot(). This line will fill gaps between data points via interpolation between plotting. This is the result:
Another thing that can help is adding .plot(marker='o', ms = 3) which won't fill gaps between points, but at least will make points visible (previously some points, mainly the peaks in energy consumption were too small in scale of whole plot). This is the result:

How can I study the properties of outliers in high-dimensional data?

I have a bundle of high-dimensional data and the instances are labeled as outliers or not. I am looking to get some insights around where these outliers reside within the data. I seek to answer questions like:
Are the outliers spread far apart from each other? Or are they clustered together?
Are the outliers lying 'in-between' clusters of good data? Or are they on the 'edge' boundaries of the data?
If outliers are clustered together, how do these cluster densities compare with clusters of good data?
'Where' are the outliers?
What kind of techniques will let me find these insights? If the data was 2 or 3-dimensional, I can easily plot the data and just look at it. But I can't do it high-dimensional data.
Analyzing the Statistical Properties of Outliers
First of all, if you can choose to focus on specific features. For
example, if you know a featues is subject to high variation, you can
draw a box plot. You can also draw a 2D graph if you want to focus on
2 features. THis shows how much the labelled outliers vary.
Next, there's a metric called a Z-score, which basically says how
many standard devations a point varies compared to the mean. The
Z-score is signed, meaning if a point is below the mean, the Z-score
will be negative. This can be used to analyze all the features of the
dataset. You can find the threshold value in your labelled dataset for which all the points above that threshold are labelled outliers
Lastly, we can find the interquartile range and similarly filter
based on it. The IQR is simply the difference between the 75
percentile and 25 percentile. You can also use this similarly to Z-score.
Using these techniques, we can analyze some of the statistical properties of the outliers.
If you also want to analyze the clusters, you can adapt the DBSCAN algorithm to your problem. This algorithm clusters data based on densities, so it will be easy to apply the techniques to outliers.

How to refine the Graphcut cmex code based on a specific energy functions?

I download the following graph-cut code:
https://github.com/shaibagon/GCMex
I compiled the mex files, and ran it for pre-defined image in the code (which is rgb image)
I wanna optimize the image segmentation results,
I have probability map of the image, which its dimension is (width,height, 5). Five probability distribution over the image dimension are stacked together. each relates to one the classes.
My problem is which parts of code should according to the probability image.
I want to define Data and Smoothing terms based on my application.
My question is:
1) Has someone refined the code according to the defining different energy function (I wanna change Unary and pair-wise formulation).
2) I have a stack of 3D images. I wanna define 6-neighborhood system, 4 neighbors in current slice and the other two from two adjacent slices. In which function and part of code can I do the refinements?
Thanks

Is there any available DM script that can compare two images and know the difference

Is there any available DM script that can compare two images and know the difference?
I mean the script can compare two or more images, and it can determine the similarity of two images, for example the 95% area of one image is same as another image, then the similarity of these two images is 95%.
The script can compare brightness and contrast distribution of images.
Thanks,
This question is a bit ill-defined, as "similarity" between images depends a lot on what you want.
If by "95% of the area is the same" you mean that 95% of the pixels are of identical value in images A & B, you can simply create a mask and sum() it to count the number of pixels, i.e.:
sum( abs(A-B)==0 ? 1 : 0 )
However, this will utterly fail if the images A & B are shifted with respect to each other even by a single pixel. It will also fail, if A & B are of same contrast but different absolute value.
I guess the intended question was to find similarity of two images in a fuzzy way.
For these, one way is to do crosscorrelation. DM has this function. Like this,
image xcorr= CrossCorrelate(ref,img)
From xcorr, the peak position gives x- and y- shift between the two, the peak intensity gives "similarity" of the two.
If you know there is no shift between the two, you can just do the sum and multiplication,
number similarity1=sum(img1*img2)
Another way to do similarity is calculate Euclidian distance of the two:
number similarity2=sqrt(sum((img1-img2)**2)).
"similarity2" calculates the "pure" similarity. "similarity1" is the pure similarity plus the mean intensity of img1 and img2. The difference is essentially this,
(a-b)**2=a**2+b**2-2*a*b.
The left term is "similarity2", the last term on the right is the "crosscorrelation" or "similarity1".
I think "similarity1" is called cross-correlation, "similarity2" is called correlation coefficient.
In example comparing two diffraction patterns, if you want to compute the degree of similarity, use "similarity2". If you want to compute the degree of similarity plus a certain character of the diffraction pattern, use "similarity1".

Kinect normalize depth

I have some Kinect data of somebody standing (reasonably) still and performing sets of punches. I am given it in the format of an x,y,z co-ordinate for each joint of which they are 20, so I have 60 data points per frame.
I'm trying to perform a classification task on the punches however I'm having some problems normalising my data. As you can see from the graph there are sections with much higher 'amplitude' than the others, my belief is that this is due to how close that person was to the kinect sensor when the readings were taken. (The graph is actually the first principal coefficient obtained by PCA for each frame, multiple sequences of the same punch are strung together in this graph)
Looking back at the data files it looks like those that are 'out' have a z co-ordinate (depth from sensor) of ~2.7 where as the others tent to hover around 3.3-3.6.
How can I perform a normalization with the depth values to make them closer to each other for each sequence? I've already tried differentiation to get the velocity, although it helps to normalise the output actually ends up too similar and makes it very hard to classify.
Edit: I should mention I am already using a normalization method by subtracting the hip position from each joint in an attempt to make the co-ordinates relative.
The Kinect can output some strange values when the person that is tracked is standing near the edges of the view of the Kinect. I would either completly ignore these data or just replace the data with an average of the previous 2 and next 2.
For example:
1,2,1,12,1,2,3
Replace 12 with (2 + 1 + 1 + 2) / 4 = 1.5
You can basically do this with the whole array of values you have, this way you have a more normalised line/graph.
You can also use the clippedEdges value to determine if one or more joints is outside the view.