What does executorRunTime consist of in Spark? - pandas

Currently working on Spark, I collected some performance metrics through the custom Spark listener API for analysis purposes. I tried to make a stacked bar plot that shows the percentage of the time the executor passes executing the task, shuffling or in garbage collection pauses for three different machine learning algorithms.
Here is a screenshot of what I found:
What caught my attention right after the plot appeared is that the rates are false. You can see that it goes beyond the value 1 for the kmeans algorithm, and less than 0.8 for the perceptron.
Here is how I computed the rates:
execution['cpuRate'] = execution['executorCpuTime'] / execution['executorRunTime']
execution['serRate'] = execution['resultSerializationTime'] / execution['executorRunTime']
execution['gcRate'] = execution['jvmGCTime'] / execution['executorRunTime']
execution['shuffleFetchRate'] = execution['shuffleFetchWaitTime'] / execution['executorRunTime']
execution['shuffleWriteRate'] = execution['shuffleWriteTime'] / execution['executorRunTime']
execution = execution[['cpuRate', 'serRate', 'gcRate', 'shuffleFetchRate', 'shuffleWriteRate']]
execution.plot.bar(stacked=True)
I use Pandas library and execution is the dataframe containing the averaged metrics.
Of course, my assumption is that the executorRunTime is a summation of the underlying other metrics, but it turns out to be false.
What are the meaning of those times, and how are they correlated? I mean: what does the executorRunTime consist of if not all the other metrics specified above?
Thanks

According to TaskMetrics.scala:
* Time the executor spends actually running the task (including fetching shuffle data).
*/
def executorRunTime: Long = _executorRunTime.sum
Measured in miliseconds.

Related

Importance of seed and num_runs in the KMeans clustering

New to ML so trying to make sense of the following code. Specifically
In for run in np.arange(1, num_runs+1), what is the need for this loop? Why didn't the author use setMaxIter method of KMeans?
What is the importance of seeding in clustering?
Why did the author chose to set the seed explicitly rather than using the default one?
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
def optimal_k(df_in,index_col,k_min, k_max,num_runs):
'''
Determine optimal number of clusters by using Silhoutte Score Analysis.
:param df_in: the input dataframe
:param index_col: the name of the index column
:param k_min: the train dataset
:param k_min: the minmum number of the clusters
:param k_max: the maxmum number of the clusters
:param num_runs: the number of runs for each fixed clusters
:return k: optimal number of the clusters
:return silh_lst: Silhouette score
:return r_table: the running results table
:author: Wenqiang Feng
:email: von198#gmail.com.com
'''
start = time.time()
silh_lst = []
k_lst = np.arange(k_min, k_max+1)
r_table = df_in.select(index_col).toPandas()
r_table = r_table.set_index(index_col)
centers = pd.DataFrame()
for k in k_lst:
silh_val = []
for run in np.arange(1, num_runs+1):
# Trains a k-means model.
kmeans = KMeans()\
.setK(k)\
.setSeed(int(np.random.randint(100, size=1)))
model = kmeans.fit(df_in)
# Make predictions
predictions = model.transform(df_in)
r_table['cluster_{k}_{run}'.format(k=k, run=run)]= predictions.select('prediction').toPandas()
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
silh_val.append(silhouette)
silh_array=np.asanyarray(silh_val)
silh_lst.append(silh_array.mean())
elapsed = time.time() - start
silhouette = pd.DataFrame(list(zip(k_lst,silh_lst)),columns = ['k', 'silhouette'])
print('+------------------------------------------------------------+')
print("| The finding optimal k phase took %8.0f s. |" %(elapsed))
print('+------------------------------------------------------------+')
return k_lst[np.argmax(silh_lst, axis=0)], silhouette , r_table
I'll try to answer your questions based on my reading of the material.
The reason for this loop is that the author sets a new seed for every loop using int(np.random.randint(100, size=1)). If the feature variables exhibit patterns that automatically group them into visible clusters, then the starting seed should not have an impact on the final cluster memberships. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. I believe the author is changing these seeds for each run to test different initial distributions. Using setMaxIter would set maximum iterations for the same seed (initial distribution).
Similar to the above - the seed defines the initial distribution of k points around which you're going to cluster. Depending on your underlying data distribution, the clusters can converge in different final distributions.
The author has control over the seed, as discussed in points 1 and 2. You can see for what seed your code converges around clusters as desired, and for which you might not get convergence. Also, if you iterate for, say, 100 different seeds and your code still converges into the same final clusters, you can remove the default seed as it likely doesn't matter. Another use is from a more software engineering perspective, setting explicit seed is super important if you want to, for example, write tests for your code and don't want it to randomly fail.

For a buckling analysis, must all forces be multiplied by the resulting eigenvalue, or only the compressive load?

I am trying to do a linear buckling analysis (sol 105) with Nastran on a cylindrical shell structure. My understanding is that the compressive load that I apply to the structure must be multiplied by the resulting eigenvalue to get the buckling load. This gives me results that I expect.
However, now I apply a single perturbation load (SPL), a small transverse force acting midway along the cylinder on a single grid point. My understanding is that the magnitude of the SPL stays the way it is, (Unlike the compressive load where I multiply it with the eigenvalue to obtain buckling load.) The results I obtain are not what I expect, as the buckling load should not reduce so much as the SPL increases, according to the theory on this topic.
I am wondering if anyone knows what I am doing wrong. I feel like my mistake is probably very easy, but I haven't been able to solve it yet. Here is some more information on my implementation:
Axial compressive force spread over top grid points of cylinder.
Both SPL (the transverse point load) and axial loads are added to the static analysis subcase. Then the buckling subcase uses the static subcase for its analysis. This is how I understand it should be done.
boundary conditions:
SPC1 restraining 123 (xyz) directions at bottom grid points.
SPC1 restraining 12 (xy) directions at top grid points.
I'm not a Nastran user but I've done a lot of buckling analysis with Cast3M software.
The linear buckling analysis does not need perturbation loading, but only your main axial loading (F^0).
To recap,
Solve the linear problem for axial loading :
solve for u^0 : [K] * u^0 = F^0
get the linear stresses from the Hooke law : \sigma^0 = D * B * u^0
Solve the eigenvalue buckling problem :
[ K + \lambda Kgeo(\sigma^0)] * X = 0
Then, if you want to perform a non-linear (large displacement) post-buckling analysis, it is recommended to introduce a small perturbation which "excites" the buckling mode.
If you introduce the perturbation loading before the linear buckling analysis, maybe Nastran is adding it to F^0 and it is then logical that the result of buckling changes.
Hope this can help you.
There is a way to scale some loads and hold others constant. Create 2 Static Subcases with 2 (different) sets of loads:
Constant loads (that are not scaled, like a preload or internal pressure)
Those that will be scaled by the eigenvalue
Order does not matter
Use the Nastran STATSUB entry to define. It looks like this:
SUBCASE 100
LOAD = 1 $ Static pre-load
SUBCASE 200
LOAD = 2 $ Varying buckling load
$ -------------
SUBCASE 1000
STATSUB(PRELOAD) = 100
STATSUB(BUCKLING) = 200
METHOD = 10
The eigensolution is modified to include influence of static and varying loads.

How to calculate a very large correlation matrix

I have an np.array of observations z where z.shape is (100000, 60). I want to efficiently calculate the 100000x100000 correlation matrix and then write to disk the coordinates and values of just those elements > 0.95 (this is a very small fraction of the total).
My brute-force version of this looks like the following but is, not surprisingly, very slow:
for i1 in range(z.shape[0]):
for i2 in range(i1+1):
r = np.corrcoef(z[i1,:],z[i2,:])[0,1]
if r > 0.95:
file.write("%6d %6d %.3f\n" % (i1,i2,r))
I realize that the correlation matrix itself could be calculated much more efficiently in one operation using np.corrcoef(z), but the memory requirement is then huge. I'm also aware that one could break up the data set into blocks and calculate bite-size subportions of the correlation matrix at one time, but programming that and keeping track of the indices seems unnecessarily complicated.
Is there another way (e.g., using memmap or pytables) that is both simple to code and doesn't put excessive demands on physical memory?
After experimenting with the memmap solution proposed by others, I found that while it was faster than my original approach (which took about 4 days on my Macbook), it still took a very long time (at least a day) -- presumably due to inefficient element-by-element writes to the outputfile. That wasn't acceptable given my need to run the calculation numerous times.
In the end, the best solution (for me) was to sign in to Amazon Web Services EC2 portal, create a virtual machine instance (starting with an Anaconda Python-equipped image) with 120+ GiB of RAM, upload the input data file, and do the calculation (using the matrix multiplication method) entirely in core memory. It completed in about two minutes!
For reference, the code I used was basically this:
import numpy as np
import pickle
import h5py
# read nparray, dimensions (102000, 60)
infile = open(r'file.dat', 'rb')
x = pickle.load(infile)
infile.close()
# z-normalize the data -- first compute means and standard deviations
xave = np.average(x,axis=1)
xstd = np.std(x,axis=1)
# transpose for the sake of broadcasting (doesn't seem to work otherwise!)
ztrans = x.T - xave
ztrans /= xstd
# transpose back
z = ztrans.T
# compute correlation matrix - shape = (102000, 102000)
arr = np.matmul(z, z.T)
arr /= z.shape[0]
# output to HDF5 file
with h5py.File('correlation_matrix.h5', 'w') as hf:
hf.create_dataset("correlation", data=arr)
From my rough calculations, you want a correlation matrix that has 100,000^2 elements. That takes up around 40 GB of memory, assuming floats.
That probably won't fit in computer memory, otherwise you could just use corrcoef.
There's a fancy approach based on eigenvectors that I can't find right now, and that gets into the (necessarily) complicated category...
Instead, rely on the fact that for zero mean data the covariance can be found using a dot product.
z0 = z - mean(z, 1)[:, None]
cov = dot(z0, z0.T)
cov /= z.shape[-1]
And this can be turned into the correlation by normalizing by the variances
sigma = std(z, 1)
corr = cov
corr /= sigma
corr /= sigma[:, None]
Of course memory usage is still an issue.
You can work around this with memory mapped arrays (make sure it's opened for reading and writing) and the out parameter of dot (For another example see Optimizing my large data code with little RAM)
N = z.shape[0]
arr = np.memmap('corr_memmap.dat', dtype='float32', mode='w+', shape=(N,N))
dot(z0, z0.T, out=arr)
arr /= sigma
arr /= sigma[:, None]
Then you can loop through the resulting array and find the indices with a large correlation coefficient. (You may be able to find them directly with where(arr > 0.95), but the comparison will create a very large boolean array which may or may not fit in memory).
You can use scipy.spatial.distance.pdist with metric = correlation to get all the correlations without the symmetric terms. Unfortunately this will still leave you with about 5e10 terms that will probably overflow your memory.
You could try reformulating a KDTree (which can theoretically handle cosine distance, and therefore correlation distance) to filter for higher correlations, but with 60 dimensions it's unlikely that would give you much speedup. The curse of dimensionality sucks.
You best bet is probably brute forcing blocks of data using scipy.spatial.distance.cdist(..., metric = correlation), and then keep only the high correlations in each block. Once you know how big a block your memory can handle without slowing down due to your computer's memory architecture it should be much faster than doing one at a time.
please check out deepgraph package.
https://deepgraph.readthedocs.io/en/latest/tutorials/pairwise_correlations.html
I tried on z.shape = (2500, 60) and pearsonr for 2500 * 2500. It has an extreme fast speed.
Not sure for 100000 x 100000 but worth trying.

Confused by random.randn()

I am a bit confused by the numpy function random.randn() which returns random values from the standard normal distribution in an array in the size of your choosing.
My question is that I have no idea when this would ever be useful in applied practices.
For reference about me I am a complete programming noob but studied math (mostly stats related courses) as an undergraduate.
The Python function randn is incredibly useful for adding in a random noise element into a dataset that you create for initial testing of a machine learning model. Say for example that you want to create a million point dataset that is roughly linear for testing a regression algorithm. You create a million data points using
x_data = np.linspace(0.0,10.0,1000000)
You generate a million random noise values using randn
noise = np.random.randn(len(x_data))
To create your linear data set you follow the formula
y = mx + b + noise_levels with the following code (setting b = 5, m = 0.5 in this example)
y_data = (0.5 * x_data ) + 5 + noise
Finally the dataset is created with
my_data = pd.concat([pd.DataFrame(data=x_data,columns=['X Data']),pd.DataFrame(data=y_data,columns=['Y'])],axis=1)
This could be used in 3D programming to generate non-overlapping random values. This would be useful for optimization of graphical effects.
Another possible use for statistical applications would be applying a formula in order to test against spacial factors affecting a given constant. Such as if you were measuring a span of time with some formula doing something but then needing to know what the effectiveness would be given various spans of time. This would return a statistic measuring for example that your formula is more effective in the shorter intervals or longer intervals, etc.
np.random.randn(d0, d1, ..., dn) Return a sample (or samples) from the “standard normal” distribution(mu=0, stdev=1).
For random samples from , use:
sigma * np.random.randn(...) + mu
This is because if Z is a standard normal deviate, then will have a normal distribution with expected value and standard deviation .
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html
https://en.wikipedia.org/wiki/Normal_distribution

How does sklearn's standard DBSCAN run so fast?

I've been messing around with alternative implementations of DBSCAN for clustering radar data (like grid-based DBSCAN). Up to this point, I had been using sklearn's standard euclidean DBSCAN and it would run on 26,000 data points in less than a second. However, when I specify my own distance metric, like this:
X = np.column_stack((beam, gate, time_index))
num_pts = X.shape[0]
epsilons = np.array([[beam_eps]*num_pts, [gate_eps] * num_pts, [time_eps] * num_pts]).T
metric = lambda x, y, eps: np.sqrt(np.sum((x/eps - y/eps)**2))
def dist_metric(x, y, eps):
return np.sqrt(np.sum((x - y)**2))
db = DBSCAN(eps=eps, min_samples=minPts, metric=dist_metric, metric_params={'eps': epsilons}).fit(X)
it goes from 0.36 seconds to 92 minutes to run on the same data.
What I did in that code snippet can also be accomplished with just transforming the data beforehand and running standard Euclidean DBSCAN, but I'm trying to implement a reasonably fast version of Grid-based DBSCAN, for which the horizontal epsilon varies based on distance from the radar, so I won't be able to do that.
Part of the slowness in the above distance metric is because of that division by epsilon I think, because it only takes about a minute to run if I use a 'custom metric' that's just Euclidean distance:
metric = lambda x, y: np.sqrt(np.sum((x - y)**2))
How does sklearn's euclidean DBSCAN manage to run so much faster? I've been digging through the code, but haven't made sense of it so far.
Because it uses an index.
Furthermore, it avoids the slow and memory intensive Python interpreter, but does all the work in native code (compiled from Cython). This makes a huge difference when dealing with lots of primitive data such as doubles and ints that the Python interpreter would need to box.
Indexes make all the difference for similarity search. They can reduce the runtime from O(n²) to O(n log n).
But while the ball tree index allows custom metrics, the cost of invoking the python interpreter for every distance computation is very high, so if you really want a custom metric, edit the cython source code and compile sklearn yourself. Or you can use ELKI because the Java JVM can compile extension code into native code when necessary; it does not need to fallback to slow interpreter callbacks like sklearn.
In your case, it will likely be much better to rather preprocess the data. Scale it prior to clustering.