Clarification regarding the fit operations present in pyspark mllib algorithms - apache-spark-sql

I had a small doubt regarding the ML operations in Pyspark.
Is the fit operation in Pyspark, a distributed operation/processing or the whole fit operation gets executed on a single node.
Details: I am trying to fit a KMeans algorithm on a huge dataset but it is taking a long time. Thus wanted a bit of clarity on the same.
PS: I am new to Pyspark, thus please excuse if you find the question silly

Related

how are histograms constructed in sklearn's HistGradientBoostingClassifier to decide on best split point

Both lightgbm and sklearn's HistGradientBoostingClassifier estimators use histograms to decide on best splits for continuous features.
Is it possible to explain intuitively (or with some example) the process of histogram creation and how does it help in deciding in faster split point at a node.
I have looked for answers extensively over the Internet but could not find any simple or intuitive way as to how histograms are constructed.
I am not sure but it could be related to how (unique) Regression trees are constructed in XGBoost. For a continuous feature, you construct an histogram, decide on the split (e.g. weight < 70kg), construct a Regression tree and compute the Similarity score as well as the Gain. However, when the range of the values in the continuous feature is quite large then it is quite computationally expensive to try all the possible split values. In that case, XGBoost basically makes the split by making use of the quantiles which involves dividing all the observations into equally sized sets.
I guess sklearn's HistGradientBoostingClassifier might involve the above tool optimization as well for coming up with the best split.

Implement data generator in federated training

(I have posted the question on https://github.com/tensorflow/federated/issues/793 and maybe also here!)
I have customized my own data and model to federated interfaces and the training converged. But I am confused about an issue that in an images classification task, the whole dataset is extreme large and it can't be stored in a single federated_train_data nor be imported to memory for one time. So I need to load the dataset from the hard disk in batches to memory real-timely and use Keras model.fit_generator instead of model.fit during training, the approach people use to deal with large data.
I suppose in iterative_process shown in image classification tutorial, the model is fitted on a fixed set of data. Is there any way to adjust the code to let it fit to a data generator?I have looked into the source codes but still quite confused. Would be incredibly grateful for any hints.
Generally, TFF considers the feeding of data to be part of the "Python driver loop", which is a helpful distinction to make when writing TFF code.
In fact, when writing TFF, there are generally three levels at which one may be writing:
TensorFlow defining local processing (IE, processing that will happen on the clients, or on the server, or in the aggregators, or at any other placement one may want, but only a single placement.
Native TFF defining the way data is communicated across placements. For example, writing tff.federated_sum inside of a tff.federated_computation decorator; writing this line declares "this data is moved from clients to server, and aggregated via the sum operator".
Python "driving" the TFF loop, e.g. running a single round. It is the job of this final level to do what a "real" federated learning runtime would do; one example here would be selecting the clients for a given round.
If this breakdown is kept in mind, using a generator or some other lazy-evaluation-style construct to feed data in to a federated computation becomes relatively simple; it is just done at the Python level.
One way this could be done is via the create_tf_dataset_for_client method on the ClientData object; as you loop over rounds, your Python code can select from the list of client_ids, then you can instantiate a new list of tf.data.Datasetsand pass them in as your new set of client data. An example of this relatively simple usage would be here, and a more advanced usage (involving defining a custom client_datasets_fn which takes client_id as a parameter, and passing it to a separately-defined training loop would be here, in the code associated to this paper.
One final note: instantiating a tf.data.Dataset does not actually load the dataset into memory; the dataset is only loaded in when it is iterated over. One helpful tip I have received from the lead author of tf.data.Dataset is to think of tf.data.Dataset more as a "dataset recipe" than a literal instantiation of the dataset itself. It has been suggested that perhaps a better name would have been DataSource for this construct; hopefully that may help the mental model on what is actually happening. Similarly, using the tff.simulation.ClientData object generally shouldn't really load anything into memory until it is iterated over in training on the clients; this should make some nuances around managing dataset memory simpler.

Do we use Spark because it's faster or because it can handle large amount of data? [duplicate]

A Spark newbie here.
I recently started playing around with Spark on my local machine on two cores by using the command:
pyspark --master local[2]
I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.
However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.
I was wondering what could be a possible reason for this. I have a couple of thoughts.
Do built-in functions do the process of serialization/de-serialization inefficiently? If yes, what are the alternatives to them?
Is the data set too small that it cannot outrun the overhead cost of the underlying JVM on which spark runs?
Thanks for looking. Much appreciated.
Because:
Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
Because local mode is not designed for performance. It is used for testing.
Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison
You can go on like this for a long time...

Is there a way to partition a tf.Dataset with TensorFlow’s Dataset API?

I checked the doc but I could not find a method for it. I want to de cross validation, so I kind of need it.
Note that I'm not asking how to split a tensor, as I know that TensorFlow provides an API for that an has been answered in another question. I'm asking on how to partition a tf.Dataset (which is an abstraction).
You could either:
1) Use the shard transformation partition the dataset into multiple "shards". Note that for best performance, sharding should be to data sources (e.g. filenames).
2) As of TensorFlow 1.12, you can also use the window transformation to build a dataset of datasets.
I am afraid you cannot. The dataset API is a way to efficiently stream inputs to your net at run time. It is not a set of tools to manipulate datasets as a whole -- in that regards it might be a bit of a misnomer.
Also, if you could, this would probably be a bad idea. You would rather have this train/test split done once and for all.
it let you review those sets offline
if the split is done each time you run an experiment there is a risk that samples start swapping sets if you are not extremely careful (e.g. when you add more data to your existing dataset)
See also a related question about how to split a set into training & testing in tensorflow.

Amazon EC2 vs PiCloud [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
We are students trying to handling data size of about 140 million records and trying to run few machine learning algorithms. we are newbie to the entire cloud solutions and mahout implementations.Currently we have set them up in postgresql database but the current implementation doesn't scale up and read/write operations seems to be extremely slow after numerous performance tuning.Hence we are planning to go for cloud based services.
We have explored a few possible alternatives.
Amazon cloud based services( Mahout implementation)
Picloud with scikits learn (we were planning to use HDF5 format with NumPy)
Please recommend any other alternatives if any.
Here are the following questions
Which would yield us better results(turn around time) and would be cost effective? Please do mention us any other alternatives present.
In case if we set up amazon services how should we have the data format? If we use dynamodb will the cost shoot up?
Thanks
It depends on the nature of the machine learning problem you want to solve. I would recommend you to first subsample your dataset to something that fits in memory (e.g. 100k samples with a few hundred non-zero features per samples assuming a sparse representation).
Then try a couple of machine learning algorithms that scale to large number of samples in scikit-learn:
SGDClassifier or MultinomialNB if you want to do supervised classification (if you have categorical labels to predict in your dataset)
SGDRegressor if you want to do supervised regression (if you have continuous target variable to predict)
MiniBatchKMeans clustering to do unsupervised clustering (but then there is no objective way to quantify the quality of the resulting clusters by default).
...
Perform grid search to find the optimal values of the hyperparameters of the model (e.g. the regularizer alpha and the number of passes n_iter for SGDClassifier) and evaluate the performance using cross-validation.
Once done, retry with 2x larger dataset (still fitting in memory) and see if it improves you predictive accuracy significantly. If it's not the case then don't waste your time trying to parallelize this on a cluster to run that on the full dataset as it won't yield any better results.
If it does what you could do, is shard the data into pieces, then slices of data on each nodes, learn of SGDClassifier or SGDRegressor model on each node independently with picloud and collect back the weights (coef_ and intercept_) and then compute the average weights to build the final linear model and evaluate it on some held out slice of your dataset.
To learn more about the error analysis. Have look at how to plot learning curves:
http://digitheadslabnotebook.blogspot.fr/2011/12/practical-advice-for-applying-machine.html
https://gist.github.com/1540431
http://jakevdp.github.com/tutorial/astronomy/practical.html#bias-variance-over-fitting-and-under-fitting
PiCloud is built on top of AWS, so either way you're going to be using Amazon at the end of the day. The question is how much infrastructure you'll have to write yourself to get everything wired together. PiCloud gives some free usage to put it through the paces so you might give it shot initially. I haven't used it myself but clearly they're trying to provide ease of deployment for machine-learning type applications.
It seems like this is trying for results, not to be a cloud project, so I would either look into using one of Amazon's other services besides straight EC2 or otherwise some other software like PiCloud or Heroku or other service that can take care of the bootstrapping.
AWS has a program in place for supporting educational users, so you might want to do some research into that program.
You should take a look at numba if you are looking for some Numpy speed ups:
https://github.com/numba/numba
Doesn't solve your cloud scaling issue, but may reduce time to compute.
I just made a comparison between PiCloud & Amazon EC2 > might be helpful.