Amazon EC2 vs PiCloud [closed] - numpy

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
We are students trying to handling data size of about 140 million records and trying to run few machine learning algorithms. we are newbie to the entire cloud solutions and mahout implementations.Currently we have set them up in postgresql database but the current implementation doesn't scale up and read/write operations seems to be extremely slow after numerous performance tuning.Hence we are planning to go for cloud based services.
We have explored a few possible alternatives.
Amazon cloud based services( Mahout implementation)
Picloud with scikits learn (we were planning to use HDF5 format with NumPy)
Please recommend any other alternatives if any.
Here are the following questions
Which would yield us better results(turn around time) and would be cost effective? Please do mention us any other alternatives present.
In case if we set up amazon services how should we have the data format? If we use dynamodb will the cost shoot up?
Thanks

It depends on the nature of the machine learning problem you want to solve. I would recommend you to first subsample your dataset to something that fits in memory (e.g. 100k samples with a few hundred non-zero features per samples assuming a sparse representation).
Then try a couple of machine learning algorithms that scale to large number of samples in scikit-learn:
SGDClassifier or MultinomialNB if you want to do supervised classification (if you have categorical labels to predict in your dataset)
SGDRegressor if you want to do supervised regression (if you have continuous target variable to predict)
MiniBatchKMeans clustering to do unsupervised clustering (but then there is no objective way to quantify the quality of the resulting clusters by default).
...
Perform grid search to find the optimal values of the hyperparameters of the model (e.g. the regularizer alpha and the number of passes n_iter for SGDClassifier) and evaluate the performance using cross-validation.
Once done, retry with 2x larger dataset (still fitting in memory) and see if it improves you predictive accuracy significantly. If it's not the case then don't waste your time trying to parallelize this on a cluster to run that on the full dataset as it won't yield any better results.
If it does what you could do, is shard the data into pieces, then slices of data on each nodes, learn of SGDClassifier or SGDRegressor model on each node independently with picloud and collect back the weights (coef_ and intercept_) and then compute the average weights to build the final linear model and evaluate it on some held out slice of your dataset.
To learn more about the error analysis. Have look at how to plot learning curves:
http://digitheadslabnotebook.blogspot.fr/2011/12/practical-advice-for-applying-machine.html
https://gist.github.com/1540431
http://jakevdp.github.com/tutorial/astronomy/practical.html#bias-variance-over-fitting-and-under-fitting

PiCloud is built on top of AWS, so either way you're going to be using Amazon at the end of the day. The question is how much infrastructure you'll have to write yourself to get everything wired together. PiCloud gives some free usage to put it through the paces so you might give it shot initially. I haven't used it myself but clearly they're trying to provide ease of deployment for machine-learning type applications.
It seems like this is trying for results, not to be a cloud project, so I would either look into using one of Amazon's other services besides straight EC2 or otherwise some other software like PiCloud or Heroku or other service that can take care of the bootstrapping.

AWS has a program in place for supporting educational users, so you might want to do some research into that program.

You should take a look at numba if you are looking for some Numpy speed ups:
https://github.com/numba/numba
Doesn't solve your cloud scaling issue, but may reduce time to compute.

I just made a comparison between PiCloud & Amazon EC2 > might be helpful.

Related

Experiments that using Feature Store with Feast is worth

I am currently working on a research project about Feature Stores with Feast and I am looking for examples of experiments or case studies that demonstrate the value of using Feature Stores in data science projects.
Currently I’m preparing some experiments with Feast. Specially I’m focus on:
· Trying to prove that using Feast improve speed of getting features on data either stored locally or on GCP with many rows, tables and many data sources.
· Secondly, I hope Feast also will help with feature engineering during calculating features on demand and will takes less time in getting prediction from model.
· Hope also that Feast improve ML quality by reducing training and retraining time and also improve models score by data validation
· Finally, I’ll prove that Feast will reduce costs in GCP cloud.
If any of you have experience or knowledge in this area, I would greatly appreciate any proposals for new expermients.

Neural network hyperparameter tuning - is setting random seed a good idea? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am trying to tune a basic neural network as practice. (Based on an example from a coursera course: Neural Networks and Deep Learning - DeepLearning.AI)
I face the issue of the random weight initialization. Lets say I try to tune the number of layers in the network.
I have two options:
1.: set the random seed to a fixed value
2.: run my experiments more times without setting the seed
Both version has pros and cons.
My biggest concern is that if I use a random seed (e.g.: tf.random.set_seed(1)) then the determined values can be "over-fitted" to the seed and may not work well without the seed or if the value is changed (e.g.: tf.random.set_seed(1) -> tf.random.set_seed(2). On the other hand, if I run my experiments more times without random seed then I can inspect less option (due to limited computing capacity) and still only inspect a subset of possible random weight initialization.
In both cases I feel that luck is a strong factor in the process.
Is there a best practice how to handle this topic?
Has TensorFlow built in tools for this purpose? I appreciate any source of descriptions or tutorials. Thanks in advance!
Tuning hyperparameters in deep learning (generally in machine learning) is a common issue. Setting the random seed to a fixed number ensures reproducibility and fair comparison. Repeating the same experiment will lead to the same outcomes. As you probably know, best practice to avoid over-fitting is to do a train-test split of your data and then use k-fold cross-validation to select optimal hyperparameters. If you test multiple values for a hyperparameter, you want to make sure other circumstances that might influence the performance of your model (e.g. train-test-split or weight initialization) are the same for each hyperparameter in order to have a fair comparison of the performance. Therefore I would always recommend to fix the seed.
Now, the problem with this is, as you already pointed out, the performance for each model will still depend on the random seed, like the particular data split or weight initialization in your case. To avoid this, one can do repeated k-fold-cross validation. That means you repeat the k-fold cross-validation multiple times, each time with a different seed, select best parameters of that run, test on test data and average the final results to get a good estimate of performance + variance and therefore eliminate the influence the seed has in the validation process.
Alternatively you can perform k-fold cross validation a single time and train each split n-times with a different random seed (eliminating the effect of weight initialization, but still having the effect of the train-test-split).
Finally TensorFlow has no build-in tool for this purpose. You as practitioner have to take care of this.
There is no an absolute right or wrong answer to your question. You are almost answered your own question already. In what follows, however, I will try to expand more, via the following points:
The purpose of random initialization is to break the symmetry that makes neural networks fail to learn:
... the only property known with complete certainty is that the
initial parameters need to “break symmetry” between different units.
If two hidden units with the same activation function are connected to
the same inputs, then these units must have different initial
parameters. If they have the same initial parameters, then a
deterministic learning algorithm applied to a deterministic cost and
model will constantly update both of these units in the same way...
Deep Learning (Adaptive Computation and Machine Learning series)
Hence, we need the neural network components (especially weights) to be initialized by different values. There are some rules of thumb of how to choose those values, such as the Xavier initialization, which samples from normal distribution with mean of 0 and special variance based on the number of the network layer. This is a very interesting article to read.
Having said so, the initial values are important but not extremely critical "if" proper rules are followed, as per mentioned in point 2. They are important because large or improper ones may lead to vanishing or exploding gradient problems. On the other hand, different "proper" weights shall not hugely change the final results, unless they are making the aforementioned problems, or getting the neural network stuck at some local maxima. Please note, however, the the latter depends also on many other aspects, such as the learning rate, the activation functions used (some explode/vanish more than others: this is a great comparison), the architecture of the neural network (e.g. fully connected, convolutional ..etc: this is a cool paper) and the optimizer.
In addition to point 2, bringing a good learning optimizer into the bargain, other than the standard stochastic one, shall in theory not let a huge influence of the initial values to affect the final results quality, noticeably. A good example is Adam, which provides a very adaptive learning technique.
If you still get a noticeably-different results, with different "proper" initialized weights, there are some ways that "might help" to make neural network more stable, for example: use a Train-Test split, use a GridSearchCV for best parameters, and use k-fold cross validation...etc.
At the end, obviously the best scenario is to train the same network with different random initial weights many times then get the average results and variance, for more specific judgement on the overall performance. How many times? Well, if can do it hundreds of times, it will be better, yet that clearly is almost impractical (unless you have some Googlish hardware capability and capacity). As a result, we come to the same conclusion that you had in your question: There should be a tradeoff between time & space complexity and reliability on using a seed, taking into considerations some of the rules of thumb mentioned in previous points. Personally, I am okay to use the seed because I believe that, "It’s not who has the best algorithm that wins. It’s who has the most data". (Banko and Brill, 2001). Hence, using a seed with enough (define enough: it is subjective, but the more the better) data samples, shall not cause any concerns.

Is there a way to partition a tf.Dataset with TensorFlow’s Dataset API?

I checked the doc but I could not find a method for it. I want to de cross validation, so I kind of need it.
Note that I'm not asking how to split a tensor, as I know that TensorFlow provides an API for that an has been answered in another question. I'm asking on how to partition a tf.Dataset (which is an abstraction).
You could either:
1) Use the shard transformation partition the dataset into multiple "shards". Note that for best performance, sharding should be to data sources (e.g. filenames).
2) As of TensorFlow 1.12, you can also use the window transformation to build a dataset of datasets.
I am afraid you cannot. The dataset API is a way to efficiently stream inputs to your net at run time. It is not a set of tools to manipulate datasets as a whole -- in that regards it might be a bit of a misnomer.
Also, if you could, this would probably be a bad idea. You would rather have this train/test split done once and for all.
it let you review those sets offline
if the split is done each time you run an experiment there is a risk that samples start swapping sets if you are not extremely careful (e.g. when you add more data to your existing dataset)
See also a related question about how to split a set into training & testing in tensorflow.

Tensorflow requirements [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I would like to deploy tensorflow on a production server. I expect about 200 concurrent users. I will be using the parser and a few of my own neural network deep learning models. I would like to know the peak memory and cpu usage for the same.
Appreciate your help.
Trying a simple (But very variable) guess:
If you talk about deep learning, I infer you are talking at least of 3 or more layers, including some CNN and probably RNNs.
If you are using simple 2D or 3D inputs, but a complex architecture it can be safely said that your bottleneck will be on CPU, and thus implementing the algorithms on GPU will be needed.
You also need to prepare to scale for any number of clients, so a scaling mechanism will be useful from the start.
Also you need to know how the workload will be handled, will you have to serve real time, or a batch queue is needed? This change the requirements enormously.
Once you can figure out this and maybe other details, you can refine your estimation.
Best Regards.
For the memory it depends on 3 factors:
graph size
batch size for the graph
the queue size with the incoming request
Among the 3 factors probably the batch size has the most impact as memory of the graph is: graph size x batch size
About the CPU I suggest you to use the GPU for the graph. You can make some tests and count the number of inferences per second you can do with your graph and the selected batch size. Tensorflow serving well implemented handles concurrency very nicely and you bound is going to be the graph speed

What are the types of problems TensorFlow can help solve? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 months ago.
Improve this question
The TensorFlow home page describes its purpose as 'a software library for numerical computation'. Looking through the sample problems it looks like a problem is always formulated as follows:
Input
Model parameters
Desired output
Given some training data for 1) and 3), 2) can be computed.
I can see how this can be used to create bots, self-driving cars, image classifiers etc.
Given the broad definition of 'numerical computation', am I missing a class of other problems this can be used for? Can this be used for, say, more classical numerical computations such as the airflow around an aircraft or deformation of a structure under stress? Do you have any examples of how these classical problems would have to be formulated to fit the form above?
A nice discussion on what artificial neural networks could do, the fact that our brain is a neural network might imply that eventually an artificial neural network will be able to to the same tasks.
Some more examples of artificial neural networks used today: music creation, image based location, page rank, google voice, stock trade predictions, nasa star classifiaction, traffic management
Some fields i know of but do not have a good reference for:
optical quantum mechanics test set-up generator
medical diagnosis, reference only about safety
The Sharp LogiCook microwave oven, wiki, nasa mention
I think there are many millions of "problems" that can be solved with an ANN, deciding on the data representation (input,output) will be a challenge for some of these. some useful and useless examples i have been thinking about:
home thermostat that learns your wishes with certain weather types.
bakery production prediction
recognize go-stones on a board and map their locations
personal activity guesser and turn on appropriate device.
recognize person based on mouse movement
Given the right data and network these examples will work.
Dad has a pc controlling the heating system back home, i trained a network based on his 10years of heating data (outside temp, inside temp, humidity etc.) unfortunately i am not allowed to hook it up.
My aunt and uncle have a bakery, based on 6years of sales data i trained a network predicting how many breads and buns they should make. It showed me how important the correct inputs are. first i used the day of the year but when i switched to day of the week i saw a 15% increase in accuracy.
Currently i am working on a network that will detect a go board in a given image and map all 361 locations telling me if there is a black, white or no stone present.
Two examples that showed me how much information can be stored in a single neuron and of different ways to represent data:
Image example, neuron example (unfortunately you have to train both examples yourself so give them a little time.)
On to your example airflow around an aircraft.
I know none to nothing about airflow calculations and my try would be a really huge 3D input layer where you can "draw" an airplane and the direction and speed of the airflow.
It might work but it will require a tremendous amount of computation power, somebody knowing more about this specific topic probably knows a more abstract way of representing the data resulting in a more manageable network.
This nasa paper talks about a neural network for calculating airflow around a wing. Unfortunately i do not understand what kind of input they use, maybe it is more clear to you.