pandas : finding relationship between data in the large dataset - pandas

I am new to the data science and i want to explore the relationship between data .. I have a very large dataset containing 556784 X 60 rows and columns . There are some unwanted variable to ignore to feed to the neural network . Using Linearregression && Multipleregression can help us to find the relationship between Xlabel and Ylabel . But running regression technique in such huge dataset really helps ? or there any other ways to find which data is really important to the problem and which data not ?
I know this a theory question but it really helps me to further proceed .
Thanks!

I'm also a noob in DS, but I think I can give you some ideas:
The way you treat your data depends on what kind of data you are working with(is in numbers, text, or some kind of time-series)
It is a good idea to explore it by yourself with making some plots.
You can use a reasonably small part of your data to reduce computation time.
Is there really need in NN? It gives quite unclear results which are hard to interpret and takes time to train, maybe you should try to start with "classic" models first and make some good feature engineering.
Finally, you can check sklearn manual (which I find really good) for data preprocessing chapter, I think it would give you some ideas to try with:
http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
I hope some of this will be helpful.

Related

Can you build a model that normalises FEATURES using the test set while avoiding data leakage?

Just can't wrap my head around this one.
I understand that:
Normalising a target variable using the test set uses information on that target variable in the test set. This means we get inflated performance metrics that cannot be replicated once we receive a new test set (which does not have a target variable available).
However, when we receive a new test set, we do have predictor variables available. What is wrong with using these to normalise? Yes, the predictors contain information that relates to the target variable, however that's literally the definition of predicting using a model, we use the information in predictors to get specific predictions for a target. Why can't it be built-in to the model definition that it uses input data to normalise, before predicting?
The performance metrics, surely, wouldn't be skewed as we are just using information from the predictors.
Yes, the test set is supposed to be 'unseen', but in reality, surely it's only the test set target variable that is unseen, not the predictors.
I have read around this and answers so far are vague, just repeating that test set is unseen and that we gain information about the test set. I would really appreciate an answer on why we can't use specifically the predictors, as I think the target case is obvious.
Thanks in advance!!
Having gone away and thought about my Q - normalising our data on the training set as well - I realise this doesn't make much sense. Normalising is not part of the training, but something we do before training, therefore normalising w/ test set features is fine as an idea, but we then would have to go train this normalised data on the training set outcomes. I originally thought "normalise on more data" > "normalise on less data" but actually we'd normalise on one set (training + test), then fit on another (training). Probably get a more poorly trained model as a result and so as I believe it's a stupid idea!

Tensorflow / Keras: Normalize train / test / realtime Data or how to handle reality?

I started developing some LSTM-models and now have some questions about normalization.
Lets pretend I have some time series data that is roughly ranging between +500 and -500. Would it be more realistic to scale the Data from -1 to 1, or is 0 to 1 a better way, I tested it and 0 to 1 seemed to be faster. Is there a wrong way to do it? Or would it just be slower to learn?
Second question: When do I normalize the data? I split the data into training and testdata, do I have to scale / normalize this data seperately? maybe the trainingdata is only ranging between +300 to -200 and the testdata ranges from +600 to -100. Thats not very good I guess.
But on the other hand... If I scale / normalize the entire dataframe and split it after that, the data is fine for training and test, but how do I handle real new incomming data? The model is trained to scaled data, so I have to scale the new data as well, right? But what if the new Data is 1000? the normalization would turn this into something more then 1, because its a bigger number then everything else before.
To make a long story short, when do I normalize data and what happens to completely new data?
I hope I could make it clear what my problem is :D
Thank you very much!
Would like to know how to handle reality as well tbh...
On a serious note though:
1. How to normalize data
Usually, neural networks benefit from data coming from Gaussian Standard distribution (mean 0 and variance 1).
Techniques like Batch Normalization (simplifying), help neural net to have this trait throughout the whole network, so it's usually beneficial.
There are other approaches that you mentioned, to tell reliably what helps for which problem and specified architecture you just have to check and measure.
2. What about test data?
Mean to subtract and variance to divide each instance by (or any other statistic you gather by any normalization scheme mentioned previously) should be gathered from your training dataset. If you take them from test, you perform data leakage (info about test distribution is incorporated into training) and you may get false impression your algorithm performs better than in reality.
So just compute statistics over training dataset and use them on incoming/validation/test data as well.

Train / Test split % for Object Detection - what's the current recommendation?

Using the Tensorflow Object Detection API, what's the current recommendation / best practice around the train / test split percentage for labeled examples? I've seen a lot of conflicting info, anywhere from 70/30 to 95/5. Any recent real world experience is appreciated.
Traditional advice is ~70-75% training and the rest test data. More recent articles indeed suggest a different split. I read 95/2.5/2.5 (train / test / dev for hyperparameter tuning) a lot these days.
I guess your optimal split depends on the amount of available data and the bias/variance characteristics. Poor performance on training data may be caused by underfitting and need more training data. If your model is fitting well or even overfitting, you should be able to allocate some of the training data away to test data.
If you're stuck in the middle, you may also consider cross validation as a computationally expensive but data friendly option.
It depends on the size of the dataset as Andrew Ng suggests:
(train/ dev or Val /test)
If the size of the dataset is 100 to 10K ~ 60/20/20
If the size of the dataset is 1M to INF ==> 98/1/1 or 99.5/0.25/0.25
Note that these are not fixed and just suggestions.
The goal of the test set mentioned here is to give you an unbiased performance measurement of your work. In some works, it is OK not to have only two sets set (then they will call it train/test, though test set here is actually working as dev set ratio can be 70/30 )

Stata output variable to matrix with ebalance

I'm using the ebalance Stata package to calculate post-stratification weights, and I'd like to convert the weights output (_webal, which is generated as a double with format %10.0g) to a matrix.
I'd like to normalize all weights in the "control" group, but I can't seem to convert the variable to a matrix in order to manipulate the weights individually (I'm a novice to Stata, so I was just going to do this using a loop––I'd normally just export and do this in R, but I have to calculate results within a bootstrap). I can, however, view the individual-level weights produced by the output, and I can use them to calculate sample statistics.
Any ideas, anyone? Thanks so much!
This is not an answer, but it doesn't fit within a comment box.
As a self-described novice in Stata, you are asking the wrong question.
Your problem is that you have a variable that you want to do some calculations on, and since you can't just use R and you don't know how to do those (unspecified) calculations directly in Stata, you have decided that the first step is to create a matrix from the variable.
Your question would be better phrased as a simple description of the relevant portions of your data and the calculation you need to do using that data (ebalance is an obscure distraction that probably lost you a few readers) and where you are stuck.
See also https://stackoverflow.com/help/mcve for a discussion of completing a minimal complete example with a description of the results you expect for that example.

How to create a synthetic dataset

I want to run some Machine Learning clustering algorithms on some big data.
The problem is that I'm having troubles to find interesting data for this purpose on the web.Also, usually this data might be inconvenient to use because the format won't fit for me.
I need a txt file which each line represents a mathematical vector, each element seperated by space, for example:
1 2.2 3.1
1.12 0.13 4.46
1 2 54.44
Therefore, I decided to first run those algorithms on some synthetic data which I'll create by my self. How can I do this in a smart way with numpy?
In smart way, I mean that it won't be generated uniformly, because it's a little bit boring. How can I generate some interesting clusters?
I want to have 5GB / 10GB of data at the moment.
You need to define what you mean by "clusters", but I think what you are asking for is several random-parameter normal distributions combined together, for each of your coordinate values.
From http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.randn.html#numpy.random.randn:
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu
And use <range> * np.random.rand(<howmany>) for each of sigma and mu.
There is no one good answer for such question. What is interesting? For clustering, unfortunately, there is no such thing as an interesting or even well posed problem. Clustering as such has no well defineid evaluation, consequently each method is equally good/bad, as long as it has well defined internal objective. So k-means will always be good one to minimize inter-cluster euclidean distance and will struggle with sparse data, non-convex, imbalanced clusters. DBScan will always be the best in greedy density based sense and will strugle with diverse density clusters. GMM will be always great fitting on gaussian mixtures, and will strugle with clusters which are not gaussians (for example lines, squares etc.).
From the question one could deduce that you are at the very begining of work with clustering and so need "just anything more complex than uniform", so I suggest you take a look at datasets generators, in particular accesible in scikit-learn (python) http://scikit-learn.org/stable/datasets/ or in clusterSim (R) http://www.inside-r.org/packages/cran/clusterSim/docs/cluster.Gen or clusterGeneration (R) https://cran.r-project.org/web/packages/clusterGeneration/clusterGeneration.pdf