Difference between gridded data and station data - grads

I just started practicing GrADS tutorial.
and I'm stuck at the first page which was about 'gridded data' and 'station data'
What is the difference between gridded data and station data?
Furthermore, when do we use gridded or station data?

Station data contains observations at different "stations." That could be readings from ASOS or other platforms. Gridded data is more like model output. You have a grid on which the numerical model is ran, with this you have a data value at every grid point, both horizontally and vertically.
As for usage, it depends what you are trying to visualize/analyze.
Let me know if you have other questions, I have a lot of experience with GrADS and atmospheric science gridded data.

Related

How can I combine two time-series datasets with different time-steps?

I want to train a Multivariate LSTM model by using data from 2 datasets MIMIC-1.0 and MIMIC-3. The problem is that the vital signs recorded in the first data set is minute by minute while in MIMIC-III the data is recorded hourly. There is a interval difference between recording of data in both data sets.
I want to predict diagnosis from the vital signs by giving streams/sequences of vital signs to my model every 5 minutes. How can I merge both data sets for my model?
You need to be able to find a common field using which you can do a merge. For e.g. patient_ids or it's like. You can do the same with ICU episode identifiers. It's a been a while since I've worked on the MIMIC dataset to recall exactly what those fields were.
Dataset
Granularity
Subsampling for 5-minutely
MIMIC-I
Minutely
Subsample every 5th reading
MIMIC-III
Hourly
Interpolate the 10 5-minutely readings between each pair of consecutive hourly readings
The interpolation method you choose to get the between hour readings could be as simple as forward-filling the last value. If the readings are more volatile, a more complex method may be appropriate.

Does Google's AutoML Table shuffle my data samples before training/evaluation?

I sought through the documentation but still have no clue whether or not the service shuffles data before training/evaluation. I need to know this because by data is time-series which would be realistic to evaluate a trained model on samples of earlier period of time.
Can someone please let me know the answer or guide me how to figure this out?
I know that I can export evaluation result and tweak on it but BigQuery seems to not respect the order of original data and there's no absolute time feature in the data.
It doesn't shuffle but split it.
Take a look here: About controlling data split. It says:
By default, AutoML Tables randomly selects 80% of your data rows for training, 10% for validation, and 10% for testing.
If your data is time-sensitive, you should use the Time column.
By using it, AutoML Tables will use the earliest 80% of the rows for training, the next 10% of rows for validation, and the latest 10% of rows for testing.

HOW to train LSTM for Multiple time series data - both for Univariate and Multivariate scenario?

I have data for hundreds of devices(pardon me, I am not specifying much detail about device and data recorded for devices). For each device, data is recorded per hour basis.
Data recorded are of 25 dimensions.
I have few prediction tasks
time series forecasting
where I am using LSTM. As because I have hundreds of devices, and each device is a time series(multivariate data), so all total my data is a Multiple time series with multivariate data.
To deal with multiple time series - my first approach is to concatenate data one after another and treat them as one time series (it can be both uni variate or multi variate) and apply LSTM and train my LSTM model.
But by this above approach(by concatenating time series data), actually I am loosing my time property of my data, so I need a better approach.
Please suggest some ideas, or blog posts.
Kindly don't confuse with Multiple time series with Multi variate time series data.
You may consider a One-fits-all model or Seq2Seq as e.g. this Google paper suggests. The approach works as follows:
Let us assume that you wanna make a 1-day ahead forecast (24 values) and you are using last 7 days (7 * 24 = 168 values) as input.
In time series analysis data is time dependent, such that you need a validation strategy that considers this time dependence, e.g. by rolling forecast approach. Separate hold-out data for testing your final trained model.
In the first step you will generate out of your many time series 168 + 24 slices (see the Google paper for an image). The x input will have length 168 and the y input 24. Use all of your generated slices for training the LSTM/GRU network and finally do prediction on your hold-out set.
Good papers on this issue:
Foundations of Sequence-to-Sequence Modeling for Time Series
Deep and Confident Prediction for Time Series at Uber
more
Kaggle Winning Solution
Kaggle Web Traffic Time Series Forecasting
List is not comprehensive, but you can use it as a starting point.

Objective - Subjective text Classifier :

I am trying to built a classifier for subjective and objective text using imdb data . For objective data point I am using the movie's plot summary as input where as for subjective data points I am using review of the movies.
I took complete plot summary as one data point where as in case of reviews each review by a single user is a single data point .In my database different reviews of the same movie by different user is entered as different data points.
After this I cleaned the words of special character , removed stop word , calculated the Information gain to create the word dictionary and applied Naive Bayes using word frequency to calculate the probabilities .
Now my question are
Is my algo to build the classifier correct ?
My classifer is heavvily biased toward objective. Am I making mistake
in creation of training data ?
I want to create a genric classifer that can be used for tweets or
somthing extracted from blogs . Is movie review data is sufficient ? Right now its not working even for movie review data

Kinect normalize depth

I have some Kinect data of somebody standing (reasonably) still and performing sets of punches. I am given it in the format of an x,y,z co-ordinate for each joint of which they are 20, so I have 60 data points per frame.
I'm trying to perform a classification task on the punches however I'm having some problems normalising my data. As you can see from the graph there are sections with much higher 'amplitude' than the others, my belief is that this is due to how close that person was to the kinect sensor when the readings were taken. (The graph is actually the first principal coefficient obtained by PCA for each frame, multiple sequences of the same punch are strung together in this graph)
Looking back at the data files it looks like those that are 'out' have a z co-ordinate (depth from sensor) of ~2.7 where as the others tent to hover around 3.3-3.6.
How can I perform a normalization with the depth values to make them closer to each other for each sequence? I've already tried differentiation to get the velocity, although it helps to normalise the output actually ends up too similar and makes it very hard to classify.
Edit: I should mention I am already using a normalization method by subtracting the hip position from each joint in an attempt to make the co-ordinates relative.
The Kinect can output some strange values when the person that is tracked is standing near the edges of the view of the Kinect. I would either completly ignore these data or just replace the data with an average of the previous 2 and next 2.
For example:
1,2,1,12,1,2,3
Replace 12 with (2 + 1 + 1 + 2) / 4 = 1.5
You can basically do this with the whole array of values you have, this way you have a more normalised line/graph.
You can also use the clippedEdges value to determine if one or more joints is outside the view.