Data updates with featuretools / DFS - data-science

In the ML 2.0 and AI PM papers it implies update data - which could be either existing data or new data - happens dynamically (in real-time). For example, in the AI PM paper it says, "Rather, we have demonstrated a complete system that works in the real world, on continually updating live data."
Do you mean update data is automatically pre-processed into appropriate feature vectors and included in the next model re-training cycle? Or, is the model being updated dynamically?

In this case, the data update means new data is automatically appended to existing data and then transformed into new feature vectors. These feature vectors can be used to retrain the model or score using an existing model.
The automation is that the feature engineering on the new data may depend on historical data to compute, so the APIs in Featuretools aim to abstract that away as much as possible from the developer. This is achieved using the Entityset.concat(..) method.

Related

Implement data generator in federated training

(I have posted the question on https://github.com/tensorflow/federated/issues/793 and maybe also here!)
I have customized my own data and model to federated interfaces and the training converged. But I am confused about an issue that in an images classification task, the whole dataset is extreme large and it can't be stored in a single federated_train_data nor be imported to memory for one time. So I need to load the dataset from the hard disk in batches to memory real-timely and use Keras model.fit_generator instead of model.fit during training, the approach people use to deal with large data.
I suppose in iterative_process shown in image classification tutorial, the model is fitted on a fixed set of data. Is there any way to adjust the code to let it fit to a data generator?I have looked into the source codes but still quite confused. Would be incredibly grateful for any hints.
Generally, TFF considers the feeding of data to be part of the "Python driver loop", which is a helpful distinction to make when writing TFF code.
In fact, when writing TFF, there are generally three levels at which one may be writing:
TensorFlow defining local processing (IE, processing that will happen on the clients, or on the server, or in the aggregators, or at any other placement one may want, but only a single placement.
Native TFF defining the way data is communicated across placements. For example, writing tff.federated_sum inside of a tff.federated_computation decorator; writing this line declares "this data is moved from clients to server, and aggregated via the sum operator".
Python "driving" the TFF loop, e.g. running a single round. It is the job of this final level to do what a "real" federated learning runtime would do; one example here would be selecting the clients for a given round.
If this breakdown is kept in mind, using a generator or some other lazy-evaluation-style construct to feed data in to a federated computation becomes relatively simple; it is just done at the Python level.
One way this could be done is via the create_tf_dataset_for_client method on the ClientData object; as you loop over rounds, your Python code can select from the list of client_ids, then you can instantiate a new list of tf.data.Datasetsand pass them in as your new set of client data. An example of this relatively simple usage would be here, and a more advanced usage (involving defining a custom client_datasets_fn which takes client_id as a parameter, and passing it to a separately-defined training loop would be here, in the code associated to this paper.
One final note: instantiating a tf.data.Dataset does not actually load the dataset into memory; the dataset is only loaded in when it is iterated over. One helpful tip I have received from the lead author of tf.data.Dataset is to think of tf.data.Dataset more as a "dataset recipe" than a literal instantiation of the dataset itself. It has been suggested that perhaps a better name would have been DataSource for this construct; hopefully that may help the mental model on what is actually happening. Similarly, using the tff.simulation.ClientData object generally shouldn't really load anything into memory until it is iterated over in training on the clients; this should make some nuances around managing dataset memory simpler.

Data Model/Schema decoupling in Data Processing Pipeline suing Event Driven Architecture

I was wondering how Microservices in the Streaming Pipeline based on Event Driven Architecture can be truly decoupled from the data model perspective. We have implemented a data processing pipeline using Event-Driven Architecture where the data model is very critical. Although all the Microservices are decoupled from the business perspective, they are not truly decoupled as the data model is shared across all the services.
In the ingestion pipeline, we have collected data from multiple sources where they have a different data model. Hence, a normalizer microservice is required to normalize those data models to a common data model that can be used by downstream consumers. The challenge is Data Model can change for any reason and we should be able to easily manage the change here. However, that level of change can break the consumer applications and can easily introduce a cascade of modification to all the Microservices.
Is there any solution or technology that can truly decouple microservices in this scenario?
This problem is solved by carefully designing the data model to ensure backward and forward compatibility. Such design is important for independent evolution of services, rolling upgrades etc. A data model is said to be backward compatible if a new client (using new model) can read / write the data written by another client (using old model). Similarly, forward compatibility means a client (using old data model) can read / write the data written by another client (using new data model).
Let's say in a Person object is shared across services in a JSON encoded format. Now a one of the services introduces a new field alternateContact. A service consuming this data and using the old data model can simply ignore this new field and continue its operation. If you're using Jackson library, you'd use #JsonIgnoreProperties(ignoreUnknown = true). Thus the consuming service is designed for forward compatibility.
Problem arises when the service (using old data model) deserializes a Person data written with the new model, updates one or more field values and writes the data back. Since the unknown properties are ignored, the write will result in data loss.
Fortunately, binary encoding format such as Protocol Buffer 3.5 and later versions preserve unknown fields during deserialization using old model. Thus when you serialize the data back, the new fields remain as is.
There maybe other data model evolutions hou need to deal with like field removal, field rename etc. The nasic idea is you need to be aware of and plan for these possibilities early on in the design phase. The common data encoding formats are JSON, Apache Thrift, Protocol Buffer, Avro etc.

Is there a way to partition a tf.Dataset with TensorFlow’s Dataset API?

I checked the doc but I could not find a method for it. I want to de cross validation, so I kind of need it.
Note that I'm not asking how to split a tensor, as I know that TensorFlow provides an API for that an has been answered in another question. I'm asking on how to partition a tf.Dataset (which is an abstraction).
You could either:
1) Use the shard transformation partition the dataset into multiple "shards". Note that for best performance, sharding should be to data sources (e.g. filenames).
2) As of TensorFlow 1.12, you can also use the window transformation to build a dataset of datasets.
I am afraid you cannot. The dataset API is a way to efficiently stream inputs to your net at run time. It is not a set of tools to manipulate datasets as a whole -- in that regards it might be a bit of a misnomer.
Also, if you could, this would probably be a bad idea. You would rather have this train/test split done once and for all.
it let you review those sets offline
if the split is done each time you run an experiment there is a risk that samples start swapping sets if you are not extremely careful (e.g. when you add more data to your existing dataset)
See also a related question about how to split a set into training & testing in tensorflow.

How Data and String are treated in graphlab

I am having large date set in which some of columns are Date and other are categorical Data like Status, Department Name, Country Name.
So how this data is treated in graphlab when i call the graphlab.linear_regression.create method, does i have to pre-process this data and convert them into numbers or can directly provide to graphlab.
Graphlab is mostly used for computing tabular and graph based datasets, and have high scalability and performance. In graphlab.linear_regression.create, graphlab have inbuilt feature of understanding the type of data and giving most suitable method of linear regression for optimizing results. For Example, for numeric data of target and feature both, most of the time, graphlab takes Newtons Method of linear regression. Similarly, depending on the dataset, understands the need and gives method accordingly.
Now, about preprocessing, graphlab only takes SFrame for learning that need to be parsed correctly before any learning. While creating an SFrame, unprocessed and error creating data are always reflected and throws an error. So, in order to go through any learning, you need to have a clean data. If SFrame accepts the data, and also your chosen target and feature for learning that you want, you are good to go but pre-processing and cleaning data is always recommended. Also, its always a good practice to do feature engineering before any learning algorithm, and redefining data types before learning is always recommended for accuracy.
About your point on how data is treated in Graphlab, I would say, it depends!. Some datasets are tabular and are treated accordingly and some in graph structure. Graphlab performs very well when comes to regression tree and boosted classifiers which follows decision tree concept and are quite time and resource consuming in other libraries than graphlab.
For me, graphlab performed very well while creating recommendation engine where I had dataset of nodes and edges and boosted tree classifier with 18 iterations too worked flawless in quite scalable time and I must say, even for tree structured data, graphlab performs very well. I hope this answer helps.

analysis Fitbit walking and sleeping data

I'm participating in small data analysis competition in our school.
We use Fitbit wearable devices, which is loaned to each participants by host of contest.
For 2 months during the contest, they walk and sleep with this small device 24/7,
allow it to gather data about participant's walk count with heart rate(bpm), etc.
and we need to solve some problems based on these participants' data
like, example,
show the relations between rainy days and participants' working out rate using the chart,
i think purpose of problem is,
because of rain, lot of participants are expected to be at home.
can you show some cause and effect numerically?
i'm now studying python library numpy, pandas with ipython notebook.
but still i have no idea about solving these problems..
could you recommend some projects or sites use for references? i really eager to win this competition.:(
and lastly, sorry for my poor English.
Thank you.
that's a fun project. I'm working on something kind of similar.
Here's what you need to do:
Learn the fitbit API and stream the data from the fitbit accelerometer and gyroscope. If you can combine this with heart rate data, great. The more types of data you have, the more effective your algorithm will be. You can store this data in a simple csv file (streaming the accel/gyro data at 50Hz is recommended). Or setup a web server and store it in a database for easy access
Learn how to use pandas and scikit learn
[optional but recommended]: Learn matplotlib so you can graph you data and get a feel for how it looks
Load the data into pandas and create features on the data - notably using 1-2 second sliding window analysis with 50% overlap. Good features include (for all three Accel X, Y, Z): max, min, standard deviation, root mean square, root sum square and tilt. Polynomials will help.
Since this is a supervised classification problem, you will need to create some labelled data - so do this manually (state 1 = rainy day, state 2 = non-rainy day) and then train a classification algorithm. I would recommend a random forest
Test using unlabeled data - don't forget to use cross validation
Voila, you now have a highly accurate model and will win the competition. Plus you've learned about a bunch of really cool Python and machine learning stuff.
For more tutorials on how all this stuff works, I'd highly recommend the Kaggle tutorial projects
BONUS: If you want to take it to a new level, you can start adding smoothers on top of your classifier, for example by using a Hidden Markov Model as explained in this talk
BONUS 2: Go get a PhD in Human Activity Recognition.