Using statistics to extract missing variables in a given dataset? - data-science

I would like to know which statistical approach is best suited in data science to introduce new features for a given dataset?
Thanks!

Related

What is the proper way of using featuretools for single table data?

Assume that I have a dataset consisting of single table, for instance you can consider titanic dataset on kaggle.
Now what is a proper way of using feature tools to get most benefit from it? as featuretools is specially for relational data.
now by 'proper' I mean, I know that when creating entityset the index parameter will be just index of the dataset but what should be my new index when normalizing the entity? also is it okay to use RFE blindly for feature selection?
You can get the most benefit from Featuretools by normalizing the entity set. The more normalized an entity set can be, the greater DFS can leverage the relational structure to generate better features.
The objective of the normalization process is to eliminate redundant data. So, the new index with additional variables should be one that helps towards this objective. This guide goes into more depth on creating an entity from a de-normalized table.
For feature selection, I think RFE can be used judiciously with the objectives to improve the accuracy and reduce the complexity of a model.

What is better Orange.data.Table or Pandas for data manage in python?

Iam doing data mining and i dont know if going to use Table or Pandas?
any information for select the most suitable library for manage my dataset going to be welcome. Thank for any answer that help me in this.
I am an Orange programmer, and I'd say that if you are writing python scripts to analyze data, start with numpy + sklearn or Pandas.
To create an Orange.data.Table, you need to define Domain, which Orange uses for data transformations. Thus, tables in Orange are harder to create (but can, for example, provide automatic processing of testing data).
Of course, if you need to interface something specific from Orange, you will have to make a Table.

Neural Network: Convert HTML Table into JSON data

I'm kinda new to Neural Networks and just started to learn coding them by trying some examples.
Two weeks ago I was searching for an interesting challenge and I found one. But I'm about to give up because it seems to be too hard for me... But I was curious to know if anyone of you is able to solve this?
The Problem: Assume there are ".htm"-files that contain tables about the same topic. But the table structure isn't the same for every file. For example: We have a lot ".htm"-files containing information about teachers substitutions per day per school. Because the structure of those ".htm"-files isn't the same for every file it would be hard to program a parser that could extract the data from those tables. So my thought was that this is a task for a Neural Network.
First Question: Is it a task a Neural Network can/should handle or am I mistaken by that?
Because for me a Neural Network seemed to fit for this kind of a challenge I tried to thing of an Input. I came up with two options:
First Input Option: Take the HTML Code (only from the body-tag) as string and convert it as Tensor
Second Input Option: Convert the HTML Tables into Images (via Canvas maybe) and feed this input to the DNN through Conv2D-Layers.
Second Question: Are those Options any good? Do you have any better solution to this?
After that I wanted to figure out how I would make a DNN output this heavily dynamic data for me? My thought was to convert my desired JSON-Output into Tensors and feed them to the DNN while training and for every prediction i would expect the DNN to return a Tensor that is convertible into a JSON-Output...
Third Question: Is it even possible to get such a detailed Output from a DNN? And if Yes: Do you think the Output would be suitable for this task?
Last Question: Assuming all my assumptions are correct - Wouldn't training this DNN take for ever? Let's say you have a RTX 2080 ti for it. What would you guess?
I guess that's it. I hope i can learn a lot from you guys!
(I'm sorry about my bad English - it's not my native language)
Addition:
Here is a more in-depth Example. Lets say we have a ".htm"-file that looks like this:
The task would be to get all the relevant informations from this table. For example:
All Students from Class "9c" don't have lessons in their 6th hour due to cancellation.
1) This is not particularly suitable problem for a Neural Network, as you domain is a structured data with clear dependcies inside. Tree based ML algorithms tend to show much better results on such problems.
2) Both you choices of input are very unstructured. To learn from such data would be nearly impossible. The are clear ways to give more knowledge to the model. For example, you have the same data in different format, the difference is only the structure. It means that a model needs to learn a mapping from one structure to another, it doesn't need to know any data. Hence, words can be Tokenized with unique identifiers to remove unnecessary information. Htm data can be parsed to a tree, as well as json. Then, there are different ways to represent graph structures, which can be used in a ML model.
3) It seems that the only adequate option for output is a sequence of identifiers pointing to unique entities from text. The whole problem then is similar to Seq2Seq best solved by RNNs with an decoder-encoder architecture.
I believe that, if there is enough data and htm files don't have huge amount of noise, the task can be completed. Training time hugely depends on selected model and its complexity, as well as diversity of initial data.

Is there a way to partition a tf.Dataset with TensorFlow’s Dataset API?

I checked the doc but I could not find a method for it. I want to de cross validation, so I kind of need it.
Note that I'm not asking how to split a tensor, as I know that TensorFlow provides an API for that an has been answered in another question. I'm asking on how to partition a tf.Dataset (which is an abstraction).
You could either:
1) Use the shard transformation partition the dataset into multiple "shards". Note that for best performance, sharding should be to data sources (e.g. filenames).
2) As of TensorFlow 1.12, you can also use the window transformation to build a dataset of datasets.
I am afraid you cannot. The dataset API is a way to efficiently stream inputs to your net at run time. It is not a set of tools to manipulate datasets as a whole -- in that regards it might be a bit of a misnomer.
Also, if you could, this would probably be a bad idea. You would rather have this train/test split done once and for all.
it let you review those sets offline
if the split is done each time you run an experiment there is a risk that samples start swapping sets if you are not extremely careful (e.g. when you add more data to your existing dataset)
See also a related question about how to split a set into training & testing in tensorflow.

Is there a way to efficiently index table inheritance in MSSQL?

I am toying with the idea of moving from a Table Per Hierarchy model to a Table Per Type model.
I haven't been able to find any conclusive material on whether there is an efficient way (from a performance perspective) to do this.
Are there any indexing techniques that can be used to ensure good performance with large datasets in a Table Per Type database model?