What is the proper way of using featuretools for single table data? - data-science

Assume that I have a dataset consisting of single table, for instance you can consider titanic dataset on kaggle.
Now what is a proper way of using feature tools to get most benefit from it? as featuretools is specially for relational data.
now by 'proper' I mean, I know that when creating entityset the index parameter will be just index of the dataset but what should be my new index when normalizing the entity? also is it okay to use RFE blindly for feature selection?

You can get the most benefit from Featuretools by normalizing the entity set. The more normalized an entity set can be, the greater DFS can leverage the relational structure to generate better features.
The objective of the normalization process is to eliminate redundant data. So, the new index with additional variables should be one that helps towards this objective. This guide goes into more depth on creating an entity from a de-normalized table.
For feature selection, I think RFE can be used judiciously with the objectives to improve the accuracy and reduce the complexity of a model.

Related

Using PCA on Part of Dataframe

I want to use a clustering algorithm to a dataframe that contains a lot of features (32 columns).
A part of the features are encoded using one hot encoder.
I want to use PCA ( Principal Component analysis ) to reduce the dimension and make the machine learning process easier.
Is it possible to use the PCA just for some columns of the data frame and keep the other columns as they are then use machine learning model.
Or it is obligatory to use PCA for all the dataframe before clustering.
I guess there should be no issue with doing what you describe.
What this does, effectively, is merge some of the objects' features into fewer ones, but then using other, non-merged ones in addition to the merged ones. I don't know what effect that would have on the outcome; it might be good to run a correlation to see whether the unmerged features add anything to the PCA-merged ones. You might find that they basically duplicate what is there already.
Since clustering is an exploratory method, you can basically do whatever you want. It is of course advisable to have a reason for doing so, as it otherwise ends up as simply trial-and-error, and if you find a result, you won't be able to describe why you got there. It is possible (or even likely for some data sets) that there are multiple ways to cluster them, so you should make decisions based on what you know about the data already, so they can be justified in those terms.
Running random trial-and-error clustering until you find a structure makes it a bit difficult to come up with a good explanation why that structure is valid.

Is there a way to efficiently index table inheritance in MSSQL?

I am toying with the idea of moving from a Table Per Hierarchy model to a Table Per Type model.
I haven't been able to find any conclusive material on whether there is an efficient way (from a performance perspective) to do this.
Are there any indexing techniques that can be used to ensure good performance with large datasets in a Table Per Type database model?

Where should I begin with this database design?

I have 5 tables all unnormalised and I need to create an ER model, a logical model, normalise the data and also a bunch of queries.
Where would you begin? Would you normalise the data first? Create the ER model and the relationships?
There are two ways to start data modelling: top-down and bottom-up.
The top-down approach is to ask what things (tangible and intangible) are important to your system. These things become your entities. You then go through the process of figuring out how your entities are related to each other (your relationships) and then flesh out the entities with attributes. The result is your conceptual or logical model. This can be represented in ERD form if you like.
Either along the way or after your entities, relationships and attributes are defined, you go through the normalization process and make other implementation decisions to arrive at your physical model - which can also be represented as an ERD.
The bottom-up approach is to take your existing relations - i.e. whatever screens, reports, datastores, or whatever existing data representations you have and then perform a canonical synthesis to reduce the entire set of data representation into a single, coherent, normalized model. This is done by normalizing each of your views of data and looking for commonalities that let you bring items together into a single model.
Which approach you use depends a little bit on personal choice, and quite a bit on whether you have existing data views to start from.
I think you should first prepare the list of entities and attributes. so that you will be able to get the complete details of the data information.
Once you are clear with the data information. You can start creating the master table and Normalize then.
Then after the complete database is design with normalization, You can create the ER diagram very easily.
I would start by evaluating and then preparing the list of entites and attributes within your data.
I would do it in this order.
Relationships
Create the ER model.
Normalise the data.
I know many others will have a different opinion but this is the way I would go ahead with it :)

Fast in-memory inverted index

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.
Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.
If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
org.apache.lucene.search.Similarity
you should be able to adapt it to your needs and set your version as default with
setDefault(Similarity similarity)
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing

When to choose Cassandra over a SQL/Semantic Store solution?

I have 30-40 GB of data and 3 developer machines (Core Duo i4, 3GB). The data is a set of graph like structures and I have queries that traverse the graphs. Is there a guideline that could help me to decide to use Cassandra or a classic solution, e.g., SQL or Semantic Store? My current plan is to set up Cassandra and see how does it work but I would like to learn more before starting the installation.
I would not use Cassandra for any kind of graph level structure. It has been about 6 months since I looked into doing something similar so maybe Cassandra has moved on since then but I found it was fundamentally limited by the fact that it only has row level indexes.
For a Graph based structure (assuming a simplistic one arc per row layout) you really need column indexes as well since if you want to traverse the graph you want to be able to start from a particular node A and find all the arcs that go from that node (assuming a directed Graph) then you'd have to do a row scan of the entire dataset as there is no built in functionality for saying give me the rows that have A in a particular column.
To achieve this you have to effectively design a data layout for Cassandra that gives you an inverted index. This is somewhat tricky and requires you to know ahead of time the type of queries that you want to answer - answering new types of queries at a later data may be very difficult or impossible if you don't design well. These slides demonstrate the idea but I hope it makes it clear that you effectively have to construct your own indexes.
For Graph structures that can be decomposed to triples consider an RDF store - for more complex structures then consider a full blown Graph Database. If you really want to do NoSQL you can probably build something on top of a document database as they tend to have much better indexing but again you'll have to think carefully about how you store your data.