Data Visualization: Plotting friendship relations - data-visualization

I guess those who have worked in communities and social networks might have some experience in this.
I am trying to plot a graph of all the friendships that exists on my site and in doing so identify clusters of strongly interconnected users.
Does anyone have any experience in doing something like this? Also, does SQL Server 2008 BI have tools that allows for this type of modelling?
Thanks

Programming Collective Intelligence's chapter 5 is dedicated to optimization and network visualization. Using the modules available here and the snippet below, I could make the following image:
>>> import optimization
>>> import socialnetwork
>>> sol = optimization.annealingoptimize(socialnetwork.domain, socialnetwork.crosscount, step=50, cool=0.99)
>>> socialnetwork.drawnetwork(sol)
The advantages of this approach is that you can easily change the cost function, use different optimization algorithms, or use another library to view the solution.

Take a look at neato from the Graphviz command line tool suite. AS input it takes a so called .dot file. The format is straight forward you should just be able to iterate over all friendship relations in your system and write them into the file.

For inspiration, take a look at these social graphs from "Visual Complexity" collection.
Many visualizations have explanatory papers and articles mentioning graphing tools, libraries and algorithms used to obtain the images.
Examples from "Social Networks" category:

Your graph will be probably reasonably large, so GraphViz is a poor choice. It does a nice job for tiny graphs, but not for huge ones. I'd recommend that you try aiSee instead (here are some example graphs). It requires graphs to be specified in a simple human-readable format called GDL.
(source: aisee.com)
Sample social network http://www.aisee.com/graph_of_the_month/pubmed5.gif
(source: aisee.com)

For visualization, have a look at the Javascript Infovis Toolkit.

You might take a look at the Girvan-Newman algorithm, the output of which gives you an idea of community structure in the form of a dendrogram.

You should look at Mark Shepherd's SpringGraph which is a neat and sexy way of showing big graphs.

Please take a look at the prefuse visualization toolkit

Check out Wikipedia -- Social Network which does talk about social network analysis and graphing relations between users. I think the basic idea is you use a graph to map all the relations and then the more shared relations there are, the higher the interconnected relationships.

Related

What is needed for a recommendation engine based on word/text input

I'm new to the Machine-Learning (AI) technology. I'm developing a messenger app for Android/IOs where I would like to recommend the users based on the texts/word/conversation a product from a relative small product portfolio.
Example 1:
In case the user of the messenger writes a sentence including the words "vine", "dinner", "date" the AI should recommend a bottle of vine to the user.
Example 2:
In case the user of the app writes that he has drunk a good coffee this morning, the AI should recommend a mug to the user.
Example 3:
In case the user writes something about a cute boy she met last day, the AI should recommend a "teddy bear" to the user.
I'm a Software Developer since almost 20 year with experience in the development of C/C++/Java based application (Android and IOs apps) as well as some experience in Google Cloud Platform. The ML/AI technology is completely new to me. Okay, I know the basics (input data is needed to train the ML/AI system etc.), but I wonder If there is already a framework which could help me to develop such a system which solves the above described uses-case.
I would appreciate it, if you could give me some hints where and how to start.
Thank you and regards
It is definitely possible to implement such an application, in case you want to do it in Google Cloud you will need some understanding of Tensorflow.
First of all, I recommend to you to do the Machine Learning Crash Course, for a good introduction to Machine Learning and to start to familiarize yourself with TensorFlow. Afterwards I recommend to take a look into Tensorflow tutorials which will give you a more practical introduction to Tensorflow, and include various examples on building/training/testing models.
Once you are famirialized with Tensorflow, you can jump into learning how to run jobs in the Machine Learning engine, you can start by following the quickstart. The documentation includes detailed guides on how to use the ml-engine, plus multiple samples and tutorials.
Since I believe that your application would fall into the Recommender System type, here you can see an example model, in Google Cloud ML Engine, on how to recommend items to users based on his previous searches. In your case, you would have to build a model in order to recommend items to users based on his previous words in the sentence.
The second option, in case you don't want to go through the hassle of building a new model from scratch, would be to use the Google Cloud Natural Language API, which you can understand as pre-trained models using Google (incredibly big) data. In your case, I believe that the Content Classifying API would help you achieve what your application intends to do, however, the outputs (which you can see here) are limited to what the model was trained to do, and might not be specific enough for your application, however it is an easy solution and you can still profit of this API in order to extract labels/information and send it as input to another model.
I hope that these links provide you with some foundations on what is possible to do with Tensorflow in the ML Engine, and are useful to you.

Forming conditional distributions in TensorFlow probability

I am using Tensorflow Probability to build a VAE which includes image pixels as well as some other variables. The output of the VAE:
tfp.distributions.Independent(tfp.distributions.Bernoulli(logits), 2, name="decoder-dist")
I am trying to understand how to form other conditional distributions based on this which I can use with the inference methods (MCMC or VI). Say the output above was P(A,B,C | Z), how would I take that distribution to form a posterior P(A|B, C, Z) that I could perform inference on? I have been trying to read through the docs but I am having some trouble grasping them.
The answer to your question depends very much on the nature of the joint model within which you'd like to do the conditioning. Much has been written about the topic, and in short it's a very hard problem in general :). Without knowing a bit more about the particulars of your problem, it's near impossible to recommend a useful generic inference procedure. However, we do have a bunch of examples (scripts and jupyter/colab notebooks) in the TFP repo here: https://github.com/tensorflow/probability/tree/master/tensorflow_probability/examples
In particular, there's
The Hierarchical Linear Model example, which is a sort of Rosetta stone showing how to do posterior inference using Hamiltonian Monte Carlo (an MCMC technique) in TFP, R, and Stan,
The Linear Mixed Effects Model example, showing how you might use VI to solve a standard LME problem,
among many others. You can click the "Run in Google Colab" link at the top of any of these notebooks to open and run on them on https://colab.research.google.com.
Please feel free, also, to reach out on to us via email at tfprobability#tensorflow.org. This is a public Google Group where users can engage with the team that builds TFP directly. If you provide us some more info there on what you'd like to do, we're happy to provide guidance on modeling and inference with TFP.
Hope this is gives at least a start in the right direction!

Searching for tools to do bayesian network "structure" learning

There are a lot of programs that do parameter learning for Bayes nets. I am having a hard time finding libraries or tools that do (or try to do) structure learning. Specifically, one that uses an information theoretic approach, by looking at the information gain from adding an edge, or analyzing the cross entropy across Random Variables to determine if they have any relationships or are independent. This is not the core problem I am trying to work on, but learning structure is an important part of it. So finding an existing tool/library would help immensely.
Try the bnlearn library. It contains structure learning, parameter learning, inference and various well-known example datasets such as Sprinkler, Asia, Alarm.
Github documentation pages
Examples can be found here
Blog about detecting causal relationships can be found here.

Suitability of Naive Bayes classifier in Mahout to classifying websites

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.
In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.
From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.
I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.
Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.
Alternatively, if I'm barking up entirely the wrong tree please let me know!
You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.
How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.

How do you find the most discriminant terms in binary document classification?

I want to use feature selection to find the terms in a document that are most useful for a binary classification task.
I've been looking around:
This mentions Mutual Information and the chi-squared test metric
http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html
MATLAB has a number of functions as well:
http://www.mathworks.com/help/toolbox/stats/brj0qbu.html
Feature Selection in MATLAB
Of the above, relieff and rankfeatures look promising.
I do not know if my data follows a normal distribution. Any thoughts on which technique performs the best? Are there any newer methods you would suggest? The focus is to increase classification accuracy.
Thank you!
Since the answer is highly dependent on the nature of your data, I'd suggest playing with several options, possibly using a hold-out set for verification.
The easiest path would probably be to use Weka or RapidMiner for experimenting. Choosing from the plethora of options provided by them, you'll probably get acquainted with several other methods.
Having said that, I have found Mutual Information/Infogain to be useful on a large variety of problems.