Implementing big data into Cytoscape - cytoscape.js

I am new to Cytoscape and need some advice. I have a file that has two columns named: source & destination (edge -> end node)
For example, a sample starting from the top may look like this:
src | dst
12.251.512 | 12.623.743
51.734.312 | 23.233.991
6334.6231.123 | 42.532.54453
It has roughly a million+ lines, and I need a way to visualize it.
Is Cytoscape the right tool for this kind of visualization job? If so,
What methods can be used to simplify such large networks so that visualization actually gives us some information? Any plugin/tool/technique advice would be very much appreciated.

If you have so many elements, your users are going to have trouble making sense of the graph. Try hiding elements using the style: http://js.cytoscape.org/#style/visibility
You can show N elements at a time based on filters. What metrics you use depends on your data. A basic one might be degree.
You could control the filtering with sliders, toggles, etc.

Related

Implementations of (fully) dynamic connectivity data structures

The dynamic connectivity problem for graphs consists in maintaining a graph data structure that allows for adding and deleting edges of the graph.
Moreover, the data structure should support connectivity queries.
Typically, such a query is of the form ''Are the nodes u and v connected in the graph?''
There are variants of the dynamic connectivity problem that also support different connectivity queries like 2-edge-connectivity or biconnectivity.
My question is: Are there existing efficient implementations of dynamic connectivity data structures?
By efficient I mean that data structures with a low amortized operation costs.
In particular, I am NOT interested in trivial implementations with a complexity of O(n) per operation!
Below I describe in more detail what I am looking for an what I already know.
If only edge insertions are allowed the dynamic connectivity problem can be solved by the well known disjoint-set (aka union find) data structure.
For this data structure there are implementations available in many different programming languages.
Unfortunately, this does not seem to be the case for the dynamic connectivity problem that also allows edge deletions.
The situation is even worse for data structures that also allow other connectivity queries like 2-edge- or biconnectivity.
To the best of my knowledge the algorithms presented in Holm et al. (2001) are still state of the art for many dynamic connectivity problems.
This publication was accompanied by an experimental study, however, as far as I can tell the code was never made publicly available. Also, therein only implementations for the regular connectivity problem are discussed, not for 2-edge- or biconnectivity.
The algorithms by Holm et al. (and also by other authors) are highly non-trivial.
Even though the algorithms are described in much detail it requires a lot of expertise to implement these algorithms in practice.
Because of this I am looking for existing implementation of different dynamic connectivity data structures.
The table below summarizes the (currently underwhelming) implementations of different combinations of supported manipulations and queries.
Graph Manipulations
Connectivity
2-edge-connectivity
Biconnectivity
incremental (adding edges)
disjoint-set
decremental (deleting edges)
Rafael Glikis
fully (adding and deleting edges)
I have searched for implementations in different places. I have looked on git-hub, I have looked through the external links in the relevant Wikipedia articles and I have skimmed through a lot of literature without any success.
I expect we will need a framework for trying things out so that we can discuss this in concrete terms.
I have implemented a small windows application that accepts user queries to read, build, edit and query the connectivity of a graph, showing the time taken to execute each.
Sample run:
Supported queries
add v1 v2 : add link to graph
delete v1 v2 : remove link from graph
reach src dst : find path between vertices
read filepath : input graph links from file
help : this help display
type query> read ../dat/3elt.graph.seq.txt
4720 vertices 27444 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.539246 0.539246 query
type query> delete 23 20
4720 vertices 27443 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.004432 0.004432 query
type query> add 23 20
4720 vertices 27444 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.0046639 0.0046639 query
The complete application is at https://github.com/JamesBremner/graphConnectivity
To demonstrate how this application can be used, I built it with the graph engine at https://github.com/JamesBremner/PathFinderFeb2023 and ran it on a couple of the test datasets from https://dyngraphlab.github.io/
dataset
edge count
delete
add
3elt.graph.seq.txt
27,443
5ms
5ms
144.graph.seq.txt
2,148,787
13ms
13ms
To get the average time to perform multiple queries, use the random command, like this:
Supported queries
add v1 v2 : add link to graph
add random n : add n random links to graph
delete v1 v2 : remove link from graph
reach src dst : find path between vertices
read filepath : input graph links from file
help : this help display
type query> read ../dat/3elt.graph.seq.txt
4720 vertices 27444 edges
type query> add random 10
4720 vertices 27454 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
10 1.62e-06 1.62e-05 randomAdd

Reducing number of unique categories pandas

I have my dataset with job metrics, and one of my features is industry. It is a categorical feature and has 1200 unique values. Before I go on and work on building a model, I need to figure out how to best encode it esp because it has 1200 unique values. Does anyone have any tips or guidance as to where I should start?
The picture below shows the top 9 industries. I am thinking of selective encoding - maybe only using one-hot encoding for these 15-20 most frequent values, but I will be thankful for any suggestions. Thanks
Tried to look for several resources, but couldn't find anything promising so far
[A picture of the 9 most occurring industries]
https://i.stack.imgur.com/tDAEk.jpg
You could one hot encode everything, and maybe check correlations against target to see which job categories may be informative features.
if the data is too large to do this, then yes perhaps selective encoding as you said -- just conditionally fill everything else as "other" and then proceed with one hot encoding.

algorithms to select price ranges

What is best way to represent a sereis of item, price ranges to reduce noise for the end user.
Typically when an item is displayed they show a histogram of price ranges is displayed in ecommerce sites. Are there standard algorithms that these sites use for this display?.
Well it seems to me that you would first and foremost need a way to aggregate this data. That having been said, if you have that data and need to create a histogram it can be fairly simple in the programming language R (here is some documentation: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/base/html/hist.html ). There is also an R extension I've read about that allows you to post/run R code in wiki-like pages ( http://mars.wiwi.hu-berlin.de/mediawiki/slides/index.php/R_extension_-_Mediawiki ).
If you already have this data (in this case prices) I dont think you need an algorithm so much as you just need a way to display it in a type of graph. I think R should be useful. I hope this helps!

Adding custom data to GapMinder

Does anyone have any experience adding their own data to GapMinder, the really cool software that Hans Rosling uses in his TED talks? I have an array od objects in JSON that would be easy to show in moving bubbles. This would be really cool.
I can see that my Ubuntu box has what looks like data in /opt/Gapminder Desktop/share/assets/graphs/world, but I would need to figure out:
How to add a measure to a graph
How to add a data series
How to set the time range of the data
Identify the measures to follow at each time step
and so on.
Just for the record: if you want to use Gapminder with your own dataset, you have to convert your data in a format suitable to Gapminder. More specifically, looking in the assets/graphs/world, you will have to:
Edit the file overview.xml, which contains the tree structure of all the indicators (just copy/paste an entry and specify your own data);
Convert your data copying the structure of the xml files in that directory (this is the tricky part): you can specify some metadata in the preamble, and then specify your own data series, with something like:
<t1 m="i20,50.0,99.0,1992" d="90.0, ... ,50.0, ..."/> where i20 is the country id, which is followed by the minima and maxima of the series, and the year it refers to.
In my humble opinion, Gapminder is a great app but it definitely needs more work on integration with other datasets. Way better to use Google Motion Chart as you did, or MooGraph (site and doc), which is unfortunately not as great as Gapminder.
#Stefano
the information you provided is very valuable. Is somewhere available a detailed specification of the XML files containing the data?
Anyway, just to enrich your response, I also found that:
overview.xml file
The link between Nations and their IDs is in this file
The structure of the menus for the selection of the indicators is also in the same file (at the bottom) under the section <indicatorCategorization>
The structure of the datafile XML
For each line the year represents the first year of the serie, and then the values follow one per year, comma separated.
Grazie,
Max
I ended up using the google motion chart API. I ended up with this.

Designing file processing that handles many file formats, parsing, validation, and persistence

If you had to design a file processing component/system, that could take in a wide variety of file formats (including proprietary formats such as Excel), parse/validate and store this information to a DB.. How would you do it?
NOTE : 95% of the time 1 line of input data will equal one record in the database, but not always.
Currently I'm using some custom software I designed to parse/validate/store customer data to our database. The system identifies a file by location in the file system(from an ftp drop) and then loads an XML "definition" file. (The correct XML is loaded based on where the input file was dropped off at).
The XML specifies things like file layout (Delimited or Fixed Width) and field specific items (Length, Data Type(numeric, alpha, alphanumeric), and what DB column to store the field to).
<delimiter><![CDATA[ ]]></delimiter>
<numberOfItems>12</numberOfItems>
<dataItems>
<item>
<name>Member ID</name>
<type>any</type>
<minLength>0</minLength>
<maxLength>0</maxLength>
<validate>false</validate>
<customValidation/>
<dbColumn>MembershipID</dbColumn>
</item>
Because of this design the input files must be text (fixed width or delimited) and have a 1 to 1 relation from input file data field to DB column.
I'd like to extend the capabilities of our file processing system to take in Excel, or other file formats.
There are at least a half dozen ways I can proceed but I'm stuck right now because I don't have anyone to really bounce the ideas off of.
Again : If you had to design a file processing component, that could take in a wide variety of file formats (including proprietary formats such as Excel), parse/validate and store this information to a DB.. How would you do it?
Well, a straightforward design is something like...
+-----------+
| reader1 |
| |---
+-----------+ \---
\--- +----------------+ +-------------+
\--| validation | | DB |
/---| |---------------| |
+-----------+ /----- +----------------+ +-------------+
| reader2 |----
| |
+-----------+
Readers take care of file validation(does the data exist?) and parsing, the Validation section takes care of any business logic, and the DB...is a DB.
So part of what you'd have to design is the Generic ReaderToValidator data container. That's more of a business logic kind of container. I suspect you want the same kind of data regardless of the input format, so G.R.2.V. is not going to be too hard.
You can polymorphic this by designing a GR2V superclass with the Validator method and the data members, then each reader subclasses off of GR2V and fills up the data with its own ReadParseFile method. That's going to introduce a bit more coupling though than having a strict procedural approach. I'd go procedural for this, since data is being procedurally processed in the conceptual design.
You may want to start a blog, then perhaps if you are on something like LinkedIn you can point the discussion to your blog, or start a discussion on LinkedIn, as some of the discussions there go on for a while.
SO is good for specifics, it seems like true discussion is not so easily done here. Comments are too small for interchange of ideas. I would tend to go elsewhere.
Although such discussions should be technology-agnostic, I suspect that you'll probably find that the Java and .Net camps don't meet too much. I would look at The Server Side but I do Java and hence look for Java stuff.