Flat files and relational databases give us a mechanism to serialize structured data. XML is superb for serializing un-structured tree-like data.
But many problems are best represented by graphs. A thermal simulation program will, for instance, work with temperature nodes connected to each others through resistive edges.
So what is the best way to serialize a graph structure? I know XML can, to some extent, do it---in the same way that a relational database can serialize a complex web of objects: it usually works but can easily get ugly.
I know about the dot language used by the graphviz program, but I'm not sure this is the best way to do it. This question is probably the sort of thing academia might be working on and I'd love to have references to any papers discussing this.
How do you represent your graph in memory?
Basically you have two (good) options:
an adjacency list representation
an adjacency matrix representation
in which the adjacency list representation is best used for a sparse graph, and a matrix representation for the dense graphs.
If you used suchs representations then you could serialize those representations instead.
If it has to be human readable you could still opt for creating your own serialization algorithm. For example you could write down the matrix representation like you would do with any "normal" matrix: just print out the columns and rows, and all the data in it like so:
1 2 3
1 #t #f #f
2 #f #f #t
3 #f #t #f
(this is a non-optimized, non weighted representation, but can be used for directed graphs)
Typically relationships in XML are shown by the parent/child relationship. XML can handle graph data but not in this manner. To handle graphs in XML you should use the xs:ID and xs:IDREF schema types.
In an example, assume that node/#id is an xs:ID type and that link/#ref is an xs:IDREF type. The following XML shows the cycle of three nodes 1 -> 2 -> 3 -> 1.
<data>
<node id="1">
<link ref="2"/>
</node>
<node id="2">
<link ref="3"/>
</node>
<node id="3">
<link ref="1"/>
</node>
</data>
Many development tools have support for ID and IDREF too. I have used Java's JAXB (Java XML Binding. It supports these through the #XmlID and the #XmlIDREF annotations. You can build your graph using plain Java objects and then use JAXB to handle the actual serialization to XML.
XML is very verbose. Whenever I do it, I roll my own. Here's an example of a 3 node directed acyclic graph. It's pretty compact and does everything I need it to do:
0: foo
1: bar
2: bat
----
0 1
0 2
1 2
Adjacency lists and adjacency matrices are the two common ways of representing graphs in memory. The first decision you need to make when deciding between these two is what you want to optimize for. Adjacency lists are very fast if you need to, for example, get the list of a vertex's neighbors. On the other hand, if you are doing a lot of testing for edge existence or have a graph representation of a markov chain, then you'd probably favor an adjacency matrix.
The next question you need to consider is how much you need to fit into memory. In most cases, where the number of edges in the graph is much much smaller than the total number of possible edges, an adjacency list is going to be more efficient, since you only need to store the edges that actually exist. A happy medium is to represent the adjacency matrix in compressed sparse row format in which you keep a vector of the non-zero entries from top left to bottom right, a corresponding vector indicating which columns the non-zero entries can be found in, and a third vector indicating the start of each row in the column-entry vector.
[[0.0, 0.0, 0.3, 0.1]
[0.1, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0]
[0.5, 0.2, 0.0, 0.3]]
can be represented as:
vals: [0.3, 0.1, 0.1, 0.5, 0.2, 0.3]
cols: [2, 3, 0, 0, 1, 4]
rows: [0, 2, null, 4]
Compressed sparse row is effectively an adjacency list (the column indices function the same way), but the format lends itself a bit more cleanly to matrix operations.
One example you might be familiar is Java serialization. This effectively serializes by graph, with each object instance being a node, and each reference being an edge. The algorithm used is recursive, but skipping duplicates. So the pseudo code would be:
serialize(x):
done - a set of serialized objects
if(serialized(x, done)) then return
otherwise:
record properties of x
record x as serialized in done
for each neighbour/child of x: serialize(child)
Another way of course is as a list of nodes and edges, which can be done as XML, or in any other preferred serialization format, or as an adjacency matrix.
On a less academic, more practical note, in CubicTest we use Xstream (Java) to serialize tests to and from xml. Xstream handles graph-structured object relations, so you might learn a thing or two from looking at it's source and the resulting xml. You're right about the ugly part though, the generated xml files don't look pretty.
Related
Suppose i have bunches of the below n=36 polynomials/data:
They are all quite similar but with sightly different roof and amplitude, what is the best approach for me to code a sequence of coefficients/changes so that i can use this sequence to transform one polynomial to another one, say: the blue one + a change sequence -> the green one?
P.S.:
I had tried to use gaussian curve to fit the data, but unfortunately the results were very poor, so i have to use polynomials;
Currently the data are fitted by numpy.polyfit(x, y, 35)
Edit:
The intention is to find a way to generically describe the transformation between two polys, so i can use it to transform the future polys, say: in future i get a totally new poly like above, i can use this transformation code to transform it in a specific manner: increase/decrease the roof/amplitude, by specific manner i mean, note in the graph, the y changes around the roof x is always bigger, along +x / -x the changes are descending in a way, quite like gaussian curve, but unfortunately cannot use gaussian curve to express the data
I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?
Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.
It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.
I want to implement column generation by using Dantzig Wolf Decomposition.
In the algorithm, the feasible polyhedrons of the problem are represented as the convex combination of its extreme points and extreme rays. Thus we build a projection between the original problem and the master problem: x -> ∑μi*xi, where xi is the extreme points of the original feasible polyhedron.
I want to know how can I implement the projection in my code, i.e., for each μi, I can obtain its corresponding extreme point. Because for each extreme point, it is a list of the coordinates of the original variables. However the problem has a large number of variables so the list will be very long. If I save the coordinates for every μi, it will be expensive.
So, I have a set of texts I'd like to do some clustering analysis on. I've taken a Normalized Compression Distance between every text, and now I have basically built a complete graph with weighted edges that looks something like this:
text1, text2, 0.539
text2, text3, 0.675
I'm having tremendous difficulty figuring out the best way to plug this data into scipy's hierarchical clustering methods. I can probably convert the distance data into a table like the one on this page. How can I format this data so that it can easily be plugged into scipy's HAC code?
You're on the right track with converting the data into a table like the one on the linked page (a redundant distance matrix). According to the documentation, you should be able to pass that directly into scipy.cluster.hierarchy.linkage or a related function, such as scipy.cluster.hierarchy.single or scipy.cluster.hierarchy.complete. The related functions explicitly specify how distance between clusters should be calculated. scipy.cluster.hierarchy.linkage lets you specify whichever method you want, but defaults to single link (i.e. the distance between two clusters is the distance between their closest points). All of these methods will return a multidimensional array representing the agglomerative clustering. You can then use the rest of the scipy.cluster.hierarchy module to perform various actions on this clustering, such as visualizing or flattening it.
However, there's a catch. As of the time this question was written, you couldn't actually use a redundant distance matrix, despite the fact that the documentation says you can. Based on the fact that the github issue is still open, I don't think this has been resolved yet. As pointed out in the answers to the linked question, you can get around this issue by passing the complete distance matrix into the scipy.spatial.distance.squareform function, which will convert it into the format which is actually accepted (a flat array containing the upper-triangular portion of the distance matrix, called a condensed distance matrix). You can then pass the result to one of the scipy.cluster.hierarchy functions.
I would like to add (arithmetics) two large System.Arrays element-wise in IronPython and store the result in the first array like this:
for i in range(0:ArrA.Count) :
arrA.SetValue(i, arrA.GetValue(i) + arrB.GetValue(i));
However, this seems very slow. Having a C background I would like to use pointers or iterators. However, I do not know how I should apply the IronPython idiom in a fast way. I cannot use Python lists, as my objects are strictly from type System.Array. The type is 3d float.
What is the fastests / a fast way to perform to compute this computation?
Edit:
The number of elements is appr. 256^3.
3d float means that the array can be accessed like this: array.GetValue(indexX, indexY, indexZ). I am not sure how the respective memory is organized in IronPython's System.Array.
Background: I wrote an interface to an IronPython API, which gives access to data in a simulation software tool. I retrieve 3d scalar data and accumulate it to a temporal array in my IronPython script. The accumulation is performed 10,000 times and should be fast, so that the simulation does not take ages.
Is it possible to use the numpy library developed for IronPython?
https://pytools.codeplex.com/wikipage?title=NumPy%20and%20SciPy%20for%20.Net
It appears to be supported, and as far as I know is as close you can get in python to C style pointer functionality with arrays and such.
Create an array:
x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
Multiply all elements by 3.0:
x *= 3.0