Pandas panelnd vs dataframe with hierarchical index - pandas

I was wondering when and why I should prefer a panel(nd) over a dataframe with hierarchical index, and vice versa. In my very brief experience, I would say that the former is more convenient for slicing, while the latter for mathematical operations. My particular need would be to interactively manipulate 3-5 dimensional panels with convenient slicing and element-wise operations.
Thanks,
Giacomo

Generally stick with a multi-indexed frame as they are more fully supported.
A panelnd is like a generalized n-dim Panel, good mainly for single-dtyped data. It does work like a Panel, but has some quirks and missing features (its why its experimental).
Their are ways to apply some operations to multiple slabs of a n-dim (esp. via new apply in 0.13.1, see here.
Once I get to more than 3 dimensions, I mainly 'hold' the data and slice to work it in 2 dimensions, then reassemble it if needed. Storage can also be convient for these higher dim objects (e.g. via HDFStore), and was the reason they were created in the first place.

Related

Using PCA on Part of Dataframe

I want to use a clustering algorithm to a dataframe that contains a lot of features (32 columns).
A part of the features are encoded using one hot encoder.
I want to use PCA ( Principal Component analysis ) to reduce the dimension and make the machine learning process easier.
Is it possible to use the PCA just for some columns of the data frame and keep the other columns as they are then use machine learning model.
Or it is obligatory to use PCA for all the dataframe before clustering.
I guess there should be no issue with doing what you describe.
What this does, effectively, is merge some of the objects' features into fewer ones, but then using other, non-merged ones in addition to the merged ones. I don't know what effect that would have on the outcome; it might be good to run a correlation to see whether the unmerged features add anything to the PCA-merged ones. You might find that they basically duplicate what is there already.
Since clustering is an exploratory method, you can basically do whatever you want. It is of course advisable to have a reason for doing so, as it otherwise ends up as simply trial-and-error, and if you find a result, you won't be able to describe why you got there. It is possible (or even likely for some data sets) that there are multiple ways to cluster them, so you should make decisions based on what you know about the data already, so they can be justified in those terms.
Running random trial-and-error clustering until you find a structure makes it a bit difficult to come up with a good explanation why that structure is valid.

Titan vertex centric indices vs Neo4j labels

I was trying to make a comparison between these two technologies when approaching this and I was wondering if any of you already have some experience dealing with any or both of them?
I am mainly interested in performance numbers when dealing with similar use cases.
The difference between the two concepts is the difference between global and local indexing.
As I understand it, Neo4j vertex labels allow you to break up your index space by "categories" of vertices. In this way, a O(log(|V|)) lookup is now an O(log(|V|/c)), where c is the number of categories/labels you have over your vertex set and (the equation) assumes an equal number of vertices in each category. As such, vertex label aid in global index calls as this is a function of V.
Next, Titan's vertex-centric indices sort and index the incident edges of a vertex. The cost to find a particular edge by its label/properties incident to a vertex is O(log(inc(v))), where inc(v) is the size of the incident edge set to vertex v. As such, vertex-centric indices are local indices as this is a function of v.
As I understand it, Neo4j does not support vertex-centric indices. You see this concept currently in Titan, OrientDB, and TinkerGraph (…and RDF stores sort in this manner as well -- via spog pairings). Next, all known graph databases support global indices though, (I believe only Neo4j and OrientDB), support a vertex set partition via the concept of a label.
Again, assuming my assumptions are correct about the use of vertex labels in Neo4j, we are talking about two different use cases — global vs. local indexing. From the perspective of the supernode problem, global indices do not quell the issue of traversing through a large vertex, while this is the sole purpose of the local vertex-centric indices.
You can read about the supernode problem and vertex-centric indices here:
http://thinkaurelius.com/2012/10/25/a-solution-to-the-supernode-problem/
Agreeing with everything Marko said, one could take it further and argue that in the graph database world local indexes can (and even should) substitute global ones. In my opinion, the single greatest advantage of a graph data model is that it lets you encode your data model into the graph topology, gaining qualitative advantages in terms of flexibility, ease of evolution and performance. With this in mind, I'd argue that labels in Neo4j actually detract from all this; reifying a label into a node with adjacent edges pointing to the source having that label is much more in line with the "schema is the graph" philosophy.
Of course, if your engine lacks local indexes we are back at the supernode problem. But if you do have them (something which I'd say should be a requirement for something to be called a graph database), you can easily transform your label into a node L, and create relationships pointing to that node for those vertices which you want labeled with L
v -[L]-> L
meaning that v has label L. Now if you want this in Titan to behave like a Neo4j label, just make the -[L]-> relation to be "manyToOne" (see Titan cardinality constraints) and create a vertex-centric index. This pattern lets you get everything that you could with labels and much more; you can
effectively use this as a namespace for properties relating to that label
sort your elements inside one label
nest labels easily without losing performance (just use a composite key)
separate the declaration of a label L with how elements labeled with it are accessed
Labels may afford some design patterns that improve performance by de-densifying the graph. For example: they eliminate the need for type nodes, which can often get quite dense. Labels can optionally be associated with a unique index. Here, the ability to index a property isn't new, but the ability to constrain it uniquely is. If you were previously doing work in your application, you may experience some performance gains by letting the database handle this. (It's certainly much more convenient to do so.) Finally, if you don't assign a unique index to a label, it will still be indexed, in order to help performance for certain kinds of queries (e.g. "give me all of the nodes having label ")
All that said, while labels may help with performance in certain cases, they were introduced more with ease-of-use in mind. We're just getting started with Neo4j 2.1, which specifically addresses dense node performance (something I know you've been waiting for), along with other performance & scalability improvements... including removing (for all practical purposes eliminating) the upper size limits.
Philip

If I use python pandas, is there any need for structured arrays?

Now that pandas provides a data frame structure, is there any need for structured/record arrays in numpy? There are some modifications I need to make to an existing code which requires this structured array type framework, but I am considering using pandas in its place from this point forward. Will I at any point find that I need some functionality of structured/record arrays that pandas does not provide?
pandas's DataFrame is a high level tool while structured arrays are a very low-level tool, enabling you to interpret a binary blob of data as a table-like structure. One thing that is hard to do in pandas is nested data types with the same semantics as structured arrays, though this can be imitated with hierarchical indexing (structured arrays can't do most things you can do with hierarchical indexing).
Structured arrays are also amenable to working with massive tabular data sets loaded via memory maps (np.memmap). This is a limitation that will be addressed in pandas eventually, though.
I'm currently in the middle of transition to Pandas DataFrames from the various Numpy arrays. This has been relatively painless since Pandas, AFAIK, if built largely on top of Numpy. What I mean by that is that .mean(), .sum() etc all work as you would hope. On top of that, the ability to add a hierarchical index and use the .ix[] (index) attribute and .xs() (cross-section) method to pull out arbitray pieces of the data has greatly improved the readability and performance of my code (mainly by reducing the number of round-trips to my database).
One thing I haven't fully investigated yet is Pandas compatibility with the more advanced functionality of Scipy and Matplotlib. However, in case of any issues, it's easy enough to pull out a single column that behaves enough like an array for those libraries to work, or even convert to an array on the fly. A DataFrame's plotting methods, for instance, rely on matplotlib and take care of any conversion for you.
Also, if you're like me and your main use of Scipy is the statistics module, pystatsmodels is quickly maturing and relies heavily on pandas.
That's my two cents' worth
I never took the time to dig into pandas, but I use structured array quite often in numpy. Here are a few considerations:
structured arrays are as convenient as recarrays with less overhead, if you don't mind losing the possibility to access fields by attributes. But then, have you ever tried to use min or max as field name in a recarray ?
NumPy has been developed over a far longer period than pandas, with a larger crew, and it becomes ubiquitous enough that a lot of third party packages rely on it. You can expect structured arrays to be more portable than pandas dataframes.
Are pandas dataframes easily pickable ? Can they be sent back and forth with PyTables, for example ?
Unless you're 100% percent that you'll never have to share your code with non-pandas users, you might want to keep some structured arrays around.

Haskell: list/vector/array performance tuning

I am trying out Haskell to compute partition functions of models in statistical physics. This involves traversing quite large lists of configurations and summing various observables - which I would like to do as efficiently as possible.
The current version of my code is here: https://gist.github.com/2420539
Some strange things happen when trying to choose between lists and vectors to enumerate the configurations; in particular, to truncate the list, using V.toList . V.take (3^n) . V.fromList (where V is Data.Vector) is faster than just using take, which feels a bit counter-intuitive. In both cases the list is evaluated lazily.
The list itself is built using iterate; if instead I use Vectors as much as possible and build the list by using V.iterateN, again it becomes slower ...
My question is, is there a way (other than splicing V.toList and V.fromList at random places in the code) to predict which one will be the quickest? (BTW, I compile everything using ghc -O2 with the current stable version.)
Vectors are strict, and have O(1) subsets (e.g. take). They also have an optimized insert and delete. So you will sometimes see performance improvements by switching data structures on the fly. However, it is usually the wrong approach -- keeping all data in either one form or the other is better. (And you're using UArrays as well -- further confusing the issue).
General rules:
If the data is large and being transformed only in bulk fashion, using a dense, efficient structures like vectors make sense.
If the data is small, and traversed linearly, rarely, then lists make sense.
Remember that operations on lists and vectors have different complexity, so while iterate . replicate on lists is O(n), but lazy, the same on vectors will not necessarily be as efficient (you should prefer the built in methods in vector to generate arrays).
Generally, vectors should always be better for numerical operations. It might be that you have to use different functions that you do in lists.
I would stick to vectors only. Avoid UArrays, and avoid lists except as generators.

Object Oriented implementation of graph data structures

I have been reading quite a bit graph data structures lately, as I have intentions of writing my own UML tool. As far as I can see, what I want can be modeled as a simple graph consisting of vertices and edges. Vertices will have a few values, and will so best be represented as objects. Edges does not, as far as I can see, need to be neither directed or weighted, but I do not want to choose an implementation that makes it impossible to include such properties later on.
Being educated in pure object oriented programming, the first things that comes to my mind is representing vertices and edges by classes, like for example:
Class: Vertice
- Array arrayOfEdges;
- String name;
Class: Edge
- Vertice from;
- Vertice to;
This gives me the possibility to later introduce weights, direction, and so on. Now, when I read up on implementing graphs, it seems that this is a very uncommon solution. Earlier questions here on Stack Overflow suggests adjacency lists and adjacency matrices, but being completely new to graphs, I have a hard time understanding why that is better than my approach.
The most important aspects of my application is having the ability to easily calculate which vertice is clicked and moved, and the ability to add and remove vertices and edges between the vertices. Will this be easier to accomplish in one implementation over another?
My language of choice is Objective-C, but I do not believe that this should be of any significance.
Here are the two basic graph types along with their typical implementations:
Dense Graphs:
Adjacency Matrix
Incidence Matrix
Sparse Graphs:
Adjacency List
Incidence List
In the graph framework (closed source, unfortunately) that I've ben writing (>12k loc graph implementations + >5k loc unit tests and still counting) I've been able to implement (Directed/Undirected/Mixed) Hypergraphs, (Directed/Undirected/Mixed) Multigraphs, (Directed/Undirected/Mixed) Ordered Graphs, (Directed/Undirected/Mixed) KPartite Graphs, as well as all kinds of Trees, such as Generic Trees, (A,B)-Trees, KAry-Trees, Full-KAry-Trees, (Trees to come: VP-Trees, KD-Trees, BKTrees, B-Trees, R-Trees, Octrees, …).
And all without a single vertex or edge class. Purely generics. And with little to no redundant implementations**
Oh, and as if this wasn't enough they all exist as mutable, immutable, observable (NSNotification), thread-unsafe and thread-safe versions.
How? Through excessive use of Decorators.
Basically all graphs are mutable, thread-unsafe and not observable. So I use Decorators to add all kinds of flavors to them (resulting in no more than 35 classes, vs. 500+ if implemented without decorators, right now).
While I cannot give any actual code, my graphs are basically implemented via Incidence Lists by use of mainly NSMutableDictionaries and NSMutableSets (and NSMutableArrays for my ordered Trees).
My Undirected Sparse Graph has nothing but these ivars, e.g.:
NSMutableDictionary *vertices;
NSMutableDictionary *edges;
The ivar vertices maps vertices to adjacency maps of vertices to incident edges ({"vertex": {"vertex": "edge"}})
And the ivar edges maps edges to incident vertex pairs ({"edge": {"vertex", "vertex"}}), with Pair being a pair data object holding an edge's head vertex and tail vertex.
Mixed Sparse Graphs would have a slightly different mapping of adjascency/incidence lists and so would Directed Sparse Graphs, but you should get the idea.
A limitation of this implementation is, that both, every vertex and every edge needs to have an object associated with it. And to make things a bit more interesting(sic!) each vertex object needs to be unique, and so does each edge object. This is as dictionaries don't allow duplicate keys. Also, objects need to implement NSCopying. NSValueTransformers or value-encapsulation are a way to sidestep these limitation though (same goes for the memory overhead from dictionary key copying).
While the implementation has its downsides, there's a big benefit: immensive versatility!
There's hardly any type graph that I could think of that's impossible to archieve with what I already have. Instead of building each type of graph with custom built parts you basically go to your box of lego bricks and assemble the graphs just the way you need them.
Some more insight:
Every major graph type has its own Protocol, here are a few:
HypergraphProtocol
MultigraphProtocol [tagging protocol] (allows parallel edges)
GraphProtocol (allows directed & undirected edges)
UndirectedGraphProtocol [tagging protocol] (allows only undirected edges)
DirectedGraphProtocol [tagging protocol] (allows only directed edges)
ForestProtocol (allows sets of disjunct trees)
TreeProtocol (allows trees)
ABTreeProtocol (allows trees of a-b children per vertex)
FullKAryTreeProtocol [tagging protocol] (allows trees of either 0 or k children per vertex)
The protocol nesting implies inharitance (of both protocols, as well as implementations).
If there's anything else you'd like to get some mor insight, feel free to leave a comment.
Ps: To give credit where credit is due: Architecture was highly influenced by the
JUNG Java graph framework (55k+ loc).
Pps: Before choosing this type of implementation I had written a small brother of it with just undirected graphs, that I wanted to expand to also support directed graphs. My implementation was pretty similar to the one you are providing in your question. This is what gave my first (rather naïve) project an abrupt end, back then: Subclassing a set of inter-dependent classes in Objective-C and ensuring type-safety Adding a simple directedness to my graph cause my entire code to break apart. (I didn't even use the solution that I posted back then, as it would have just postponed the pain) Now with the generic implementation I have more than 20 graph flavors implemented, with no hacks at all. It's worth it.
If all you want is drawing a graph and being able to move its nodes on the screen, though, you'd be fine with just implementing a generic graph class that can then later on be extended to specific directedness, if needed.
An adjacency matrix will have a bit more difficulty than your object model in adding and removing vertices (but not edges), since this involves adding and removing rows and columns from a matrix. There are tricks you could use to do this, like keeping empty rows and columns, but it will still be a bit complicated.
When moving a vertex around the screen, the edges will also be moved. This also gives your object model a slight advantage, since it will have a list of connected edges and will not have to search through the matrix.
Both models have an inherent directedness to the edges, so if you want to have undirected edges, then you will have to do additional work either way.
I would say that overall there is not a whole lot of difference. If I were implementing this, I would probably do something similar to what you are doing.
If you're using Objective-C I assume you have access to Core Data which would be probably be a great place to start - I understand you're creating your own graph, the strength of Core Data being that it can do a lot of the checking you're talking about for free if you set up your schema properly