How to move my pandas dataframe to d3? - pandas

I am new to Python and have worked my way through a few books on it. Everything is great, except visualizations. I really dislike matplotlib and Bokeh requires too heavy of a stack.
The workflow I want is:
Data munging analysis using pandas in ipython notebook -> visualization using d3 in sublimetext2
However, being new to both Python and d3, I don't know the best way to export my pandas dataframe to d3. Should I just have it as a csv? JSON? Or is there a more direct way?
Side question: Is there any (reasonable) way to do everything in an ipython notebook instead of switching to sublimetext?
Any help would be appreciated.

Basically there is no best format what will fit all your visualization needs.
It really depends on the visualizations you want to obtain.
For example, a Stacked Bar Chart takes as input a CSV file, and an adjacency matrix vizualisation takes a JSON format.
From my experience:
to display relations beetween items, like adjacency matrix or chord diagram, one will prefer a JSON format that will allow to describe only existing relations. Data are stored like in a sparse matrix, and several data can be nested using dictionary. Moreover this format can directly be parsed in Python.
to display properties of an array of items, a CSV format can be fine. A perfect example can be found here with a parallel chart display.
to display hierarchical data, like a tree, JSON is best suited.
The best thing to do to help you figure out what best format you need, is to have a look at this d3js gallery

You can use D3 directly inside of Jupyter / Ipython. Try the two links below ..
http://blog.thedataincubator.com/2015/08/embedding-d3-in-an-ipython-notebook/
https://github.com/cmoscardi/embedded_d3_example/blob/master/Embedded_D3.ipynb

Related

Geoviews bokeh vs matplotlib for plotting large xarrays

I am trying to plot a large xarray dataset of x=1000 by y=1000 for t=10 different timestamp in Google Colab.
See the following example notebook:
https://colab.research.google.com/drive/1HLCqM-x8kt0nMwbCjCos_tboeO6VqPjn
However, when I try to plot this with gv.extension('bokeh') it doesn't give any output...
When gv.extenstion('matplotlib') does correctly show a plot of this data.
I suppose it has something to do with the amount of data bokeh can store in one view?
I already tried putting dynamic=True, which does make it work. But for my use case the delay in viewing different timestamps is not really desirable. Same goes for datashader regrid, which makes it run faster, but the delay in viewing different timestamps is not wanted.
Is there a way to plot this large xarray with bokeh making it as smoothly visible and slidable as with matplotlib?
Or are there any other ways I can try and visualise this data interactively (on a web app)?

What is better Orange.data.Table or Pandas for data manage in python?

Iam doing data mining and i dont know if going to use Table or Pandas?
any information for select the most suitable library for manage my dataset going to be welcome. Thank for any answer that help me in this.
I am an Orange programmer, and I'd say that if you are writing python scripts to analyze data, start with numpy + sklearn or Pandas.
To create an Orange.data.Table, you need to define Domain, which Orange uses for data transformations. Thus, tables in Orange are harder to create (but can, for example, provide automatic processing of testing data).
Of course, if you need to interface something specific from Orange, you will have to make a Table.

What is the safest way to handle different versions of a DataFrame in pandas?

I'm learning some pandas/ML type stuff. Right now I'm doing a Kaggle tutorial, and the example data we've been given has a bunch of features. I suspect that some of these features are adding noise to the model rather than helping. So, I want to apply several models to the data with all features (as in the tutorial) and record their scores as a baseline. Then, I want to remove one feature at a time, and use the same models on the data without that one feature, and compare the scores.
What's the best way to do this? Naively, I'd just make a different copy of the dataset for each removed feature, but copy() is a little confusing in pandas (in version 0.20, it says that it makes a deep copy by default, which should be exactly what I want, right? A copy with no connection/reference to the original?). I tried it and it didn't seem to actually be making the copy.
Is there a better way? Thank you.
Using for loop.
variables = locals()
feature=['A','B','C']
for i in feature:
variables["dfremoved{0}".format(i)] = df.drop(i,axis=1)
''' Do your fit and predict here within the for loop'''

Visualizing a large data series

I have a seemingly simple problem, but an easy solution is alluding me. I have a very large series (tens or hundreds of thousands of points), and I just need to visualize it at different zoom levels, but generally zoomed well out. Basically, I want to plot it in a tool like Matlab or Pyplot, but knowing that each pixel can't represent the potentially many hundreds of points that map to it, I'd like to see both the min and the max of all the array entries that map to a pixel, so that I can generally understand what's going on. Is there a simple way of doing this?
Try hexbin. By setting the reduce_C_function I think you can get what you want. Ex:
import matplotlib.pyplot as plt
import numpy as np
plt.hexbin(x,y,C=C, reduce_C_function=np.max) # C = f(x,y)
would give you a hexagonal heatmap where the color in the pixel is the maximum value in the bin.
If you only want to bin in one direction, see this this method.
First option you may want to try is Gephi- https://gephi.org/
Here is another option, though I'm not quite sure it will work. It's hard to say without seeing the data.
Try going to this link- http://bl.ocks.org/3887118. Do you see toward the bottom of the page data.tsv with all of the values? IF you can save your data to resemble this then the HTML code above should be able to build your data in the scatter plot example shown in that link.
Otherwise, try visiting this link to fashion your data to a more appropriate web page.
There are a set of research tools called TimeSearcher 1--3 that provide some examples of how to deal with large time-series datasets. Below are some example images from TimeSearcher 2 and 3.
I realized that simple plot() in MATLAB actually gives me more or less what I want. When zoomed out, it renders all of the datapoints that map to a pixel column as vertical line segments from the minimum to the maximum within the set, so as not to obscure the function's actual behavior. I used area() to increase the contrast.

If I use python pandas, is there any need for structured arrays?

Now that pandas provides a data frame structure, is there any need for structured/record arrays in numpy? There are some modifications I need to make to an existing code which requires this structured array type framework, but I am considering using pandas in its place from this point forward. Will I at any point find that I need some functionality of structured/record arrays that pandas does not provide?
pandas's DataFrame is a high level tool while structured arrays are a very low-level tool, enabling you to interpret a binary blob of data as a table-like structure. One thing that is hard to do in pandas is nested data types with the same semantics as structured arrays, though this can be imitated with hierarchical indexing (structured arrays can't do most things you can do with hierarchical indexing).
Structured arrays are also amenable to working with massive tabular data sets loaded via memory maps (np.memmap). This is a limitation that will be addressed in pandas eventually, though.
I'm currently in the middle of transition to Pandas DataFrames from the various Numpy arrays. This has been relatively painless since Pandas, AFAIK, if built largely on top of Numpy. What I mean by that is that .mean(), .sum() etc all work as you would hope. On top of that, the ability to add a hierarchical index and use the .ix[] (index) attribute and .xs() (cross-section) method to pull out arbitray pieces of the data has greatly improved the readability and performance of my code (mainly by reducing the number of round-trips to my database).
One thing I haven't fully investigated yet is Pandas compatibility with the more advanced functionality of Scipy and Matplotlib. However, in case of any issues, it's easy enough to pull out a single column that behaves enough like an array for those libraries to work, or even convert to an array on the fly. A DataFrame's plotting methods, for instance, rely on matplotlib and take care of any conversion for you.
Also, if you're like me and your main use of Scipy is the statistics module, pystatsmodels is quickly maturing and relies heavily on pandas.
That's my two cents' worth
I never took the time to dig into pandas, but I use structured array quite often in numpy. Here are a few considerations:
structured arrays are as convenient as recarrays with less overhead, if you don't mind losing the possibility to access fields by attributes. But then, have you ever tried to use min or max as field name in a recarray ?
NumPy has been developed over a far longer period than pandas, with a larger crew, and it becomes ubiquitous enough that a lot of third party packages rely on it. You can expect structured arrays to be more portable than pandas dataframes.
Are pandas dataframes easily pickable ? Can they be sent back and forth with PyTables, for example ?
Unless you're 100% percent that you'll never have to share your code with non-pandas users, you might want to keep some structured arrays around.