Run the same IPython notebook code on two different data files, and compare - matplotlib

Is there a good way to modularize and re-use code in IPython Notebook (Jupyter) when doing the same analysis on two different sets of data?
For example, I have a notebook with a lot of cells doing analysis on a data file. I have another data file of the same format, and I'd like to run the same analysis and compare the output. None of these options looks particularly appealing for this:
Copy and paste the cells to a second notebook. The analysis code is now duplicated and harder to update.
Move the analysis code into a module and run it for both files. This would lose the cell-by-cell format of the figures that are currently generated and simply jumble them all together in one massive cell.
Load both files in one notebook and run the analyses side by side. This also involves a lot of copy-and-pasting, and doesn't generalize well to 3 or 4 different data files.
Is there a better way to do this?

You could lace demo directives into the standalone module, as per the IPython Demo Mode example.
Then when actually executing it in the notebook, you make a call to the demo object wrapper each time you want to step to the next important part. So your cells would mostly consist of calls to that demo wrapper object.
Option 2 is clearly the best for code re-use, it is the de facto standard arguably in all of software engineering.
I argue that the notebook concept itself doesn't scale well to 3, 4, 5, ... different data files. Notebook presentations are not meant to be batch processing receptacles. If you find yourself needing to do parameter sweeps across different data sets, and wanting to re-run analyses on top of the different data loaded for each parameter group (even when the 'parameters' might be as simple as different file names) it raises a bad code smell. It likely means the level of analysis being performed in an 'interactive' way is wrong. Witnessing analysis 'interactively' and at the same time performing batch processing are two pretty much incompatible goals. A much better idea would be to batch process all of the parameter sets separately, 'offline' from the point of view of any presentation, and then build a set of stand-alone functions that can produce visual results from the computed and stored batch results. Then the notebook will just be a series of function calls, each of which produces summary data (some of which could be examples from a selection of parameter sets during batch processing) across all of the parameter sets at once to invite the necessary comparisons and meaningfully present the result data side-by-side.
'Witnessing' an entire interactive presentation that performs analysis on one parameter set, then changing some global variable / switching to a new notebook / running more cells in the same notebook in order to 'witness' the same presentation on a different parameter set sounds borderline useless to me, in the sense that I cannot imagine a situation where that mode of consuming the presentation is not strictly worse than consuming a targeted summary presentation that first computed results for all parameter sets of interest and assembled important results into a comparison.
Perhaps the only case I can think of would be toy pedagogical demos, like some toy frequency data and a series of notebooks that do some simple Fourier analysis or something. But that's exactly the kind of case that begs for the analysis functions to be made into a helper module, and the notebook itself just lets you selectively declare which toy input file you want to run the notebook on top of.

Related

Why are my final sample sizes different on SPSS compared to PROCESS MACRO

I’m doing a moderation analysis in SPSS. I recoded all my variables and coded inapplicable, don’t know, doesn’t apply as -99 so they would be considered as missing variables and not applied to the analysis. When I run the data through PROCESS MACRO, I get a sample size of 680. When I run it through SPSS, I get 768. I also centered the variables on PROCESS and SPSS
Not sure what I’m doing wrong and why I’m getting two different sample sizes.
Any suggestions?
Tried running frequency analysis but not getting same results as PROCESS. Not sure if I’m doing something wrong there.

C5.0 gives back only a single leaf

I'm doing a data analysis task in SPSS Modeler and I have finally arrived to the point of the stream where I'm trying to fit some models on the data.
However when I tried to run the mentioned c5.0 modeling node on my data, the node generated a modeling nugget containing only a single leaf, so there are no decision rules in the model. I partitioned the data before to train and test subsets (70-30). I did not use misclassification cost, used the properly predefined attribute roles. In the model's model page I checked the use partitioned data, build model for each split, Group symbolics, Use global pruning options in, I also tried to use expert mode, but it fails on simple mode too. I have tried to use different options but it gives the same output without a single split.
How can I make the model give back a more complex decision tree, I suppose that this is not the expected outcome.
Any suggestions are welcomed.
Please, check your distribution of the target variable and share it.
If the balances differs greatly from 50%-50%, you may need to balance your inputs first.
Missclassification cost is another technique to give you an output, but again it should be based on your empirical distributions.

How Data and String are treated in graphlab

I am having large date set in which some of columns are Date and other are categorical Data like Status, Department Name, Country Name.
So how this data is treated in graphlab when i call the graphlab.linear_regression.create method, does i have to pre-process this data and convert them into numbers or can directly provide to graphlab.
Graphlab is mostly used for computing tabular and graph based datasets, and have high scalability and performance. In graphlab.linear_regression.create, graphlab have inbuilt feature of understanding the type of data and giving most suitable method of linear regression for optimizing results. For Example, for numeric data of target and feature both, most of the time, graphlab takes Newtons Method of linear regression. Similarly, depending on the dataset, understands the need and gives method accordingly.
Now, about preprocessing, graphlab only takes SFrame for learning that need to be parsed correctly before any learning. While creating an SFrame, unprocessed and error creating data are always reflected and throws an error. So, in order to go through any learning, you need to have a clean data. If SFrame accepts the data, and also your chosen target and feature for learning that you want, you are good to go but pre-processing and cleaning data is always recommended. Also, its always a good practice to do feature engineering before any learning algorithm, and redefining data types before learning is always recommended for accuracy.
About your point on how data is treated in Graphlab, I would say, it depends!. Some datasets are tabular and are treated accordingly and some in graph structure. Graphlab performs very well when comes to regression tree and boosted classifiers which follows decision tree concept and are quite time and resource consuming in other libraries than graphlab.
For me, graphlab performed very well while creating recommendation engine where I had dataset of nodes and edges and boosted tree classifier with 18 iterations too worked flawless in quite scalable time and I must say, even for tree structured data, graphlab performs very well. I hope this answer helps.

solutions for cleaning/manipulating big data (currently using Stata)

I'm currently using a 10% sample of a very large dataset (10 vars, over 300m rows) which amounts to over 200 GB of data when stored in .dta format for the full dataset. Stata is able to handle operations like egen, collapse, merging, etc in a reasonable amount of time for the 10% sample when using Stata-MP on a UNIX server with ~50G of RAM and multiple cores.
However, now I want to move on to analyzing the whole sample. Even if I use a machine that has enough RAM to hold the dataset, simply generating a variable takes ages. (I think perhaps the background operations are causing Stata to run into virtual mem)
The problem is also very amenable to parallelization, i.e., the rows in the dataset are independent of each other, so I can just as easily think about the one large dataset as 100 smaller datasets.
Does anybody have any suggestions for how to process/analyze this data or can give me feedback on some suggestions I currently have? I mostly use Stata/SAS/MATLAB so perhaps there are other approaches that I am simply unaware of.
Here are some of my current ideas:
Split the dataset up into smaller datasets and utilize informal parallel processing in Stata. I can run my cleaning/processing/analysis on each partition and then merge the results after without having the store all the intermediate parts.
Use SQL to store the data and also perform some of the data manipulation such as aggregating over certain values. One concern here is that some tasks that Stata can handle fairly easily such as comparing values across time won't work so well in SQL. Also, I'm already running into performance issues when running some queries in SQL on a 30% sample of the data. But perhaps I'm not optimizing by indexing correctly, etc. Also, Shard-Query seems like it could help with this but I have not researched it too thoroughly yet.
R also looks promising, but I'm not sure if it would solve the problem of working with this enormous amount of data.
Thanks to those who have commented and replied. I realized that my problem is similar to this thread. I have re-written some of my data manipulation code in Stata into SQL and the response time is much quicker. I believe I can make large optimization gains by correctly utilizing indexes and using parallel processing via partitions/shards if necessary. After all the data manipulation has been done, I can import that data via ODBC in Stata.
Since you are familiar with Stata there is a well documented FAQ about large data sets in Stata Dealing with Large Datasets: you might find this helpful.
I would clean via columns, splitting those up, running any specific cleaning routines and merge back in later.
Depending on your machine resources, you should be able to hold the individual columns in multiple temporary files using tempfile. Taking care to select only the variables or columns most relevant to your analysis should reduce the size of your set quite a lot.

Correctness testing for process modelling application

Our group is building a process modelling application that simulates an industrial process. The final output of this process is a set of number representing chemistry and flow rates.
This application is based on some very old software that uses the exact same underlying mathematical model to create the simulation. Thousands of variables are involved in the simulation.
Although each component has been unit tested, we now need to be able to make sure that the data output produced by our software matches that of the old simulation software. I am wondering how best to approach this issue in a formalised and rigorous manner.
The old program works by specifying the input via a text file, so I was thinking we could programatically take each variable, adjust its value in the file (and correspondingly in our new application), then compare the outputs between the new and old application. We do this for every variable in the model.
We know the allowable range for each variable so I suppose a random sample across each variable of a few values is enough to show correctness for that particular variable.
Any thoughts on this approach? Any other ideas?
The comparison of output of the old and new applications id definitely good idea. This is sometime called back-to-back testing.
Regarding test input samples - get familiarized with following concepts:
Equivalence partitioning
Boundary-value analysis