How to dynamically update table based on selecting a single data point on a time series graph - data-visualization

I currently have set up Superset Apache to have two visualizations, one being a Time Series Graph and the other being a table. Both visualizations are using the same dataset. I want the values of the table to be dynamically updated based on which data point I am hovering over/selecting in the Time Series Graph. Is this type of functionality possible in Superset Apache? I know it is possible in Power BI but I'd like to know if Superset is capable of accomplishing this as well.
(As far as I can tell right now, it seems like each visualization is independent. The only time the visualizations are linked is when there is a filter applied from a filter visualization which affects the overall dataset)

Related

C5.0 gives back only a single leaf

I'm doing a data analysis task in SPSS Modeler and I have finally arrived to the point of the stream where I'm trying to fit some models on the data.
However when I tried to run the mentioned c5.0 modeling node on my data, the node generated a modeling nugget containing only a single leaf, so there are no decision rules in the model. I partitioned the data before to train and test subsets (70-30). I did not use misclassification cost, used the properly predefined attribute roles. In the model's model page I checked the use partitioned data, build model for each split, Group symbolics, Use global pruning options in, I also tried to use expert mode, but it fails on simple mode too. I have tried to use different options but it gives the same output without a single split.
How can I make the model give back a more complex decision tree, I suppose that this is not the expected outcome.
Any suggestions are welcomed.
Please, check your distribution of the target variable and share it.
If the balances differs greatly from 50%-50%, you may need to balance your inputs first.
Missclassification cost is another technique to give you an output, but again it should be based on your empirical distributions.

Fast reporting with user parameters and temp result sets

I have come across a problem with reporting from SQL Server databases using SSRS, that I wonder if you could help me with.
When you have a huge amount of data in a table, and you want to select only those rows within a certain criteria, and you want to allow the users to specify that criteria (for example, it might be a start date and end date), and you then want to take that data (within the criteria) and perform a ton of other transformations on it, including producing various temporary result sets along the way (using CTEs or Table Variables or Temp tables) to finally produce the report, this basically takes ages in SQL. You can do it, but your users might have to wait an hour or two from the moment they've hit View Report, to their report being rendered.
I don't know much about MDX or DAX, cubes or tabular models, but I wonder if there is a quicker way to do what I want. Note the important aspect of the problem: the user is specifying a criteria that has to go all the way back to the original table, and then various transformations (including temp result sets) have to be applied to produce the final report.
What is the best way to do this? Am I doing it the only way possible? I know it's a broad question, but I'd like to know, theoretically, what the answer is. Where should I be looking? Should I be looking at Cubes? Tabular Models? Should I be using R in SQL Server?
There is always a balance when it comes to handling large datasets. Sometimes it makes sense to do some of the work ahead of time so that on-demand reports can run in a reasonable amount of time.
In order for a model to be a good option here are some general guidelines:
Many reports would be able to use common attributes from the model
The data involves aggregates, not just lists of records
The data does not need to be live
You have plenty of development and testing time
Anyone who would be using it as a data source will have to have be
trained on the structure and be at least slightly familiar with MDX
Another option for you to consider is to have a stored procedure that "prepares" the data for you overnight in a separate table. This table could be well indexed because the write time is not as important. They report would then point to this table to be able to quickly retrieve the data it needs to present. This shifts most of the preparation/aggregation work. You can still of course have parameters that limit how much of this data you pull back.
Based on the little bit of information you've given us (300 million rows in a single non-normalized table), there is definitely a faster way. However, there will not be any quick solutions and you haven't provided enough information for me to give any recommendations.
I think you may need to seek some professional help to review your infrastructure and needs along with your usage and objectives so you can be pointed in the right direction.

Run the same IPython notebook code on two different data files, and compare

Is there a good way to modularize and re-use code in IPython Notebook (Jupyter) when doing the same analysis on two different sets of data?
For example, I have a notebook with a lot of cells doing analysis on a data file. I have another data file of the same format, and I'd like to run the same analysis and compare the output. None of these options looks particularly appealing for this:
Copy and paste the cells to a second notebook. The analysis code is now duplicated and harder to update.
Move the analysis code into a module and run it for both files. This would lose the cell-by-cell format of the figures that are currently generated and simply jumble them all together in one massive cell.
Load both files in one notebook and run the analyses side by side. This also involves a lot of copy-and-pasting, and doesn't generalize well to 3 or 4 different data files.
Is there a better way to do this?
You could lace demo directives into the standalone module, as per the IPython Demo Mode example.
Then when actually executing it in the notebook, you make a call to the demo object wrapper each time you want to step to the next important part. So your cells would mostly consist of calls to that demo wrapper object.
Option 2 is clearly the best for code re-use, it is the de facto standard arguably in all of software engineering.
I argue that the notebook concept itself doesn't scale well to 3, 4, 5, ... different data files. Notebook presentations are not meant to be batch processing receptacles. If you find yourself needing to do parameter sweeps across different data sets, and wanting to re-run analyses on top of the different data loaded for each parameter group (even when the 'parameters' might be as simple as different file names) it raises a bad code smell. It likely means the level of analysis being performed in an 'interactive' way is wrong. Witnessing analysis 'interactively' and at the same time performing batch processing are two pretty much incompatible goals. A much better idea would be to batch process all of the parameter sets separately, 'offline' from the point of view of any presentation, and then build a set of stand-alone functions that can produce visual results from the computed and stored batch results. Then the notebook will just be a series of function calls, each of which produces summary data (some of which could be examples from a selection of parameter sets during batch processing) across all of the parameter sets at once to invite the necessary comparisons and meaningfully present the result data side-by-side.
'Witnessing' an entire interactive presentation that performs analysis on one parameter set, then changing some global variable / switching to a new notebook / running more cells in the same notebook in order to 'witness' the same presentation on a different parameter set sounds borderline useless to me, in the sense that I cannot imagine a situation where that mode of consuming the presentation is not strictly worse than consuming a targeted summary presentation that first computed results for all parameter sets of interest and assembled important results into a comparison.
Perhaps the only case I can think of would be toy pedagogical demos, like some toy frequency data and a series of notebooks that do some simple Fourier analysis or something. But that's exactly the kind of case that begs for the analysis functions to be made into a helper module, and the notebook itself just lets you selectively declare which toy input file you want to run the notebook on top of.

Is OLAP/MDX a good way to process data w/ unknown values at various aggregation levels

I'm new to OLAP, so perhaps I don't know the right terminology to use for this question, but bear with me here.
I work with lots of hierarchical, multidimensional data where parent/aggregated cells mostly have data, but child/leaf cells are often missing data (attribute values are unknown but non-zero). I currently use a combination of scripting and SQL to work with it, but that's getting unwieldy. It seems like OLAP cubes and MDX are better suited to the structure of the data, but not necessarily to tasks I need to do with it. For example:
OLAP seems mainly designed for read-only reporting; I do a lot of modifications to the data in batch processes
OLAP seems to like having complete leaf-level data to calculate aggregates; my data has missing values at various levels
Examples of what I want to do:
Load original multi-level data into cube and preserve known parents; don't overwrite or display their values as calculated aggregates of children (which may be incomplete).
Create/update/delete cells in a cube based on results from complicated queries/joins of other cubes. Sometimes a cube needs to be transformed to use a slightly different dimension definition.
Users require estimates for unknown values. I can create decent estimates, but need to adjust them so they conform to known parents/children across all dimensions and levels (this is much harder than it sounds). I am already doing this, but it involves pulling the data out of the RDBMS into a custom executable.
Queries and calculations need to be able to handle the unknowns properly. Ideally be able to easily query how much of an aggregated cell's value is made up of estimated vs. known values, possibly compute confidence/error statistics, or check whether we can derive an exact value for an unknown when it has a known parent and all known siblings, etc.
Data can be large... up to tens of millions of fact table rows. Performance needs to be decent for batch jobs (minutes are ok, hours not so much).
Could an OLAP server and MDX be a good tool for this type of work? Are there any other tools that would work well for manipulating hierarchical/multidimensional/gap-filled data?
That's some needs for an OLAP system, interesting and challenging :-) :
- Load original multi-level data into cube and preserve known parents; don't overwrite or display their values as calculated aggregates of children (which may be incomplete).
You can change the way cubes aggregate values in a hierarchy. Doing this in one hierarchy is fine doing this using in multiple hierarchies might start to get complicated. It's worth checking twice if there is a mathematical 'unique' solution to the problem with multiple 'special' hierarchies.
Create/update/delete cells in a cube based on results from complicated queries/joins of other cubes. Sometimes a cube needs to be transformed to use a slightly different dimension definition.
Here you can use writeback (MDX function Update cube), but I think it's a bit too simple for your needs. Implementation depend on the vendors. Pay attention creating cells can kill your memory as for large cubes you can quickly have millions of cells in a subcube.
What is the sparsity of your model ? -> number of cells with data / number of total cells
Some models have sparsities of 1e-30, here it's easy to explode if you're updating all cells ;-).
Users require estimates for unknown values. I can create decent estimates, but need to adjust them so they conform to known parents/children across all dimensions and levels (this is much harder than it sounds). I am already doing this, but it involves pulling the data out of the RDBMS into a custom executable.
This is looking complicated The issue here is the complexity of the algos, a possible solution using MDX language and how they match with the OLAP engige (fast enough). You're taking the risk it explodes, but have a look at Scope function
Data can be large... up to tens of millions of fact table rows. Performance needs to be decent for batch jobs (minutes are ok, hours not so much).
That should not be a real challenge..
To answer your question, I don't think so. We've a similar problem - on the genetical field - and we are going to solve the problem 'adding' a dedicated calculation module to our OLAP solution. It's an interesting on going project

What is the difference between Grid and Matrix in SAP B1?

When should I use a Matrix and when should I use a Grid?
when you want to work with UDOs, use matrix, otherwise use grid.
Some differences.
Matrix
to load the data you need Dbdatasource
effective for Udo's
Filled automatically by SAP b1 when your navigate
Grid
you can load the data using a SQL Query.
Has Problems with columns of type double (you must use a workaround for this)
you cannot modify udo data (just for viewing) - can't link to UDO
you can create levels of visualitation (extend/collapse rows)
grid is much faster with linking large ammount of data
Bye.
In short:
Grid is faster in certain situations, but matrix is more versatile (at least in Sbo version 2007 and before).
Chech out the Sap SDN forums on https://www.sdn.sap.com/, they are big source of information.
Gird is best used for display purpose. we can easily display huge data simply. where as Matrix is best used for data maintaining purpose because it can be bonded with UDO where user can easily assess data screen level. No extra code can be written like grid evey thing will be handled by SAP object only.