I have Essbase as the BI solution (for Predictive Analytics and Data Mining) in my current workplace. It's a really clunky tool, hard to configure and slow to use. We're looking at alternatives. Any pointers as to where I can start at?
Is Microsoft Analysis Services an option I can look at? SAS or any others?
Essbase focus and strenght is in the information management space, not in predictive analytics and data mining.
The top players (and expensive ones) in this space are SAS (with Enterprise Miner & Enteprise Guide combination) and IBM with SPSS.
Microsoft SSAS (Analysis Services) is a lot less expensive (it's included with some SQL Server versions) and has good Data Mining capabilities but is more limited in the OR (operations research) and Econometrics/Statistics space.
Also, you could use R, an open source alternative, that is increasing its popularity and capabilities over time, for example some strong BI players (SAP, Microstrategy, Tableau, etc.) are developing R integration for predictive analytics and data mining.
Check www.kpionline.com , is a product cloud based in Artus.
It has many dashboards, scenarios and functions prebuilt to do analysis.
Other tool than you could check is microstrategy. It has many functions to analysis.
Related
I am looking to convert an excel spreadsheet into more of a scalable solution for reporting. The volume of the data is not very large. At the moment the spreadsheet around 5k rows and grows by about 10 every day. There are also semi-frequent changes in how we capture information i.e. new columns as we starting to mature the processes. The spreadsheet just stores attributes or dimensions data on cases.
I am just unsure whether I should use a traditional SQL database or NoSQL database (or any other tool). I have no experience in NoSQL but I understand that it is designed to be very flexible which is what I want compared to a traditional DB.
Any thoughts would be appreciated :) !
Your dataset is really small and any SQL database (say, PostgreSQL) will work just fine. Stay away from NoSQL DBs as they are more limited in terms of reporting capability.
However, since your facts schema is still not stable ("new columns as we starting to mature the processes.") you may simply use your Spreadsheet as a data source in BI tools. To keep your reports up-to-date you may use the following process:
Store your Spreadsheet on cloud storage (like Google Drive or OneDrive)
Use codeless automation platform (like Zapier) to setup a job to sync Spreadsheet file with BI tool when it changes. This is easily possible if BI tool is SeekTable, for instance.
Currently I am looking for big data technology that supports Big data Geo-spatial analysis. I came up to ESRI and found its main support for Geo-spatial data analysis and visualization. However, currently, they don't have extensive support for Big Data Geo-spatial analysis, except for the ArcGIS GeoAnalytics Server which requires licensing. At the same time, I found how powerful is Google BigQuery which recently provide support for Geospatial processing and analysis (pay for what you use, per second).
What I would like to know is: which tool I should pick for Geospatial big data processing, analysis and visualization? and which tool (ESRI vs. BigQuery) is better used for what?
I would like to run complex queries on very large temporal Geo-spatial dataset and finally visualize results on a map.
Please note that I have just started my research on Geospatial big data processing and I would like to chose between the alternative tools out there.
Any help is much appreciated!!
(note that Stack Overflow doesn't always welcome this type of questions... but you can always come to https://reddit.com/r/bigquery for more discussions)
For "Geospatial big data processing, analysis and visualization" my favorite recommendation right now is Carto+BigQuery. Carto is one of the leading GIS analysis companies, and they recently announced moving one of their backends to BigQuery. They even published some handy notebooks showing how to work with Carto and BigQuery:
https://carto.com/blog/carto-google-bigquery-data/
I dont wanna use the ADL and ADLA as a black box. I need to understand how the gears rotate underhood to use it in an efficient way.
Where i can find an information that describe internals:
how U-SQL query is processed
how parallelism is worked
how storage is organized in ADL at low level
how DB's storage is organized in ADL at low level (is it rowstore or columnstore)
how partitioning is organized
etc
There is exists a lot of books and whitepappers that describes RDBMS engine's internals. Does it exists for ADL/ADLA?
There are a lot of guys who works in Azure. Could you publish any drafts/whitepappers to use as is (unoficially).
Some of that information is available in presentations we have given. For example you can find some of these presentations on my slideshare account at: http://www.slideshare.net/MichaelRys.
To answer some of your questions above:
The current clustered index version of U-SQL tables are stored in your catalog folder structured as so called structured stream files. These are highly compressible, scaled out files that use a row-oriented structure with self-contained meta data and statistics (more detailed stats can be created). The table construct provides 2 level partitioning: addressable partitions and internal distribution schemes (HASH, RANGE etc). Both help with parallelization, although distribution schemes are more for performance while partition more for data lifecycle management. There is no limit on them, although the sweet spot is 1GB to 4GB per distribution bucket.
1 AU is basically 1 container. And ADLS is NOT HDFS architecturally but offers the WebHDFS API for compatibility.
This is a pretty broad question. I assume you've started with the existing documentation on ADLA and U-SQL?
https://learn.microsoft.com/en-us/azure/data-lake-analytics/
https://msdn.microsoft.com/library/azure/mt591959
ADLA GA'd in November of 2016, compared to SQL Server in 1987 - that's a very apples and oranges comparison.
Maybe we can start with your specific questions?
I've just started work at a new company who have a datawarehouse that uses some bizzare proprietary ETL built in PHP.
I'm looking for arguments as to why its worth the investment to move to a standard system such as SSIS or infomatica or something. The primary reasons I have at the moment are:
A wider and more diverse community of developers available for contract work, replacements etc.
A large online knowledge base/support networks
Ongoing updates and support will be better
What are other good high level arguments to bring a little standardisation in :)
The only real disadvantage is that a lot of the data sources are web apis returning individual row-by-row records which are more easily looped through with PHP as opposed to standard ETL.
Here are some more:
Simplifies development and deployment process.
Easy to debug and incorporate changes. Would reduce maintenance and enhancement costs.
Industry standard ETL tools perform better on large volume of data as they use various techniques like, grid computing, parallel processing, partitioning etc.
Can support many types for data as source or target. Less impact if source or target systems are migrated to a different data store.
Codes are re-usable. Same component of code can be used in multiple processes.
I'm an ETL developer using different tools for ETL tasks. The same question rises in all our projects: the importance of the data profiling before the Data Warehouse is build and before the ETL is build for data movement. Usually I have done data profiling (i.e. finding bad data, data anomalies, counts, distinct values etc.) using pure SQL because ETL tools does not provide a good alternative for these (there is some data quality components in our tools, but they are not so sophisticated). One option is to use R programming language or SPSS Modeler etc. kind of tools for this kind of Exploratory Data Analysis. But usually these kinds of tools are not available or does not qualify if there is millions of rows of data.
How to do this kind of profiling using SQL? Is there any helper scripts available? How do you do this kind of Exploratory Data Analysis before data cleaning and ETL?
Load the data into the some staging system and use the Data profiler task from SSIS. Use this link http://gowdhamand.wordpress.com/2012/07/27/data-profiling-task-in-ssis/ to verify how to data analysis. Hope this helps.
I found a good tool for this purpose: Datacleaner. This seems to do most of things I want to do with data in EDA process.
USe this Exploratory Data Analysis for SQL which can help in Data Profiling & Analysis
https://pypi.org/project/edaSQL/
source code:
https://github.com/selva221724/edaSQL