Pentaho slow reporting - pentaho

have been working on a projet about data integration, analysing and reporting using Pentaho. So at last, I needed to do some reporting using Pentaho report designer, weekly. The problem that is our data is so big (about 4M/day), so the reporting platform was too slow and we can't do any other queries from tables in use, until we kill the process. Is there any solution to this ? A reporting tool or platform that we can use instead of Pentaho tool without having to change the whole thing and get from the first ETL steps.

I presume you mean 4M records/day.
That’s not particularly big for reporting, but the devil is in the details. You may need to tweak your db structure and analyze the various queries at play.
As you describe it, there isn’t much info to go with to give you a more detailed answer.

Related

BigQuery Testing, Debugging, and Design Patterns

We use BigQuery as the main data warehouse in our company.
We have gotten very efficient with SQL syntax and we write multi-page SQL queries with valid Syntax to analyze our data.
The main problem that we are struggling with are terrible logic mistakes in our queries. For example, it could be that a > should have been a >=, or that a join was treating NULL values the wrong way.
The effect is that we are getting wrong data out of BigQuery.
The logic within our data structure is so complicated ("what again was the definition of Customer Type ABC?") that it's terribly difficult to actually pull out anything useable. We estimate that up to 50% of analytics that we pull out of BigQuery are plain wrong.
Of course this is a problem that significantly hurts our bottom line and leads to wrong business decision. It has gotten so bad that we are craving for a normalized database structure that at least could be comprehended easier.
My hope is that maybe we are just missing certain design patterns to properly use BigQuery. However I find zero guidance about this online. The SQL we are using is so complex that I'm starting to think that although the Syntax is correct, SQL was not made for this. What we are doing feels like fitting a complex program into a single function, which in turn becomes untestable and a nightmare to work with.
I would appreciate any input and guidance
I can empathize here. I don't think your issue is unique, and there isn't one best practice. I can tell you what we have done to help with these same issues.
We are a small team of analysts, and only have a couple TB of data to crunch daily so your mileage will vary with these tips depending on your situation.
We use DBT - https://www.getdbt.com/. It has a free command line version, or you can pay for DBT cloud if you aren't confident with command line tools. It will help you go from Pages long SQL queries to smaller digestible chunks that are easier to maintain.
It helps with 3 main use cases for us.
database normalization/summarization - you can easily write queries, have them dependent on each other, have them scheduled to run at a certain time, while doing a lot of the more complex data engineering tasks for you. Such as making sure to run things in the right order, and that no circular references exist. This part of the tool helped us migrate away from pages long SQL queries to smaller digestible chunks that are useful in multiple applications.
documentation: there is a documentation site built in. So you can document a column and write out the definition of 'customer' easily.
Testing. We write loads of tests. We have a 100% accepted answer to certain metrics. Any time we need to reference this metric in other queries, or transform data to slice that metric by other dimensions, we write a test to make sure the new transformation matches back to the 100% accepted answer.
We have explored DBT, unfortunately we didn't have the bandwidth to support it at the company level. As an alternative we use airflow to build and maintain datasets in Bigquery. We use the BigQuery operators to interface with BQ through airflow. This helps us in the following ways:
Ability to build custom operators that can help with organizational level bells and whistles (integration with internal systems, data lifecycle management, lineage management etc.)
Ability to break down complex pieces of SQL into smaller manageable blocks that can be reused
Ability to incorporate testing in the process. You can build testing into your pipeline DAG or can build out separate DAGs of tests that can monitor your datasets and send out reports.
Ability to replay and recreate datasets
Ability to easily manage schema changes
I am sure there are other use cases where airflow helps, but these are some of the things that come to mind.

Alter data source

Power BI can connect to various data sources and run SELECT queries.
Is it possible to run also other queries (INSERT INTO, UPDATE...)?
Now I need it for a postgresql database, but could use also for others in the future.
No, you can't run directly INSERT/UPDATE queries from Power BI. This isn't the idea of the tool. If you find you need it, then probably there is a major flaw in your design, or you are not using the right tool for this job. But there are few ways to workaround this (again, I'm not saying that you SHOULD do it). Usually this is done in a combination with custom written Power App, embedded in your report in Power Apps visual. The idea is that the app will write to the database, and will refresh your report after that (if needed).
You can start here and I will recommend you to look at this in-depth session - Writing back data to PowerBI from your reports.
The answer is No if I am very straight forward. PBI is a analysis platform for data. There are probably some advance way to do that but, this is not logical or good idea to think about manipulating data from report or from any BI tools. You can search answers from different blog where the same questions asked. For more details, you can check below links-
help link 1
help link 2

Proper way to move data to a data warehouse

I am in the middle of a small project aimed to eventually create a data warehouse. I am currently moving data from a flat file system and two SQL Server databases. The project started in C# to automate the processing of data from the flat file system. Along with this, the project executes stored procedures to bring data from other databases. They are accessing the data from other databases using linked servers.
I am wondering if this is incorrect as even though it does get the job done, there may better approach? The other way I have thought about this is to use the app to pull data from each DB then push it to the data warehouse, but I am not sure about performance. Is there another way? Any path that I can look into is appreciated.
'proper' is a pretty relative term. I have seen a series of stored procedures, SSIS (microsoft), and third party tools. THey each have some advantages
Stored procedures
Using a job to schedule a series of stored procedures that insert rows from one server to the next works. I find sql developers more likely to take this path...it's flexible in design and good SQL programmers can accomplish nearly anything in here. That said, it is exceedingly difficult to support / troubleshoot / maintain / alter (especially if the initial developer(s) are no longer with the company). There is usually very poor error handling here
SSIS and other tools such as pentaho or data stage or ...google search it, theres a few.
This gives a more graphical design interface, although I've seen SSIS packages that simply called a stored procedures in order that may as well just been a job. These tools are really what you make of them. They give very easy to see work flows and are substantially robust when it comes to error handling and troubleshooting ability (trust me, every ETL process is going to have a few bad days and you'll be very happy for any logging you have to identify what you want). I find configuring a servers resources (multiple processors for example) is significantly easier with these tools. They all come with quite the learning curve though.
I find SQL developers are very much inclined to use the stored procedure route while people from a DBA background are generally more inclined to use the tools. If you're investing the time into it, the SSIS or equivlent tool is a better way to go from the future of your company standpoint, though takes a bit more to implement.
In choosing what to use you need to consider the following factors:
How much data are we talking about moving and how quickly does it need to be moved. There is s huge difference between using a linked server to move45,000 records and using it to move 100,000,000 records. Consider alo the expected growth of the data set to be moved over time. A process taht is fine in the early stages may chocke and die once you get more records. Tools like SSIS are much faster once you know how to use them which brings us to point 2.
How much development time do you have and what tools does the developer and the person who will maintain the import over time know? SSIS for instance is a complex tool, it can take a long time to feel comfortable with it.
How much data cleaning and transforming do you need to do? What kind of error trapping and exception processing do you need, what kind of logging will you need? The more complex the process, the more likely you will need to bite the bullet and learn an ETL specific tool.
Even there is a few answers, and I agree with two of them, I have to give my subjective opinion about the wider picture.
I am in the middle of a small project aimed to eventually create a data warehouse.
Question name perfectly suits to your question description. It could be very helpful to future readers. So, your project should create data warehouse. However it's small, learn to develop projects with scalability. Always!
In that point of view, search and study about how data warehouse project should look like. And develop each step.
Custom software vs Stored Procedures (Linked DBs) vs ETL
Custom software (in this case your C# project) should be used in two cases:
Medium scale projects where budget ETL cannot do everything
You're working for Enterprise level IT company, so developing your solution is cheaper and more manageable
And perhaps you think for tiny straight-forward projects. But NO, because those projects can grow and very quick outgrow your solution (new tables, new sources, changing ERP or CRM, ect).
If you're using just SQL Server, if you no need for data cleansing, if you no need for data profiling, if you no need for external data, Stored Procedures are OK. But, a lot of 'ifs' is here. And again, you're loosing scalability (your managment what's to add some data from Google Spreadsheet they internly use, KPI targets for example).
ETL tools are one native step in data warehouse development. In begining, there could be few table copy operation, or some SQL's, one source, one target. As far as your project is growing, you can adding new transformations.
SSIS is perhaps best as you're using SQL Server, but there is some good, free tools.

What is the best approach to archiving operational data?

I have a sql server 2012 database which is the backend to an asp.net MVC application, storing customer and order information. This database is accessed under high load and high usage.
I know have a requirement to be able to generate ad hoc reports from the database accessing the same data as the MVC application works with. I am concerned what impact this would have on the database server and the database itself, around locking etc. As such their is a distinction between the data, for the app its operational, but for the reports its more data warehouse oriented.
Therefore I am looking at my options as to the best approach to avoid such.
I am considering creating another database on a different server and archive the data to it using a sql job at regular intervals during the day. Only concern around this is that it would require maintenance and also a dependency to ensure any necessary changes are made to the target database when the source database changes.
What other options opened to me in such a situation and what advice could be given regarding such? What is the best approach to such?
You don't have to think of your own solution to keep the databases in sync. SQL Server has build in ways to achieve this.
Database Mirroring
Replication
Always On Availability Groups
If you're using Enterprise Edition of SQL Server 2012 then I would look into Always On Availability Groups if not then (Transactional) Replication. Both of these solutions can keep a second read-only and near real-time copy of the database.
As Steve McConell suggests you should make no assumptions about performance. You should just measure it before making any decisions. It is not a wise choice to make design choices without knowing the actual performance overhead. So I would suggest to measure, or simulate the performance overhead before even consider using a complex architecture, because you would not know if it's worth the trouble.
Anyway, I think that your approach is right. I would create a windows service which periodically retrieves the data I need from my database and stores them in my warehouse (the new database). I don't think you would ever find a tool keeping consistency between the two schemas, unless you want one schema to be an exact copy of the other.
I don't know your exact needs and perhaps my suggestion is an overkill but I would encourage you to consider using an OLAP approach in the data warehouse where your reporting data will come from. I have to warn you that these systems are oriented in really big data and advanced reporting needs but perhaps you can take some ideas from them. Since you are familiar with the Microsoft ecosystem, I would suggest using Business Intelligence Studio. You could there build an OLAP cube using your normal database as data source and integrate advanced reporting.
Hope I helped.

PostgreSQL performance monitoring tool

I'm setting up a web application with a FreeBSD PostgreSQL back-end. I'm looking for some database performance optimization tool/technique.
Database optimization is usually a combination of two things
Reduce the number of queries to the database
Reduce the amount of data that needs to be looked at to answer queries
Reducing the amount of queries is usually done by caching non-volatile/less important data (e.g. "Which users are online" or "What are the latest posts by this user?") inside the application (if possible) or in an external - more efficient - datastore (memcached, redis, etc.). If you've got information which is very write-heavy (e.g. hit-counters) and doesn't need ACID-semantics you can also think about moving it out of the Postgres database to more efficient data stores.
Optimizing the query runtime is more tricky - this can amount to creating special indexes (or indexes in the first place), changing (possibly denormalizing) the data model or changing the fundamental approach the application takes when it comes to working with the database. See for example the Pagination done the Postgres way talk by Markus Winand on how to rethink the concept of pagination to make it more database efficient
Measuring queries the slow way
But to understand which queries should be looked at first you need to know how often they are executed and how long they run on average.
One approach to this is logging all (or "slow") queries including their runtime and then parsing the query log. A good tool for this is pgfouine which has already been mentioned earlier in this discussion, it has since been replaced by pgbadger which is written in a more friendly language, is much faster and more actively maintained.
Both pgfouine and pgbadger suffer from the fact that they need query-logging enabled, which can cause a noticeable performance hit on the database or bring you into disk space troubles on top of the fact that parsing the log with the tool can take quite some time and won't give you up-to-date insights on what is going in the database.
Speeding it up with extensions
To address these shortcomings there are now two extensions which track query performance directly in the database - pg_stat_statements (which is only helpful in version 9.2 or newer) and pg_stat_plans. Both extensions offer the same basic functionality - tracking how often a given "normalized query" (Query string minus all expression literals) has been run and how long it took in total. Due to the fact that this is done while the query is actually run this is done in a very efficient manner, the measurable overhead was less than 5% in synthetic benchmarks.
Making sense of the data
The list of queries itself is very "dry" from an information perspective. There's been work on a third extension trying to address this fact and offer nicer representation of the data called pg_statsinfo (along with pg_stats_reporter), but it's a bit of an undertaking to get it up and running.
To offer a more convenient solution to this problem I started working on a commercial project which is focussed around pg_stat_statements and pg_stat_plans and augments the information collected by lots of other data pulled out of the database. It's called pganalyze and you can find it at https://pganalyze.com/.
To offer a concise overview of interesting tools and projects in the Postgres Monitoring area i also started compiling a list at the Postgres Wiki which is updated regularly.
pgfouine works fairly well for me. And it looks like there's a FreeBSD port for it.
I've used pgtop a little. It is quite crude, but at least I can see which query is running for each process ID.
I tried pgfouine, but if I remember, it's an offline tool.
I also tail the psql.log file and set the logging criteria down to a level where I can see the problem queries.
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this time.
I also use EMS Postgres Manager to do general admin work. It doesn't do anything for you, but it does make most tasks easier and makes reviewing and setting up your schema more simple. I find that when using a GUI, it is much easier for me to spot inconsistencies (like a missing index, field criteria, etc.). It's only one of two programs I'm willing to use VMWare on my Mac to use.
Munin is quite simple yet effective to get trends of how the database is evolving and performing over time. In the standard kit of Munin you can among other thing monitor the size of the database, number of locks, number of connections, sequential scans, size of transaction log and long running queries.
Easy to setup and to get started with and if needed you can write your own plugin quite easily.
Check out the latest postgresql plugins that are shipped with Munin here:
http://munin-monitoring.org/browser/branches/1.4-stable/plugins/node.d/
Well, the first thing to do is try all your queries from psql using "explain" and see if there are sequential scans that can be converted to index scans by adding indexes or rewriting the query.
Other than that, I'm as interested in the answers to this question as you are.
Check out Lightning Admin, it has a GUI for capturing log statements, not perfect but works great for most needs. http://www.amsoftwaredesign.com
DBTuna http://www.dbtuna.com/postgresql_monitor.php has recently started supporting PostgreSQL monitoring. We use it extensively for MySQL monitoring, so if it provides the same for Postgres then it should be a good fit for you too.