Hive on Accumulo recommended settings

Hive on Accumulo recommended settings - hive

We use Hive (v. 1.2.1) to read with "sql like" on accumulo (v. 1.7.1) tables.
Is there any special settings what we can configure in hive or somewhere to gain our performance or stability?
If we use the hive this way is there any point for example trying out some hive indexing or whatever settings like "hive.auto.convert.join" or it works different way and not really affect in these case?
Thank you!

Obligatory: I wrote (most of) the AccumuloStorageHandler, but I am by no means a Hive expert.
The biggest gain you will probably be able to find is when you can structure your query in such a way that you can either prune the row-space (via a statement in the WHERE clause over the :rowid-mapped column). To my knowledge, there isn't much (any?) query optimization that is pushed down into Accumulo itself.
Depending on your workload, you could use Hive to generate your own "index tables" in Accumulo. If you can make a custom table that has the column you want to actively query stored in the Accumulo row, your queries should run much faster.

Related

Tableau take forever to use a PostgreSQL view

I am trying to connect Tableau to a SQL view I made in PostgreSQL.
This view returns ~80k rows with 12 fields. On my local PostgreSQL database, it take 7 seconds to execute. But when I try to create a chart in a worksheet using this view, it take forever to display something (more than 2 minutes to add just a field).
This views in complex and involve many join, coalesce and case due to business specifities.
Do you guys have an idea to improve?
Thank you very much for your help ! :-)
Best,
Max

Tableau documentation has helpful info for performance optimization
https://help.tableau.com/current/pro/desktop/en-us/performance_tips.htm
I highly recommend the whitepaper on designing efficient dashboards mentioned on that site - a bit dated, but timeless advice
For starters, learn to use the Performance Recorder in Tableau to find out what tasks are causing delays, and if they involve queries, to capture the SQL that Tableau emits.
With Tableau, and many other client tools, the standard first approach is to see what SQL the client tool generates, then execute that SQL without using the client tool, say just in psql in your case. If you can reproduce the slow query just in SQL, then you are better positioned to either
Optimize your database, say either with indices, or restructuring your schema OR
Understand why your client tool, Tableau in this case, generated that inefficient query and reason about what you could differently in Tableau that would cause it to generate different SQL
The whitepaper I mentioned should be helpful

Custom SQL Query in Tableau

Does using 'custom SQL' instead of joins in Tableau increase the performance of extract refresh on the server? Can someone explain it briefly?

The answer to almost every performance question is first, "it depends" and second, test and understand the measurement results. Real results carry more weight than advice from anyone on the Internet (from me or anyone else)
Still, custom SQL is usually not helpful for increasing performance in Tableau, and often hurts. It is usually much better to define your relationships in Tableau and let Tableau then generate optimized SQL for each view -- just as you let a compiler generate optimized machine code.
When you use custom SQL, you prevent Tableau from optimizing the SQL it generates. It has to run the SQL you provide in a subquery.
The best use case for custom SQL in Tableau is to access database specific features, or possibly windowing queries. Most other SQL functionality is available by using the corresponding Tableau features.
If you do have a complex slow custom SQL query that you must use, it is usually a good idea to make an extract so you only pay the performance cost during extract refresh.
So in your case, I'd focus effort on streamlining or eliminating the custom SQL, monitoring the query plan for the generated SQL, and indexing your database to best support that query.

Tuning OBIEE generated SELECT queries

We have our data marts/warehouse on Oracle 11g implemented as a star schema. Business reports are designed using OBIEE. I come from a ETL background and have very little knowledge in OBIEE.
Once the OBIEE RPD is designed, I see that OBIEE starts generating SELECT queries in the background to feed data into the reports. On many occasions, I have noticed that the SELECT queries are not optimized (big fact table is fully scanned more than once in separate WITH clauses).
When the report performance is bad, the OBIEE queries are sent to the ETL team for performance tuning. I'm confused about how I can tune them because they are auto generated. I know there is an option to write custom sql in OBIEE (without going via RPD) for each report, but our standards do not allow that and I also think it does not leverage the benefits of OBIEE.
Has anyone faced a problem like above? How to tune such queries?

Firstly, you're right that custom SQL (known as direct database query) is not a good idea in principle, though it is useful on occasion. But it's not the solution to your problem.
Tuning the OBI queries generated is an OBI RPD task, for the OBI developer; tuning the database for the OBI queries generated is a database/ETL task. But you can't really do one without the other – OBI needs to be designed so as to generate suitable queries, and the database needs to be designed in such a way that suitable good queries can be generated to answer the question being asked.
OBI is basically a SQL generator, and if the RPD model is bad suboptimal, then the resulting query will be bad suboptimal. OBI will generate SQL based on the information it has in the RPD about the layout and structure of the data and database.
You're obviously coming at it from the database side, and so to you the SQL is bad because it isn't what you'd write. It's also possible that the database design is bad for getting an answer to the question that OBI is being asked.

As jackohug says, OBIEE is a SQL generator, and the general aproach is to try to optimize the query generated by OBIEE, not try to change this query. Somehow, depending on the performance problem, you can try some tricks.
First all, is your table partioned and your reports can benefit from the partioning?
Second, add indexes on the fact table so any filter on the dimensions can benefit the access to the fact table.
Third, building agregate tables, resuming the fact table, so when reports don't show much detail you first access to the agregate table with much less data, and is only as the users drill down through structure (and while doing so, they are applying filters to the data they are interested in) that they access to the much detailed fact table but applying filters to avoid full scans.
You could also tell OBIEE to use hints when accessing to the table, although, as with Direct Database Query I wouldn't recommend it, I would try first optimizing using the first three aproaches.
Regards

if you have diagnostics and tuning pack licenses, you can run the SQL Tuning Advisor. The SQL Tuning Advisor is running the optimizer in tuning mode and it may be able to generate a SQL Profile with a better execution plan. Sometimes the advisor recommends indexes for tuning as well. Both SQL Profiles and indexes do not require a change to the application.

I've yet to have much success with the SQL tuning advisor. Some experience in SQL tuning and a bit of research can typically produce a far better plan.
If all the layers are built well and all you need is a final tweak then add a hidden column to the start of the report (Answer/Analysis) with a SQL hint.
I'd be very careful about adding hints through the RPD layers because of the many different and unexpected ways that others will join and use the tables.

Hive with Lucene

Is it possible to use Hive for querying Lucene index which is distributed over Hadoop???

Hadapt is a startup whose software bridges Hadoop with a SQL front-end (like Hive) and hybrid storage engines. They offer a archival text search capability that may meet your needs.
Disclaimer: I work for Hadapt.

As far as I know you can essentially write custom "row-extraction" code in Hive so I would guess that you could. I've never used Lucene and barely used Hive, so I can't be sure. If you find a more conclusive answer to your question, please post it!

I know this is a fairly old post, but thought I could offer a better alternative.
In your case, instead of going through the hassle of mapping your HDFS Lucene index to hive schema, it's better to push them into pig, because pig can read flat files. Unless you want a Relational way of storing your data, you could probably process them through Pig and use, Hbase as your DB.

You could write a custom input format for Hive to access lucene index in Hadoop.

PostgreSQL performance monitoring tool

I'm setting up a web application with a FreeBSD PostgreSQL back-end. I'm looking for some database performance optimization tool/technique.

Database optimization is usually a combination of two things
Reduce the number of queries to the database
Reduce the amount of data that needs to be looked at to answer queries
Reducing the amount of queries is usually done by caching non-volatile/less important data (e.g. "Which users are online" or "What are the latest posts by this user?") inside the application (if possible) or in an external - more efficient - datastore (memcached, redis, etc.). If you've got information which is very write-heavy (e.g. hit-counters) and doesn't need ACID-semantics you can also think about moving it out of the Postgres database to more efficient data stores.
Optimizing the query runtime is more tricky - this can amount to creating special indexes (or indexes in the first place), changing (possibly denormalizing) the data model or changing the fundamental approach the application takes when it comes to working with the database. See for example the Pagination done the Postgres way talk by Markus Winand on how to rethink the concept of pagination to make it more database efficient
Measuring queries the slow way
But to understand which queries should be looked at first you need to know how often they are executed and how long they run on average.
One approach to this is logging all (or "slow") queries including their runtime and then parsing the query log. A good tool for this is pgfouine which has already been mentioned earlier in this discussion, it has since been replaced by pgbadger which is written in a more friendly language, is much faster and more actively maintained.
Both pgfouine and pgbadger suffer from the fact that they need query-logging enabled, which can cause a noticeable performance hit on the database or bring you into disk space troubles on top of the fact that parsing the log with the tool can take quite some time and won't give you up-to-date insights on what is going in the database.
Speeding it up with extensions
To address these shortcomings there are now two extensions which track query performance directly in the database - pg_stat_statements (which is only helpful in version 9.2 or newer) and pg_stat_plans. Both extensions offer the same basic functionality - tracking how often a given "normalized query" (Query string minus all expression literals) has been run and how long it took in total. Due to the fact that this is done while the query is actually run this is done in a very efficient manner, the measurable overhead was less than 5% in synthetic benchmarks.
Making sense of the data
The list of queries itself is very "dry" from an information perspective. There's been work on a third extension trying to address this fact and offer nicer representation of the data called pg_statsinfo (along with pg_stats_reporter), but it's a bit of an undertaking to get it up and running.
To offer a more convenient solution to this problem I started working on a commercial project which is focussed around pg_stat_statements and pg_stat_plans and augments the information collected by lots of other data pulled out of the database. It's called pganalyze and you can find it at https://pganalyze.com/.
To offer a concise overview of interesting tools and projects in the Postgres Monitoring area i also started compiling a list at the Postgres Wiki which is updated regularly.

pgfouine works fairly well for me. And it looks like there's a FreeBSD port for it.

I've used pgtop a little. It is quite crude, but at least I can see which query is running for each process ID.
I tried pgfouine, but if I remember, it's an offline tool.
I also tail the psql.log file and set the logging criteria down to a level where I can see the problem queries.
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this time.
I also use EMS Postgres Manager to do general admin work. It doesn't do anything for you, but it does make most tasks easier and makes reviewing and setting up your schema more simple. I find that when using a GUI, it is much easier for me to spot inconsistencies (like a missing index, field criteria, etc.). It's only one of two programs I'm willing to use VMWare on my Mac to use.

Munin is quite simple yet effective to get trends of how the database is evolving and performing over time. In the standard kit of Munin you can among other thing monitor the size of the database, number of locks, number of connections, sequential scans, size of transaction log and long running queries.
Easy to setup and to get started with and if needed you can write your own plugin quite easily.
Check out the latest postgresql plugins that are shipped with Munin here:
http://munin-monitoring.org/browser/branches/1.4-stable/plugins/node.d/

Well, the first thing to do is try all your queries from psql using "explain" and see if there are sequential scans that can be converted to index scans by adding indexes or rewriting the query.
Other than that, I'm as interested in the answers to this question as you are.

Check out Lightning Admin, it has a GUI for capturing log statements, not perfect but works great for most needs. http://www.amsoftwaredesign.com

DBTuna http://www.dbtuna.com/postgresql_monitor.php has recently started supporting PostgreSQL monitoring. We use it extensively for MySQL monitoring, so if it provides the same for Postgres then it should be a good fit for you too.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas