From the docs:
Normal JOIN operations require that the right-side table contains less
than 8 MB of compressed data. The EACH modifier is a hint that informs
the query execution engine that the JOIN might reference two large
tables. The EACH modifier can't be used in CROSS JOIN clauses.
When possible, use JOIN without the EACH modifier for best
performance. Use JOIN EACH when table sizes are too large for JOIN.
Why isn't that automatic?
Is there a way to simplify this? Can I just always use JOIN EACH or always use JOIN (It seems I can't always use join because of the 8mb limitation written above)
BigQuery parallelizes the processing of information within many servers that pass condensed information onto further servers in a tree topology. Everything ends up in a root node, and some of the BigQuery limitations come from this bottleneck: You can read "unlimited" amounts of data, but the output of a query has to fit into a single server.
(more details in the 2010 Dremel paper http://research.google.com/pubs/pub36632.html)
To overcome this limitation the EACH keyword was introduced: It forces a shuffle at the starting level, allowing the parallelization of tasks - without requiring a single output node - and allowing JOINing tables of unlimited size. This approach has some drawbacks, like losing the ability to ORDER BY the end result, as no single node will have visibility onto the whole output.
Would it be possible for BigQuery to detect when to use EACH automatically? Ideally, but for now the EACH keyword allows you to complete previously impossible operations - with the drawback of requiring your awareness of it.
Related
In my corporate project, I need to cross join a dataset of over a billion rows with another of about a million rows using Spark SQL. As cross join was used, I decided to divide the first dataset into several parts (each having about 250 million rows) and cross join each part with the million-row one. I then made of use "union all".
Now I need to improve the performance of the join processes. I heard it can be done by partitioning data and distribution of work to Spark workers. My questions are how the effective performance can be made with partitioning? and What are the other ways to do this without using partitioning?
Edit: filtering already included.
Well, in all scenarios, you will end up with tons of data. Be careful, try to avoid cartesian joins on big data set as much as possible as it usually ends with OOM exceptions.
Yes, partitioning can be the way that help you, because you need to distribute your workload from one node to more nodes or even to the whole cluster. Default partitioning mechanism is hash of key or original partitioning key from source (Spark is taking this from source directly). You need to first evaluate what is your partitioning key right now and afterwards you can find maybe better partitioning key/mechanism and repartition data, therefore distribute load. But, anyway join must be done, but it will be done with more parallel sources.
There should be some filters on your join query. you can use filter attributes as key to partition the data and then join based on the partitioned.
Considering this case where we have 2 tables. The requirement is to implement a function to select the top 10 records (ordered by some rules) from TABLE_A and TABLE_B, where table_a.id == table_b.a_id == X. There are two options:
Using JOIN to query the SQL;
Making 2 selection queries from db: SELECT * FROM table_a WHERE id = X and SELECT * FROM table_b WHERE a_id = X, fetching 10 records from each query (let's assume the ordering is correct in this case) in memory, then join them in the code (using a for loop and a hashtable or sth like that).
I've heard that JOIN might lower the system performance (was "db performance" here but that was wrong)(see follow up below for reference). Besides, in this case we only queries for 10 results at maximum, which is acceptable to load them all in memory then join them there.
My question is, is there a general guideline in the industry, to say under what circumstances would we recommend using JOIN in database layer instead of doing it in memory, and when to do the opposite?
============
Follow up:
So here's some reason/scenario I've read for "moving JOIN from database layer to
service layer":
If we are joining multiple tables, they will all be locked at once. And if the operation take times and the service requires low response time, it might block other executions;
Hard to maintain in the big system. Changes of the tables that are involved in JOIN might make the query broken.
There might be some historical reason for those complicated systems, that data might be migrated/created in different db (or db systems, say one table in DynamoDB and the other one in Postgres), which makes JOIN in the database layer impossible.
To answer simply, it depends.
Generally, it is preferable to do data operations closer to data, instead of bringing them higher up in the layers and handle data operations. You can see many PL/SQL based implementations, where they do operations closer to data. Languages like PL/SQL(ORACLE) or TSQL(SQL Server) are designed to do complex data operations.
But, if you have an application, which brings data from disparate systems and have to join between them, you have to do them in memory.
If we are joining multiple tables, they will all be locked at once. And if the operation take times and the service requires low response time, it might block other executions;
Readers are not blocking other readers. They have something called sharedlock. Also, once the read operation is over, shared lock is released. As, #TimBiegeleisen, you can create indexes to speed up read operations, based on the need. It is always preferable to read only needed columns(projection), needed rows(filtering).
Hard to maintain in the big system. Changes of the tables that are involved in JOIN might make the query broken.
As long as you are selecting only the needed columns, instead of SELECT *, you should not be having issues. If many changes are coming, you can considering creating SCHEMA BINDING View, to avoid schema changes to the underlying tables.
There might be some historical reason for those complicated systems, that data might be migrated/created in different db (or db systems, say one table in DynamoDB and the other one in Postgres), which makes JOIN in the database layer impossible.
Design the application for current need. Dont assume that something like that will happen in future and design for that and compromise on current application performance. If there is a definite future need, go for in-memory operations. Otherwise, better go for database JOINs.
I have a project where I have very complex queries (Legacy project)
There are a lot of queries, stored procedures in the project, etc.
Queries have anywhere from 10-30 joins and slow filtering.
Project cannot be modified right away (Will take at least a year worth of work)
Is there a hardware way to increase performance. Can I use some smart Azure setup with increased computing power to increase speed?
Or what things do I need to look for in a physical server?
Avoid Multiple Joins in a Single Query.
Eliminate Cursors from the Query.
Avoid Use of Non-correlated
Scalar Sub Query.
Avoid Multi-statement Table Valued Functions (TVFs).
Creation and Use of Indexes.
Understand the Data.
Create a Highly Selective Index Filter data separately in hash table
from actual table
Do for every table that contains huge data then apply join on hash
tables rather actual tables in store procedure to combine the result.
Hardware approaches:
More RAM on motherboard and disk controller.
Additional processors (may require a different SQL Server license).
Faster storage devices.
If data is on an external SAN device, consider switching to a device with a faster connection type (Fibre Channel vs iSCSI, ATA over Ethernet (AoE), or HyperSCSI)
You can scale your performance with Azure using Standard and Premium subscriptions. So if you have slow queries you can always throw more hardware at it while you have the benefit of doing so only when needed. You can set your database to automatically scale, so when you're demand is low you pay less and when it's high your workload doesn't suffer.
Azure provides QPI, that basically identifies which of your queries are the most expensive ones so you can optimize them first.
Azure also provides different advisors, like index advisor, that will learn how you use your database and advise indexes to be created or dropped - the best thing about this is that it can do it automatically for you.
If you're thinking in on-prem solution you should consider the operating system cost and hardware cost as well and the time and cost of setting this up and configuring properly. Creating geo-replicated settings can add another level of complexity. So if you need to start fresh and your business requirements allow cloud services I'd say Azure is the way to go since it provides rich telemetry and all kinds of smart database capabilities (and they are coming month after month). Also don't forget that Azure is updated roughly every month while box editions get cumulative update packages after half a year or more.
hardware is very important, the better the faster.
but also you need to check your queries with aspect of performance.
for example,
check for missing indexes with actual execution plans. if there is,
you must add.
learn how to read execution plans & STATICS. cpu cost
is important. table-scans are deadly :) do not use.
rebuild indexes frequently.
if you are using ms-sql, you need "NOLOCK" property
after tables, otherwise you lock your table while reading/select.
when you are joining table, try to add your conditions on join. not
"where clause"
for example
SELECT * FROM TABLE A WITH (NOLOCK) INNER JOIN TABLE B WITH (NOLOCK)
ON A.ID = B.ID
WHERE B.SOMECOLUMN IS NOT NULL
SELECT * FROM TABLE A WITH (NOLOCK) INNER JOIN TABLE B WITH (NOLOCK)
ON A.ID = B.ID AND B.SOMECOLUMN IS NOT NULL
second one is better.
avoid "ORDER BY", "DISTINCT", if not necessary.
:)
Sometimes I'm going nuts over different query execution plans in development, integration tests, and productive systems. Apart from the usual analysis I run, I just want to know:
Can some of the query optimiser's transformation operations be deactivated on a system level (just as they can be deactivated on a per-query level using hints)?
In this case, I'd expect a UNION ALL PUSHED PREDICATE operation for a query looking roughly like this:
SELECT ...
FROM (SELECT ... FROM A
UNION ALL
SELECT ... FROM B)
WHERE X = :B1
A and B are views, both selecting from the same tables containing X, where X is the primary key. It is important that the selection for X is pushied into both views A and B before fetching all of A and B's data. And it's also possible, because no complex transformations are required.
So apart from deactivated indexes, bad statistics, bind variable peeking issue, and all the other usual suspects, is there a possibility that the whole Oracle instance just can't do one or two transformations because they're switched off?
Yes. Various and sundry initialization parameters control query transformation and optimization, and a significant number of them aren't documented.
The following query shows all the undocumented parameters, at least for 10g:
SELECT a.ksppinm "Parameter",
b.ksppstvl "Session Value",
c.ksppstvl "Instance Value"
FROM x$ksppi a,
inner join x$ksppc b
on a.indx = b.indx
inner join x$ksppsv c
on a.indx = c.indx
WHERE a.ksppinm LIKE '/_%' escape '/'
/
Similarly, setting event 10053 will generate an optimization trace file, which will show what parameters (documented or otherwise) affected the generation of the query plan.
If you want to have stable execution plans across different instances, you be able to achieve this with exporting statistics on the reference system and importing them into the others.
Examples can be found in the manual and here
You might also want to lock the statistics in the target environments after the import so that they are not changed.
There are a number of database initialization parameters that can enable or disable various optimizer options and different query transformations. So if you have different initialization parameters set in different environments, you can definitely end up in a situation where one environment can do a particular transform and another cannot despite having identical data structures and statistics.
In the case of this particular query, my mind goes immediately to the OPTIMIZER_SECURE_VIEW_MERGING parameter. That definitely has the potential to cause problems for this particular type of construct.
I would like to know if there's a really performance gain between those two options :
Option 1 :
I do a SQL Query with a join to select all User and their Ranks.
Option 2 :
I do one SQL Query to select all User
I fetch all user and do another SQL Query to get the Ranks of this User.
In code, option two is easier to realize for me. That's only because the way I design my Persistence layer.
So, I would like to know what's the impact on performance. After what limit I should consider to take Option 1 instead of Option 2 ?
Generally speaking, the DB server is always faster at joining than application code. Remember you will have to do an extra query with a network round trip for each join. However, if your first result set is small and your indexes are well tuned, this model can work fine.
If you are only doing this to re-use your ORM solution, then you may be fighting a losing battle. I have invariably found that I need read-only datasets that can only be produced with SQL, so I now use ORM for per-object CRUD operations and regular SQL for searches, reports, aggregates etc.
If ranks are static values, consider caching them in your application.
If you need users frequently and ranks only rarely, consider lazy-loading of ranks. (e.g., separate queries, but the second query gets used only occasionally).
Use the join if you're always going to need both sets of data, and they have to be current copies of the database.
Prototype any likely choices, and run performance tests.
EDIT: Further thoughts on your persistence layer, because I'm facing this one myself. Consider adding "persistence-like" classes that handle joins as their basic query, and are read-only. Whether this fits your particular scenario is for you to decide, but a lot of database access for many apps is based on joins, which can be rather large and complex. If you can handle these in a consistent manner with your persistent, updatable objects, it can be a big win for your overall architecture. Conceptually, it's a lot like having a view in the database, and querying the view instead of writing a join, but you're doing it all in code.
It depends upon how many users you anticipate. Option one will definitely be faster, but with a reasonable amount of data, the difference will be negligible.
In 99% situations join will be faster.
However there is one rare situations when it can be slower. If your are doing one to many join on table with large row size and you are hitting network bandwidth limit.
For example there is a blob column in T1 of 1MB size, you are joining T2 which consist 100 rows for each T1 row. The result set would be T1 row count multiple 100.
So if you are querying one T1 row with join it would be 100MB result set, if you fetch T1 row (1MB) and then do separate select to fetch 100 T2 for this T1 the result set will be 1MB.