Why Query_parallelism affects the result of a join between two UUID columns - ignite

I'm running the following test on ignite 2.10.0
I have 2 tables created with a query_parallelism=1 and without affinity key.
When I join the 2 following tables I have the result as expected.
0: jdbc:ignite:thin://localhost:10800> SELECT "id" AS "_A_id", "source_id" AS "_A_source_id" FROM PUBLIC."source_ml_blue";
+--------------------------------------+--------------------------------------+
| _A_id | _A_source_id |
+--------------------------------------+--------------------------------------+
| 86c068cd-da89-11eb-a185-3da86c6c6bb3 | 86c068cc-da89-11eb-a185-3da86c6c6bb3 |
+--------------------------------------+--------------------------------------+
1 row selected (0.004 seconds)
0: jdbc:ignite:thin://localhost:10800> SELECT "id" AS "_B_id", "flx_src_ip_text" AS "_B_src_ip" FROM PUBLIC."source_nprobe_tcp_blue";
+--------------------------------------+-----------+
| _B_id | _B_src_ip |
+--------------------------------------+-----------+
| 86c068cc-da89-11eb-a185-3da86c6c6bb3 | 1.1.1.1 |
+--------------------------------------+-----------+
1 row selected (0.003 seconds)
0: jdbc:ignite:thin://localhost:10800> SELECT _A."id" AS "_A_id", _A."source_id" AS "_A_source_id", _B."id" AS "_B_id", _B."flx_src_ip_text" AS "_B_src_ip" FROM PUBLIC."source_ml_blue" AS "_A" INNER JOIN PUBLIC."source_nprobe_tcp_blue" AS "_B" ON "_A"."source_id"="_B"."id";
+--------------------------------------+--------------------------------------+--------------------------------------+-----------+
| _A_id | _A_source_id | _B_id | _B_src_ip |
+--------------------------------------+--------------------------------------+--------------------------------------+-----------+
| 86c068cd-da89-11eb-a185-3da86c6c6bb3 | 86c068cc-da89-11eb-a185-3da86c6c6bb3 | 86c068cc-da89-11eb-a185-3da86c6c6bb3 | 1.1.1.1 |
+--------------------------------------+--------------------------------------+--------------------------------------+-----------+
1 row selected (0.005 seconds)
If I delete and create the same tables with a query_parallelism = 8, I do not have a SQL error (the parallelism is equal on the 2 tables) BUT the result of the join is empty.
any idea why I get this behavior ?

You observe this behaviour because of optimisations for parallel query execution. Most likely your records landed to different partitions (handled by a different thread). If you increase the number of records in both tables you will see a subset of this join as a result.
The most elegant option here is to let "_A"."source_id" and "_B"."id" be affinity keys. Most likely ignite.jdbc.distributedJoins is going to affect performance for clustered installation. Affinity collocation will make items with matching "_A"."source_id" and "_B"."id" reside in the same partition to avoid cross-partitional interaction (for clustered environments it would lead to additional networks hops).

The problem comes from the SQL client : it has to be aware of the parallelism.
On DBeaver, I had to enable ignite.jdbc.distributedJoins in the connection properties to make the request works properly.

Related

Is comparing two tables faster by importing them into a sql database or by using jdbc?

Background
I need to compare two tables in two different datacenters to make sure they're the same. The tables can be hundreds of millions, even a billion lines.
An example of this is having a production data pipeline and a development data pipeline. I need to verify that the tables at the end of each pipeline are the same, however, they're located in different datacenters.
The tables are the same if all the values and datatypes for each row and column match. There are primary keys for each table.
Here's an example input and output:
Input
table1:
Name | Age |
Alice| 25.0|
Bob | 49 |
Jim | 45 |
Cal | 52 |
table2:
Name | Age |
Bob | 49 |
Cal | 42 |
Alice| 25 |
Output:
table1 missing rows (empty):
Name | Age |
| |
table2 missing rows:
Name | Age |
Jim | 45 |
mismatching rows:
Name | Age | table |
Alice| 25.0| table1|
Alice| 25 | table2|
Cal | 52 | table1|
Cal | 42 | table2|
Note: The output doesn't need to be exactly like the above format, but it does need to contain the same information.
Question
Is it faster to import these tables into a new, common SQL environment, then use SQL to produce my desired output?
OR
Is it faster to use something like JDBC, retrieve all rows for each table, sort each table, then compare them line by line to produce my desired output?
Edits:
The above solutions would be executed at a datacenter that's hosting one of the tables. In the first solution, the only purpose for creating a new database would be to compare these tables using SQL, there are no other uses.
You should definitively start with the database option. Especially if the databases are connected with a database link you can easy set up the transfer of the data.
Such comparison often leads to a full outer join of the two sources and the experience tell us that DIY joins are notorically less performant that the native database implementation (you can deploy for example a parallel option).
Anyway you may try to implement some sofisticated algoritm that can make the compare without the necessity to transfer the whole table.
An example is based on the Merkle Trees where you first scan both source in their location to recognise which parts are identical (that can be ignored) and transfer and compare only the party with a difference.
So if you expect the tables are nearly identical and have keys that allows some hierarchy such approach could end better than a brute force full compare.
The faster solution is to load both tables to variables (memory) in your programing language and then compare them with your favorite algorithm.
Copy them first to a new table is the more than the double of time in read/write operations to disk, especially the write ones.

Index is not being used by optimizer

I have a query which is performing very badly due to full scan of a table.I have checked the statistics rebuild the indexes but its not working.
SQL Statement:
select distinct NA_DIR_EMAIL d, NA_DIR_EMAIL r
from gcr_items , gcr_deals
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
and
decode(:P55_DIRECT,'ALL','Y',trim(upper(NA_ORG_OWNER_EMAIL)))=
decode(:P55_DIRECT,'ALL','Y',trim(upper(:P55_DIRECT)))
order by 1
Execution Plan :
Plan hash value: 3180018891
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8 | 00:11:42 |
| 1 | SORT ORDER BY | | 8 | 00:11:42 |
| 2 | HASH UNIQUE | | 8 | 00:11:42 |
|* 3 | HASH JOIN | | 7385 | 00:11:42 |
|* 4 | VIEW | index$_join$_002 | 10462 | 00:00:05 |
|* 5 | HASH JOIN | | | |
|* 6 | INDEX RANGE SCAN | GCR_DEALS_IDX12 | 10462 | 00:00:01 |
| 7 | INDEX FAST FULL SCAN| GCR_DEALS_IDX1 | 10462 | 00:00:06 |
|* 8 | TABLE ACCESS FULL | GCR_ITEMS | 7386 | 00:11:37 |
-------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("GCR_DEALS"."GCR_DEALS_ID"="GCR_ITEMS"."GCR_DEALS_ID")
4 - filter("GCR_DEALS"."BU_ID"=TO_NUMBER(:P0_BU_ID))
5 - access(ROWID=ROWID)
6 - access("GCR_DEALS"."BU_ID"=TO_NUMBER(:P0_BU_ID))
8 - filter(DECODE(:P55_DIRECT,'ALL','Y',TRIM(UPPER("NA_ORG_OWNER_EMAI
L")))=DECODE(:P55_DIRECT,'ALL','Y',TRIM(UPPER(:P55_DIRECT))))
In the beginning a part of the condition in the WHERE clause must be decomposed (or "decompiled" - or "reengeenered") into a simpler form without using decode function, which a form can be understandable by the query optimizer:
AND
decode(:P55_DIRECT,'ALL','Y',trim(upper(NA_ORG_OWNER_EMAIL)))=
decode(:P55_DIRECT,'ALL','Y',trim(upper(:P55_DIRECT)))
into:
AND (
:P55_DIRECT = 'ALL'
OR
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
)
To find rows in the table based on values stored in the index, Oracle uses an access method named Index scan, see this link for details:
https://docs.oracle.com/cd/B19306_01/server.102/b14211/optimops.htm#i52300
One of the most common access method is Index Range Scan see here:
https://docs.oracle.com/cd/B19306_01/server.102/b14211/optimops.htm#i45075
The documentation says (in the latter link) that:
The optimizer uses a range scan when it finds one or more leading
columns of an index specified in conditions, such as the following:
col1 = :b1
col1 < :b1
col1 > :b1
AND combination of the preceding conditions for leading columns in the
index
col1 like 'ASD%' wild-card searches should not be in a leading
position otherwise the condition col1 like '%ASD' does not result in a
range scan.
The above means that the optimizer is able to use the index to find rows only for query conditions that contain basic comparision operators: = < > <= >= LIKE which are used to comparing simple values with plain column names. What the documentation doesn't clearly say - and you need to deduce it reading between the lines - is a fact that when some function is used in the condition, in a form function( column_name ) or function( expression_involving_column_names ) , then the index range scan cannot be used. In this case the query optimizer must evaluate this expression individually for each row in the table, thus must read all rows (perform a full table scan).
A short conclusion and a rule of thumb:
Functions in the WHERE clause can prevent the optimizer from using
indexes
If you see some function somewhere in the WHERE clause, then it is a sign that you are
running the red light
STOP immediately and think three times how
this function impact the query optimizer and the performance of your
query, and try to rewrite the condition to a form that the optimizer
is able to understand.
Now take a look at our rewritten condition:
AND (
:P55_DIRECT = 'ALL'
OR
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
)
and STOP - there are still two functions trim and upper applied to a column named NA_ORG_OWNER_EMAIL. We need to think how they can impact the query optimizer.
I assume that you have created a plain index on a single column: CREATE INDEX somename ON GCR_ITEMS( NA_ORG_OWNER_EMAIL ).If yes, then the index contains only plain values of NA_ORG_OWNER_EMAIL.
But the query is trying to find trimm(upper(NA_ORG_OWNER_EMAIL)) values, which are not stored in the index, so this index cannot be used in this case.
This condition requires a function based index:
https://docs.oracle.com/cd/E11882_01/appdev.112/e41502/adfns_indexes.htm#ADFNS00505
CREATE INDEX somename ON GCR_ITEMS( trim( upper( NA_ORG_OWNER_EMAIL )))
Unfortunately even the function based index will still not help, because the condition in the query is too general - if a value of :P55_DIRECT = ALL the query must retrieve all rows from the table (perform a full table scan), otherwise must use the index to search value within it.
This is because the query is planned (think of it as "compiled") by the query optimizer only once, during it's first execution. Then the plan is stored in the cache and used to execute the query for all further executions. A value of the parameter is not know in advance, so the plan must consider each possible cases, thus will always perform a full table scan.
In 12c there is a new feature "Adaptive query optimalization":
https://docs.oracle.com/database/121/TGSQL/tgsql_optcncpt.htm#TGSQL94982
where the query optimizer analyses each parameters of the query on each runs, and is able to detect that the plan is not optimal for some runtime parameters, and choose a better "subplans" depending on actual parameter's value ... but you must use 12c, and additionally pay for Enterprise Edition, because only this edition includes that feature. And it's still not certain if the adaptive plan will work in this case or not.
What you can do without paying for 12c EE is to DIVIDE this general query into two separate variants, one for a case where :P55_DIRECT = ALL, and the other for remaining cases, and run an appropriate variant in the client (your application) depending on the value of this parameter.
A version for :P55_DIRECT = ALL, that will perform a full table scan
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
order by 1
and a version for other cases, that will use the function based index:
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
and
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
order by 1

How to flatten a one-to-many relationship

While trying to build a data warehousing application using Talend, we are faced with the following scenario.
We have two tables tables that look like
Table master
ID | CUST_NAME | CUST_EMAIL
------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM
Events Table
ID | CUST_ID | EVENT_NAME | EVENT_DATE
---------------------------------------
1 | 1 | ACC_APPLIED | 2014-01-01
2 | 1 | ACC_OPENED | 2014-01-02
3 | 1 | ACC_CLOSED | 2014-01-02
There is a one-to-many relationship between master and the events table.Since, given a limited number of event names I proposing that we denormalize this structure into something that looks like
ID | CUST_NAME | CUST_EMAIL | ACC_APP_DATE_ID | ACC_OPEN_DATE_ID |ACC_CLOSE_DATE_ID
-----------------------------------------------------------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM | 20140101 | 20140102 | 20140103
THE DATE_ID columns refer to entries inside the time dimension table.
First question : Is this a good idea ? What are the other alternatives to this scheme ?
Second question : How do I implement this using Talend Open Studio ? I figured out a way in which I moved the data for each event name into it's own temporary table along with cust_id using the tMap component and later linked them together using another tMap. Is there another way to do this in talend ?
To do this in Talend you'll need to first sort your data so that it is reliably in the order of applied, opened and closed for each account and then denormalize it to a single row with a single delimited field for the dates using the tDenormalizeRows component.
After this you'll want to use tExtractDelimitedFields to split the single dates field.
Yeah, this is a good idea, this is called a cumulative snapshot fact. http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
Not sure how to do this in Talend (dont know the tool) but it would be quite easy to implement in SQL using a Case or Pivot statement
Regarding only your first question, it's certainly a good idea -- unless there is any possibility of the same persons applying-opening-closing their account more than once AND you want to keep all this information in their history (so UPDATE wouldn't help).
Snowflaking is definitely not a good option if you are going to design a data warehouse. So, denormalizing will certainly be a good choice in this case. Following article almost fits perfectly to clear the air over such scenarios,
http://www.kimballgroup.com/2008/09/design-tip-105-snowflakes-outriggers-and-bridges/

Storing a COUNT of values in a table

I have a table with data along the (massively simplified) lines of:
User | Value
-----|------
UsrA | 100
UsrA | 102
UsrB | 100
UsrA | 100
UsrB | 101
and, for reasons far to obscure to go into, I need to store the COUNT of each value in a table for future retrieval - ending up with something like
User | Value100Count | Value101Count | Value102Count
-----|---------------|---------------|--------------
UsrA | 2 | 0 | 1
UsrB | 1 | 1 | 0
However, there could be up to 255 different Values - meaning potentially 255 different ValueXCount columns. I know this is a horrible way to do things, but is there an easy way to get the data into a format that can be easily INSERTed into the destination table? Is there a better way to store the COUNT of values per user (unfortunately I do need to store this information; grabbing it from the source table each time isn't an option)?
The whole thing isn't very pretty, but you know that, rather than your table with 255 columns I'd consider setting up another table with:
User | Value | CountOfValue
And set a primary key over User and Value.
You could then insert the count's for given user/value combos into the CountOfValue field
As I said, the design is horrible and it feels like you would be better off starting from scratch, normalizing and doing counts live.
Check out indexed views. You can maintain the table automatically, with integrity and as a bonus it can get used in queries that already do count(*) on that data.

Vertica and joins

I'm adapting a web analysis tool to use Vertica as the DB. I'm having real problems optimizing joins. I tried creating pre-join projections for some of my queries, and while it did make the queries blazing fast, it slowed data loading into the fact table to a crawl.
A simple INSERT INTO ... SELECT * FROM which we use to load data into the fact table from a staging table goes from taking ~5 seconds to taking 20+ minutes.
Because of this I dropped all pre-join projections and tried using the Database Designer to design query specific projections but it's not enough. Even with those projections a simple join is taking ~14 seconds, something that takes ~1 second with a pre-join projection.
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
We're running Vertica on a 5 node cluster, each node having 2 x quad core CPU and 32 GB of memory. The tables in my example query have 188,843,085 and 25,712,878 rows respectively.
The EXPLAIN output looks like this:
EXPLAIN SELECT referer_via_.url as referralPageUrl, COUNT(DISTINCT sessio
n.id) as visits FROM owa_session as session JOIN owa_referer AS referer_vi
a_ ON session.referer_id = referer_via_.id WHERE session.yyyymmdd BETWEEN
'20121123' AND '20121123' AND session.site_id = '49' GROUP BY referer_via_
.url ORDER BY visits DESC LIMIT 250;
Access Path:
+-SELECT LIMIT 250 [Cost: 1M, Rows: 250 (STALE STATISTICS)] (PATH ID: 0)
| Output Only: 250 tuples
| Execute on: Query Initiator
| +---> SORT [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID: 1)
| | Order: count(DISTINCT "session".id) DESC
| | Output Only: 250 tuples
| | Execute on: All Nodes
| | +---> GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 1M, Rows: 1 (STALE
STATISTICS)] (PATH ID: 2)
| | | Aggregates: count(DISTINCT "session".id)
| | | Group By: referer_via_.url
| | | Execute on: All Nodes
| | | +---> GROUPBY HASH (SORT OUTPUT) (RESEGMENT GROUPS) [Cost: 1M, Rows
: 1 (STALE STATISTICS)] (PATH ID: 3)
| | | | Group By: referer_via_.url, "session".id
| | | | Execute on: All Nodes
| | | | +---> JOIN HASH [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID:
4) Outer (RESEGMENT)
| | | | | Join Cond: ("session".referer_id = referer_via_.id)
| | | | | Execute on: All Nodes
| | | | | +-- Outer -> STORAGE ACCESS for session [Cost: 463, Rows: 1 (ST
ALE STATISTICS)] (PUSHED GROUPING) (PATH ID: 5)
| | | | | | Projection: public.owa_session_projection
| | | | | | Materialize: "session".id, "session".referer_id
| | | | | | Filter: ("session".site_id = '49')
| | | | | | Filter: (("session".yyyymmdd >= 20121123) AND ("session"
.yyyymmdd <= 20121123))
| | | | | | Execute on: All Nodes
| | | | | +-- Inner -> STORAGE ACCESS for referer_via_ [Cost: 293K, Rows:
26M] (PATH ID: 6)
| | | | | | Projection: public.owa_referer_DBD_1_seg_Potency_2012112
2_Potency_20121122
| | | | | | Materialize: referer_via_.id, referer_via_.url
| | | | | | Execute on: All Nodes
To speedup join:
Design session table as being partitioned on column "yyyymmdd". This will enable partition pruning
Add condition on column "yyyymmdd" to _referer_via_ and partition on it, if it is possible (most likely not)
have column site_id as possible close to the beginning of order by list in used (super)projection of session
have both tables segmented on referer_id and id correspondingly.
And having more nodes in cluster do help.
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
I guess the amount affected would vary depending on data sets and structures you are working with. But, since this is the variable you changed, I believe it is safe to say the pre-join projection is causing the slowness. You are gaining query time at the expense of insertion time.
Someone please correct me if any of the following is wrong. I'm going by memory and by information picked up with conversations with others.
You can speed up your joins without a pre-join projection a few ways. In this case, the referrer ID. I believe if you segment your projections for both tables with the join predicate that would help. Anything you can do to filter the data.
Looking at your explain plan, you are doing a hash join instead of a merge join, which you probably want to look at.
Lastly, I would like to know via the explain plan or through system tables if your query is actually using the projections Database Designer has recommended. If not, explicitly specify them in your query and see if that helps.
You seem to have a lot of STALE STATISTICS.
Responding to STALE statistics is important. Because that is the reason why your queries are slow. Without statistics about the underlying data, Vertica's query optimizer cannot choose the best execution plan. And responding to STALE statistics only improves SELECT performance not update performance.
If you update your tables regularly do remember there are additional things you have to consider in VERTICA. Please check the answer that I posted to this question.
I hope that should help improve your update speed.
Explore the AHM settings as explained in that answer. If you don't need to be able to select deleted rows in a table later, it is often a good idea to not keep them around. There are ways to keep only the latest epoch version of the data. Or manually purge deleted data.
Let me know how it goes.
I think your query could use some more of being explicit. Also don't use that Devil BETWEEN Try this:
EXPLAIN SELECT
referer_via_.url as referralPageUrl,
COUNT(DISTINCT session.id) as visits
FROM owa_session as session
JOIN owa_referer AS referer_via_
ON session.referer_id = referer_via_.id
WHERE session.yyyymmdd <= '20121123'
AND session.yyyymmdd > '20121123'
AND session.site_id = '49'
GROUP BY referer_via_.url
-- this `visits` column needs a table name
ORDER BY visits DESC LIMIT 250;
I'll say I'm really perplexed as to why you would use the same DATE with BETWEEN may want to look into that.
this is my view coming from an academic background working with column databases, including Vertica (recent PhD graduate in database systems).
Blockquote
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
Blockquote
Yes, updating projections is very slow and you should ideally do it only in large batches to amortize the update cost. The fundamental reason is that each projection represents another copy of the data (of each table column that is part of the projection).
A single row insert requires adding one value (one attribute) to each column in the projection. For example, a single row insert in a table with 20 attributes requires at least 20 column updates. To make things worse, each column is sorted and compressed. This means that inserting the new value in a column requires multiple operations on large chunks of data: read data / decompress / update / sort / compress data / write data back. Vertica has several optimization for updates but cannot hide completely the cost.
Projections can be thought of as the equivalent of multi-column indexes in a traditional row store (MySQL, PostgreSQL, Oracle, etc.). The upside of projections versus traditional B-Tree indexes is that reading them (using them to answer a query) is much faster than using traditional indexes. The reasons are multiple: no need to access head data as for non-clustered indexes, smaller size due to compression, etc. The flipside is that they are way more difficult to update. Tradeoffs...