Why Synapse Analytics chooses a much bigger table to broadcast when joining - azure-synapse

I have two table. table 1 is around 800 rows and the other table has 80.000.000 rows.
I make a simple join between the table. It really frustrates me that it keeps broadcasting the big table when joining, instead of the small one, causing serious performance problem.
Even if I use option such as "MERGE JOIN", it still broadcasts the big table. (I come from Spark background, and if I understand right, it should shuffle (to sort merge) instead of keeping broadcasting anything.)
The execution plan looks like this.
I never have such problem with such simple joins when working with Spark. Can somebody help me with this.

In order to avoid data movement operators in execution plans you should set up your table distribution properly. The best practice here would be to REPLICATE your smaller table and to HASH distribute your large table on a suitable column that provides good distribution.
Some sample DDL:
CREATE TABLE fact.yourBigTable (
...
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH( someColumn )
);
CREATE TABLE dim.yourSmallTable (
...
)
WITH
(
CLUSTERED INDEX ( someColumn ),
DISTRIBUTION = REPLICATE
);
Obviously you have to experiment with your workload and data to find the combination. Generally I would say updating columnstore indexes isn't a great idea: consider CTAS or another alternative.

Related

indexing of Temptables

Good Day to every one!
i have a migration process from a remote query, i fetch data and store it in a #Temptable,
the question is, what would be better? putting index after Creation of table of #temptable or insert data first in the #temtable before putting an index? and why? or it is better to process the data while in the remote query before inserting the data in a #temptable
ex.
Select * into #BiosData
from sometable a
where (a.Status between 3 and 5)
CREATE CLUSTERED INDEX IDX_MAINID ON #BiosData([MAINID])
**Process the data retrieved above....**
OR this?
select A.MAINIDinto #BiosData
from table a
inner join Transactions.sometable c
on a.ID= c.fld_ID
inner join Reference.sometable b
on cast(a.[ID]/1000000000000 as decimal (38,0)) = b.fld_ID
where a.version > b.fld_version
and (a.Status between 3 and 5)
thank you for your tips and suggestions :) im a newbie in Sql please be gentle to me :)
As a generic rule:
If you create a fresh table and are going to insert data into it and it needs an index then it is faster to insert the data first and create the index afterwards. Why: because creating an index means calculating it if data exists, but inserting data on an indexed table will continiously reshuffle the index contents which also need to be written. So by creating the index afterwards you avoid the overhead of updating the index while inserting
Exception 1: if you want to have the index combined with the data hence when a read occurs to the index t find a particular value it also has the data available in the same read operation. In oracle they call it an index organized table. I think in MS SQL it might be called an clustered index, but not 100% sure.
Exception 2: if your index is used to enforce some constraint then creating the index first is a good option to make sure that during the inserts the constraint is maintained.
In your case: I notice that in the complex query there is an additional where clause: it may result in fewer inserts hence faster processing, however if the tables used in the complex query have additional indexes which speed up the query, make sure similar indices are also created on the temp table.
Finally: indices are typically used to reduce disk i/o, temporary tables are if I am not mistaken maintained in memory. So adding indices are not guaranteed to increase speed...

is it considerably faster to query different tables than having a where clause

imagine that we have this table:
create table Foo(
id int,
name varchar,
k int --can be 1 or 2 or 3
)
or we could have 3 tables for each value of k
create Fook1(
id int,
name varchar
)
...
create table Fook2
...
createa table Fook3
is it going to be considerably faster to do:
select * from Foo where k = 3
than doing:
select * from Fook3
Potentially, using multiple tables could be faster than using a single table (particularly if those tables are going to have many millions of records), but there would be trade-offs in terms of ease of use, manageability, etc.
However, you could have the benefits of both by partitioning your table.
-Do-Not-Do-That-
Oh, wait, that's not helpful, it's just beligerant :)
Partitioning the data in this way CAN yield performance benefits. But they also introduce other costs:
- Queries that need to span all three tables become more complex
- Your schema becomes more cluttered
- It's easier to make mistakes
- It's hard to ensure referential integrity
- You may need to include a view to unify the 3 tables
You are most likely much better off with an Index that has k within it. And depending on how you query the data, k may be the first field in that index. When you specify k = ?, it just needs to do a Very quick check in the index and then you're only looking at the relevant portion of the table. And, if the index is a clustered index, the data is even physically stored in that order.
I'd highly recommend making use of indexes in this way before partitioning your data. It's an optimisation with costs, and so should be approached when it can be shown Necessary, not as a safety net early in design.
It may depend on the DB, so a real example is needed. For instance, in Oracle you can use partitioning, which does exactly what you say here behind the curtains, or create a materialized view with the union and then have the option to do both.
Normally, I'd say that you should create a correct implementation and then tune; early optimization is the root of all evils, especially with DBs. I think it is quite likely that your bottleneck is not going to be where you expect it.

Is it reasonable to divide data into different tables, based on a column value?

If I have a large table with a column which has a rather limited range of values (e.g. < 100), is it reasonable to divide this table into several tables with names tied to that column value?
E.g. a table like with columns:
table "TimeStamps": [Id] [DeviceId] [MessageCounter] [SomeData]
where [DeviceId] is the "limited range" column would be separated into several different tables:
table "TimeStamps1": [Id] [MessageCounter] [SomeData]
table "TimeStamps2": [Id] [MessageCounter] [SomeData]
...
table "TimeStampsN": [Id] [MessageCounter] [SomeData]
The problem I am having with my original table is that finding a largest MessageCounter value for some DeviceId values takes really long time to execute (see this post).
If tables would be separated, finding a maximum column number should be an O(1) operation.
[Edit]
Just stumbled upon this, thought I would update it. The problem I originally brought me here was performance issues when querying the original database. However, after adding additional db indexes and scheduled index reorganizing jobs, I was able to get great performance with the normalized form. SSMS Database Engine Tuning Advisor tool was of great help for identifying bottlenecks and suggesting the missing indexes.
While you could do it as a last-ditch performance optimization, I would advise against it. Mainly because it makes it very difficult to accomodate new DeviceIDs.
At any rate, doing this should not be necessary. If there's an index for DeviceID, the DBMS should be able to filter on it very quickly. That's what a DBMS is for, after all...
I fear this approach would add a great deal to the complexity of any application which needed to access this data. An alternate approach, which gains you whatever benefits you might get from putting each device in a separate table while still keeping all devices in the same table, would be to partition the table on DeviceID. I suggest that you investigate table partitioning to see if it fits your needs.
Share and enjoy.
This is what a distributed database is for. The servers share a table in the same database based on some column. You tell the servers how to distribute the table based on ranges of column values. Once this is set up you just query the table and aren't concerned on which server the data actually resides.
Have you considered Database partitioning? This is the baked in solution for the type of problem you've described. See: Partitioned Tables and Indexes in SQL Server 2005

Which one have better performance : Derived Tables or Temporary Tables

Sometimes we can write a query with both derived table and temporary table. my question is that which one is better? why?
Derived table is a logical construct.
It may be stored in the tempdb, built at runtime by reevaluating the underlying statement each time it is accessed, or even optimized out at all.
Temporary table is a physical construct. It is a table in tempdb that is created and populated with the values.
Which one is better depends on the query they are used in, the statement that is used to derive a table, and many other factors.
For instance, CTE (common table expressions) in SQL Server can (and most probably will) be reevaluated each time they are used. This query:
WITH q (uuid) AS
(
SELECT NEWID()
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q
will most probably yield two different NEWID()'s.
In this case, a temporary table should be used since it guarantees that its values persist.
On the other hand, this query:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS rn
FROM master
) q
WHERE rn BETWEEN 80 AND 100
is better with a derived table, because using a temporary table will require fetching all values from master, while this solution will just scan the first 100 records using the index on id.
It depends on the circumstances.
Advantages of derived tables:
A derived table is part of a larger, single query, and will be optimized in the context of the rest of the query. This can be an advantage, if the query optimization helps performance (it usually does, with some exceptions). Example: if you populate a temp table, then consume the results in a second query, you are in effect tying the database engine to one execution method (run the first query in its entirety, save the whole result, run the second query) where with a derived table the optimizer might be able to find a faster execution method or access path.
A derived table only "exists" in terms of the query execution plan - it's purely a logical construct. There really is no table.
Advantages of temp tables
The table "exists" - that is, it's materialized as a table, at least in memory, which contains the result set and can be reused.
In some cases, performance can be improved or blocking reduced when you have to perform some elaborate transformation on the data - for example, if you want to fetch a 'snapshot' set of rows out of a base table that is busy, and then do some complicated calculation on that set, there can be less contention if you get the rows out of the base table and unlock it as quickly as possible, then do the work independently. In some cases the overhead of a real temp table is small relative to the advantage in concurrency.
I want to add an anecdote here as it leads me to advise the opposite of the accepted answer. I agree with the thinking presented in the accepted answer but it is mostly theoretical. My experience has lead me to recommend temp tables over derived tables, common table expressions and table value functions. We used derived tables and common table expressions extensively with much success based on thoughts consistent with the accepted answer until we started dealing with larger result sets and/or more complex queries. Then we found that the optimizer did not optimize well with the derived table or CTE.
I looked at an example today that ran for 10:15. I inserted the results from the derived table into a temp table and joined the temp table in the main query and the total time dropped to 0:03. Usually when we see a big performance problem we can quickly address it this way. For this reason I recommend temp tables unless your query is relatively simple and you are certain it will not be processing large data sets.
The big difference is that you can put constraints including a primary key on a temporary table. For big (I mean millions of records) sometime you can get better performance with temporary. I have the key query that needs 5 joins (each joins happens to be similar). Performance was OK with 2 joins and then on the third performance went bad and query plan went crazy. Even with hints I could not correct the query plan. Tried restructuring the joins as derived tables and still same performance issues. With with temporary tables can create a primary key (then when I populate first sort on PK). When SQL could join the 5 tables and use the PK performance went from minutes to seconds. I wish SQL would support constraints on derived tables and CTE (even if only a PK).

faster way to use sets in MySQL

I have a MySQL 5.1 InnoDB table (customers) with the following structure:
int record_id (PRIMARY KEY)
int user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..
There are roughly 7 million rows in the table. Currently, the table is being queried like this:
SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...
in the actual query, currently over 560 user_ids are in the IN clause. With several million records in the table, this query is slow!
There are secondary indexes on table, the first of which being on user_id itself, which I thought would help.
I know that SELECT(*) is A Bad Thing and this will be expanded to the full list of fields required. However, the fields not listed above are more ints and doubles. There are another 50 of those being returned, but they are needed for the report.
I imagine there's a much better way to access the data for the user_ids, but I can't think how to do it. My initial reaction is to remove the ALLOW NULL on the user_id field, as I understand NULL handling slows down queries?
I'd be very grateful if you could point me in a more efficient direction than using the IN ( ) method.
EDIT
Ran EXPLAIN, which said:
select_type = SIMPLE
table = customers
type = range
possible_keys = userid_idx
key = userid_idx
key_len = 5
ref = (NULL)
rows = 637640
Extra = Using where
does that help?
First, check if there is an index on USER_ID and make sure it's used.
You can do it with running EXPLAIN.
Second, create a temporary table and use it in a JOIN:
CREATE TABLE temptable (user_id INT NOT NULL)
SELECT *
FROM temptable t
JOIN customers c
ON c.user_id = t.user_id
Third, how may rows does your query return?
If it returns almost all rows, then it just will be slow, since it will have to pump all these millions over the connection channel, to begin with.
NULL will not slow your query down, since the IN condition only satisfies non-NULL values which are indexed.
Update:
The index is used, the plan is fine except that it returns more than half a million rows.
Do you really need to put all these 638,000 rows into the report?
Hope its not printed: bad for rainforests, global warming and stuff.
Speaking seriously, you seem to need either aggregation or pagination on your query.
"Select *" is not as bad as some people think; row-based databases will fetch the entire row if they fetch any of it, so in situations where you're not using a covering index, "SELECT *" is essentially no slower than "SELECT a,b,c" (NB: There is sometimes an exception when you have large BLOBs, but that is an edge-case).
First things first - does your database fit in RAM? If not, get more RAM. No, seriously. Now, suppose your database is too huge to reasonably fit into ram (Say, > 32Gb) , you should try to reduce the number of random I/Os as they are probably what's holding things up.
I'll assuming from here on that you're running proper server grade hardware with a RAID controller in RAID1 (or RAID10 etc) and at least two spindles. If you're not, go away and get that.
You could definitely consider using a clustered index. In MySQL InnoDB you can only cluster the primary key, which means that if something else is currently the primary key, you'll have to change it. Composite primary keys are ok, and if you're doing a lot of queries on one criterion (say user_id) it is a definite benefit to make it the first part of the primary key (you'll need to add something else to make it unique).
Alternatively, you might be able to make your query use a covering index, in which case you don't need user_id to be the primary key (in fact, it must not be). This will only happen if all of the columns you need are in an index which begins with user_id.
As far as query efficiency is concerned, WHERE user_id IN (big list of IDs) is almost certainly the most efficient way of doing it from SQL.
BUT my biggest tips are:
Have a goal in mind, work out what it is, and when you reach it, stop.
Don't take anybody's word for it - try it and see
Ensure that your performance test system is the same hardware spec as production
Ensure that your performance test system has the same data size and kind as production (same schema is not good enough!).
Use synthetic data if it is not possible to use production data (Copying production data may be logistically difficult (Remember your database is >32Gb) ; it may also violate security policies).
If your query is optimal (as it probably already is), try tuning the schema, then the database itself.
Is this your most important query? Is this a transactional table?
If so, try creating a clustered index on user_id. Your query might be slow because it still must make random disk reads to retrieve the columns (key lookups), even after finding the records that match (index seek on the user_Id index).
If you cannot change the clustered index, then you might want to consider an ETL process (simplest is a trigger that inserts into another table with the best indexing). This should yield faster results.
Also note that such large queries may take some time to parse, so help it out by putting the queried ids into a temp table if possibl
Are they the same ~560 id's every time? Or is it a different ~500 ids on different runs of the queries?
You could just insert your 560 UserIDs into a separate table (or even a temp table), stick an index on the that table and inner join it to you original table.
You can try to insert the ids you need to query on in a temp table and inner join both tables. I don't know if that would help.