Distinct performance in Redshift - sql

I am trying to populate a multiple dimension tables from single Base table.
Sample Base Table:
| id | empl_name | emp_surname | country | dept | university |
| 1 | AAA | ZZZ | USA | CE | U_01 |
| 2 | BBB | XXX | IND | CE | U_01 |
| 3 | CCC | XXX | CAN | IT | U_02 |
| 4 | CCC | ZZZ | USA | MECH | U_01 |
Required Dimension tables :
emp_name_dim with values - AAA,BBB,CCC
emp_surname_dim with values - ZZZ,XXX
country_dim with values - USA,IND,CAN
dept_dim with values - CE,IT,MECH
university_dim with values - U_01,U_02
Now to populate above dimension tables from base table, I am thinking of 2 approaches
Get distinct values from base table for all above columns combination, create single temp table out of that and use that temp table for subsequent individual dimension table creation. Here, I will be reading data from base table only once but with more column combination.
Create separate temp tables for distinct values specific to each dimension. This way we need to read base table for multiple times, but created temp table will be smaller(i.e. less number of rows and only single column's distinct values).
Which approach is better if we consider for performance?
Note :
Base table is huge containing millions of rows.
Above columns are just for sample. In actual table there are around 50 columns for
which I need to consider for distinct combination.

Scanning the large table only once is the way to go.
Also there is another way to get the distinct values which in some cases will be faster than distinct. As an alternative approach perform a "group by" on all the columns. Run this as a bake-off to see which is faster. In general if there will be a small number (fits in memory) number of resulting rows from distinct, then distinct will be faster. However, if the result will be large then group by will be faster. There are a lot of corner-cases and factors (distribution style) that can impact this rule-of-thumb so testing both for speed will give you which is faster in your case.
Given that you have 50 columns and you want all the unique combination I'd guess that the output set will be large and that group by will wind but this is just a guess.


Selecting Sorted Records Prior to Target Record

The background to this question is that we have had to hand-roll replication between a 3rd party Oracle database and our SQL Server database since there are no primary keys defined in the Oracle tables but there are unique indexes.
In most cases the following method works fine: we load the values of the columns in the unique index along with an MD5 hash of all column values from each corresponding table in the Oracle and SQL Server databases and are able to then calculate what records need to be inserted/deleted/updated.
However, in one table the sheer number of rows precludes us from loading all records into memory from the Oracle and SQL Server databases. So we need to do the comparison in blocks.
The method I am considering is: to query the first n records from the Oracle table and then - using the same sort order - to query the SQL Server table for all records up to the last record that was returned from the Oracle database and then compare the two data sets for what needs to be inserted/deleted/updated.
Then once that has been done to load the next n records from the Oracle database and query the records in the SQL Server table that when sorted in the same way fall between (and include) the first and last records in that data set.
My question is: how to achieve this in SQL Server? If I have the values of the nth record (having queried the table in Oracle with a certain sort order) how can I return the range of records up to and including the record with those values from SQL Server?
I have the following table:
| AQ000001_10_25/07/2004 00:00:00_14_1 | AQ000001 | 10 | 2004-07-2500:00:00.000 | 14 | 1 | Black 2.5mm Cable |
| AQ000004_91_26/07/2004 00:00:00_15.4833333333333_64 | AQ000004 | 91 | 2004-07-26 00:00:00.000 | 15.4333333333333 | 63 | 2.5mm Yellow Cable |
| AQ000005_31_26/07/2004 00:00:00_10.8333333333333_18 | AQ000005 | 31 | 2004-07-26 00:00:00.000 | 10.8333333333333 | 18 | Rotary Cam Switch |
| AQ000012_50_26/07/2004 00:00:00_11.3_17 | AQ000012 | 50 | 2004-07-26 00:00:00.000 | 11.3 | 17 | 3Mtr Heavy Gauge Cable |
The Id field is basically a concatenation of the five fields which make up the unique index on the table i.e. SOU_ORDREF, SOU_LINESEQ, SOU_DATOVER, SOU_TIMEOVER, and SOU_SEQ.
What I would like to do is to be able to query, for example, all the records (when sorted by those columns) up to the record with the Id 'AQ000005_31_26/07/2004 00:00:00_10.8333333333333_18' which would give us the following result (I'll just show the ids):
| Id |
| AQ000001_10_25/07/2004 00:00:00_14_1 |
| AQ000004_91_26/07/2004 00:00:00_15.4833333333333_64 |
| AQ000005_31_26/07/2004 00:00:00_10.8333333333333_18 |
So, the query has not included the record with Id 'AQ000012_50_26/07/2004 00:00:00_11.3_17' since it comes after 'AQ000005_31_26/07/2004 00:00:00_10.8333333333333_18' when we order by SOU_ORDREF, SOU_LINESEQ, SOU_DATOVER, SOU_TIMEOVER, and SOU_SEQ.

Searching a "vertical" table in SQLite

Tables are usually laid out in a "horizontal" fashion:
| 1 | Jim | Jones |
| 2 | Adam | Smith |
Here, however, is a table with the same data in a "vertical" layout:
|rowID|recID| Property | Value |
| 1 | 1 |FirstName | Jim | \
+-----+-----+----+-----+-------+ These two rows constitute a single logical record
| 2 | 1 |LastName | Jones | /
| 3 | 2 |FirstName | Adam | \
+-----+-----+----+-----+-------+ These two rows are another single logical record
| 4 | 2 |LastName | Smith | /
Question: In SQLite, how can I search the vertical table efficiently and in such a way that recIDs are not duplicated in the result set? That is, if multiple matches are found with the same recID, only one (any one) is returned?
Example (incorrect):
SELECT rowID from items WHERE "Value" LIKE "J%"
returns of course two rows with the same recID:
1 (Jim)
2 (Jones)
What is the optimal solution here? I can imagine storing intermediate results in a temp table, but hoping for a more efficient way.
(I need to search through all properties, so the SELECT cannot be restricted with e.g. "Property" = "FirstName". The database is maintained by a third-party product; I suppose the design makes sense because the number of property fields is variable.)
To avoid duplicate rows in the result returned by a SELECT, use DISTINCT:
FROM items
WHERE "Value" LIKE 'J%'
However, this works only for the values that are actually returned, and only for entire result rows.
In the general case, to return one result record for each group of table records, use GROUP BY to create such groups.
For any column that does not appear in the GROUP BY clause, you then have to choose which rowID in the group to return; here we use MIN:
FROM items
WHERE "Value" LIKE 'J%'
To make this query more efficient, create an index on the recID column.

SQL: Creating a common table from multiple similar tables

I have multiple databases on a server, each with a large table where most rows are identical across all databases. I'd like to move this table to a shared database and then have an override table in each application database which has the differences between the shared table and the original table.
The aim is to make updating and distributing the data easier as well as keeping database sizes down.
Problem constraints
The table is a hierarchical data store with date based validity.
table DATA (
ID int primary key,
CODE nvarchar,
PARENT_ID int foreign key references DATA(ID),
END_DATE datetime,
Each unique CODE in DATA may have a number of rows, but at most a single row where END_DATE is null or greater than the current time (a single valid row per CODE). New references are only made to valid rows.
Updating the shared database should not require anything to be run in application databases. This means any override tables are final once they have been generated.
Existing references to DATA.ID must point to the same CODE, but other columns do not need to be the same. This means any current rows can be invalidated if necessary and multiple occurrences of the same CODE may be combined.
PARENT_ID references must have same parent CODE before and after the split. The actual PARENT_ID value may change if necessary.
The shared table is updated regularly from an external source and these updates need to be reflected in each database's DATA. CODEs that do not appear in the external source can be thought of as invalid, new references to these will not be added.
Existing functionality will continue to use DATA, so the new view (or alternative) must be transparent. It may, however, contain more rows than the original provided earlier constraints are met.
New functionality will use the shared table directly.
Select performance is a concern, insert/update/delete is not.
The solution needs to support SQL Server 2008 R2.
Possible solution
-- in a single shared DB
-- in each app DB
DATA_SHARED (synonym to DATA_SHARED in shared DB)
Take an existing DATA table to become DATA_SHARED.
Exclude IDs with more than one possible CODE so only rows common across all databases remain. These missing rows will be added back once the data is updated the first time.
Unfortunately every DATA_OVERRIDE will need all rows that differ in any table, not only rows that differ between DATA_SHARED and the previous DATA. There are several IDs that differ only in a single database, this causes all other databases to inflate. Ideas?
This solution causes DATA_SHARED to have a discontinuous ID space. It's a mild annoyance rather than a major issue, but worth noting.
edit: I should be able to keep all of the rows in DATA_SHARED, just invalidate them, then I only need to store differing rows in DATA_OVERRIDE.
I can't think of any situations where PARENT_ID references become invalid, thoughts?
1 | A | NULL | NULL
2 | A1 | 1 | 2020
3 | A2 | 1 | 2010
1 | A | NULL | NULL
2 | X | NULL | NULL
3 | A2 | 1 | 2010
4 | X1 | 2 | NULL
5 | A1 | 1 | 2020
After initial processing (DATA_SHARED created from DB1.DATA):
1 | A | NULL | NULL
3 | A2 | 1 | 2010
-- END_DATE is omitted from DATA_OVERRIDE as every row is implicitly invalid
2 | A1 | 1
2 | X |
4 | X1 | 2
5 | A1 | 1
After update from external data where A1 exists in source but X and X1 don't:
1 | A | NULL | NULL
3 | A2 | 1 | 2010
6 | A1 | 1 | 2020
edit: The DATA view would be something like:
select D.ID, ...
from DATA D
left join DATA_OVERRIDE O on D.ID = O.ID
where O.ID is null
union all
select ID, ...
order by ID
Given the small number of rows in DATA_OVERRIDE, performance is good enough.
I also considered an approach where instead of DATA_SHARED sharing IDs with the original DATA, there would be mapping tables to link DATA.IDs to DATA_SHARED.IDs. This would mean DATA_SHARED would have a much cleaner ID-space and there could be less data duplication, but the DATA view would require some fairly heavy joins. The additional complexity is also a significant negative.
Thank you for your time if you made it all the way to the end, this question ended up quite long as I was thinking it through as I wrote it. Any suggestions or comments would be appreciated.

SQL Server: Use a column to save order of the record

I'm facing a database that keeps the ORDERING in columns of the table.
It's like:
Id Name Description Category OrderByName OrderByDescription OrderByCategory
1 Aaaa bbbb cccc 1 2 3
2 BBbbb Aaaaa bbbb 2 1 2
3 cccc cccc aaaaa 3 3 1
So, when the user want's to order by name, the SQL goes with an ORDER BY OrderByName.
I think this doesn't make any sense, since that's why Index are for and i tried to find any explanation for that but haven't found. Is this faster than using indexes? Is there any scenario where this is really useful?
It can make sense for many reasons but mainly when you don't want to follow the "natural order" given by the ORDER BY clause.
This is a scenario where this can be useful :
SQL Fiddle
MS SQL Server 2008 Schema Setup:
([Id] int, [Name] varchar(15), [OrderByName] int)
([Id], [Name], [OrderByName])
(1, 'Del Torro', 2 ),
(2, 'Delson', 1),
(3, 'Delugi', 3)
Query 1:
FROM Table1
| 1 | Del Torro | 2 |
| 2 | Delson | 1 |
| 3 | Delugi | 3 |
Query 2:
FROM Table1
ORDER BY OrderByName
| 2 | Delson | 1 |
| 1 | Del Torro | 2 |
| 3 | Delugi | 3 |
I think it makes little sense for two reasons:
Who is going to maintain this set of values in the table? You need to update them every time any row is added, updated, or deleted. You can do this with triggers, or horribly buggy and unreliable constraints using user-defined functions. But why? The information that seems to be in those columns is already there. It's redundant because you can get that order by ordering by the actual column.
You still have to use massive conditionals or dynamic SQL to tell the application how to order the results, since you can't say ORDER BY #column_name.
Now, I'm basing my assumptions on the fact that the ordering columns still reflect the alphabetical order in the relevant columns. It could be useful if there is some customization possible, e.g. if you wanted all Smiths listed first, and then all Morts, and then everyone else. But I don't see any evidence of this in the question or the data.
This could be useful if the ordering was customizable - that is, if users did not want to see the list in alphabetical order, but rather in some custom order.
An index on the int columns would be smaller than an index on the column that holds the actual text, but I don't see that there is any real benefit to this in most cases.

How can I speed up a count(*) which is already using indexes? (MyISAM)

I have a 3 large tables (10k, 10k, and 100M rows) and am trying to do a simple count on a join of them, where all the joined columns are indexed. Why does the COUNT(*) take so long, and how can I speed it up (without triggers and a running summary)?
mysql> describe SELECT COUNT(*) FROM `metaward_alias` INNER JOIN `metaward_achiever` ON (`metaward_alias`.`id` = `metaward_achiever`.`alias_id`) INNER JOIN `metaward_award` ON (`metaward_achiever`.`award_id` = `metaward_award`.`id`) WHERE `metaward_award`.`owner_id` = 8;
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | metaward_award | ref | PRIMARY,metaward_award_owner_id | metaward_award_owner_id | 4 | const | 1552 | |
| 1 | SIMPLE | metaward_achiever | ref | metaward_achiever_award_id,metaward_achiever_alias_id | metaward_achiever_award_id | 4 | paul.metaward_award.id | 2498 | |
| 1 | SIMPLE | metaward_alias | eq_ref | PRIMARY | PRIMARY | 4 | paul.metaward_achiever.alias_id | 1 | Using index |
3 rows in set (0.00 sec)
But actually running the query takes about 10 minutes, and I'm on MyISAM so the tables are fully locked down for that duration
I guess the reason is that you do a huge join over three tables (without applying where clause first, the result would be 10k * 10k * 100M = 1016 rows). Try to reorder joins (for example start with metaward_award, then join only metaward_achiever see how long that takes, then try to plug metaward_alias, possibly using subquery to force your preferred evaluation order).
If that does not help you might have to denormalize your data, for example by storing number of aliases for particular metaward_achiever. Then you'd get rid of one join altogether. Maybe you can even cache the sums for metaward_award, depending on how and how often is your data updated.
Other thing that might help is getting all your database content into RAM :-)
Make sure you have indexes on:
metaward_alias id
metaward_achiever alias_id
metaward_achiever award_id
metaward_award id
metaward_award owner_id
I'm sure many people will also suggest to count on a specific column, but in MySql this doesn't make any difference for your query.
You could also try to set the condition on the main table instead of one of the joined tables. That would give you the same result, but it could be faster (I don't know how clever MySql is):
SELECT COUNT(*) FROM `metaward_award`
INNER JOIN `metaward_achiever`
ON (`metaward_achiever`.`award_id` = `metaward_award`.`id`)
INNER JOIN `metaward_alias`
ON (`metaward_alias`.`id` = `metaward_achiever`.`alias_id`)
WHERE `metaward_award`.`owner_id` = 8
10 minutes is way too long for that query. I think you must have a really small key cache. You can get its size in bytes with:
SELECT ##key_buffer_size
First off, you should run ANALYZE TABLE or OPTIMIZE TABLE. They'll sort your index and can slightly improve the performance.
You should also see if you can use more compact types for your columns. For instance, if you're not going to have more than 16 millions owners or awards or aliases, you can change your INT columns into MEDIUMINT (UNSIGNED, of course). Perhaps even SMALLINT in some cases? That will reduce your index footprint and you'll fit more of it in the cache.