Very slow SQL join, with limited amount of data - sql

We have a very slow running piece of SQL, and I was wondering if anyone has any
advice on speeding it up.
We are collecting the data from a large number of tables (21) into a single table
for later processing. The tables are temporary tables, and exist only for the query.
All the tables share three columns (USN, DATASET, and INTERNAL_ID), and
the combination of the three is unique in each table, but the same values exist
in all the tables. It it possible that INTERNAL_ID is also unique, but I am not sure.
Each table contains six rows of data, and the output table also contains six rows.
I.e, each table has the following data, the first three columns being the same in each table, and the remaining columns contain different data for each table.
USN DATASET INTERNAL_ID <more stuff>
20 BEN 67 ...
20 APP 68 ...
30 BEN 70 ...
30 BEN 75 ...
50 CRM 80 ...
70 CRM 85 ...
The server is SQL 2008 R2 with 4 x 2.3GHz cores, 32GB memory, which is sitting
idle and should be more than adequate.
The INSERT INTO query itself takes approximately 3 seconds.
What can I do to either find out the reason for the code being so slow, or to speed it up. If there a maximum number of joins that I should do in a single query?
CREATE TABLE #output (
USN INT,
DATASET VARCHAR(150),
INTERNAL_ID INT,
MASTER_DATA INT,
EX1_DATA INT,
EX2_DATA INT,
EX3_DATA INT,
-- More columns
)
The full output table consists of 247 columns, with 71 integers, 11 floats, 44 datetimes and 121 varchars with a total size of 16,996 characters!!! I would expect each varchar to have around 20-30 characters.
CREATE TABLE #master (
USN INT,
DATASET VARCHAR(150),
INTERNAL_ID INT,
MASTER_DATA INT,
-- More columns
)
CREATE TABLE #ex1 (
USN INT,
DATASET VARCHAR(150),
INTERNAL_ID INT,
EX1_DATA INT,
-- More columns
)
CREATE TABLE #ex2 (
USN INT,
DATASET VARCHAR(150),
INTERNAL_ID INT,
EX2_DATA INT,
-- More columns
)
-- Repeat for ex3 .. ex20
Most of the ex tables are 10-11 columns with a couple in the 20-30 column range.
-- Insert data into master, ex1..ex20
INSERT INTO #output(USN, DATASET, INTERNAL_ID, MASTER_DATA, EX1_DATA, EX2_DATA, ...)
SELECT #master.USN, #master.DATASET, #master.INTERNAL_ID, #master.MASTER_DATA, #ex1.EX1_DATA, #ex2.EX2_DATA, ...
FROM
#master
LEFT JOIN #ex1 ON #master.USN = #ex1.USN AND
#master.DATASET = #ex1.DATASET AND
#master.INTERNAL_ID = #ex1.INTERNAL_ID
LEFT JOIN #ex2 ON #master.USN = #ex2.USN AND
#master.DATASET = #ex2.DATASET AND
#master.INTERNAL_ID = #ex2.INTERNAL_ID
-- contine until we hit ex20

I would add index on each of temporary tables according to the data (unique).
I would start with index on both int columns only, and if it is not enough I would add DATASET column to the index.
And sometimes the order you JOIN tables make (or made in previous version of MS SQL) a huge difference, so start JOINs from the smallest table (if possible).

If there are more than just one row with given USN, DATASET, INTERNAL_ID in each of these tables, the size of resulting table will grow exponentially with each join sequence. If this is the case, consider rework your statement, or replace with a number of simpler ones.
Consider using an index for a row in join statement with highest cardinality in each of #ex1-20 tables (or even complex index for two columns or the entire trio)
And, of course, if there are some constraints in resulting temporary table, you need an index for each such constraint as well.

Related

How to shard from existing data in a table in Postgresql

I have a large table inter, which contains 50 billion rows. Each row consists of two columns, both of them are actually foreign keys of IDs of the other two tables(just the relation, foreign key constraints were not set in the database).
My table structure is like:
create table test_1(
id integer primary key,
content varchar(300),
content_len integer
);
create index test_1_id_len on test_1(id, content_len);
--this has 1.5 billion rows.
-- example row1: 1, 'alskfnla', 8
-- example row2: 1, 'asdgaagder', 10
-- example row3: 1, 'dsafnlakdsvn', 12
create table test_2(
id integer primary key,
split_str char(3)
);
--this has 60,000 rows.
-- example row1: 1, 'abc'
-- example row2: 2, 'abb'
create table inter(
id_1 integer, -- id of test_1
id_2 integer -- id of test_2
);
create index test_index_1 on inter(id_1);
create index test_index_2 on inter(id_2);
create index test_index_1_2 on inter(id_1, id_2);
--this has 50 billion rows.
-- example row1: 1, 2
-- example row2: 1, 3
-- example row3: 1, 4
Further, I need to do some queries like
select *
from inter
inner join test_1 on(test_1.id = inter.id_1)
where id_2 in (1,2,3,4,5,67,8,9,10)
and test_1.content_len = 30
order by id_2;
The reason why I want to shard the table is that I could not create indices on the two columns( the transactions did not end for one week, and it occupied full virtual memory).
SO I am considering to shard the table by one of the columns. This column has around 60,000 values, from 1 to 60,000. I would like to shard the table to 60,000 subtables. I do some searches, but most of the articles do it by a trigger, which could not be applied in my case since the data are already in the table. Does anyone know how to do that, thanks a lot!
ENV: redhat, RAM 180GB, postgresql 11.0
You don't want to shard the table, but partition it.
60000 partitions is too many. Use list partitioning to split the table in something like at most 600 partitions. Make sure to upgrade to PostgreSQL v12 so that you can benefit from the latest performance improvements.
The hard part will be moving the data without eexcessive downtime. Perhaps you can use triggers to capture changes while you INSERT INTO ... SELECT and catch up later.

How to Duplicate a Small Table To All Amps?

I have a Small Table in a Teradata Database that consists of 30 rows and 9 columns.
How do I duplicate the Small Table across all amps?
Note: this is the opposite of what one usually wants to do with a Large Table, distribute the rows evenly
You can not "duplicate" the same table content across all amps. You can try to store all rows from the table to one AMP through unevenly distributed rows. So if I understand the request you want all rows from your small table to be stored on one amp only.
If so, you can create a column that has the same value for all rows(if you don't already have this). You can make it INTEGER column in order to use less space. Then you have to make this column primary index of the table and your actual keys you can make them as secondary keys.
You can check how the rows are stored on the amps true the code below.
SELECT
TABLENAME
,VPROC AS NUM_AMP
,CAST(SUM(CURRENTPERM)/(1024*1024*1024) AS DECIMAL(18,5)) AS USEDSPACE_IN_GB
FROM DBC.TABLESIZEV
WHERE UPPER(DATABASENAME) = UPPER('databasename') AND UPPER(TABLENAME) = UPPER('tablename')
GROUP BY 1, 2
ORDER BY 1;
or
SELECT
HASHAMP(HASHBUCKET(HASHROW(primary_index_columns))) AS "AMP"
,COUNT(*) AS CNT
FROM databasename.tablename
GROUP BY 1
ORDER BY 2 DESC;

How do I create a wide table in SQL server 2016?

With this code(*), the creation of a wide table in SQL keeps me sending this:
Msg 1702, Level 16, State 1, Line 11
CREATE TABLE failed because column '2010/12/01' in table 'PriceToBookFinalI' exceeds the maximum of 1024 columns.
USE [Style]
GO
CREATE TABLE [dbo].[PriceToBookFinalI]
(DocID int PRIMARY KEY,
[2006/12/29][Money],
[2007/01/01][Money],
...
SpecialPurposeColumns XML COLUMN_SET FOR ALL_SPARSE_COLUMNS);
GO
(2614 columns)
Looking for a good hint !
Here is the background set of data I want to import to my wide table
The solution for this is to normalize your design. Even if you could fit it into the 1024 limit, your design is not a good idea. For example, what if you wanted to know the average amount a DocID changed per each month. That would be a nightmare to write in this model.
Try this instead.
CREATE TABLE dbo.PriceToBookFinalI (
DocID INT PRIMARY KEY,
SpecialPurposeColumns XML COLUMN_SET FOR ALL_SPARSE_COLUMNS
);
CREATE TABLE dbo.PriceToBookFinalMoney (
DocID INT,
DocDate DATE,
DocAmount MONEY,
CONSTRAINT PK_PriceToBookFinalMoney
PRIMARY KEY CLUSTERED
(
DocID,
DocDate
)
);
You can easily join the table with the SpecialPurposeColumns to the table with the dates and amounts for each DocID. You can still pivot the dates if desired into the format you provided above. Having the date as a value in a column gives you much more flexibility how you use the data, better performance, and naturally handles more dates.
Normalise it, allow for the columning as part of your query:
Create table Price (DocID INT primary key,
DocRef Varchar(30), -- the values from your [DATES] column
DocDate DATE,
DocValue MONEY);
Create your table with three columns: ID, Date, Amount. Each ID will have multiple rows in the table (for each date there's an amount value for).
There is a column count limitation in SQL server:
https://msdn.microsoft.com/en-us/library/ms143432.aspx
Columns per nonwide table 1,024
Columns per wide table 30,000
You can use "Wide table", where is Sparse columns - column sets. https://msdn.microsoft.com/en-us/library/cc280521.aspx
BUT - table will have limitation - 8,060 bytes per row. So, most of your columns should have no data.
So - the problem is in your design. Looks like, months should be as rows, not columns. Or maybe better would be some other structure of table. It can not be guessed without seeing the data structure in application.

Fetching data from multiple columns of same type in SQL Server.

I have a SQL Server table with 3 similar columns: Telephone 1, Telephone 2, and Telephone 3. User will provide a telephone number, and SQL should fetch the data in the least possible time in an optimum way from one of the 3 columns. Telephone number can exist in any of the 3 columns.
I'm thinking of one or two options:
Creating a 4th column combining the 3 telephone columns. And, doing a search on the concatenated value.
May be a child table with only the 3 telephone columns with a CLUSTERED index.
Is there a better way? (Im sure there's one.)
I know we can do a hash of 3 columns and do a faster search. I don't know much about hash. Has anyone worked on a similar situation?
Well, you can do a search by doing:
where #USERNUMBER in (telephone1, telephone2, telephone3)
However, databases in general find it difficult to optimize such queries.
The right solution is to normalize the data. That is, create a new table, maybe called something like PersonTelephones that would have, among other columns, a PersonId, and a TelephoneNumber. Then, you are not limited to just one number.
This table can be indexes on the telephone numbers to optimize searches on the column.
I totally agree with other answer(s) that involve normalizing the data. That is probably the best solution.
However, if you are stuck with the existing data model, you could try a stored proc like the one below.
I have assumed that you are looking for an exact match.
CREATE PROC FindPersons
#PhoneNumber VARCHAR(16)
AS
BEGIN
--Create a temp table here with a column that matches the PK
--of your main table (the one with the 3 phone number colums).
--I'll assume that a phone number search can return multiple rows.
CREATE TABLE #Persons (
PersonId INT NOT NULL
)
--Just in case the temp table gets populated with a lot of records.
CREATE INDEX IDX_Persons_Id ON #Persons(PersonId)
INSERT INTO #Persons
SELECT pt.PersonId
FROM PersonsTable pt
WHERE pt.Telephone1 = #PhoneNumber
--If the above statement inserts zero rows,
--try again on the 2nd phone column.
--Depending on your business needs, you may
--want to run it regardless.
IF ##ROWCOUNT = 0
BEGIN
INSERT INTO #Persons
SELECT pt.PersonId
FROM PersonsTable pt
WHERE pt.Telephone2 = #PhoneNumber
--If the above statement inserts zero rows,
--try again on the 3rd phone column.
--Depending on your business needs, you may
--want to run it regardless.
IF ##ROWCOUNT = 0
BEGIN
INSERT INTO #Persons
SELECT pt.PersonId
FROM PersonsTable pt
WHERE pt.Telephone3 = #PhoneNumber
END
END
--Select data from the main table.
SELECT pt.*
FROM PersonsTable pt
--PK column from main table is indexed. The join should perform efficiently.
JOIN #Persons p
ON p.PersonId = pt.PersonId
END

Copy data between tables in different databases without PK's ( like synchronizing )

I have a table ( A ) in a database that doesn't have PK's it has about 300 k records.
I have a subset copy ( B ) of that table in other database, this has only 50k and contains a backup for a given time range ( july data ).
I want to copy from the table B the missing records into table A without duplicating existing records of course. ( I can create a database link to make things easier )
What strategy can I follow to succesfully insert into A the missing rows from B.
These are the table columns:
IDLETIME NUMBER
ACTIVITY NUMBER
ROLE NUMBER
DURATION NUMBER
FINISHDATE DATE
USERID NUMBER
.. 40 extra varchar columns here ...
My biggest concern is the lack of PK. Can I create something like a hash or a PK using all the columns?
What could be a possible way to proceed in this case?
I'm using Oracle 9i in table A and Oracle XE ( 10 ) in B
The approximate number of elements to copy is 20,000
Thanks in advance.
If the data volumes are small enough, I'd go with the following
CREATE DATABASE LINK A CONNECT TO ... IDENTIFIED BY ... USING ....;
INSERT INTO COPY
SELECT * FROM table#A
MINUS
SELECT * FROM COPY;
You say there are about 20,000 to copy, but not how many in the entire dataset.
The other option is to delete the current contents of the copy and insert the entire contents of the original table.
If the full datasets are large, you could go with a hash, but I suspect that it would still try to drag the entire dataset across the DB link to apply the hash in the local database.
As long as no duplicate rows should exist in the table, you could apply a Unique or Primary key to all columns. If the overhead of a key/index would be to much to maintain, you could also query the database in your application to see whether it exists, and only perform the insert if it is absent