How to shard from existing data in a table in Postgresql

How to shard from existing data in a table in Postgresql - sql

I have a large table inter, which contains 50 billion rows. Each row consists of two columns, both of them are actually foreign keys of IDs of the other two tables(just the relation, foreign key constraints were not set in the database).
My table structure is like:
create table test_1(
id integer primary key,
content varchar(300),
content_len integer
);
create index test_1_id_len on test_1(id, content_len);
--this has 1.5 billion rows.
-- example row1: 1, 'alskfnla', 8
-- example row2: 1, 'asdgaagder', 10
-- example row3: 1, 'dsafnlakdsvn', 12
create table test_2(
id integer primary key,
split_str char(3)
);
--this has 60,000 rows.
-- example row1: 1, 'abc'
-- example row2: 2, 'abb'
create table inter(
id_1 integer, -- id of test_1
id_2 integer -- id of test_2
);
create index test_index_1 on inter(id_1);
create index test_index_2 on inter(id_2);
create index test_index_1_2 on inter(id_1, id_2);
--this has 50 billion rows.
-- example row1: 1, 2
-- example row2: 1, 3
-- example row3: 1, 4
Further, I need to do some queries like
select *
from inter
inner join test_1 on(test_1.id = inter.id_1)
where id_2 in (1,2,3,4,5,67,8,9,10)
and test_1.content_len = 30
order by id_2;
The reason why I want to shard the table is that I could not create indices on the two columns( the transactions did not end for one week, and it occupied full virtual memory).
SO I am considering to shard the table by one of the columns. This column has around 60,000 values, from 1 to 60,000. I would like to shard the table to 60,000 subtables. I do some searches, but most of the articles do it by a trigger, which could not be applied in my case since the data are already in the table. Does anyone know how to do that, thanks a lot!
ENV: redhat, RAM 180GB, postgresql 11.0

You don't want to shard the table, but partition it.
60000 partitions is too many. Use list partitioning to split the table in something like at most 600 partitions. Make sure to upgrade to PostgreSQL v12 so that you can benefit from the latest performance improvements.
The hard part will be moving the data without eexcessive downtime. Perhaps you can use triggers to capture changes while you INSERT INTO ... SELECT and catch up later.

Related

How to Duplicate a Small Table To All Amps?

I have a Small Table in a Teradata Database that consists of 30 rows and 9 columns.
How do I duplicate the Small Table across all amps?
Note: this is the opposite of what one usually wants to do with a Large Table, distribute the rows evenly

You can not "duplicate" the same table content across all amps. You can try to store all rows from the table to one AMP through unevenly distributed rows. So if I understand the request you want all rows from your small table to be stored on one amp only.
If so, you can create a column that has the same value for all rows(if you don't already have this). You can make it INTEGER column in order to use less space. Then you have to make this column primary index of the table and your actual keys you can make them as secondary keys.
You can check how the rows are stored on the amps true the code below.
SELECT
TABLENAME
,VPROC AS NUM_AMP
,CAST(SUM(CURRENTPERM)/(1024*1024*1024) AS DECIMAL(18,5)) AS USEDSPACE_IN_GB
FROM DBC.TABLESIZEV
WHERE UPPER(DATABASENAME) = UPPER('databasename') AND UPPER(TABLENAME) = UPPER('tablename')
GROUP BY 1, 2
ORDER BY 1;
or
SELECT
HASHAMP(HASHBUCKET(HASHROW(primary_index_columns))) AS "AMP"
,COUNT(*) AS CNT
FROM databasename.tablename
GROUP BY 1
ORDER BY 2 DESC;

How to Never Retrieve Different Rows in a Changing Table

I have a table of millions of rows that is constantly changing(new rows are inserted, updated and some are deleted). I'd like to query 100 new rows(I haven't queried before) every minute but these rows can't be ones I've queried before. The table has a about 2 dozen columns and a primary key.
Happy to answer any questions or provide clarification.

A simple solution is to have a separate table with just one row to store the last ID you fetched.
Let's say that's your "table of millions of rows":
-- That's your table with million of rows
CREATE TABLE test_table (
id serial unique,
col1 text,
col2 timestamp
);
-- Data sample
INSERT INTO test_table (col1, col2)
SELECT 'test', generate_series
FROM generate_series(now() - interval '1 year', now(), '1 day');
You can create the following table to store an ID:
-- Table to keep last id
CREATE TABLE last_query (
last_quey_id int references test_table (id)
);
-- Initial row
INSERT INTO last_query (last_quey_id) VALUES (1);
Then with the following query, you will always fetch 100 rows never fetched from the original table and maintain a pointer in last_query:
WITH last_id as (
SELECT last_quey_id FROM last_query
), new_rows as (
SELECT *
FROM test_table
WHERE id > (SELECT last_quey_id FROM last_id)
ORDER BY id
LIMIT 100
), update_last_id as (
UPDATE last_query SET last_quey_id = (SELECT MAX(id) FROM new_rows)
)
SELECT * FROM new_rows;
Rows will be fetched by order of new IDs (oldest rows first).

You basically need a unique, sequential value that is assigned to each record in this table. That allows you to search for the next X records where the value of this field is greater than the last one you got from the previous page.
Easiest way would be to have an identity column as your PK, and simply start from the beginning and include a "where id > #last_id" filter on your query. This is a fairly straightforward way to page through data, regardless of underlying updates. However, if you already have millions of rows and you are constantly creating and updating, an ordinary integer identity is eventually going to run out of numbers (a bigint identity column is unlikely to run out of numbers in your great-grandchildren's lifetimes, but not all DBs support anything but a 32-bit identity).
You can do the same thing with a "CreatedDate" datetime column, but as these dates aren't 100% guaranteed to be unique, depending on how this date is set you might have more than one row with the same creation timestamp, and if those records cross a "page boundary", you'll miss any occurring beyond the end of your current page.
Some SQL system's GUID generators are guaranteed to be not only unique but sequential. You'll have to look into whether PostgreSQL's GUIDs work this way; if they're true V4 GUIDs, they'll be totally random except for the version identifier and you're SOL. If you do have access to sequential GUIDs, you can filter just like with an integer identity column, only with many more possible key values.

How do I create a wide table in SQL server 2016?

With this code(*), the creation of a wide table in SQL keeps me sending this:
Msg 1702, Level 16, State 1, Line 11
CREATE TABLE failed because column '2010/12/01' in table 'PriceToBookFinalI' exceeds the maximum of 1024 columns.
USE [Style]
GO
CREATE TABLE [dbo].[PriceToBookFinalI]
(DocID int PRIMARY KEY,
[2006/12/29][Money],
[2007/01/01][Money],
...
SpecialPurposeColumns XML COLUMN_SET FOR ALL_SPARSE_COLUMNS);
GO
(2614 columns)
Looking for a good hint !
Here is the background set of data I want to import to my wide table

The solution for this is to normalize your design. Even if you could fit it into the 1024 limit, your design is not a good idea. For example, what if you wanted to know the average amount a DocID changed per each month. That would be a nightmare to write in this model.
Try this instead.
CREATE TABLE dbo.PriceToBookFinalI (
DocID INT PRIMARY KEY,
SpecialPurposeColumns XML COLUMN_SET FOR ALL_SPARSE_COLUMNS
);
CREATE TABLE dbo.PriceToBookFinalMoney (
DocID INT,
DocDate DATE,
DocAmount MONEY,
CONSTRAINT PK_PriceToBookFinalMoney
PRIMARY KEY CLUSTERED
(
DocID,
DocDate
)
);
You can easily join the table with the SpecialPurposeColumns to the table with the dates and amounts for each DocID. You can still pivot the dates if desired into the format you provided above. Having the date as a value in a column gives you much more flexibility how you use the data, better performance, and naturally handles more dates.

Normalise it, allow for the columning as part of your query:
Create table Price (DocID INT primary key,
DocRef Varchar(30), -- the values from your [DATES] column
DocDate DATE,
DocValue MONEY);

Create your table with three columns: ID, Date, Amount. Each ID will have multiple rows in the table (for each date there's an amount value for).

There is a column count limitation in SQL server:
https://msdn.microsoft.com/en-us/library/ms143432.aspx
Columns per nonwide table 1,024
Columns per wide table 30,000
You can use "Wide table", where is Sparse columns - column sets. https://msdn.microsoft.com/en-us/library/cc280521.aspx
BUT - table will have limitation - 8,060 bytes per row. So, most of your columns should have no data.
So - the problem is in your design. Looks like, months should be as rows, not columns. Or maybe better would be some other structure of table. It can not be guessed without seeing the data structure in application.

How to delete 3 billion rows from 2 related tables

I have a table with 5 billion rows (table1) another table with 3 billion rows in table 2. These 2 tables are related. I have to delete 3 billion rows from table 1 and its related rows from table 2. Table1 is child of table 2. I tried using the for all method from plsql it didn't help much. Then I thought of using oracle partition strategy. Since I am not a DBA I would like to know if partioning of a existing table is possible on primary key column for a selected number of id's? My primary key is 64 bit auto generated number.

It is hard to partition the objects online(it can be done using dbms_redefinition). And not necessary(with the details you gave).
Best ideea would be to recreate the objects without the undesired rows.
For example some simple code would be like:
create table undesired_data as (select undesired rows from table1);
Create table1_new as (select * from table1 where key not in (select key from undesired_data));
Create table2_new as (select * from table2 where key not in (select key from undesired_data));
rename table1 to table1_old;
rename table2 to table2_old;
rename table1_new to table1;
rename table2_new to table2;
recreate constraints;
check if everything is ok;
drop table1_old and table2_old;
This can be done offlining consumers, but would be very small downtime for them if scripts are ok(you should test them in a test environment).

Sounds very dubious.
If it is real use-case then you don't delete you create another table, well defined, including partitioned and you fill it using insert /*+ append */ into MyNewTable select ....
The most common practice is to define partitions on dates (record create date, event date etc.).
Again, if this is a real use-case I strongly recommend that you will reach for real help, not seek for advice on the internet and not doing it yourself.

Very slow SQL join, with limited amount of data

We have a very slow running piece of SQL, and I was wondering if anyone has any
advice on speeding it up.
We are collecting the data from a large number of tables (21) into a single table
for later processing. The tables are temporary tables, and exist only for the query.
All the tables share three columns (USN, DATASET, and INTERNAL_ID), and
the combination of the three is unique in each table, but the same values exist
in all the tables. It it possible that INTERNAL_ID is also unique, but I am not sure.
Each table contains six rows of data, and the output table also contains six rows.
I.e, each table has the following data, the first three columns being the same in each table, and the remaining columns contain different data for each table.
USN DATASET INTERNAL_ID <more stuff>
20 BEN 67 ...
20 APP 68 ...
30 BEN 70 ...
30 BEN 75 ...
50 CRM 80 ...
70 CRM 85 ...
The server is SQL 2008 R2 with 4 x 2.3GHz cores, 32GB memory, which is sitting
idle and should be more than adequate.
The INSERT INTO query itself takes approximately 3 seconds.
What can I do to either find out the reason for the code being so slow, or to speed it up. If there a maximum number of joins that I should do in a single query?
CREATE TABLE #output (
USN INT,
DATASET VARCHAR(150),
INTERNAL_ID INT,
MASTER_DATA INT,
EX1_DATA INT,
EX2_DATA INT,
EX3_DATA INT,
-- More columns
)
The full output table consists of 247 columns, with 71 integers, 11 floats, 44 datetimes and 121 varchars with a total size of 16,996 characters!!! I would expect each varchar to have around 20-30 characters.
CREATE TABLE #master (
USN INT,
DATASET VARCHAR(150),
INTERNAL_ID INT,
MASTER_DATA INT,
-- More columns
)
CREATE TABLE #ex1 (
USN INT,
DATASET VARCHAR(150),
INTERNAL_ID INT,
EX1_DATA INT,
-- More columns
)
CREATE TABLE #ex2 (
USN INT,
DATASET VARCHAR(150),
INTERNAL_ID INT,
EX2_DATA INT,
-- More columns
)
-- Repeat for ex3 .. ex20
Most of the ex tables are 10-11 columns with a couple in the 20-30 column range.
-- Insert data into master, ex1..ex20
INSERT INTO #output(USN, DATASET, INTERNAL_ID, MASTER_DATA, EX1_DATA, EX2_DATA, ...)
SELECT #master.USN, #master.DATASET, #master.INTERNAL_ID, #master.MASTER_DATA, #ex1.EX1_DATA, #ex2.EX2_DATA, ...
FROM
#master
LEFT JOIN #ex1 ON #master.USN = #ex1.USN AND
#master.DATASET = #ex1.DATASET AND
#master.INTERNAL_ID = #ex1.INTERNAL_ID
LEFT JOIN #ex2 ON #master.USN = #ex2.USN AND
#master.DATASET = #ex2.DATASET AND
#master.INTERNAL_ID = #ex2.INTERNAL_ID
-- contine until we hit ex20

I would add index on each of temporary tables according to the data (unique).
I would start with index on both int columns only, and if it is not enough I would add DATASET column to the index.
And sometimes the order you JOIN tables make (or made in previous version of MS SQL) a huge difference, so start JOINs from the smallest table (if possible).

If there are more than just one row with given USN, DATASET, INTERNAL_ID in each of these tables, the size of resulting table will grow exponentially with each join sequence. If this is the case, consider rework your statement, or replace with a number of simpler ones.
Consider using an index for a row in join statement with highest cardinality in each of #ex1-20 tables (or even complex index for two columns or the entire trio)
And, of course, if there are some constraints in resulting temporary table, you need an index for each such constraint as well.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas