SQL Database Design - Single column table - SELECT efficiency? - sql

I'm putting together a database which I want to be very efficient for SELECT queries as all the data in the database will be created once and multiple read-only queries run on that data.
I have multiple tables (~20) and each have a composite primary key which is made up of a combination of Time (int) and either ProductID (int) or ServiceID (int) depending on the table.
I understand to maximize read/SELECT efficiency I should generally de-normalize the data to prevent expensive table joins.
So considering that, if I want to optimize read performance should I
have 3 single-column tables containing all the possible Time, ProductID and ServiceID values. Then have these as a foreign key in each of the tables.
keep all the 20 tables completely independent to optimize SELECT performance.

The fastest SELECT statement is an index SEEK from a single table.
If you only care about SELECT performance, and don't have to worry about writing new data to the tables, then design your tables around your expected queries, so that all the data you need for each query can be found in one table, and that table has an index on the expected search arguments.

Related

Inherits from one basetable? Good idea?

I would like to create a base table as follows:
CREATE TABLE IF NOT EXISTS basetable
(
id BIGSERIAL NOT NULL PRIMARY KEY,
createdon TIMESTAMP NOT NULL DEFAULT(NOW()),
updatedon TIMESTAMP NULL
);
Then all other tables will inherits this table. So this table contains the ids of all records. Does it becomes performance problems with more then 20 billion records (distributed on the ~10 tables).
Having one table from which "all other tables will inherit" sounds like a strange idea but you might have a use case that is unclear to us.
Regarding your question specifically, having 20B rows is going to work but as #Gordon mentioned it, you will have performance challenges. If you query a row by ID it will be perfectly fine but if you search rows by timestamp ranges, even with indexes, it will be slower (how slow will depend on how fast your server is).
For large tables, a good solution is to use table partitioning (see https://wiki.postgresql.org/wiki/Table_partitioning). Based on what you query the most in your WHERE clause (id, createdon or updatedon) you can create partitions for that column and PostgreSQL will be able to read only the partition it needs instead of the entire table.

Partitioning table in postgres

I have a simple table structure in postgres which has a site and site_pages table which is a one to many relationship. The tables join on site.id to site_pages.site_id
These tables are still performing quickly but growing fast and am aware they might not for much longer so just want to be prepared as.
I had two ideas:
Partition on site.id and site_pages.site_id grouping by 1M rows but will have queries selecting from multiple partitions
Partitioning by active (True/False) but will probably only be a short term fix.
Is there a better approach i'm missing?
Table Structure
site ~ 7 million rows
id
url
active
site_pages ~ 60 millions rows
id
site_id
page_url
active
I don't think that partitioning in the classical sense will help you there. If you end up having to select from all partitions, you won't end up faster.
If most of the queries access only active data and you want to optimize for that case, you could introduce an old_siteand an old_site_pages and move all data there when they become inactive. Queries accessing all data will have to use a UNION of the current and the old data and might become slower, but queries accessing active data can become fast.
Tables with just a few columns should perform acceptably up to some hundreds of millions of rows. From this I think you could skip on site table for now.
As for site_pages, partitioning will help you if you use the partitioning criteria in your SELECTs. This means if you partition by site_id (grouped by some millions of rows) and have CHECK criteria set properly for each table (CHECK site_id >= 1000000 AND site_id < 2000000) then your SELECT ... WHERE site_id = 1536987 will not use UNION. It will only read partitions that match your criteria, thus going through only one table. You can see it from EXPLAIN.
And finally, you could move NOT active sites and site_pages into different tables - some archive.
P.S.: I assume you know how to set up partitioning on Postgres (subtables should INHERIT parent table, add check constraints, index each subtable, etc).

dictionary database, one table vs table for each char

I have a very simple database contains one table T
wOrig nvarchar(50), not null
wTran nvarchar(50), not null
The table has +50 million rows. I execute a simple query
select wTran where wOrig = 'myword'
The query takes about 40 sec to complete. I divided the table based on the first char of wOrig and the execution time is much smaller than before (based on each table new length).
Am I missing something here? Should not the database use more efficient way to do the search, like binary search?
My question What changes to the database options - based on this situation - could make the search more efficient in order to keep all the data in one table?
You should be using an index. For your query, you want an index on wTran(wOrig). Your query will be much faster:
create index idx_wTran_wOrig on wTran(wOrig);
Depending on considerations such as space and insert/update characteristics, a clustered index on (wOrig) or (wOrig, wTran) might be the best solution.

Best practices for multiple table joins UNION vs JOIN

I'm using a query which brings ~74 fields from different database tables.
The query consists of 10 FULL JOINS and 15 LEFT JOINS.
Each join condition is based on different fields.
The query fetches data from a main table which contains almost 90% foreign keys.
I'm using those joins for the foreign keys but some of the data doesn't require those joins because it's type of data(as logic) doesn't use those information.
Let me give an example:
Each employee can have multiple Tasks.There are four types of tasks(1,2,3,4).
Each TaskType has different meaning. When running the query , I'm getting data for all those tasktypes and then do some logic for showing them separately.
My question is : It is better to use UNION ALL and split all the 4 different cases into queries? This way I could use only the required joins for each case in each union.
Thanks,
I would think it depends strongly on the size (row number count) of the main table an e.g. the task tables.
Say if your main table has tens of millions of rows and the tasks are smaller, then a union with all tasks will necessitate table scans every time, whereas a join with the smaller task tables can do this with one table scan.

Index for table in SQL Server 2012

I had a question on indexes. I have a table like this:
id BIGINT PRIMARY KEY NOT NULL,
cust_id VARCHAR(8) NOT NULL,
dt DATE NOT NULL,
sale_type VARCHAR(10) NOT NULL,
sale_type_sub VARCHAR(40),
amount DOUBLE PRECISION NOT NULL
The table has several million rows. Assuming that queries will often filter results by date ranges, sale types, amounts above and below certain values, and that joins will occur on cust_id... what do you all think is the ideal index structure?
I wasn't sure if a clustered index would be best, or individual indexes on each column? Both?
Any serious table in SQL Server should always have a well-chosen, good clustering key - it makes so many things faster and more efficient. From you table structure, I'd use the ID as the clustering key.
Next, you say joins occur on cust_id - so I would put an index on cust_id. This speeds up joins in general and is a generally accepted recommendation.
Next, it really depends on your queries. Are they all using the same columns in their WHERE clauses? Or do you get queries that use dt, and others that use sale_type separately?
The point is: the fewer indices the better - so if ever possible, I'd try to find one compound index that covers all your needs. But if you have an index on three columns (e.g. on (sale_type, dt, amount), then that index can be used for queries
using all three columns in the WHERE clause
using sale_type and dt in the WHERE clause
using only sale_type in the WHERE clause
but it could NOT be used for queries that use dt or amount alone. A compound index always requires you to use the n left-most columns in the index definition - otherwise it cannot be used.
So my recommendation would be:
define the clustering key on ID
define a nonclustered index on cust_id for the JOINs
examine your system to see what other queries you have - what criteria is being used for selection, how often do those queries execute? Don't over-optimize a query that's executed once a month - but do spend time on those that are executed dozens of times every hour.
Add one index at a time - let the system run for a bit - do you measure an improvement in query times? Does it feel faster? If so: leave that index. If not: drop it again. Iterate until you're happy with the overall system performance.
The best way to find indexes for your table is sql server profiler.