Postgres partitioning for large database suggestions - sql

I am rebuilding a system and migrating all of the data from mysql to postgres (along with redis just as cache layer).
The structure is as follows :
users table with under 5000 rows
items table with a few billion rows
chars table with under a million rows
accounts table with under 200k rows
each item row has a foreign key for chars(char_id)
each char row has a foreign key for accounts(acc_id)
each account row has a foreign key for users(user_id)
I have also added the users(user_id) foreign key to items table. This is to avoid having to join items + chars + accounts and users every time I simply want to look for items by userid.
I want to know the most efficient way to access the items rows. Most likely in the future that table will be sharded to a remote server when the server can't handle the whole.
Some places I read that partitioning is a good idea here, and others say indexing/views are sufficient.
So my idea is to partition items by user_id since queries will almost always be limited to that user's data. A typical query will be to SELECT ~100 or so items for a specific user that are on a char that is the correct type and this char also has to be on an account that is of the correct type (two joins).
My first question is, partitions by user, good? And if so, the ideal layout. And if not, what alternatives? Views?
A secondary question would be, should I put the char and account details on each item row? I don't mind wasting space if it means faster queries.
There's a lot to say and I hope everything makes sense. If it doesn't, please don't hesitate to ask for more details. Thanks in advance.

Related

Should I apply type 2 history to tables with duplicate keys?

I'm working on a data warehouse project using BigQuery. We're loading daily files exported from various mainframe systems. Most tables have unique keys which we can use to create the type 2 history, but some tables, e.g. a ledger/positions table, can have duplicate rows. These files contain the full data extract from the source system every day.
We're currently able to maintain a type 2 history for most tables without knowing the primary keys, as long as all rows in a load are unique, but we have a challenge with tables where this is not the case.
One person on the project has suggested that the way to handle it is to "compare duplicates", meaning that if the DWH table has 5 identical rows and the staging tables has 6 identical rows, then we just insert one more, and if it is the other way around, we just close one of the records in the DWH table (by setting the end date to now). This could be implemented by adding and extra "sub row" key to the dataset like this:
Row_number() over(partition by “all data columns” order by SystemTime) as data_row_nr
I've tried to find out if this is good practice or not, but without any luck. Something about it just seems wrong to me, and I can't see what unforeseen consequences can arise from doing it like this.
Can anybody tell me what the best way to go is when dealing with full loads of ledger data on a daily basis, for which we want to maintain some kind of history in the DWH?
No, I do not think this would be a good idea to introduce an artificial primary key based on all columns plus the index of the duplicated row.
You will solve the technical problem, but I doubt there will be some business value.
First of all you should distinct – the tables you get with primary key are dimensions and you can recognise changes and build history.
But the table without PK are most probably fact tables (i.e. transaction records) that are typically not full loaded but loaded based on some DELTA criterion.
Anyway you will never be able to recognise an update in those records, only possible change is insert (deletes are typically not relevant as data warehouse keeps longer history that the source system).
So my todo list
Check if the dup are intended or illegal
Try to find a delta criterion to load the fact tables
If everything fails, make the primary key of all columns with a single attribute of the number of duplicates and build the history.

advantages and disadvantages of database automatic number generator for each row vs manual numbering for each row

Imagine two tables that implemented like the following description:
The first table rows numbers created by database system administration automatically.
The second table rows numbers created manually by the programmer in a sequential order.
The main question is what are the advantages and disadvantages of these two approaches?
One distinct advantage of having the database manage auto-numbering over manually creating them is that the database implementation is thread safe - and manually creating them is usually (99.9% of the cases) is not (It's hard to do it correctly).
On the other hand, the database implementation does not guarantee sequential numbering - there can be gaps in the numbers.
Given these two facts, an auto-increment column should be used only as a surrogate key, when the values of this column does not have any business meaning - but they are simple used as a simple row identifier.
Please note that when using a surrogate key, it's best to also enforce uniqueness of a natural key - otherwise you might get rows where all the data is duplicated except the surrogate key.
When the database automatically create numbers, you habe less work.
Think about a sign up system, you have fields like name, email, password and so one:
1.) the number is generated by the database, so you can just insert the data into the table.
2.) if this is not the case you have to get the last number, so before the insert into you have to get the last id so instead a insert into you have a select + insert into.
Another reason is, what happened when you delete a row in your table?
Maybe in a forum, you want to delete the account but not all of his posts, so you can work with a workaround and when a post has a user_id not given you know this is/was a deleted or banned account - if you give a new user the number from a deleted user you will come in trouble.

Update and delete records in the fact table

I have a fact table with five dimension tables associated to it.Typically, the fact table contains the surrogate keys of each dimension and has no business/surrogate key. I am trying to load the fact table with data resulted of the staging fact table i.e.Insert new records. However, I notice the fact table can also handle other operations such as Update or Delete on data. A conditional split was used in the SSIS Package for this purpose to check if all surrogate keys are 0 then make the new insert. My question is, Can I use the surrogate keys in terms of Update or Delete?
I made an insert on the fact table just to give an idea of how the data will look like.
The answer is yes, you can. BUT, will there be a situation where one employee sold the same product, from the same supplier, to the same customer, on the same day? Perhaps a different order on the same day? (this is based on the data you present in the question)
If all the surrogate keys together can uniquely identify a record, update fact records to your hearts content. But, if that is not the case, you could end up updating records when you do not intend to update.
I tend to include an order number in the fact tables I design to help avoid that situation, but you may not have that in your actual fact tables. Including the order number is a pattern referred to a degenerate dimension in the fact table. I have found it to be pretty handy.
Anyway, the answer is the same. You can update fact records based on surrogate keys, as long as all of them together can uniquely identify the row(s) you want to update.
Don't throw caution to the wind, be sure your data warehouse is designed such that you can do this if you need to. Being able to do in place updates of facts can be nice, versus delete and replace, in that there could be fewer steps in the ETL process.

SQL table performance with foreign key

I have a website that needs to do a lot of active searching of users. I have a User table which contains links to all the full user details but that is only really of interest when looking at your own account. When searching for other users, there is very limited information you need so in order to make searches faster and more efficient, every time you update your user details, the code writes an entry to a separate table called UserLight - which only contains about 8 columns and is all pure data - ie no links to other child tables or collection objects, just string data for speed. Each user can only have one UserLight entry at a time which is the summary representation of how their account appears to other users.
My question is for performance, does it matter that I am making the UserId a foreign key constraint with the User table? So you cannot create a UserLight entry without the corresponding row in User, and also so when you delete the User row, it automatically cascades and deletes the UserLight entry. That is ideal and how I would like to have it but I'm just wondering if having this FK constraint on the UserLight table in any way slows down the performance on read or write operations to/from this table? If it does, I am happy to drop the FK constraint and have a completely isolated table with no constraints or external references to other objects to speed up performance, and just manage housekeeping manually, but if the FK constraint doesnt affect performance at all - I would prefer to keep it.
It will not hamper your performance instead its preferred to have data constrained so as to avoid insert/delete/update anomalies.

How to properly index my database to increase query performance

I'm working on simple log in page using OpenID: if the user has just registered for an OpenID, then I need to create a new entry in the database for the user, otherwise I just display their alias with a greeting. Every time somebody gets authenticated with their Open ID, I must find their alias by looking up which user has the given OpenID and it seems that it might be fairly slow if the primary key is the UserID (and there are millions of users).
I'm using SQL Server 2008 and I have two tables in my database (Users and OpenIDs): I plan the check if the Open ID exists in the OpenIDs table, then use the corresponding UserID to get the rest of the user information from the Users table.
The Users table is indexed by UserID and has the following columns:
UserID (pk)
EMail
Alias
OpenID (fk)
The OpenIDs table is indexed by OpenID and has the following columns:
OpenID (pk)
UserID (fk)
Alternately, I can index the Users table by UserID and OpenID (i.e have 2 indexes) and completely drop the OpenIDs table.
What would be the recommended way to improve the query for a user with the matching OpenID in this case: index the Users table with two keys or use the OpenIDs table to find the matching UserID?
May be the answers to What are some best practises and “rules of thumb” for creating database indexes? can help you.
Without knowing what kind of queries you'll be running in detail, I would recommend indexing the two foreign key columns - Users.OpenID and OpenIDs.UserID.
Indexing the foreign keys is typically a good idea to help with JOIN conditions and other queries.
But quite honestly, if you use the OpenIDs table only to check the existance of an OpenID, you'd be much better off just indexing (possibly a unique index?) that column in the Users table and be done with it. That OpenIDs table as you have it now serves no real purpose at all - just takes up space for redundant information.
Other than that: you need to observe how your application behaves, samples some usage data, and then see what kind of queries are running the most often, and the longest, and then start doing performance tweaking. Don't over-do the ahead-of-time performance optimizations - too many indices can be worse than having none at all !
Every time somebody gets authenticated
with their Open ID, I must find their
alias by looking up which user has the
given OpenID and it seems that it
might be fairly slow if the primary
key is the UserID (and there are
millions of users).
Actually, quite the contrary! If you have a value that's unique amongst millions of rows, finding that single value is actually quite quick - even with millions of users. It will take only a handful (max. 5-6) comparisons, and bang! you have your one user out of a million. If you have an index on that OpenID column, that should be pretty fast indeed. Such a highly selective index (one value picks out 1 in a million) work very very efficiently.