Conditionally linking Postgres rows to data in various other tables

Conditionally linking Postgres rows to data in various other tables - sql

I have a product table that is updated using CSV feeds from various suppliers. Each feed has its own table, however products can appear multiple times in the same supplier table, and in multiple supplier tables. Each product can only occur once in our main table though. I don't anticipate ever using more than about ten different supplier tables. Tables are updated at least daily, and at most every 6-8 hours, and read speeds are a much higher priority than write speeds. There are usually about 500,000 enabled products at any given time.
My first plan was to store the table name and primary key ID in that table for each product, then recalculate it during each update, but according to the responses here, having to do that is an indication that the database isn't designed correctly.
Using a view to combine these tables into a single virtual table seems like it'd help a lot with the organization. That way, I can just create a rule to make one column an SQL query, then index that column to increase search/read speed. The rules that determine where to pull supplier information from are not somewhat involved, and need to take country and price into account, as well as perhaps a few other things.
So I guess the question here is, is there a correct way of doing this? Or is it going to be messy no matter how I do it? Also, am I on the right track?

Using a view unifying all your feed tables might well simplify the form of your queries, but you cannot index a view. (Well, in Oracle I think you can index a MATERIALIZED view, but that's a special case).
Structurally, I find it a bit suspect that you split your supplier feeds into separate tables; doing so may simplify and speed updates from the supplier feeds, and it is certainly the fastest alternative for queries against specific, individual feeds, but it's ugly for updating (recomputing?) the main table, and it is flatly unsound for supporting rows of the main table being related back to the particular supplier feed from which they were drawn.
If you need fast queries against the supplier feeds, independent of the main table, and you also need the main table to be related to a detail table containing supplier-specific information, then perhaps your best bet would be to maintain a physical auxiliary table as the UNION ALL of all the per-supplier tables (this requires those tables to have the same structure), each with a distinct supplier ID. In Oracle, you can automate that as a MATERIALIZED VIEW, but with most DBMSs you would need to maintain that table manually.
The auxilliary table can be indexed, can be joined to the main table as needed in queries, and can be queried fairly efficiently. If appropriate, it can be used to update the main table.

Hmm, why not just create one product table that contains data from all suppliers? Have a field in that table that identifies which supplier. When you get your input feeds, update this one table rather than having a separate table for each supplier. If you're using COPY to import a CSV file into a db table, fine, but then the imported table is just a temporary work table. Promptly copy the data from there into the "real", unified table. Then the import table can be dropped or truncated, or more likely you keep it around for troubleshooting. But you don't use it within the program.
You should be able to copy from the import table to the unified table with a single insert statement. Even if the tables are large I'd expect that to be fast. It would almost surely be faster overall to do one mass insert for each import than to have a view that does a union on 10 tables and try to work with that. If the unified table has all the data from all suppliers plus a supplier field, then I don't see why you would ever need to query the raw import tables. Except, that is, for trouble-shooting problems with the import, but fine, so you keep them around for that. Unless you're constrained on disk space so that keeping what amounts to duplicates of every record is a problem, I'd think this would be the easy solution. If disk space is an issue, than drop the import table immediately after copying the data to the unified table, and keep the original raw import on backup media somewhere.

Related

SQL - multiple tables vs one big table

I want to move multiple SQLite files to PostgreSQL.
Data contained in these files are monthly time-series (one month in a single *.sqlite file). Each has about 300,000 rows. There are more than 20 of these files.
My dilemma is how to organize the data in the new database:
a) Keep it in multiple tables
or
b) Merge it to one huge table with new column describing the time period (e.g. 04.2016, 05.2016, ...)
The database will be used only to pull data out of it (with the exception of adding data for new month).
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Which structure should I go for - one huge table or multiple smaller tables?

Think I would definitely go for one table - just make sure you use sensible indexes.

If you have the space and the resource 1 table, as other users have appropriately pointed out databases can handle millions of rows no problem.....Well depends on the data that is in them. The row size can make a big difference... Such as storing VARCHAR(MAX), VARBINARY(MAX) and several per row......
there is no doubt writing queries, ETL (extract transform load) is significantly easier on a single table! And maintenance of that is easier too from a archival perspective.
But if you never access the data and you need the performance in the primary table some sort of archive might make since.
There are some BI related reasons to maintain multiple tables but it doesn't sound like that is your issue here.
There is no perfect answer and will depend on your situation.

PostgreSQL is easily able to handle millions of rows in a table.
Go for option b) but..
with new column describing the time period (e.g. 04.2016, 05/2016, ...)
Please don't. Querying the different periods will become a pain, an unnecessary one. Just put the date in one column, put a index on the column and you can, probably, execute fast queries on it.
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Complicated for you to write or for the database to execute? An Example would be nice for us to get an image of your actual requirements.

Indexing items marked as deleted

Due to client requirements I need to implement following scenario:
Whenever user want to delete a record representing a document, that particular record needs to be marked as deleted using simple BOOLEAN is_deleted condition.
Document is a general name for one of the tables that store invoices, orders or offers.
Everything is pretty dead simple, but I wonder if there is a way to index records to perform quick searching and somehow skip/omit deleted items (or there is no need to worry about performance at all and use simple where clause is_deleted=False).
Other solutions/advices would be appreciated as well.

PostgreSQL supports partial indexes. You can do something like:
create index document_id_is_deleted_idx ON document(id) where is_deleted;
You can even create unique indexes if you need unique subsets over portions of your data.
Of course getting the right columns in your index is an exercise, but it is quite manageable.

Another option you might like to explore is to move those records to another table, or to use partitioning to separate the deleted and undeleted rows (which would amount to broadly the same thing).
That would let you keep all the records of interest in a smaller table that can be indexed differently from that of the deleted records.
If you went down the partitioning route you'd have a DOCUMENTS master table with DOCUMENTS_DELETED and DOCUMENTS_LIVE tables inheriting from it.

database design bigger table vs split table have the same col

i have a database program for store as you know there is too type of invoice in it one for the thing i bought and the other for me when i sold them the two table is almost identical like
invoice table
Id
customerName
date
invoiceType
and invoiceDetails which have
id
invoiceId
item
price
amount
my question is simple its what best to keep the design like that or split every table for two sperate tabels
couple of my friend suggest splitting the tables as one for saleInvoice and the other for buyInvoice to speed the time for querying
so whats the pro and con of every abrouch i feel that if i split them its like i dont follow DRY rule
i am using Nhibernate BTW so its kindda weird to have to identical class with different names

Both approached would work. If you use the single table approach, then the invoiceType column would be your discriminator field. In your nHibernate mapping, this discriminator field would be used by nHibernate to decide which type (i.e. a purchase or a sale) to instantiate for a given row in the table (see section 5.1.6 of the nHibernate mapping guide. For ad hoc SQL queries or reporting queries, you could create two views, one to return only rows with invoiceType = purchase and one to return only rows where invoiceType=sales.
Alternatively, you could create two separate tables, one for purchase and one for sales. As you point out, these two tables would have nearly identical schemas and nhibernate mapping files.
If you are anticipating very high transaction volumes, you would want to put purchases and sales on two different physical discs. With two different tables, this can be accomplished by putting them into different file groups. With a single table, you still could accomplish this by creating a SQL Server Partitioned Table. Before you go to this trouble, you might want to evaluate if this really is necessary and that disc access to the table is really going to be the performance bottleneck. You don't want to spend a lot of time doing premature optimization if it is not necessary.
My preference would be to have a single table with a discriminator column, to better follow DRY principles. Unless I had solid numbers that indicated that indicated it was necessary, I would hold off implementing a partitioned table until if and when it became necessary.

I'd ask myself, how do I intend to use this information? Will I need sales and buy invoices in the same queries? Am I likely to need specialized information eventually (highly likely in my experience) for each type? And if I do will I need to have child tables for only 1 type? How would that affect referntial integrity? Would a change to one automatically mean I needed a change to the other? How large is the table likely to be (It would have to be in the multi-millions before I would consider that it might need to be split out only due to size). How likely is it that I would get the information mixed up by accident if they are in the same table and include both when I didn't want to? The answers would determine whether I needed to split it out for me. I would tend to see these as two separate functions and it would take alot to convince me to put them in one table.

When to Create, When to Modify a Table?

I wanted to know, what should i consider while deciding if i should create a new table or modify an existing table for a sql db. i use both mysql and sqlite.
-Edit- I always thought if i can put a column into a table where it makes sense and can be used by every row then i would always modify it. However at work if its a different 'release' we put it in a different table.

You can modify existing tables, as long as
you are keeping the database Normalized
you are not breaking code that uses the table
You can create new tables even if 1. and 2. are true for the following reasons:
Performance reasons
Clarity in your schema logic.

Not sure if I'm understanding your question correctly, but one thing I always try to consider is the impact on existing data.
Taking the case of an application which relies on a database...
When you update the application (including database schema updates), it is important to ensure that any existing, in-use databases will be either backwards compatible with the application, or there is way to migrate and update the existing database.

Generally if the data is in a one-to-one relationship with the existing data in the table and if the table row size is not too large already and if there aren't too many records in the table, then I usually alter the table to accept the new column.
However, suppose I want to add a column with a default value to a table where it doesn't exist. Adding it to the table with 50 million records might not be so speedy a process and it might lock up the table on production when we move the change up. In this case, putting it into a separate table and adding the records to it may work out better. In general, I wouldn't do this unless my testing has shown that adding and populating the column will take an unacceptably long time. I would prefer to keep the record together where possible.
Same thing with the overall record size. SQL server has a byte limit to the number of bytes that can be in a record, it will allow you to create a structure that is potentially larger than that, but it will not alow you to put more than the byte limit into a specific record. Further, less wide tables tend to be faster to access due to how they are stored. Frequently, people will create a table that has a one-to-one relationship (we call them extended tables in our structure) for additional columns that are not as frequnetly used. If the fields from both tables will be frequently used, often they still create two tables but have a view that will pickout all the columns needed.
And of course if the data is in a one to many relationship, you need a related table not just a new column.
Incidentally, you should always do an alter table through a script and the SSMS GUI as it is more efficient and easier to move to prod.

What is the best way to query data from multilpe tables and databases?

I have 5 databases which represent different regions of the country. In each database, there are a few hundred tables, each with 10,000-2,000,000 transaction records. Each table is a representation of a customer in the respective region. Each of these tables has the same schema.
I want to query all tables as if they were one table. The only way I can think of doing it is creating a view that unions all tables, and then just running my queries against that. However, the customer tables will change all the time (as we gain and lose customers), so I'd have to change the query for my view to include new tables (or remove ones that are no longer used).
Is there a better way?
EDIT
In response to the comments, (I also posted this as a response to an answer):
In most cases, I won't be removing any tables, they will remain for historic purposes. As I posted in comment to one response, the idea was to reduce the time it takes a smaller customers (one with only 10,000 records) to query their own history. There are about 1000 customers with an average of 1,000,000 rows (and growing) a piece. If I were to add all records to one table, I'd have nearly a billion records in that table. I also thought I was planning for the future, in that when we get say 5000 customers, we don't have one giant table holding all transaction records (this may be an error in my thinking). So then, is it better not to divide the records as I have done? Should I mash it all into one table? Will indexing on customer Id's prevent delays in querying data for smaller customers?

I think your design may be broken. Why not use one single table with a region and a customer column?
If I were you, I would consider refactoring to one single table, and if necessary (for reverse compatibility for example), I would use views to provide the same info as in the previous tables.
Edit to answer OP comments to this post :
One table with 10 000 000 000 rows in it will do just fine, provided you use proper indexing. Database servers are built to cope with this kind of volume.
Performance is definitely not a valid reason to split one such table into thousands of smaller ones !

The architecture of this system smells like it needs a vastly different approach if there are a few hundred tables and each has the same schema
Why are you adding or removing tables at all? This should not be happening under any normal circumstances.

Agree with Brann,
That's an insane DB Schema Design. Why didn't you go with (or is an option to change to) a single normalised structure with columns to filter by region and whatever condition separates each table within a region database.
In that structure you're stuck with some horribly large (~500 tables) unioned view that you would have to dynamically regenerate as regularly as new tables appear in the system.

2 solutions
1. write a stored procedure who build the view for you by parsing all table names in the 5 databases and build the view with union as you would do it by hand.
create a new database with one table and import each night per example all the records of all the tables in this one.

Sounds like your stuck somewhere between a multi and single tenant database shema. Specifically your storing it as "light"multi-tenant (separate tables vs separate databases) but querying it as single-tenant, one query to rule them all.
In the short term have your data access layer dynamically pick the table to query and not union everything together for one uber query.
In the long term pick one approach and stick too it. One database and one table or many databases.
Here are some posts on the subject.
What are the advantages of using a single database for EACH client?
http://msdn.microsoft.com/en-us/library/aa479086.aspx

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas