What is the best way to query data from multilpe tables and databases? - sql

I have 5 databases which represent different regions of the country. In each database, there are a few hundred tables, each with 10,000-2,000,000 transaction records. Each table is a representation of a customer in the respective region. Each of these tables has the same schema.
I want to query all tables as if they were one table. The only way I can think of doing it is creating a view that unions all tables, and then just running my queries against that. However, the customer tables will change all the time (as we gain and lose customers), so I'd have to change the query for my view to include new tables (or remove ones that are no longer used).
Is there a better way?
EDIT
In response to the comments, (I also posted this as a response to an answer):
In most cases, I won't be removing any tables, they will remain for historic purposes. As I posted in comment to one response, the idea was to reduce the time it takes a smaller customers (one with only 10,000 records) to query their own history. There are about 1000 customers with an average of 1,000,000 rows (and growing) a piece. If I were to add all records to one table, I'd have nearly a billion records in that table. I also thought I was planning for the future, in that when we get say 5000 customers, we don't have one giant table holding all transaction records (this may be an error in my thinking). So then, is it better not to divide the records as I have done? Should I mash it all into one table? Will indexing on customer Id's prevent delays in querying data for smaller customers?

I think your design may be broken. Why not use one single table with a region and a customer column?
If I were you, I would consider refactoring to one single table, and if necessary (for reverse compatibility for example), I would use views to provide the same info as in the previous tables.
Edit to answer OP comments to this post :
One table with 10 000 000 000 rows in it will do just fine, provided you use proper indexing. Database servers are built to cope with this kind of volume.
Performance is definitely not a valid reason to split one such table into thousands of smaller ones !

The architecture of this system smells like it needs a vastly different approach if there are a few hundred tables and each has the same schema
Why are you adding or removing tables at all? This should not be happening under any normal circumstances.

Agree with Brann,
That's an insane DB Schema Design. Why didn't you go with (or is an option to change to) a single normalised structure with columns to filter by region and whatever condition separates each table within a region database.
In that structure you're stuck with some horribly large (~500 tables) unioned view that you would have to dynamically regenerate as regularly as new tables appear in the system.

2 solutions
1. write a stored procedure who build the view for you by parsing all table names in the 5 databases and build the view with union as you would do it by hand.
create a new database with one table and import each night per example all the records of all the tables in this one.

Sounds like your stuck somewhere between a multi and single tenant database shema. Specifically your storing it as "light"multi-tenant (separate tables vs separate databases) but querying it as single-tenant, one query to rule them all.
In the short term have your data access layer dynamically pick the table to query and not union everything together for one uber query.
In the long term pick one approach and stick too it. One database and one table or many databases.
Here are some posts on the subject.
What are the advantages of using a single database for EACH client?
http://msdn.microsoft.com/en-us/library/aa479086.aspx

Related

SQL - multiple tables vs one big table

I want to move multiple SQLite files to PostgreSQL.
Data contained in these files are monthly time-series (one month in a single *.sqlite file). Each has about 300,000 rows. There are more than 20 of these files.
My dilemma is how to organize the data in the new database:
a) Keep it in multiple tables
or
b) Merge it to one huge table with new column describing the time period (e.g. 04.2016, 05.2016, ...)
The database will be used only to pull data out of it (with the exception of adding data for new month).
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Which structure should I go for - one huge table or multiple smaller tables?
Think I would definitely go for one table - just make sure you use sensible indexes.
If you have the space and the resource 1 table, as other users have appropriately pointed out databases can handle millions of rows no problem.....Well depends on the data that is in them. The row size can make a big difference... Such as storing VARCHAR(MAX), VARBINARY(MAX) and several per row......
there is no doubt writing queries, ETL (extract transform load) is significantly easier on a single table! And maintenance of that is easier too from a archival perspective.
But if you never access the data and you need the performance in the primary table some sort of archive might make since.
There are some BI related reasons to maintain multiple tables but it doesn't sound like that is your issue here.
There is no perfect answer and will depend on your situation.
PostgreSQL is easily able to handle millions of rows in a table.
Go for option b) but..
with new column describing the time period (e.g. 04.2016, 05/2016, ...)
Please don't. Querying the different periods will become a pain, an unnecessary one. Just put the date in one column, put a index on the column and you can, probably, execute fast queries on it.
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Complicated for you to write or for the database to execute? An Example would be nice for us to get an image of your actual requirements.

Conditionally linking Postgres rows to data in various other tables

I have a product table that is updated using CSV feeds from various suppliers. Each feed has its own table, however products can appear multiple times in the same supplier table, and in multiple supplier tables. Each product can only occur once in our main table though. I don't anticipate ever using more than about ten different supplier tables. Tables are updated at least daily, and at most every 6-8 hours, and read speeds are a much higher priority than write speeds. There are usually about 500,000 enabled products at any given time.
My first plan was to store the table name and primary key ID in that table for each product, then recalculate it during each update, but according to the responses here, having to do that is an indication that the database isn't designed correctly.
Using a view to combine these tables into a single virtual table seems like it'd help a lot with the organization. That way, I can just create a rule to make one column an SQL query, then index that column to increase search/read speed. The rules that determine where to pull supplier information from are not somewhat involved, and need to take country and price into account, as well as perhaps a few other things.
So I guess the question here is, is there a correct way of doing this? Or is it going to be messy no matter how I do it? Also, am I on the right track?
Using a view unifying all your feed tables might well simplify the form of your queries, but you cannot index a view. (Well, in Oracle I think you can index a MATERIALIZED view, but that's a special case).
Structurally, I find it a bit suspect that you split your supplier feeds into separate tables; doing so may simplify and speed updates from the supplier feeds, and it is certainly the fastest alternative for queries against specific, individual feeds, but it's ugly for updating (recomputing?) the main table, and it is flatly unsound for supporting rows of the main table being related back to the particular supplier feed from which they were drawn.
If you need fast queries against the supplier feeds, independent of the main table, and you also need the main table to be related to a detail table containing supplier-specific information, then perhaps your best bet would be to maintain a physical auxiliary table as the UNION ALL of all the per-supplier tables (this requires those tables to have the same structure), each with a distinct supplier ID. In Oracle, you can automate that as a MATERIALIZED VIEW, but with most DBMSs you would need to maintain that table manually.
The auxilliary table can be indexed, can be joined to the main table as needed in queries, and can be queried fairly efficiently. If appropriate, it can be used to update the main table.
Hmm, why not just create one product table that contains data from all suppliers? Have a field in that table that identifies which supplier. When you get your input feeds, update this one table rather than having a separate table for each supplier. If you're using COPY to import a CSV file into a db table, fine, but then the imported table is just a temporary work table. Promptly copy the data from there into the "real", unified table. Then the import table can be dropped or truncated, or more likely you keep it around for troubleshooting. But you don't use it within the program.
You should be able to copy from the import table to the unified table with a single insert statement. Even if the tables are large I'd expect that to be fast. It would almost surely be faster overall to do one mass insert for each import than to have a view that does a union on 10 tables and try to work with that. If the unified table has all the data from all suppliers plus a supplier field, then I don't see why you would ever need to query the raw import tables. Except, that is, for trouble-shooting problems with the import, but fine, so you keep them around for that. Unless you're constrained on disk space so that keeping what amounts to duplicates of every record is a problem, I'd think this would be the easy solution. If disk space is an issue, than drop the import table immediately after copying the data to the unified table, and keep the original raw import on backup media somewhere.

Which database structure is efficient? One table with 10,000 records or 1000 tables with 10 records?

We at college are making an application to generate PDF document from Excel sheet records using Java SE. I have though about two approaches to design the database. In one approach, there will be one table that will contain a lot of records (50K every year). In other approach, there will be a lot of tables created (1000 every year) at runtime and each table will contain max 50 records.
Which approach is efficient comparatively considering better overall time performance?
Multiple tables of identical structure almost never makes sense.
Databases are designed to have many records in few tables.
50K records is not "a lot" of records. You don't specify what database you will be using, but most commercial-grade databases can handle many, many millions of records in a table.
This is assuming you have proper indexes, etc. If you have to keep creating tables for you application, then there is something wrong with your design, and you need to re-think that.
When building a relational database the basic rule would be to avoid redundancy.
Look over your data and try to separate things that tend to repeat. If you notice a column or a group of columns that repeat across multiple entries create a new table for them. This way you will achieve the best performance when querying.
Otherwise, if the values are unique across the entries just keep the minimum number of tables.
You should just look for some design rules for relational databases. You will find some examples as well.
50k records is not much for a database. If it's all the same type of data (same structure), it belongs in the same table. Only if size and speed becomes an issue you should consider splitting up the data over multiple tables (or more likely: different servers).

One large table or split into two smaller tables?

Is there any performance benefit to splitting a large table with roughly 100 columns into 2 separate tables? This would be in terms of inserting, deleting and selecting tasks? I'm using SQL Server 2008.
If one of the fields is a CLOB or BLOB and you anticipate it holding a huge amount of data and you won't need that field very often and the result set will transmitted over a long pipe (like server to a web-based client), then I think putting that field in a separate table would be appropriate.
But just returning 100 regular fields probably won't tax your system so much as to justify a separate table and a join.
The only benefit you might see is if a number of columns are only occasionally populated. In which case putting those into their own table and only adding a row when there is data might make sense in terms of overall row overhead and, depending on the number of rows, overall page count for the table(s). That said, this is one of the reasons they introduced sparse columns in SQL Server 2008.
For the maintenance and other overhead of managing two tables instead of one (especially given that people can act on individual tables if they choose), it's unlikely it would be worth it.
Can you describe what type of entity needs to have over 100 columns? Perhaps the data model is just wrong in the first place.
I would say no as it would take more execution time to join the 2 tables whenever you wanted to do something.
I depends if you use these fields in the same time in your application.
These kind of performance improvements are really bad : you make your source code impossible to understand. If you have performance trouble with this table, add something (like a table containing the 15 fields you'll use in a request that'll updated via trigger), don't modify your clean solution.
If you don't have performance problem, don't do anything, you'll see later !

Table with a lot of columns

If my table has a huge number of columns (over 80) should I split it into several tables with a 1-to-1 relationship or just keep it as it is? Why? My main concern is performance.
PS - my table is already in 3rd normal form.
PS2 - I am using MS Sql Server 2008.
PS3 - I do not need to access all table data at once, but rather have 3 different categories of data within that table, which I access separately. It is something like: member preferences, member account, member profile.
80 columns really isn't that many...
I wouldn't worry about it from a performance standpoint. Having a single table (if you're typically using all of the data in your standard operations) will probably outperform multiple tables with 1-1 relationships, especially if you're indexing appropriately.
I would worry about this (potentially) from a maintenance standpoint, though. The more columns of data in a single table, the less understandable the role of that table in your grand scheme becomes. Also, if you're typically only using a small subset of the data, and all 80 columns are not always required, splitting into 2+ tables might help performance.
Re the performance question - it depends. The larger a row is, the less rows can be read from disk in one read. If you have a lot of rows, and you want to be able to read the core information from the table very quickly, then it may be worth splitting it into two tables - one with small rows with only the core info that can be read quickly, and an extra table containing all the info you rarely use that you can lookup when needed.
Taking another tack, from a maintenance & testing point of view, if as you say you have 3 distinct groups of data in the one table albeit all with the same unique id (e.g. member_id) it might make sense to split it out into separate tables.
If you need to add fields to say your profile details section of the members info table, do you really want to run the risk of having to re-test the preferences & account details elements of your app as well to ensure no knock on impacts.
Also for audit trail purposes if you want to track the last user ID/Timestamp to change a members data. If the admin app allows Preferences/Account Details/Profile Details to be updated separately then it makes sense to have them in separate tables to more easily track updates.
Not quite a SQL/Performance answer but maybe something to look at from a DB & App design pov
Depends what those columns are. If you've got hard coded duplicated fields like Colour1, Colour2, Colour3, then these are candidates for child tables. My general rule of thumb is if there's more than one field of the same type (Colour), then you might as well code for N of them, not a fixed number.
Rob.
1-1 may be easier, if you have say Member_Info; Member_Pref; Member_Profile. Having too many columns can make it run if you want lots of varchar(255) as you may go over the rowsize limit, and it just makes it too confusing.
Just make sure you have the correct forgein key constraints and suchwhat, so there's always 1 row in each table with the same member_id