I'm currently working with MS SQL 2005, and have a table that has 17 columns, and the space that data in each row would take is only a bit less than what is allowed(per row/record) in MS SQL 2005. And it is for sure that I cannot break this up into smaller tables as the data stored in this table is input from excel sheets whose contents I'm not in control of.
Now the point is, that for almost everything on the Website that uses this database, that main table is providing the result sets, and these result sets are previously known. So, which would be better of the two:
a) I make use of the big table every time.
b) I create smaller tables, and depopulate/populate them as soon as data is edited in the big table.
For eg: Excel sheets containing details of products arrive(almost weekly) from various manufacturers, and they are stored in the PRODUCTS(big) table. Now there are queries like:
SELECT DISTINCT Brand_name, Model_name FROM PRODUCTS
and
SELECT DISTINCT Brand_name, Model_name FROM PRODUCTS WHERE Price < 10 and about 10-15 like these.
Now my question is: Should I build already aggregated tables for these things which amount to about 5 more other than the PRODUCTS table, and update them whenever a sheet comes in, or should I just execute all my retrieval queries on the PRODUCTS table?
The PRODUCTS table would contain about 500,000 rows at the max at a time.
I would be inclined to stick with your single table. 500k records isn't overly massive. If you make sure its properly index for the common selects you are using on it you will probably find it is fairly quick.
Try and run some controlled and repeatable tests to see what sort of speed gains you can get with the right indexes.
Related
My customer has a table with ~150 columns in their DB. I don't have access to the DB, I only have summary stats about each column in the table - distinct values in each column along with their likelihood of occurrence.
I'm trying to create a representative copy of this table on my own DB so that I can run queries against it for testing purposes. The only way I know to do this is to write a huge SELECT statement that uses the random() function to randomly choose between the possible values of each column (and other methods for timestamps and IDs). This SELECT is then used inside an INSERT INTO.
This approach just isn't scalable though. I want to be able to do this for a lot more tables. Is there an easier way to do this? I'd like to avoid paid tools if possible.
Im having 260 columns table in SQL server. When we run "Select count(*) from table" it is taking almost 5-6 to get the count. Table contains close 90-100 million records with 260 columns where more than 50 % Column contains NULL. Apart from that, user can also build dynamic sql query on to table from the UI, so searching 90-100 million records will take time to return results. Is there way to improve find functionality on a SQL table where filter criteria can be anything , can any1 suggest me fastest way get aggregate data on 25GB data .Ui should get hanged or timeout
Investigate horizontal partitioning. This will really only help query performance if you can force users to put the partitioning key into the predicates.
Try vertical partitioning, where you split one 260-column table into several tables with fewer columns. Put all the values which are commonly required together into one table. The queries will only reference the table(s) which contain columns required. This will give you more rows per page i.e. fewer pages per query.
You have a high fraction of NULLs. Sparse columns may help, but calculate your percentages as they can hurt if inappropriate. There's an SO question on this.
Filtered indexes and filtered statistics may be useful if the DB often runs similar queries.
As the guys state in the comments you need to analyse a few of the queries and see which indexes would help you the most. If your query does a lot of searches, you could use the full text search feature of the MSSQL server. Here you will find a nice reference with good examples.
Things that came me up was:
[SQL Server 2012+] If you are using SQL Server 2012, you can use the new Columnstore Indexes.
[SQL Server 2005+] If you are filtering a text column, you can use Full-Text Search
If you have some function that you apply frequently in some column (like SOUNDEX of column, for example), you could create PERSISTED COMPUTED COLUMN to not having to compute this value everytime.
Use temp tables (indexed ones will be much better) to reduce the number of rows to work on.
#Twelfth comment is very good:
"I think you need to create an ETL process and start changing this into a fact table with dimensions."
Changing my comment into an answer...
You are moving from a transaction world where these 90-100 million records are recorded and into a data warehousing scenario where you are now trying to slice, dice, and analyze the information you have. Not an easy solution, but odds are you're hitting the limits of what your current system can scale to.
In a past job, I had several (6) data fields belonging to each record that were pretty much free text and randomly populated depending on where the data was generated (they were search queries and people were entering what they basically would enter in google). With 6 fields like this...I created a dim_text table that took each entry in any of these 6 tables and replaced it with an integer. This left me a table with two columns, text_ID and text. Any time a user was searching for a specific entry in any of these 6 columns, I would search my dim_search table that was optimized (indexing) for this sort of query to return an integer matching the query I wanted...I would then take the integer and search for all occourences of the integer across the 6 fields instead. searching 1 table highly optimized for this type of free text search and then querying the main table for instances of the integer is far quicker than searching 6 fields on this free text field.
I'd also create aggregate tables (reporting tables if you prefer the term) for your common aggregates. There are quite a few options here that your business setup will determine...for example, if each row is an item on a sales invoice and you need to show sales by date...it may be better to aggregate total sales by invoice and save that to a table, then when a user wants totals by day, an aggregate is run on the aggreate of the invoices to determine the totals by day (so you've 'partially' aggregated the data in advance).
Hope that makes sense...I'm sure I'll need several edits here for clarity in my answer.
I have an aggregate data set that spans multiple years. The data for each respective year is stored in a separate table named Data. The data is currently sitting in MS ACCESS tables, and I will be migrating it to SQL Server.
I would prefer that data for each year is kept in separate tables, to be merged and queried at runtime. I do not want to do this at the expense of efficiency, however, as each year is approx. 1.5M records of 40ish fields.
I am trying to avoid having to do an excessive number of UNIONS in the query. I would also like to avoid having to edit the query as each new year is added, leading to an ever-expanding number of UNIONs.
Is there an easy way to do these UNIONs at runtime without an extensive SQL query and high system utility? Or, if all the data should be managed in one large table, is there a quick and easy way to append all the tables together in a single query?
If you really want to store them in separate tables, then I would create a view that does that unioning for you.
create view AllData
as
(
select * from Data2001
union all
select * from Data2002
union all
select * from Data2003
)
But to be honest, if you use this, why not put all the data into 1 table. Then if you wanted you could create the views the other way.
create view Data2001
as
(
select * from AllData
where CreateDate >= '1/1/2001'
and CreateDate < '1/1/2002'
)
A single table is likely the best choice for this type of query. HOwever you have to balance that gainst the other work the db is doing.
One choice you did not mention is creating a view that contains the unions and then querying on theview. That way at least you only have to add the union statement to the view each year and all queries using the view will be correct. Personally if I did that I would write a createion query that creates the table and then adjusts the view to add the union for that table. Once it was tested and I knew it would run, I woudl schedule it as a job to run on the last day of the year.
One way to do this is by using horizontal partitioning.
You basically create a partitioning function that informs the DBMS to create separate tables for each period, each with a constraint informing the DBMS that there will only be data for a specific year in each.
At query execution time, the optimiser can decide whether it is possible to completely ignore one or more partitions to speed up execution time.
The setup overhead of such a schema is non-trivial, and it only really makes sense if you have a lot of data. Although 1.5 million rows per year might seem a lot, depending on your query plans, it shouldn't be any big deal (for a decently specced SQL server). Refer to documentation
I can't add comments due to low rep, but definitely agree with 1 table, and partitioning is helpful for large data sets, and is supported in SQL Server, where the data will be getting migrated to.
If the data is heavily used and frequently updated then monthly partitioning might be useful, but if not, given the size, partitioning probably isn't going to be very helpful.
I am currently writing an application that needs to be able to select a subset of IDs from Millions of users...
I am currently writing software to select a group of 100.000 IDs from a table that contains the whole list of Brazilian population 200.000.000 (200M), I need to be able to do this in a reasonable amount of time... ID on Table = ID on XML
I am thinking of parsing the xml file and starting a thread that performs a SELECT statement on a database, I would need a connection for each thread, still this way seems like a brute force approach, perhaps there is a more elegant way?
1) what is the best database to do this?
2) what is a reasonable limit to the amount of db connections?
Making 100.000 queries would take a long time, and splitting up the work on separate threads won't help you much as you are reading from the same table.
Don't get a single record at a time, rather divide the 100.000 items up in reasonably small batches, for example 1000 items each, which you can send to the database. Create a temporary table in the database with those id values, and make a join against the database table to get those records.
Using MS SQL Server for example, you can send a batch of items as an XML to a stored procedure, which can create the temporary table from that and query the database table.
Any modern DBMS that can handle an existing 200M row table, should have no problem comparing against a 100K row table (assuming your hardware is up to scratch).
Ideal solution: Import your XML (at least the IDs) into to a new table, ensure the columns you're comparing are indexed correctly. And then query.
What language? If your using .NET you could load your XML and SQL as datasources, and then i believe there are some enumerable functions that could be used to compare the data.
Do this:
Parse the XML and store the extracted IDs into a temporary table1.
From the main table, select only the rows whose ID is also present in the temporary table:
SELECT * FROM MAIN_TABLE WHERE ID IN (SELECT ID FROM TEMPORARY_TABLE)
A decent DBMS will typically do the job quicker than you can, even if you employed batching/chunking and parallelization on your end.
1 Temporary tables are typically created using CREATE [GLOBAL|LOCAL] TEMPORARY TABLE ... syntax and you'll probably want it private for the session (check your DBMS's interpretation of GLOBAL vs. LOCAL for this). If your DBMS of choice doesn't support temporary tables, you can use "normal" tables instead, but be careful not to let concurrent sessions mess with that table while you are still using it.
I have 5 databases which represent different regions of the country. In each database, there are a few hundred tables, each with 10,000-2,000,000 transaction records. Each table is a representation of a customer in the respective region. Each of these tables has the same schema.
I want to query all tables as if they were one table. The only way I can think of doing it is creating a view that unions all tables, and then just running my queries against that. However, the customer tables will change all the time (as we gain and lose customers), so I'd have to change the query for my view to include new tables (or remove ones that are no longer used).
Is there a better way?
EDIT
In response to the comments, (I also posted this as a response to an answer):
In most cases, I won't be removing any tables, they will remain for historic purposes. As I posted in comment to one response, the idea was to reduce the time it takes a smaller customers (one with only 10,000 records) to query their own history. There are about 1000 customers with an average of 1,000,000 rows (and growing) a piece. If I were to add all records to one table, I'd have nearly a billion records in that table. I also thought I was planning for the future, in that when we get say 5000 customers, we don't have one giant table holding all transaction records (this may be an error in my thinking). So then, is it better not to divide the records as I have done? Should I mash it all into one table? Will indexing on customer Id's prevent delays in querying data for smaller customers?
I think your design may be broken. Why not use one single table with a region and a customer column?
If I were you, I would consider refactoring to one single table, and if necessary (for reverse compatibility for example), I would use views to provide the same info as in the previous tables.
Edit to answer OP comments to this post :
One table with 10 000 000 000 rows in it will do just fine, provided you use proper indexing. Database servers are built to cope with this kind of volume.
Performance is definitely not a valid reason to split one such table into thousands of smaller ones !
The architecture of this system smells like it needs a vastly different approach if there are a few hundred tables and each has the same schema
Why are you adding or removing tables at all? This should not be happening under any normal circumstances.
Agree with Brann,
That's an insane DB Schema Design. Why didn't you go with (or is an option to change to) a single normalised structure with columns to filter by region and whatever condition separates each table within a region database.
In that structure you're stuck with some horribly large (~500 tables) unioned view that you would have to dynamically regenerate as regularly as new tables appear in the system.
2 solutions
1. write a stored procedure who build the view for you by parsing all table names in the 5 databases and build the view with union as you would do it by hand.
create a new database with one table and import each night per example all the records of all the tables in this one.
Sounds like your stuck somewhere between a multi and single tenant database shema. Specifically your storing it as "light"multi-tenant (separate tables vs separate databases) but querying it as single-tenant, one query to rule them all.
In the short term have your data access layer dynamically pick the table to query and not union everything together for one uber query.
In the long term pick one approach and stick too it. One database and one table or many databases.
Here are some posts on the subject.
What are the advantages of using a single database for EACH client?
http://msdn.microsoft.com/en-us/library/aa479086.aspx