I was wondering if anyone ever had a change to measure how a would 100 joined tables perform?
Each table would have an ID column with primary index and all table are 1:1 related.
It is a common problem within many data entry applications where we need to collect 1000+ data points. One solution would be to have one big table with 1000+ columns and the alternative would be to split them into multiple tables and join them when it is necessary.
So perhaps more real question would be how 30 tables (30 columns each) would behave with multitable join.
500K-1M rows should be the expected size of the tables.
Cheers
As a rule of thumb, anymore than 25 joins might be a performance problem. I try to keep joins below 10-15. It depends on the database activity and number of concurrent users, and the ratio of reads/writes.
Suggest you look at indexed views.
With any well tuned database, 'good' indexes for the query workload are the key .
They'd most likely perform terribly, unless you had a very small number of rows per table.
Go for a wider table, but normalize it properly. My guess is that if you normalize your data properly, you will have a slightly more sane design.
What you describe is similar to the implementation of column-oriented database (wikipedia). The data is stored in "column major" format which slows down adding each row, but is much faster for querying in the case of a where clause which restricts the returned rowset.
Why is it that you would rather split the rows? Is it that you measure the data elements for each row at different times? Or is it that the query result of a row would be very large?
Since first posting this, you answered me below that your reason for desiring a split of the table is that you usually only work with a subset of the data.
In that case, splitting the table can help your performance (amount of runtime consumed by the query) some amount. This may be an important factor in your wanting to work with less data -- in the case where your database engine runs slowly with large rows.
If performance is not a concern, rather than using SQL JOINs, it might serve you to explicitly list the columns you wish to retrieve in each query. For example, if you only wish to retrieve width, height, and length for a row, you could use:
SELECT width, height, length FROM datatable; rather than SELECT * FROM datatable; and accomplish the same improvement of getting less data returned. The SQL statements used would probably be shorter than the alternative join statements we were considering.
There's no way to better organise the tables? For example a "DataPointTypes" and "DataPointValues" table?
For example (and I don't know your particular circumstances) if all of your tables are like "WebsiteDataPoints (WebsitePage, Day, Visits)" "StoreDataPoints (Branch, Week, Sales)" etc. you could instead have
DataPointSources(Name)
(with data: Website,Store)
DataPointTypes(SourceId, ColumnName)
(with data: (Website, WebsitePage), (Website, Day), (Store, Branch), (Store, Sales) etc.)
DataPointEntry(Id, Timestamp)
DataPointValues (EntryId, Value(as varchar probably))
(with data: (1, Website-WebsitePage, 'pages.php'), (2, Store-Branch, 'MainStore'), (1, Website-Day, '12/03/1980'), (2, Store-Sales '35') etc.)
In this way each table becomes a source, each column becomes a type, each row becomes an entry, and each cell becomes a value.
Related
I have two result table for calculate Monthly Allowace and OT-Allowance.
Should i combine two table become one table or use view or only join table in sql statement?
Example
Result
The question becomes, why are there two tables in the first place? If both tables are always 1 to 1 (every time a row goes into first table it must also always go into second table) then yes, combine them, save the space of the duplicate data and indexes. If it is possible for the two tables to have different entries (one always a row per "incident" but the other may not or data sometimes in first table and sometimes in second) the keep them as two to minimize the dead space per row. The data should drive the table designs. You don't say how many rows you are talking. If the rows, tables, and DB are small, does it really matter other than trying to design well (always a good idea)?
Another related consideration is how much data is already in the current design and how much effort is needed to convert to a different design? Is it worth the effort? I prefer to normalize the data "mostly" (am not fanatical about it) but also tend to not mess with what isn't broken.
Sorry if you were wanting a more clear-cut answer...
Short Intro:
When it is required to have a dozen nested calculating queries, is it more optimal to
A) Perform each operation separately (saving into a table for each result and then reading that table for the next query)
B) Have a large set of nested selects
Full Description:
I am trying to calculate some advanced forecasts from a series of input tables in SQL.
I am building around a dozen 'modules' that are separated into their own schema and each module typically includes 4-10 input tables and 6-10 calculation steps. All outputs from each module is dumped into the same output table once completed.
Queries range from 7k-200k rows.
A single schema's/module's tables might look like this:
Input Table 1
Input Table 2
Input Table 3
Input Table 4
Calculation Query 1 Result Table
Calculation Query 2 Result Table
Calculation Query 3 Result Table
Calculation Query 4 Result Table
Calculation Query 5 Result Table
Calculation Query 6 Result Table
Final Output
Each calculation query uses the results of the previous (for the most part). The final output is the result of the final calculation query. Calculations are not very complex: partitioned max, basic formula (+,-,*,/) or SUM etcetera. Normally only 1-3 of these per calculation step and always on the same column.
The main reason this is split into multiple calculation queries (instead of one super-formula) is because each calculation joins the outputs in a different way and uses different input tables; also because some are based on previous row results. (Such as max partitions or Lag)
My requirements are as follows:
A procedure that calculates final output from step 1 and merges into Final Output.
A procedure that calculates up to the selected calculation query and merges into its respective results table (and stop). Consider this the 'overriding final'
I DONT need to store the calculation results of intermediate queries - only the final output or the 'overriding final' if selected.
My Problem:
I am trying to optimise the entire process - at this point it looks like it will take around 10-15 seconds. I want it to be 1 second - however I appreciate this is probably not possible.
What I have tried:
Firstly, I created a single procedure for each calculation query that Merged the results into its respective output table. Using this method, each calculation query must read from the database and then merge into its output.
I tried temp tables however I don't see why this would be optimal because I have existing tables for the calculation steps already - which are indexed with the next step in mind.
I then made an assumption that it would be faster to simply nest all the queries into one super-procedure or maybe even have a sequence of Table-Functions.
My Question:
However I ran into a thought that I could not find an answer for - which is the following:
Inserting results into a table on every calculation step might slow the process (especially as they are indexed with 2-4 columns); but at least the data will be indexed for the next step.
Nesting selects would save the effort of inserting data but these results wouldn't be indexed? Right? Or Wrong?
Are select results intelligently indexed? And given my scenario what advise would you give on how I approach this. Maybe I am missing something really simple.
Additional Info:
Most of my larger query results (150-200K) have 4 columns that need to be indexed.
All of my tables only have one column that needs calculating - the rest are indexed.
For Example:
ForecastID, Group, Year, Type, Sub-Type, Value
So I have to index Group, Year, Type and Sub-Type to Join multiple input tables and then calculate on the Value column.
I am telling you this in case having index-heavy tables influences your advice - I wont ask for help on optimizing indexes here due to the overwhelming quantity of advice already available and because it's a different question!
Query optimization is often more art than science, there are few hard and fast rules because there are so many possible influences on the outcome. With that big caveat out of the way, Time to hit the high points.
Indexes effects on loading tables - Indexes have a similar performance impact on inserts as triggers. Unless you have a filtered index each insert will have to update every index on the table, so at three indexes you are looking at quadrupling the number of updates per insert. At one read per insert and a small table size of 200k (very doable for a table scan), for three indexes you are probably outside the butter zone for cost vs. benefit of having those indexes on your work tables.
Nesting results - Like CTEs, nested results work best when the entire result set can fit in memory. When part is in memory and part is on disk it will generally perform worse than a similarly sized temp table without an index. At 5 or so columns for 200k rows with smallish datatypes and a modern server you should be ok performance wise with nesting queries, so long as your only doing one result set at a time. Once again this varies based on your setup, if you are strapped for ram drop them into a temp table.
Joins - Another possible good reason to use temp tables/nested queries is to avoid excessively large joins. The first step in a join process is a full Cartesian join between the tables, which is then filtered based on the on and where clauses. The Join process is heavily optimized in all RDMS, so most of the time you are not aware of how much heavy lifting is occurring behind the scenes, however when tables reach large sizes this can be a major performance pain point. So instead you select the subset of data you require from both tables, and join the two much smaller sets. Once again the butter zone between subsets and full table joins depends on a number of factors, so you'll have to play around with your queries to find where it is for your situation.
Unfortunately I can't really give specific advice without some sample inputs and outputs and/or an execution plan, but I hope this is some food for thought. Good luck.
It sounds like your datasets from the subqueries are more than a few thousand rows, so I would start off with approach A, persist some of these intermediate result sets to #temptables, check the execution plan for scans on these tables, and index the #temptables if needed.
If you want to use approach B, or mix A and B, I suggest CTEs instead of nested queries where possible. They are more readable, and it is easier to switch to #temptables when you are testing/designing the query.
I'm looking to essentially have a centralized table with a number of lookup tables that surround it. The central table is going to be used to store 'Users' and the lookup tables will be user attributes, like 'Religion'. The central table will store an Id, like ReligionId, and the lookup table would contain a list of religions.
Now, I've done a lot of digging into this and I've seen many people comment saying that a UserAttribute table might be the best way to go, essentially using an EAV pattern. I'm not looking to do this. I realize that my strategy will be join-heavy and that's why I ask this question here. I'm looking for a way to optimize those joins.
If the table has 100 lookup tables, how could it be optimized to be faster than just doing a massive 100 table inner join? Some ideas come to mind like using many smaller joins, sub-selects and views. I'm open to anything, including a combination of these strategies. Again, just to note, I'm not looking to do anything that's EAV-related. I need the lookup tables for other reasons and I like normalized data.
All suggestions considered!
Here's a visual look:
Edit: Is this insane?
Optimization techniques will likely depend on the size of the center table and intended query patterns. This is very similar to what you get in data warehousing star schemas, so approaches from that paradigm may help.
For one, ensuring the size of each row is absolutely as small as possible. Disk space may be cheap, but disk throughput, memory, and CPU resources are potential bottle necks. You want small rows so that it can read them quickly and cache as much as possible in memory.
A materialized/indexed view with the joins already performed allows the joins to essentially be precomputed. This may not work well if you are dealing with a center table that is being written to alot or is very large.
Anything you can do to optimize a single join should be done for all 100. Appropriate indexes based on the selectivity of the column, etc.
Depending on what kind of queries you are performing, then other techniques from data warehousing or OLAP may apply. If you are doing lots of group by's then this is likely an area to look in to. Data warehousing techniques can be applied within SQL Server with no additional tooling.
Ask yourself why so many attributes are being queried and how they are being presented? For most analysis it is not necessary to join with lookup tables until the final step where you materialize a report, at which time you may only have grouped by on a subset of columns and thus only need some of the lookup tables.
Group By's generally should be able to group on the lookup Id's without needing the text/description from the lookup table so a join is not necessary. If your lookups have other information relevant to the query at hand then consider denormalizing it into the central table to eliminate the join and/or make that discreet value its own lookup, essentially splitting the existing lookup ID into another ID.
You could implement a master code table that combines the code tables into a single table with a CodeType column. This is not the same as EAV because you'd still have a column in the center table for each code type and a join for each, where as EAV is usually used to normalize out an arbitrary number of attributes. (Note: I personally hate master code tables.)
Lastly, consider normalization the center table if you are not doing data warehousing.
Are there lots of null values in certain lookupId columns? Is the table sparse? This is an indication that you can pull some columns out into a 1 to 1/0 relationships to reduce the size of the center table. For example, a Person table that includes address information can have a PersonAddress table pulled out of it.
Partitioning the table may improve performance if there's a large number of rows and you can determine that certain rows, perhaps with a certain old datetime from couple years in the past, would rarely be queried.
Update: See "Ask yourself why so many attributes are being queried and how they are being presented?" above. Consider a user wants to know number of sales grouped by year, department, and product. You should have id's for each of these so you can just group by those IDs on the center table and in an outer query join lookups for only what columns remain. This ensures the aggregation doesn't need to pull in unnecessary information from lookups that aren't needed anyway.
If you aren't doing aggregations, then you probably aren't querying large numbers of records at a time, so join performance is less of a concern and should be taken care of with appropriate indexes.
If you're querying large numbers of records at a time pulling in all information, I'd look hard at the business case for this. No one sits down at their desk and opens a report with a million rows and 100 columns in it and does anything meaningful with all of that data, that couldn't be accomplished in a better way.
The only case for such a query be a dump of all data intended for export to another system, in which case performance shouldn't be as much as a concern as it can be scheduled overnight.
Since you are set on your way. you can consider duplicating data in order to join less times in a similar way to what is done in olap database.
http://en.wikipedia.org/wiki/OLAP_cube
With that said I don't think this is the best way to do it if you have 100 properties.
Have you tried to export it to Microsoft Excel Power Pivot with Power Query? you can make fast data analysis with pretty awsome ways to show it with Power view video sample
I have multiple tables that data can be queried from with joins.
In regards to database performance:
Should I run multiple selects from multiple tables for the required data?
or
Should I write 1 select that uses a bunch of Joins to select the required data from all the tables at once?
EDIT:
The where clause I will be using for the select contains Indexed fields of the tables. It sounds like because of this, it will be faster to use 1 select statement with many joins. I will however still test the performance difference between the 2.
Thanks for all the great answers.
Just write one query with joins. If you are concerned about performance there are a number of options including:
Creating indexes that will help the performance of your selects
Create a persisted denormalized form of the data you want so you can query one table. This would most likely be an indexed view or another table.
This can be one of those, well-gee-it-depends, but generally if you're writing straight SQL do one query--especially since the joins might limit some of the data you get back.
There is a good chance if you do multiple point queries for one record in each table, if you're using the primary key of the table for lookup, the connection cost for each query will be more costly than the actual query.
It depends on how the tables are joined. If you do a cross-product of all tables than it would be better to do individual selects. However if your tables are properly indexed and well thought out one query with multiple selects would be more efficient.
If you have proper indexes on your tables, you might be better off with the JOINs but they are often the cause of bottlenecks. Instead of multiple selects, you might look at ways to de-normalize your data. It is far less "expensive" when a user performs an operation to update a count or timestamp in multiple tables which prevents you from having to join those tables.
The best tool I find for performance tuning of queries is using EXPLAIN. You type EXPLAIN before the query and you can see how many rows are scanned. Your goal is the lower the number the better, which means your indexes are working. The other thing is when creating indexes, use compound indexes on multiple fields and order them left to right in the order they appear in the WHERE clause.
For example you have 10,000 rows in sometable:
SELECT id, name, description, status FROM sometable WHERE name LIKE '%someName%' AND status = 'Active';
You could type EXPLAIN before the query and it might return 10,000 as number of rows scanned to match. You then create a compound index:
ALTER TABLE sometable ADD INDEX idx_st_search (name, status);
You then perform the EXPLAIN on table again and it might return 1 as number of rows scanned and performance significantly improved.
Depends on your Table designs.
Most of times one large query is better but be sure to
Use primary keys in where clause as much as you can for joins.
use indexed fields or make indexes for fields which are used in where clauses.
I have 60 columns in a table.
1). I want to add one more column to that table. Will there be any impact on performance?
2). How many columns can I add?
3). any idea for the avoid recursion. [I have no idea here - annakata]
Yes, but one more column is less of a problem than the fact that you already have 60.
I bet most of them are nullable?
With very wide tables (many columns) it becomes harder to write maintainable SQL. You are forced to deal with lots of exceptions due to the NULLS.
See also this post which asks how many fields is too many?
Will there be any impact on performance?
If you are adding a "Notes" type TEXT column, or a Blob storing the Image of a User, and most/many of your queries are
SELECT * FROM MyTable
then you will definitely be creating a performance issue.
If you always explicitly only name the columns your query needs, like:
SELECT Col1, ColX, ColN
FROM MyTable
then adding a new column will have little to no impact on performance - but wider rows mean fewer records per data page, so there is SOME impact, and if you are adding an Index to the new column then that index has to be maintained too - but if your application needs it then that is a necessary "cost".
We have plenty of tables with > 60 columns. However, I would like to think that that is By Design, rather than because the table has just grown willy-nilly.
If I were you I would be less concerned with the fact that you have to add one more column, and more concerned with deciding whether 60 columns is appropriate.