Querying large dataset for statistics in SQL Server? - sql

Say I have a sample for which 5 million data objects are stored as rows in SQL Server. If I need to run some stats on the data, would it be better to have a table for each sample, or one giant table, where I would select by sample id and then run the stats?
There may eventually be hundreds or even thousands of samples- which seems like one massive table.
But I'm not a SQL Server expert so I can't say whether one would be faster than the other...
Or maybe a better way to deal with such a large data set? I was hoping to use SQL CLR with C# to do my heavy lifting...

If you need to deal with such a large dataset, my gut feeling tells me T-SQL and working in sets will be significantly faster than anything you can do in SQL-CLR and a RBAR (row-by-agonizing-row) approach... dealing with large sets of data, summing up and selecting, that's what T-SQL is always been made for and what it's good at.
5 million rows isn't really an awful lot of data - it's a nice size dataset. But if you have the proper indices in place, e.g. on the columns you use in your JOIN conditions, in your WHERE clause and your ORDER BY clause, you should be just fine.
If you need more and more detailed advice - try to post your table structure, explain how you will query that table (what criteria you use for WHERE and ORDER BY) and we should be able to provide some more feedback.

Related

Data profiling of columns for big table (SQL Server)

I have table with over 40 million records. I need to make data profiling, including Nulls count, Distinct Values, Zeros and Blancs, %Numeric, %Date, Needs to be Trimmed, etc.
The examples that I was able to find are always including implementation of the task using cursors. For big table such solution is performance killer.
I would be happy if I receive suggestions and examples which give better performance alternatives. Is it possible to create multiple stored procedures and combine the results in a table? I have not used stored procedures so far, so I base my question only on understanding that I got from documentation.
As Gordon mentioned, you should include your table's schema and some sample data to get the best answers, but a couple things you can look into are as follows:
Columnstore Indexes - These can be helpful for analytical querying against a table, e.g. SUM(), COUNT(), COUNT(DISTINCT) etc. This is because of the efficiencies in compression that can be achieved up and down the column for analytics. This is useful if you need a "real time" based on answer every time you query against the data.
You can periodically stage and update the results in a data warehouse type table. You basically can store the results to those aggregations in it's own table and periodically update it with either a SQL Agent Job (this isn't necessarily a real time solution) or use triggers to automatically update your data warehouse table (which will be closer to a real time solution but can be performance heavy if not implemented in a lean manner).
OLAP Cubes - This is more of an automated way to the above solution and has better maintainability but is also more advanced of a solution. This is a methodology for building out an actual OLAP based data warehouse.
In terms of difficulty of implementation and based on the size of your data (which isn't anything too huge) my recommendation would be to start with columnstore indexes and see how that helps your queries. I've had much success using them for analytical querying. Otherwise my remaining recommendations are in order of difficulty as well.

Splitting up long queries for readability / maintainability

I have inherited a Access database that has a query that SELECTs over 50 columns. Even in the MS Access graphical query design tool, it's too much information to handle IMO.
A quick disclaimer - I have a programming background, but almost no experience with databases.
For this particular problem, I need data from 50+ columns. Is there a better way to get this information than one huge query?
I am a bit at a loss on how to proceed. The consensus on the web seems to be that SQL queries should be kept relatively small and well formated. I just don't see how to apply that principle when you need so many fields. Can I use lots of simple queries with a UNION or INSERT INTO a temporary table? Any suggestion or RTFMs?
EDIT: More info on the application. The data is spread across 14 tables. I'm grabbing the data to write it out to an external file which has 50+ fields per row (think CSV version of a spreadsheet).
EDIT: Managing and debugging SQL queries in MS Access looks like it contains relevant advice.
I think you're worrying yourself over nothing at all. Pulling 50 fields from 14 tables in order to output to a flat file is nothing at all. The point is that you're intentionally denormalizing data because you need a flat-file output. By definition, that will mean lots of columns.
As others have said, 50 columns in one table would be out of the ordinary (though not necessarily a design error), but in the situation you describe, I don't see an issue at all.
In short, it sounds like a perfectly fine setup and I just have to wonder what makes you think it's a problem to begin with.
I would echo most of the comments about needing 50+ columns every time you run your query.
However, if it's the ugliness of the query that gets you the 50+ columns that is causing your grief, you could create a view for that query. It doesn't get rid of the fact that you're dealing with a bunch of data, but it could encapsulate a large, hairy beast of an SQL statement.
If you're retrieving data from all columns, then you can write SELECT * FROM ....
If there are 50 columns in a single table, the database design is questionable at best. That might practically mean that you'll just have to cope with not-so-elegant solutions.
Do you really need all that data at once? If not, it's probably better to only SELECT the data you need at a given time. You should provide more information on the actual task.
If your data is coming from multiple tables, you could use Common Table Expressions to divide the query into groupings.

Which is more efficient : 2 single table queries or 1 join query

Say tableA has 1 row to be returned but will have 100 columns returned while tableB has 100 rows to be returned but only one column from each. TableB has a foreign key for table A.
Will a left join of tableA to tableB return 100*100 cells of data while 2 separate queries return 100 + 100 cells of data or 50 times less data or is that a misunderstanding of how it works?
Is it ever more efficient to use many simple queries rather than fewer more complex ones?
First and foremost, I would question a table with 100 columns, and suggest that there is a possibly a better design for your schema. In the real world, this number of columns is less common, so typically the difference in the amount of data returned with one query vs. two becomes less significant. 100 columns in a table is not necessarily bad, just a flag that it shold be considered.
However, assuming your numbers are what they are to make clear the question, there are a few important variables to consider:
1 - What is the speed of the link between the db server and the application server? If it is very slow, then you are probably better off minimizing the amount data returned vs. the number of queries you run. If it is not slow, then you will likely expend more time in the execution of two queries than you would returning the increased payload. Which is better can only be determined by testing in your own environment.
2 - How efficient is the transport protocol itself? Perhaps there is some kind of compression of the data, or an even more clever algorithm that knows column 2 through 101 are duplicate for every row, so it only passes them once. Strategies like this in the transport protocol would mitigate any of your concerns. Again, this is why you need to test in your own envionment to know for sure.
As others have pointed out, you also need to consider what will be done with the data once you get it (e.g., JOINs, GROUPing, etc), but I am limiting my response to the specifics of your question around query count vs. payload size.
What is best at joining? A database engine or client code? Saying that, I use both techniques: it depends on the client and how data will be used.
Where the data requires some processing to, say, render on a web page I'd probably split header and details recordsets. We do use this because we have some business logic between DB and HTML
Where it's consumed simply and linearly, I'd join in the database to avoid unnecessary processing. For example, simple reports or exports
It depends, if you only take into account the SQL efficiency obviusly several simpler and smaller result queries will be more efficient.
But you need to take into account the whole process if the join will be made otherwise on the client or you need to filter results after the join, then probably the DBM will be more efficient that doing it on your code.
Coding is always a tradeoff between diferent systems, DB vs Client, RAM vs CPU... you need to be conscious about this and try to find the perfect solution.
In this case probably 2 queries outperform 1 but that is not a general solution.
I think that your question basically is about database normalization. In general, it is advisable to normalize a database into multiple tables (using primary and foreign keys) and to join them as needed upon queries. This is better for insert/update performance and for keeping the data consistent, and usually results in smaller database sizes as well.
As for the row numbers returned, only a cross join would actually return 100*100 rows; any inner or outer join will not create all combinations, but rather tie together rows on the given conditions, and for outer joins preserve rows which could not be matched. Wikipedia has some samples in its JOIN article.
For very query-intense applications, the performance may be better when using less normlized tables. However, as always with optimizations, I'd only consider going into that direction after seeing real measurable problems (e.g. with a profiling tool).
In general, try to keep the number of roundtrips to the database low; a large number of single simple queries will suffer from the overhead of talking to the DB engine (network etc.). If you need to execute complex series of statements, consider using stored procedures.
Generally fewer queries makes for better performance, as long as the queries return data that is actually related. There is no point in trying to put unrelated data into the same query just to reduce the number or queries.
There are of course exceptions, and your example may be one of them. However, it depends on more than the number of fields returnes, like what the fields actually return, i.e. the actual amount of data.
As an example of how the number of queries affects performance, I can mention a solution that I have (sadly enough) seen many times. In that solution the programmer would first get a number of records from one table, then loop through the records and run another query for each record to get the related records from another table. This clearly results in a lot of queries, and a solution having either one or two queries would be much more efficient.
“Is it ever more efficient to use many simple queries rather than fewer more complex ones?”
The query that requires the least amount of data to traverse, and gives you no more than what you need is the more efficient one. Beyond this, there can be RDBMS specific conditions that can be more efficient on one RDBMS system than another. At the very low level, when you deal with less data, then your results can be retrieved much quicker, so efficient queries are queries that only work with the least amount of data needed to get you the result you are looking for.

Sql optimization Techniques

I want to know optimization techniques for databases that has nearly 80,000 records,
list of possibilities for optimizing
i am using for my mobile project in android platform
i use sqlite,i takes lot of time to retreive the data
Thanks
Well, with only 80,000 records and assuming your database is well designed and normalized, just adding indexes on the columns that you frequently use in your WHERE or ORDER BY clauses should be sufficient.
There are other more sophisticated techniques you can use (such as denormalizing certain tables, partitioning, etc.) but those normally only start to come into play when you have millions of records to deal with.
ETA:
I see you updated the question to mention that this is on a mobile platform - that could change things a bit.
Assuming you can't pare down the data set at all, one thing you might be able to do would be to try to partition the database a bit. The idea here is to take your one large table and split it into several smaller identical tables that each hold a subset of the data.
Which of those tables a given row would go into would depend on how you choose to partition it. For example, if you had a "customer_id" field that could range from 0 to 10,000 you might put customers 0 - 2500 in table1, 2,500 - 5,000 in table2, etc. splitting the one large table into 4 smaller ones. You would then have logic in your app that would figure out which table (or tables) to query to retrieve a given record.
You would want to partition your data in such a way that you generally only need to query one of the partitions at a time. Exactly how you would partition the data would depend on what fields you have and how you are using them, but the general idea is the same.
Create indexes
Delete indexes
Normalize
DeNormalize
80k rows isn't many rows these days. Clever index(es) with queries that utlise these indexes will serve you right.
Learn how to display query execution maps, then learn to understand what they mean, then optimize your indices, tables, queries accordingly.
Such a wide topic, which does depend on what you want to optimise for. But the basics:
indexes. A good indexing strategy is important, indexing the right columns that are frequently queried on/ordered by is important. However, the more indexes you add, the slower your INSERTs and UPDATEs will be so there is a trade-off.
maintenance. Keep indexes defragged and statistics up to date
optimised queries. Identify queries that are slow (using profiler/built-in information available from SQL 2005 onwards) and see if they could be written more efficiently (e.g. avoid CURSORs, used set-based operations where possible
parameterisation/SPs. Use parameterised SQL to query the db instead of adhoc SQL with hardcoded search values. This will allow better execution plan caching and reuse.
start with a normalised database schema, and then de-normalise if appropriate to improve performance
80,000 records is not much so I'll stop there (large dbs, with millions of data rows, I'd have suggested partitioning the data)
You really have to be more specific with respect to what you want to do. What is your mix of operations? What is your table structure? The generic advice is to use indices as appropriate but you aren't going to get much help with such a generic question.
Also, 80,000 records is nothing. It is a moderate-sized table and should not make any decent database break a sweat.
First of all, indexes are really a necessity if you want a well-performing database.
Besides that, though, the techniques depend on what you need to optimize for: Size, speed, memory, etc?
One thing that is worth knowing is that using a function in the where statement on the indexed field will cause the index not to be used.
Example (Oracle):
SELECT indexed_text FROM your_table WHERE upper(indexed_text) = 'UPPERCASE TEXT';

What generic techniques can be applied to optimize SQL queries?

What techniques can be applied effectively to improve the performance of SQL queries? Are there any general rules that apply?
Use primary keys
Avoid select *
Be as specific as you can when building your conditional statements
De-normalisation can often be more efficient
Table variables and temporary tables (where available) will often be better than using a large source table
Partitioned views
Employ indices and constraints
Learn what's really going on under the hood - you should be able to understand the following concepts in detail:
Indexes (not just what they are but actually how they work).
Clustered indexes vs heap allocated tables.
Text and binary lookups and when they can be in-lined.
Fill factor.
How records are ghosted for update/delete.
When page splits happen and why.
Statistics, and how they effect various query speeds.
The query planner, and how it works for your specific database (for instance on some systems "select *" is slow, on modern MS-Sql DBs the planner can handle it).
The biggest thing you can do is to look for table scans in sql server query analyzer (make sure you turn on "show execution plan"). Otherwise there are a myriad of articles at MSDN and elsewhere that will give good advice.
As an aside, when I started learning to optimize queries I ran sql server query profiler against a trace, looked at the generated SQL, and tried to figure out why that was an improvement. Query profiler is far from optimal, but it's a decent start.
There are a couple of things you can look at to optimize your query performance.
Ensure that you just have the minimum of data. Make sure you select only the columns you need. Reduce field sizes to a minimum.
Consider de-normalising your database to reduce joins
Avoid loops (i.e. fetch cursors), stick to set operations.
Implement the query as a stored procedure as this is pre-compiled and will execute faster.
Make sure that you have the correct indexes set up. If your database is used mostly for searching then consider more indexes.
Use the execution plan to see how the processing is done. What you want to avoid is a table scan as this is costly.
Make sure that the Auto Statistics is set to on. SQL needs this to help decide the optimal execution. See Mike Gunderloy's great post for more info. Basics of Statistics in SQL Server 2005
Make sure your indexes are not fragmented. Reducing SQL Server Index Fragmentation
Make sure your tables are not fragmented. How to Detect Table Fragmentation in SQL Server 2000 and 2005
Use a with statment to handle query filtering.
Limit each subquery to the minimum number of rows possible.
then join the subqueries.
WITH
master AS
(
SELECT SSN, FIRST_NAME, LAST_NAME
FROM MASTER_SSN
WHERE STATE = 'PA' AND
GENDER = 'M'
),
taxReturns AS
(
SELECT SSN, RETURN_ID, GROSS_PAY
FROM MASTER_RETURNS
WHERE YEAR < 2003 AND
YEAR > 2000
)
SELECT *
FROM master,
taxReturns
WHERE master.ssn = taxReturns.ssn
A subqueries within a with statement may end up as being the same as inline views,
or automatically generated temp tables. I find in the work I do, retail data, that about 70-80% of the time, there is a performance benefit.
100% of the time, there is a maintenance benefit.
I think using SQL query analyzer would be a good start.
In Oracle you can look at the explain plan to compare variations on your query
Make sure that you have the right indexes on the table. if you frequently use a column as a way to order or limit your dataset an index can make a big difference. I saw in a recent article that select distinct can really slow down a query, especially if you have no index.
The obvious optimization for SELECT queries is ensuring you have indexes on columns used for joins or in WHERE clauses.
Since adding indexes can slow down data writes you do need to monitor performance to ensure you don't kill the DB's write performance, but that's where using a good query analysis tool can help you balanace things accordingly.
Indexes
Statistics
on microsoft stack, Database Engine Tuning Advisor
Some other points (Mine are based on SQL server, since each db backend has it's own implementations they may or may not hold true for all databases):
Avoid correlated subqueries in the select part of a statement, they are essentially cursors.
Design your tables to use the correct datatypes to avoid having to apply functions on them to get the data out. It is far harder to do date math when you store your data as varchar for instance.
If you find that you are frequently doing joins that have functions in them, then you need to think about redesigning your tables.
If your WHERE or JOIN conditions include OR statements (which are slower) you may get better speed using a UNION statement.
UNION ALL is faster than UNION if (And only if) the two statments are mutually exclusive and return the same results either way.
NOT EXISTS is usually faster than NOT IN or using a left join with a WHERE clause of ID = null
In an UPDATE query add a WHERE condition to make sure you are not updating values that are already equal. The difference between updating 10,000,000 records and 4 can be quite significant!
Consider pre-calculating some values if you will be querying them frequently or for large reports. A sum of the values in an order only needs to be done when the order is made or adjusted, rather than when you are summarizing the results of 10,000,000 million orders in a report. Pre-calculations should be done in triggers so that they are always up-to-date is the underlying data changes. And it doesn't have to be just numbers either, we havea calculated field that concatenates names that we use in reports.
Be wary of scalar UDFs, they can be slower than putting the code in line.
Temp table tend to be faster for large data set and table variables faster for small ones. In addition you can index temp tables.
Formatting is usually faster in the user interface than in SQL.
Do not return more data than you actually need.
This one seems obvious but you would not believe how often I end up fixing this. Do not join to tables that you are not using to filter the records or actually calling one of the fields in the select part of the statement. Unnecessary joins can be very expensive.
It is an very bad idea to create views that call other views that call other views. You may find you are joining to the same table 6 times when you only need to once and creating 100,000,00 records in an underlying view in order to get the 6 that are in your final result.
In designing a database, think about reporting not just the user interface to enter data. Data is useless if it is not used, so think about how it will be used after it is in the database and how that data will be maintained or audited. That will often change the design. (This is one reason why it is a poor idea to let an ORM design your tables, it is only thinking about one use case for the data.) The most complex queries affecting the most data are in reporting, so designing changes to help reporting can speed up queries (and simplify them) considerably.
Database-specific implementations of features can be faster than using standard SQL (That's one of the ways they sell their product), so get to know your database features and find out which are faster.
And because it can't be said too often, use indexes correctly, not too many or too few. And make your WHERE clauses sargable (Able to use indexes).