For testing purposes I require a large amount of queries.
Creating this manually is not an option, so I am searching a tool which will do this automatically.
Sadly, the only solution I found (sqlsmith), is limited to postgres and SQLite.
Are there any similar tools for SQL-Server?
"I do not know from what random place people will want to travel to a random other place, so instead, let's create roads for every possible combination of origin and destination".
That sounds kind of insane, doesn't it? The same applies to what you seem to be wanting to achieve. You basically are hoping to find a tool that generates random queries against your database so you can feed them to the tuning advisor, which will then suggest query optimization indexes for hypothetical queries.
If you want to performance tune your database, you should have a pretty good idea of the type of questions your users will be throwing at it, as well as the structure of your data. Typical questions that will help you get started would be things like:
What is the most common search my users would do against this table?
What criteria are they most likely to use?
Which columns are guaranteed or likely to contain unique data in every row?
Which columns will most likely have a low selectivity of data? (I.e. Male/Female)
are you looking for generate random data for multiple tables ? we generally use redgate data genearator tool for the same.
for SQL tuning purpose I would suggest
https://www.brentozar.com/blitzindex/
http://www.nguyenlamminhdieu.com/zone/213/news/vi-VN/zone/213/news/351-database-engine-tuning-advisor-in-sql-server.aspx
Related
I used Oracle for the half past year and learned some tricks of sql tuning,but now our DB is moving to greenplum and the project manager suggest us to change some of the codes that writted in Oracle sql for their efficiency or grammar.
I am curious that Are sql tuning ways same for different DB engine,like oracle,postgresql,mysql and so on?if yes or not,why?Any suggestion are welcomed!
some like:
in or exists
count(*) or count(column)
use index or not
use exact column instead of select *
For the most part the syntax that is used will remain the same, there may be small differences from one engine to another and you may run into different terms to achieve some of the more specific output or do more complex tasks. In order to achieve parity you will need to learn those new terms.
As far as tuning, this will vary from system to system. Specifically going from Oracle to Greenplum you are looking at moving from a database where efficiency in a query if often driven by dropping an index on the data. Where Greenplum is a parallel execution system where efficiency is gained by effectively distributing the data across multiple systems and querying them in parallel. In Greenplum indexing is an additional layer that usually does not add benefit, just additional overhead.
Even within a single system using changing the storage engine type can result in different ways to optimize a query. In practice queries are often moved to a new platform and work, but are far from optimal as they don't take advantage of optimizations of that platform. I would strongly suggest getting an understanding of the new platform and you should not go in assuming a query that is optimized for one platform is the optimal way to run it in another.
Getting specifics in why they differ requires someone to be an expert in bother to be able to compare both. I don't claim to know much of greenplum.
The basic principles which I would expect all developers to learn over time dont really change. But there are "quirks" of individual engines which make specific differences. From your question I would personally anticipate 1 and 4 to remain the same.
Indexing is something which does vary. For example the ability to use two indexes was not (is not?) Ubiquitous. I wouldn't like to guess which DBMS can / can't count columns from the second field in a composite index. And the way indexes are maintained is very different from one DBMS to the next.
From my own experience I've also seen differences caused by:
Different capabilities in the data access path. As an example, one optimisation is for a DBMS to create a bit map of rows (matching and not matching) the combine multiple bitmaps to select rows. A DBMS with this feature can use multiple indexes in a single query. One without it can't.
Availability of hints / lack of hints. Not all DBMS support them. I know they are very common in Oracle.
Different locking strategies. This is a big one and can really affect update and insert queries.
In some cases DBMS have very specific capabilities for certain types of data such as geographic data or searchable free text (natural language). In these cases the way of working with the data is entirely different from one DBMS to the next.
In my project:
Data is not going to be modified (only query).
It is going to be more than 1.000.000 instances of data.
Query performance is critical.
In case of using SQL, it is going to be a single table with 7 columns. (no joints)
There are also different classification approaches used in NoSQL. Which are given below with some examples:
Column: Accumulo, Cassandra, HBase
Document: Clusterpoint, Couchdb, Couchbase, MarkLogic, MongoDB
Key-value: Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-treeACE
Graph: Allegro, Neo4J, OrientDB, Virtuoso, Stardog
Source: http://en.wikipedia.org/wiki/NoSQL#cite_note-7
First of all, does the database system really makes an observable amount of performance difference for this case?
If it makes then,can you please explain which one is more suitable for my project SQL or NoSQL, if NoSQL then which classification approach?
Thank you in advance
I am currently enrolled in a project to set up a "standard" Database with a huge amount of data. We start by implementing in SQL to see the performance of the queries. Once this is done we address the problem of performance.
There is multiple reasons for this, but to name a few:
Standard SQL is easily implemented and standard across multiple instances (as of present day)
If you know SQL, make a fast implementation. To save time and get the project going.
There are loads of information available about SQL implementations.
I cannot answer about NoSQL but hopefully someone can fill me in.
The important question you need to ask is what kind of queries you will be performing. For example ClusterPoint offers real-time aggregation, so if you need result grouping and extracting summaries, it gives you great performance.
For a regular key/value they should all perform pretty well, so pick the one you are most comfortable with.
In my limited experience in working with ORMs (so far LLBL Gen Pro and Entity Framework 4), I've noticed that inherently, queries return data for all columns. I know NHibernate is another popular ORM, and I'm not sure that this applies with it or not, but I would assume it does.
Of course, I know there are workarounds:
Create a SQL view and create models and mappings on the view
Use a stored procedure and create models and mappings on the result set returned
I know that adhering to certain practices can help mitigate this:
Ensuring your row counts are reasonably limited when selecting data
Ensuring your tables aren't excessively wide (large number of columns and/or large data types)
So here are my questions:
Are the above practices sufficient, or should I still consider finding ways to limit the number of columns returned?
Are there other ways to limit returned columns other than the ones I listed above?
How do you typically approach this in your projects?
Thanks in advance.
UPDATE: This sort of stems from the notion that SELECT * is thought of as a bad practice. See this discussion.
One of the reasons to use an ORM of nearly any kind is to delay a lot of those lower-level concerns and focus on the business logic. As long as you keep your joins reasonable and your table widths sane, ORMs are designed to make it easy to get data in and out, and that requires having the entire row available.
Personally, I consider issues like this premature optimization until encountering a specific case that bogs down because of table width.
First of : great question, and about time someone asked this! :-)
Yes, the fact an ORM typically returns all columns for a database table is something you need to take into consideration when designing your systems. But as you've mentioned - there are ways around this.
The main fact for me is to be aware that this is what happens - either a SELECT * FROM dbo.YourTable, or (better) a SELECT (list of all columns) FROM dbo.YourTable.
This is not a problem when you really want the whole object and all its properties, and as long as you load a few rows, that's fine, too - the convenience beats the raw performance.
You might need to think about changing your database structures a little bit - things like:
maybe put large columns like BLOBs into separate tables with a 1:1 link to your base table - that way, a select on the parent tables doesn't grab all those large blobs of data
maybe put groups of columns that are optional, that might only show up in certain situations, into separate tables and link them - again, just to keep the base tables lean'n'mean
Also: avoid trying to "arm-wrestle" your ORM into doing bulk operations - that's just not their strong point.
And: keep an eye on performance, and try to pick an ORM that allows you to change certain operations into e.g. stored procedures - Entity Framework 4 allows this. So if the deletes are killing you - maybe you just write a Delete stored proc for that table and handle that operation differently.
The question here covers your options fairly well. Basically you're limited to hand-crafting the HQL/SQL. It's something you want to do if you run into scalability problems, but if you do in my experience it can have a very large positive impact. In particular, it saves a lot of disk and network IO, so your scalability can take a big jump. Not something to do right away though: analyse then optimise.
Are there other ways to limit returned columns other than the ones I listed above?
NHibernate lets you add projections to your queries so you wouldn't need to use views or procs just to limit your columns.
For me this has only been an issue if the tables has LOTS of columns > 30 or if the column had alot of data for example a over 5000 character in a field.
The approach I have used is to just map another object to the existing table but with only the fields I need. So for a search that populates a table with 100 rows I would have a
MyObjectLite, but when I click to view the Details of that Row I would call a GetById and return a MyObject that has all the columns.
Another approach is to use custom SQL, Stroed procs but I only think you should go down this path if you REALLY need the performance gain and have users complaining. SO unless there is a performance problem do not waste your time trying to fix a problem that does not exist.
You can limit number of returned columns by using Projection and Transformers.AliasToBean and DTO here how it looks in Criteria API:
.SetProjection(Projections.ProjectionList()
.Add(Projections.Property("Id"), "Id")
.Add(Projections.Property("PackageName"), "Caption"))
.SetResultTransformer(Transformers.AliasToBean(typeof(PackageNameDTO)));
In LLBLGen Pro, you can return Typed Lists which not only allow you to define which fields are returned but also allow you to join data so you can pull a custom list of fields from multiple tables.
Overall, I agree that for most situations, this is premature optimization.
One of the big advantages of using LLBLGen and other ORMs as well (I just feel confident speaking about LLBLGen because I have used it since its inception) is that the performance of the data access has been optimized by folks who understand the issues better than your average bear.
Whenever they figure out a way to further speed up their code, you get those changes "for free" just by re-generating your data layer or by installing a new dll.
Unless you consider yourself an expert at writing data access code, ORMs probably improve most developers efficacy and accuracy.
I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.
with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.
Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.
I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .
In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.
First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.
The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).
Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.
Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?
if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
doing
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it
I've been reading a little about temporary tables in MySQL but I'm an admitted newbie when it comes to databases in general and MySQL in particular. I've looked at some examples and the MySQL documentation on how to create a temporary table, but I'm trying to determine just how temporary tables might benefit my applications and I guess secondly what sorts of issues I can run into. Granted, each situation is different, but I guess what I'm looking for is some general advice on the topic.
I did a little googling but didn't find exactly what I was looking for on the topic. If you have any experience with this, I'd love to hear about it.
Thanks,
Matt
Temporary tables are often valuable when you have a fairly complicated SELECT you want to perform and then perform a bunch of queries on that...
You can do something like:
CREATE TEMPORARY TABLE myTopCustomers
SELECT customers.*,count(*) num from customers join purchases using(customerID)
join items using(itemID) GROUP BY customers.ID HAVING num > 10;
And then do a bunch of queries against myTopCustomers without having to do the joins to purchases and items on each query. Then when your application no longer needs the database handle, no cleanup needs to be done.
Almost always you'll see temporary tables used for derived tables that were expensive to create.
First a disclaimer - my job is reporting so I wind up with far more complex queries than any normal developer would. If you're writing a simple CRUD (Create Read Update Delete) application (this would be most web applications) then you really don't want to write complex queries, and you are probably doing something wrong if you need to create temporary tables.
That said, I use temporary tables in Postgres for a number of purposes, and most will translate to MySQL. I use them to break up complex queries into a series of individually understandable pieces. I use them for consistency - by generating a complex report through a series of queries, and I can then offload some of those queries into modules I use in multiple places, I can make sure that different reports are consistent with each other. (And make sure that if I need to fix something, I only need to fix it once.) And, rarely, I deliberately use them to force a specific query plan. (Don't try this unless you really understand what you are doing!)
So I think temp tables are great. But that said, it is very important for you to understand that databases generally come in two flavors. The first is optimized for pumping out lots of small transactions, and the other is optimized for pumping out a smaller number of complex reports. The two types need to be tuned differently, and a complex report run on a transactional database runs the risk of blocking transactions (and therefore making web pages not return quickly). Therefore you generally don't want to avoid using one database for both purposes.
My guess is that you're writing a web application that needs a transactional database. In that case, you shouldn't use temp tables. And if you do need complex reports generated from your transactional data, a recommended best practice is to take regular (eg daily) backups, restore them on another machine, then run reports against that machine.
The best place to use temporary tables is when you need to pull a bunch of data from multiple tables, do some work on that data, and then combine everything to one result set.
In MS SQL, Temporary tables should also be used in place of cursors whenever possible because of the speed and resource impact associated with cursors.
If you are new to databases, there are some good books by Joe Kelko that review best practices for ANSI SQL. SQL For Smarties will describe in great detail the use of temp table, impact of indexes, where clauses, etc. It's a great reference book with in depth detail.
I've used them in the past when I needed to create evaluated data. That was before the time of views and sub selects in MySQL though and I generally use those now where I would have needed a temporary table. The only time I might use them is if the evaluated data took a long time to create.
I haven't done them in MySQL, but I've done them on other databases (Oracle, SQL Server, etc).
Among other tasks, temporary tables provide a way for you to create a queryable (and returnable, say from a sproc) dataset that's purpose-built. Let's say you have several tables of figures -- you can use a temporary table to roll those figures up to nice, clean totals (or other math), then join that temp table to others in your schema for final output. (An example of this, in one of my projects, is calculating how many scheduled calls a given sales-related employee must make per week, bi-weekly, monthly, etc.)
I also often use them as a means of "tilting" the data -- turning columns to rows, etc. They're good for advanced data processing -- but only use them when you need to. (My golden rule, as always, applies: If you don't know why you're using x, and you don't know how x works, then you probably shouldn't use it.)
Generally, I wind up using them most in sprocs, where complex data processing is needed. I'd love to give a concrete example, but mine would be in T-SQL (as opposed to MySQL's more standard SQL), and also they're all client/production code which I can't share. I'm sure someone else here on SO will pick up and provide some genuine sample code; this was just to help you get the gist of what problem domain temp tables address.