Fairly new to SQL. You can order records returned with something like:
SELECT * FROM MyTable ORDER BY SalesPrice;
Without physical extracting all records, clearing the table, and then re-writing them back in some customised order, is there a way to ask the database driver to re-order the physical data in the database table by some important column or key (e.g. "SalesPrice")?
This would not be desirable for tables with many records or where the data changes regularly, but may make more sense as a maintainenace action where the table is relatively static.
Using DAO and ODBC, though I'm hoping for a general answer that might apply to a variety of databases.
SQL tables represent unordered sets. Period. Well, there is a nuance to that, which are clustered indexes. So you can arrange to have the data stored in a particular order. However that does not guarantee that select * from table will return the results in that order.
If you are concerned about the performance of such a query, you wnat an index on (salesprice). Then use order by salesprice in the query.
Related
I have table about 20-25 million records, I have to put in another table based on some condition and also sorted. Example
Create table X AS
select * from Y
where item <> 'ABC'
Order By id;
I know that Order by Uses single reducer to guarantee total order in output.
I need optimize way to do sorting for above query.
SQL tables represent unordered sets. This is especially true in parallel databases where the data is spread among multiple processors.
That said, Hive does support clustered indexes (which essentially define partitions) and sorting within the partitions. The documentation is quite specific, though, that this is not supported with CREATE TABLE AS:
CTAS has these restrictions:
The target table cannot be a partitioned table.
You could do what you want by exporting the data and re-importing it.
However, I would suggest that you figure out what you really need without requiring the data to be ordered within the database.
So I have a table in SQL that's in a database. I've performed sort operations as well as catching a certain amount of records, and now my objective is to input these values into a new database. How can I do this, considering the sort operations? Would it be potentially easier to export the whole table then perform the sort in the new database? I am using SQL 2005.
It doesn't matter what order you insert the data into the new table. If you put a clustered index on the table, the data will be stored in the order of the index, regardless of the order it was inserted. (And note that storing it in this order is no guarantee that your query results will come out in this order)
The only instance where the order of inserts might matter is if you have an IDENTITY column on the table, and are allowing it to auto-populate. Then the IDENTITY will increment in the order that the rows are inserted.
That said, treating your question as academic, if you did want to export data in a certain order from one database to another, you could use an SSIS dataflow and specify an ORDER BY in the query in the Source component.
I have multiple tables that data can be queried from with joins.
In regards to database performance:
Should I run multiple selects from multiple tables for the required data?
or
Should I write 1 select that uses a bunch of Joins to select the required data from all the tables at once?
EDIT:
The where clause I will be using for the select contains Indexed fields of the tables. It sounds like because of this, it will be faster to use 1 select statement with many joins. I will however still test the performance difference between the 2.
Thanks for all the great answers.
Just write one query with joins. If you are concerned about performance there are a number of options including:
Creating indexes that will help the performance of your selects
Create a persisted denormalized form of the data you want so you can query one table. This would most likely be an indexed view or another table.
This can be one of those, well-gee-it-depends, but generally if you're writing straight SQL do one query--especially since the joins might limit some of the data you get back.
There is a good chance if you do multiple point queries for one record in each table, if you're using the primary key of the table for lookup, the connection cost for each query will be more costly than the actual query.
It depends on how the tables are joined. If you do a cross-product of all tables than it would be better to do individual selects. However if your tables are properly indexed and well thought out one query with multiple selects would be more efficient.
If you have proper indexes on your tables, you might be better off with the JOINs but they are often the cause of bottlenecks. Instead of multiple selects, you might look at ways to de-normalize your data. It is far less "expensive" when a user performs an operation to update a count or timestamp in multiple tables which prevents you from having to join those tables.
The best tool I find for performance tuning of queries is using EXPLAIN. You type EXPLAIN before the query and you can see how many rows are scanned. Your goal is the lower the number the better, which means your indexes are working. The other thing is when creating indexes, use compound indexes on multiple fields and order them left to right in the order they appear in the WHERE clause.
For example you have 10,000 rows in sometable:
SELECT id, name, description, status FROM sometable WHERE name LIKE '%someName%' AND status = 'Active';
You could type EXPLAIN before the query and it might return 10,000 as number of rows scanned to match. You then create a compound index:
ALTER TABLE sometable ADD INDEX idx_st_search (name, status);
You then perform the EXPLAIN on table again and it might return 1 as number of rows scanned and performance significantly improved.
Depends on your Table designs.
Most of times one large query is better but be sure to
Use primary keys in where clause as much as you can for joins.
use indexed fields or make indexes for fields which are used in where clauses.
Let's say we have:
SELECT *
FROM Pictures
JOIN Categories ON Categories.CategoryId = Pictures.CategoryId
WHERE Pictures.UserId = #UserId
ORDER BY Pictures.UploadDate DESC
In this case, the database first join the two tables and then work on the derived table, which I think would mean the indexes on the individual tables would be no use, unless you can come up with an index that is bound to some column in the derived table?
You have a fundamental misunderstanding of how SQL works. The SQL language specifies what result set should be returned. It says nothing about how the database should achieve those results.
It is up to the database engine to parse the statement and come up with an execution plan (hopefully an efficient one) that will produce the correct results. Many modern relational databases have sophisticated query optimizers that completely pull apart the statement and derive execution plans that seem to have no relationship with the original query. (At least not to the untrained eye)
The execution plan for the same query can even change over time if the engine uses a cost based optimizer. A cost based optimizer makes decisions based on statistics that have been gathered about data and indexes. As the statistics change, the execution plan can also change.
With your simple query you assume that the database has to join the tables and create a temporary result set before it applies the where clause. That might be how you think about the problem, but the database is free to implement it entirely differently. I doubt there are many (if any) databases that would create a temporary result set for your simple query.
This is not to say that you cannot ever predict when an index may or may not be used. But it takes practice and experience to get a feel for how a database might execute a query.
This will join the tables giving you all the category information if a picture's 'CategoryId' is in the table 'Categories''s CategoryId field. (and no result for a particular 'Picture' if there is no such category)
This query will likely return several rows of data. The indexes of either table will be useful no matter which table you would like to access.
Normally your program would loop through the result set.
CategoryId will give you the row in Categories with all the relevant fields in that Category and 'Picture.Id' (assuming there is such a field) will give you a reference to that exact picture row in the database.
You can then manipulate either table by using the relevant index
"UPDATE Categories SET .... WHERE CategoryId = " +
"UPDATE Pictures ..... WHERE PictureId =" +
or some such depending on your programming environment.
Indexes are up to the optimizer for use, which depends on what is occurring in the query. For the query posted, there's nothing obvious to stop an index from being used. However, not all databases operate the same -- MySQL only allows one index to be used per SELECT (check the query plan, because the optimizer might interpret the JOIN so that another index may be used).
The stuff that is likely to ensure that an index can not be used is any function/operation that alters the data. IE: getting the month/etc out of a date, wildcarding the left side of a LIKE clause...
I have an sql query which fetch the first N rows in a table which is designed as a low-level queue.
select top N * from my_table where status = 0 order by date asc
The intention behind this query is as follows:
First, this question is intended to be database agnostic, as my implementation will support sql server, oracle, DB2 and sybase. The sql syntax above of "top N" is just an example.
The table can contain millions of rows.
N is a relatively small number in comparison, e.g. 100.
status is 0 when the row is in the queue. Later it is changed to 1 to indicate that it is in processing. After processing it is deleted. So it is expected that at least 90% of the rows in the table will be with status 0.
rows in the table should be fetched according to their date, hence the order by clause.
What is the optimal index to make this query works fastest?
I initially thought the index should be on (date, status), but I am not sure about it anymore. Since the status column will contain mostly zeros, is there an added-value to it? Will it be sufficient to index by (date) alone?
Or maybe it should be (status, date)?
I don't think there is an efficient solution that will be RDMS independent. For example, Oracle has bitmap indexes, SQLServer has partial indexes, and I don't see reasons not to use them if, for instance, Mysql or Sqlite has nothing similar. Also, historically SQLServer implements clustered tables (or IOT in Oracle world) way better than Oracle does, so having clustered index on date column may work perfectly for SQLServer, but not for Oracle.
I'd rather change approach a bit. If you say 90% of rows don't satisfy status=0 condition, why not try refactoring schema, and adding a new table (or materialized view) that holds only records you are interested in ? The number of new programmable objects required for keeping that table up-to-date and merging data with original table is relatively small even if RDMS doesn't support materialized view directly. Also, if it's possible to redesign underlying logic, so rows never updated, only inserted or deleted, then it will help avoiding lock contentions , and as a result , the whole system will have a better performance .
Have a clustered index on Date and a non clustered index on Status.