I have a huge table of > 10 million rows. I need to efficiently grab a random sampling of 5000 from it. I have some constriants that reduces the total rows I am looking for to like 9 millon.
I tried using order by NEWID(), but that query will take too long as it has to do a table scan of all rows.
Is there a faster way to do this?
If you can use a pseudo-random sampling and you're on SQL Server 2005/2008, then take a look at TABLESAMPLE. For instance, an example from SQL Server 2008 / AdventureWorks 2008 which works based on rows:
USE AdventureWorks2008;
GO
SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (100 ROWS)
WHERE EmailPromotion = 2;
The catch is that TABLESAMPLE isn't exactly random as it generates a given number of rows from each physical page. You may not get back exactly 5000 rows unless you limit with TOP as well. If you're on SQL Server 2000, you're going to have to either generate a temporary table which match the primary key or you're going to have to do it using a method using NEWID().
Have you looked into using the TABLESAMPLE clause?
For example:
select *
from HumanResources.Department tablesample (5 percent)
SQL Server 2000 Solution, regarding to Microsoft (instead of slow NEWID() on larger Tables):
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The SQL Server team at Microsoft realized that not being able to take
random samples of rows easily was a common problem in SQL Server 2000;
so, the team addressed the problem in SQL Server 2005 by introducing
the TABLESAMPLE clause. This clause selects a subset of rows by
choosing random data pages and returning all of the rows on those
pages. However, for those of us who still have products that run on
SQL Server 2000 and need backward-compatibility, or who need truly
row-level randomness, the BINARY_CHECKSUM query is a very effective
workaround.
Explanation can be found here:
http://msdn.microsoft.com/en-us/library/cc441928.aspx
Yeah, tablesample is your friend (note that it's not random in the statistical sense of the word):
Tablesample at msdn
Related
I am transitioning from SQL Server to BigQuery and noticed that the TOP function in BigQuery is only allowed to aggregate in queries. Therefore the below code would not work:
SELECT TOP 5 * FROM TABLE
This is a habit I've had when trying to learn new tables and get more information on the data. Is there another alternative to selecting a few rows from the table? The following select all query works, but is incredibly inefficient and takes a long time to run for large tables:
SELECT * FROM TABLE
In BigQuery, you can use LIMIT as in:
SELECT t.*
FROM TABLE t
LIMIT 5;
But I caution you to be very careful with this. BigQuery charges for the number of columns accessed in a table, not the number of rows. So, in a large table, such a query can be quite expensive.
You can also go into the BigQuery GUI, navigate to the table, and click on "Preview". The preview functionality is free.
As Gordon Linoff mentioned, using LIMIT statement in BigQuery may be very expensive when used with big tables. To make exploratory queries more cost effective BigQuery now supports TABLESAMPLE operator, see also Using table sampling.
Sampling returns a variety of records while avoiding the costs associated with scanning and processing an entire table.
Query example:
SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (2 PERCENT)
If you are querying e.g. table views or TABLESAMPLE SYSTEM is not working for other reasons, what you can do is to use e.g. [...] WHERE RAND() < 0.05 for getting 5% of the results randomly selected. Make sure to put it at the end of your query in the WHERE statement.
This works also with table views and if you are not the owner of a table. :)
I am new to SQL and I have a large table with several hundred rows that I need to view all of its row. Is there a command in SQL that would act like the less command in Linux that would allow me to step one screen height at a time through the output of a select statement? So the pseudo-code for what I'm after would be, for example:
SELECT * from table less
What you're looking for is called "paging" or "pagination"
In MySQL and Postres it's LIMIT n OFFSET m: https://www.postgresql.org/docs/8.3/static/queries-limit.html
In SQL Server it's OFFSET m FETCH NEXT n: https://technet.microsoft.com/en-us/library/gg699618(v=sql.110).aspx
This QA has a more thorough answer: How universal is the LIMIT statement in SQL?
ANSI SQL supports TOP N
SELECT TOP 10 * from table
Certain other dialects of SQL like SQLite support LIMIT
In DB2, you need to use SELECT * from TABLE FETCH FIRST 10 rows only
I have a very simple query on a table with 60 million rows :
select id, max(version) from mytable group by id
It returns 6 million records and takes more than one hour to run. I just need to run it once because I am transferring the records to another new table that I keep updated.
I tried a few things that didn't work for me but that are often suggested here on stackoverflow:
inner query with select top 1 / order by desc: it is not supported in Sybase ASE
left outer join where a.version < b.version and b.version is null: I interrupted the query after more than one hour and barely a hundred thousand records were found
I understand that Sybase has to do a full scan.
Why could the full scan be so slow?
Is the slowness due to the Sybase ASE instance itself or specific to the query?
What are my options to reduce the running time of the query?
I am not intimately familiar with Sybase optimization. However, your query is really slow. Here are two ideas.
First, add an index on mytable(id, version desc). At a minimum, this is a covering index for the query, meaning that all columns used are in the index. Sybase is probably smart enough to eliminate the group by.
Another option uses the same index, but with a correlated subquery:
select t.id
from mytable t
where t.version = (select max(t2.version)
from mytable t2
where t2.id = t.id
);
This would be a full table scan (a little expensive but not an hour's worth) and an index lookup on each row (pretty cheap). The advantage of this approach is that you can select all the columns you want. The disadvantage is that if two rows have the same maximum version for an id, you will get both in the result set.
Edit : Here Nicolas a more precise answer. I have no particular experience with Sybase but I earned experience working with tones of data with a quite small server on Sql Server. From this experience, I learn that when you work with a large amount of data and your server doesn't have enough memory to deal with that amount of data, you will encounter bottlenecks (I guess it takes times to write the temporary results on the disk). I think it's your case (60 millions rows) but once again, I don't know Sybase and it depends of many factors as the numbers of columns mytable have and the amount of RAM your server have, etc ...
Here the results of a small experience I just did :
I run on Sql-Server and PostgreSQL those two queries.
Query 1 :
SELECT id, max(version)
FROM mytable
GROUP BY id
Query 2 :
SELECT id, version
FROM
(
SELECT id, version, ROW_NUMBER() OVER (PARTITION BY id ORDER BY version DESC) as RN
FROM mytable
) q
WHERE q.rn = 1
On PostgreSQL, mytable has 2.878.441 rows.
Query#1 takes 31.458 sec and returns 1.200.146 rows.
Query#2 takes 41.787 sec and returns 1.200.146 rows.
On Sql Server, mytable has 1.600.010 rows.
Query#1 takes 6 sec and returns 537.232 rows.
Query#2 takes 10 sec and returns 537.232 rows.
So far, your query is always faster. So I tried on a bigger tables.
On PostgreSQL, mytable has now 5.875.134 rows.
Query#1 takes 100.915 sec and returns 2.796.800 rows.
Query#2 takes 98.805 sec and returns 2.796.800 rows.
On Sql Server, mytable has now 11.712.606 rows.
Query#1 takes 28 min 28 sec and returns 6.262.778 rows.
Query#2 takes 2 min 39 sec and returns 6.262.778 rows.
Now we can make an assumption. In the first part on this experience. The two servers have enough memory to deal with the data, thus Group By is faster. The second part on this experiment might prove that too much data kill the performance of group by. To prevent the bottleneck ROW_NUMBER() seems to do the trick.
Criticisms : I don't have a bigger table on PostgreSQL nor I have a Sybase server at hand.
For this experiment I was using PostgreSQL 9.3.5 on x86_64 and SQL Server 2012 - 11.0-2100.60 (X64)
Maybe Nicolas this experiment will help you.
So finally the nonclustered index on (id, version desc) did the trick without having to change anything to the query. Index creation also takes one hour and the query responds in few seconds. But I guess it's still better than having another table that could cause data integrity issues.
the function max() does not help the optimizer to use the index.
Perhaps you should create a function-based index on max(version):
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc32300.1550/html/sqlug/CHDDHJIB.htm
The typical way of selecting data is:
select * from my_table
But what if the table contains 10 million records and you only want records 300,010 to 300,020
Is there a way to create a SQL statement on Microsoft SQL that only gets 10 records at once?
E.g.
select * from my_table from records 300,010 to 300,020
This would be way more efficient than retrieving 10 million records across the network, storing them in the IIS server and then counting to the records you want.
SELECT * FROM my_table is just the tip of the iceberg. Assuming you're talking a table with an identity field for the primary key, you can just say:
SELECT * FROM my_table WHERE ID >= 300010 AND ID <= 300020
You should also know that selecting * is considered poor practice in many circles. They want you specify the exact column list.
Try looking at info about pagination. Here's a short summary of it for SQL Server.
Absolutely. On MySQL and PostgreSQL (the two databases I've used), the syntax would be
SELECT [columns] FROM table LIMIT 10 OFFSET 300010;
On MS SQL, it's something like SELECT TOP 10 ...; I don't know the syntax for offsetting the record list.
Note that you never want to use SELECT *; it's a maintenance nightmare if anything ever changes. This query, though, is going to be incredibly slow since your database will have to scan through and throw away the first 300,010 records to get to the 10 you want. It'll also be unpredictable, since you haven't told the database which order you want the records in.
This is the core of SQL: tell it which 10 records you want, identified by a key in a specific range, and the database will do its best to grab and return those records with minimal work. Look up any tutorial on SQL for more information on how it works.
When working with large tables, it is often a good idea to make use of Partitioning techniques available in SQL Server.
The rules of your partitition function typically dictate that only a range of data can reside within a given partition. You could split your partitions by date range or ID for example.
In order to select from a particular partition you would use a query similar to the following.
SELECT <Column Name1>…/*
FROM <Table Name>
WHERE $PARTITION.<Partition Function Name>(<Column Name>) = <Partition Number>
Take a look at the following white paper for more detailed infromation on partitioning in SQL Server 2005.
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope this helps however please feel free to pose further questions.
Cheers, John
I use wrapper queries to select the core query and then just isolate the ROW numbers that i wish to take from the query - this allows the SQL server to do all the heavy lifting inside the CORE query and just pass out the small amount of the table that i have requested. All you need to do is pass the [start_row_variable] and the [end_row_variable] into the SQL query.
NOTE: The order clause is specified OUTSIDE the core query [sql_order_clause]
w1 and w2 are TEMPORARY table created by the SQL server as the wrapper tables.
SELECT
w1.*
FROM(
SELECT w2.*,
ROW_NUMBER() OVER ([sql_order_clause]) AS ROW
FROM (
<!--- CORE QUERY START --->
SELECT [columns]
FROM [table_name]
WHERE [sql_string]
<!--- CORE QUERY END --->
) AS w2
) AS w1
WHERE ROW BETWEEN [start_row_variable] AND [end_row_variable]
This method has hugely optimized my database systems. It works very well.
IMPORTANT: Be sure to always explicitly specify only the exact columns you wish to retrieve in the core query as fetching unnecessary data in these CORE queries can cost you serious overhead
Use TOP to select only a limited amont of rows like:
SELECT TOP 10 * FROM my_table WHERE ID >= 300010
Add an ORDER BY if you want the results in a particular order.
To be efficient there has to be an index on the ID column.
When I worked on the Zend Framework's database component, we tried to abstract the functionality of the LIMIT clause supported by MySQL, PostgreSQL, and SQLite. That is, creating a query could be done this way:
$select = $db->select();
$select->from('mytable');
$select->order('somecolumn');
$select->limit(10, 20);
When the database supports LIMIT, this produces an SQL query like the following:
SELECT * FROM mytable ORDER BY somecolumn LIMIT 10, 20
This was more complex for brands of database that don't support LIMIT (that clause is not part of the standard SQL language, by the way). If you can generate row numbers, make the whole query a derived table, and in the outer query use BETWEEN. This was the solution for Oracle and IBM DB2. Microsoft SQL Server 2005 has a similar row-number function, so one can write the query this way:
SELECT z2.*
FROM (
SELECT ROW_NUMBER OVER(ORDER BY id) AS zend_db_rownum, z1.*
FROM ( ...original SQL query... ) z1
) z2
WHERE z2.zend_db_rownum BETWEEN #offset+1 AND #offset+#count;
However, Microsoft SQL Server 2000 doesn't have the ROW_NUMBER() function.
So my question is, can you come up with a way to emulate the LIMIT functionality in Microsoft SQL Server 2000, solely using SQL? Without using cursors or T-SQL or a stored procedure. It has to support both arguments for LIMIT, both count and offset. Solutions using a temporary table are also not acceptable.
Edit:
The most common solution for MS SQL Server 2000 seems to be like the one below, for example to get rows 50 through 75:
SELECT TOP 25 *
FROM (
SELECT TOP 75 *
FROM table
ORDER BY BY field ASC
) a
ORDER BY field DESC;
However, this doesn't work if the total result set is, say 60 rows. The inner query returns 60 rows because that's in the top 75. Then the outer query returns rows 35-60, which doesn't fit in the desired "page" of 50-75. Basically, this solution works unless you need the last "page" of a result set that doesn't happen to be a multiple of the page size.
Edit:
Another solution works better, but only if you can assume the result set includes a column that is unique:
SELECT TOP n *
FROM tablename
WHERE key NOT IN (
SELECT TOP x key
FROM tablename
ORDER BY key
);
Conclusion:
No general-purpose solution seems to exist for emulating LIMIT in MS SQL Server 2000. A good solution exists if you can use the ROW_NUMBER() function in MS SQL Server 2005.
Here is another solution which only works in Sql Server 2005 and newer because it uses the except statement. But I share it anyway.
If you want to get the records 50 - 75 write:
select * from (
SELECT top 75 COL1, COL2
FROM MYTABLE order by COL3
) as foo
except
select * from (
SELECT top 50 COL1, COL2
FROM MYTABLE order by COL3
) as bar
SELECT TOP n *
FROM tablename
WHERE key NOT IN (
SELECT TOP x key
FROM tablename
ORDER BY key
DESC
);
When you need LIMIT only, ms sql has the equivalent TOP keyword, so that is clear.
When you need LIMIT with OFFSET, you can try some hacks like previously described, but they all add some overhead, i.e. for ordering one way and then the other, or the expencive NOT IN operation.
I think all those cascades are not needed. The cleanest solution in my oppinion would be just use TOP without offset on the SQL side, and then seek to the required starting record with the appropriate client method, like mssql_data_seek in php. While this isn't a pure SQL solution, I think it is the best one because it doesn't add any overhead (the skipped-over records will not be transferred on the network when you seek past them, if that is what worries you).
I would try to implement this in my ORM as it is pretty simple there. If it really needs to be in SQL Server then I would look at the code generated by linq to sql for the following linq to sql statement and go from there. The MSFT engineer who implemented that code was part of the SQL team for many years and knew what he was doing.
var result = myDataContext.mytable.Skip(pageIndex * pageSize).Take(pageSize)