SQL Wildcard Search - Efficiency? - sql

There has been a debate at work recently at the most efficient way to search a MS SQL database using LIKE and wildcards. We are comparing using %abc%, %abc, and abc%. One person has said that you should always have the wildcard at the end of the term (abc%). So, according to them, if we wanted to find something that ended in "abc" it'd be most efficient to use `reverse(column) LIKE reverse('%abc').
I set up a test using SQL Server 2008 (R2) to compare each of the following statements:
select * from CLMASTER where ADDRESS like '%STREET'
select * from CLMASTER where ADDRESS like '%STREET%'
select * from CLMASTER where ADDRESS like reverse('TEERTS%')
select * from CLMASTER where reverse(ADDRESS) like reverse('%STREET')
CLMASTER holds about 500,000 records, there are about 7,400 addresses that end "Street", and about 8,500 addresses that have "Street" in it, but not necessarily at the end. Each test run took 2 seconds and they all returned the same amount of rows except for %STREET%, which found an extra 900 or so results because it picked up addresses that had an apartment number on the end.
Since the SQL Server test didn't show any difference in execution time I moved into PHP where I used the following code, switching in each statement, to run multiple tests quickly:
<?php
require_once("config.php");
$connection = odbc_connect( $connection_string, $U, $P );
for ($i = 0; $i < 500; $i++) {
$m_time = explode(" ",microtime());
$m_time = $m_time[0] + $m_time[1];
$starttime = $m_time;
$Message=odbc_exec($connection,"select * from CLMASTER where ADDRESS like '%STREET%'");
$Message=odbc_result($Message,1);
$m_time = explode(" ",microtime());
$m_time = $m_time[0] + $m_time[1];
$endtime = $m_time;
$totaltime[] = ($endtime - $starttime);
}
odbc_close($connection);
echo "<b>Test took and average of:</b> ".round(array_sum($totaltime)/count($totaltime),8)." seconds per run.<br>";
echo "<b>Test took a total of:</b> ".round(array_sum($totaltime),8)." seconds to run.<br>";
?>
The results of this test was about as ambiguous as the results when testing in SQL Server.
%STREET completed in 166.5823 seconds (.3331 average per query), and averaged 500 results found in .0228.
%STREET% completed in 149.4500 seconds (.2989 average per query), and averaged 500 results found in .0177. (Faster time per result because it finds more results than the others, in similar time.)
reverse(ADDRESS) like reverse('%STREET') completed in 134.0115 seconds (.2680 average per query), and averaged 500 results found in .0183 seconds.
reverse('TREETS%') completed in 167.6960 seconds (.3354 average per query), and averaged 500 results found in .0229.
We expected this test to show that %STREET% would be the slowest overall, while it was actually the fastest to run, and had the best average time to return 500 results. While the suggested reverse('%STREET') was the fastest to run overall, but was a little slower in time to return 500 results.
Extra fun: A coworker ran profiler on the server while we were running the tests and found that the use of the double wildcard produced a significant increase CPU usage, while the other tests were within 1-2% of each other.
Are there any SQL Efficiency experts out that that can explain why having the wildcard at the end of the search string would be better practice than the beginning, and perhaps why searching with wildcards at the beginning and end of the string was faster than having the wildcard just at the beginning?

Having the wildcard at the end of the string, like 'abc%', would help if that column were indexed, as it would be able to seek directly to the records which start with 'abc' and ignore everything else. Having the wild card at the beginning means it has to look at every row, regardless of indexing.
Good article here with more explanation.

Only wildcards at the end of a Like character string will use an index.
You should look at using FTS Contains if you want to improve speed of wildcards at the front and back of a character string. Also see this related SO post regarding Contains versus Like.

From Microsoft it is more efficient to leave the closing wildcard because it can, if one exists, use an index rather than performing a scan. Think about how the search might work, if you have no idea what's before it then you have to scan everything, but if you are only searching the tail end then you can order the rows and even possible (depending on what you're looking for) do a quasi-binary search.
Some operators in joins or predicates tend to produce resource-intensive operations. The LIKE operator with a value enclosed in wildcards ("%a value%") almost always causes a table scan. This type of table scan is a very expensive operation because of the preceding wildcard. LIKE operators with only the closing wildcard can use an index because the index is part of a B+ tree, and the index is traversed by matching the string value from left to right.
So, the above quote also explains why there was a huge processor spike when running two wildcards. It completed faster only by happenstance because there is enough horsepower to cover up the inefficiency. When trying to determine performance on a query you want to look at the execution of the query rather than the resources of the server because those can be misleading. If I have a server with enough horsepower to serve a weather vain and I'm running queries on tables as small as 500,000 rows the results are going to be misleading.
Less the fact that Microsoft quoted your answer, when doing performance analysis, consider taking the dive into learning how to read the execution plan. It's an investment and very dry, but it will be worth it in the long run.
In short though, whoever was indicating that the trailing wildcard only is more efficient, is correct.

In MS SQL, if you want to have the names those are ending with 'ABC', then u can have the query like below(suppose table name is student)
select * from student where student_name like'%[ABC]'
so it will give those names which ends with 'A' ,'B','C'.
2) if u want to have names which are starting with 'ABC' means-
select * from student where student_name like '[ABC]%'
3) if u want to have names which in middle have 'ABC'
select * from student where student_name like '%[ABC]%'

Related

Why the difference in speed between these SQL queries?

I'm currently trying to write a query to sift through our ERP DB and I noticed a very odd drop in speed when I removed a filter condition.
This is what the Query looked like before (it took less than a second to complete, typically returned anywhere from 10 to hundreds of records depending on the order and item)
SELECT TOP 1000 jobmat.job, jobmat.suffix, jobmat.item, jobmat.matl_qty,
jobmat.ref_type, jobmat.ref_num, spec.NoteContent, spec.NoteDesc,
job.ord_num, jobmat.RowPointer
FROM jobmat
INNER JOIN ObjectNotes AS obj ON obj.RefRowPointer = jobmat.RowPointer
INNER JOIN SpecificNotes AS spec ON obj.SpecificNoteToken = spec.SpecificNoteToken
INNER JOIN job ON job.job = jobmat.job AND job.suffix = jobmat.suffix
WHERE ord_num LIKE '%3766%' AND ref_type = 'P' AND
(spec.NoteDesc LIKE '%description%' OR spec.NoteContent LIKE '%COMPANY%DWG%1162402%')
And this is what I changed the WHERE Statement too:
WHERE ord_num LIKE '%3766%' AND ref_type = 'P' AND
spec.NoteContent LIKE '%COMPANY%DWG%1162402%'
Running it after having made this modifcation bumped my runtime up to like 9 seconds (returns normally 1-3 records). Is there an obvious reason that I'm missing? I would have thought that the same should have been roughly the same. Any help is greatly appreciated!
Edit: I have run both versions of this query many times to test, and the runtimes are fairly consistant; <1 second for version 1, ~9 seconds for version 2.
Your original query has to find 1000 matching rows before it returns the result set. It is doing this by creating a result set based on the JOINs and then filtering based on the WHERE clause.
The condition:
(spec.NoteDesc LIKE '%description%' OR spec.NoteContent LIKE '%COMPANY%DWG%1162402%')
matches more rows than:
(spec.NoteContent LIKE '%COMPANY%DWG%1162402%')
Hence, the second version has to go through and process more rows to get 1,000 matches.
That is why the second version is slower.
If you want the second version to seem faster, you can add an ORDER BY clause. This requires reading all the data in both cases. Because the ORDER BY then takes additional time related to the number of matching rows, you'll probably see that the second is slightly faster than the first. One caveat: both will then take at least 9 seconds.
EDIT:
I like the above explanation, but if all the rows are processed in both cases, then it might not be correct. Here is another explanation, but it is more speculative.
It depends on NoteContent being very long, so the LIKE has a noticeable effect on query performance. The second version might be "optimized" by pushing the filtering clauses to the node in the processing that reads the data. This is something that SQL Server regularly does. That means that the LIKE is being processed on all the data in the second case.
Because of the OR condition, the first example cannot push the filtering down to that level. Then, the condition on NoteDescription is short-circuited in multiple ways:
The join conditions.
The other WHERE conditions.
The first LIKE condition.
Because of all this filtering, the second condition is run on many fewer rows, and the code runs faster. Usually filtering before joining is a good idea, but there are definitely exceptions to this rule.

How to do server paging in SQL correct?

My situation: My application is slow. As slow as it gets... mostly because I have the feeling my Server paging for my dataTables / grids are wrongly implemented.
Let's start:
I have a SQL Server 2008 database, one table with all the information, 10 columns in it, at the moment 19K rows
My application is based on a JavaScript and ASP.Net backend code.
My SQL query is:
WITH Ordered AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Created DESC) AS 'RowNumber'
FROM Meetings
WHERE State IN ('Appointed', 'Accepted')
AND [xxx] LIKE '%1%'
AND [yyy] LIKE '%2%'
)
SELECT *
FROM Ordered
WHERE RowNumber BETWEEN 1 AND 41;
So at the moment this query runs around 27 to 32 seconds, which means over 30 seconds I got a timeout... on 19k rows in 1 year... which means in 1 month latest every query will run against dead...
As far as I am understand the order for this query is the problem: No index done here.
Because the query first sorts, then selects all with a manual row number, then selects only 40... (of course on page 2 of my grid it gets Rows 41 to 81...)
I COULD do an Index on my "Created desc" and the query would be much much faster, BUT every column is sortable for my grid which means "Created desc" could be every other column of my table and of course desc and asc order!
So, how to improve this?
//Edit:
Sorry to forget that:
The inner query (Inner Select) runs 6 seconds, while the total query runs 31 seconds...
Which means the "WITH ORDERES AS" is the problem here!
First things first: you have a performance problem, approach it with a proper methodology and measure appropriately. The inner query (Inner Select) runs 6 seconds, while the total query runs 31 seconds... Which means ... is amateurism. Read How to analyse SQL Server performance for correct ways to measure performance. And before we continue, if you start from 6 seconds you have already lost the game.
Now, on to the question.
WHERE State in('Appointed','Accepted') AND [xxx] LIKE '%1%' AND [yyy] LIKE '%2%'
This expression is basically non-indexable. Even if you add an index on State it will not help because of the low cardinality (few values with many rows each). And like '% ... %' is unindexable because it searches for values in the middle of the text.
You could try to replace like '% ... %' with a full-text search like CONTAINS ... which will be faster, provider you search for specific enough terms. But it does require you to deploy and configure properly the full-text indexes.
As for the paging, I do not favor much the ROWNUMBER approach. Even when a sort column exists, it involves a scan and count to skip the number of rows and gets slower and slower as you go to higher pages. I much more recommend the key based approach:
SELECT TOP (page size) ...
WHERE keys > <last row>
ORDER BY...
but this approach is more difficult to implement as it requires keeping track of keys rather than the page number.
But expect no miracles. You are asking a relational OLTP system to do the work of an ElasticSearch/Solr. It will never work as you expect. Use a tool appropriate for the job (a Search engine). Also read Dynamic Search Conditions in T‑SQL for a more thorough discussion, but again, expect no miracles.

SQL Server: Returning different number of rows from the same source data?

I have encountered today by far the most strange behaviour of SQL Server 2008 for me so far (or maybe I was already too tired to figure out what is going on).
I have a fairly complicated statement operating on around 1,100,000 rows. My source data is in table used by the subquery and it doesn't change. The statement looks something like this (I'll include only the parts that could cause the error in my opinion):
SELECT
-- here are some columns straight from the subquery and some hardcoded columns like
'My company' AS CompanyName
-- and here is the most important part
SUM(Subq.ImportantFloatWithBigPrecision)
FROM
(
SELECT
-- Here are some simple columns fetched from table and after that:
udf.GetSomeMappedValue(Value) AS MappedValueFromFunction
,ImportantFloatWithBigPrecision
FROM MyVeryBigTable
WHERE ImportantFloatWithBigPrecision <> 0
AND(...)
) Subq
GROUP BY everything except SUM(Subq.ImportantFloatWithBigPrecision)
HAVING SUM(Subq.ImportantFloatWithBigPrecision)<>0
ORDER BY (...)
I've checked the subquery quite a few times and it returns the same results every time, but the whole query returns 850-855 rows from the same data! Most of the time (around 80%) it is 852 rows, but sometimes its 855, sometimes 850 and I really have no idea why.
Surprisingly removing the <> 0 condition from the subquery helped at first glance (so far 6 times I got the same results), but it has drastic impact on performance (for this amount of rows it runs about 8-9 mins longer (remember the udf? :/)
Could someone explain this to me? Any ideas/questions? I'm completely clueless...
EDIT (inspired by Crono): We're running this on two environments: dev and test and comparing the results so maybe some of the settings may differ.
EDIT 2: Values of ImportantFloatWithBigPrecision are from a really wide range (around -1,000,000 to + 1,000,000) but there are 'few' (propably in this scale around 25k-30k) rows which have values really close to 0 (first non-zero digit on 6-th place after separator, some start even further) both negative and positive.

Access & SQL Server: Number of uses since date aggregate problem - new reporting problem (solved aggregate issue)

BACKGROUND:
I've been trying to streamline the work involved in running a report in my program. Lately, I've had to supply a listing of job numbers an instrument has been used on with the listing of items for cost/benefit analysis. Mostly to see how often an instrument is used since it was last serviced/calibrated and the last time anyone did use it. I was looking to integrate this into the query that helps generate the report - but I keep hitting a brick wall of sorts with the number of uses - since I want that aggregate to be based on the date the instrument was last calibrated (a field based in the same query). I can get it to give me the number of uses in the system total - but it will not accept the limitation that I want it to be only counting the times used since the last time it was calibrated
PROBLEM:
Attempts to put an aggregate function in my report for the number of uses since the item's calibration are met either with undesired results, or the dreaded 'aggregate missing' error (don't remember the exact warning).
-- Edited to add 8/12/2011 # 16:09 --
An additional problem with the use of the Max aggregate has been found for instruments that have never been used being excluded by this query.
DETAILS:
Here is the query that does work so far:
SELECT
dbo_tblPOGaugeDetail.intGagePOID,
dbo_tblPOGaugeDetail.strGageDetailID,
dbo_Gage_Master.Description,
dbo_Gage_Master.Manufacturer,
dbo_Gage_Master.Model_No,
dbo_Gage_Master.Gage_SN,
dbo_Gage_Master.Unit_of_Meas,
dbo_Gage_Master.User_Defined,
dbo_Gage_Master.Calibration_Frequency,
dbo_Gage_Master.Calibration_Frequency_UOM,
dbo_tblPOGaugeDetail.bolGageLeavePriceBlank,
dbo_tblPOGaugeDetail.intGageCost,
dbo_Gage_Master.Last_Calibration_Date,
dbo_Gage_Master.Next_Due_Date,
dbo_tblPOGaugeDetail.bolGageEvaluate,
dbo_tblPOGaugeDetail.bolGageExpedite,
dbo_tblPOGaugeDetail.bolGageAccredited,
dbo_tblPOGaugeDetail.bolGageCalibrate,
dbo_tblPOGaugeDetail.bolGageRepair,
dbo_tblPOGaugeDetail.bolGageReturned,
dbo_tblPOGaugeDetail.bolGageBER,
dbo_tblPOGaugeDetail.intTurnaroundDaysOut,
qryRCEquipmentLastUse.MaxOfdatDateEntered
FROM (dbo_tblPOGaugeDetail
INNER JOIN dbo_Gage_Master ON dbo_tblPOGaugeDetail.strGageDetailID = dbo_Gage_Master.Gage_ID)
INNER JOIN qryRCEquipmentLastUse ON dbo_Gage_Master.Gage_ID = qryRCEquipmentLastUse.Gage_ID
ORDER BY dbo_tblPOGaugeDetail.strGageDetailID;
But I can't seem to aggregate a count of Uses (making a Count(strCustomerJobNum)) from the tblGageActivity with the following fields:
strGageID
strCustomerJobNum
datDateEntered
datTimeEntered
I tried to add a field to the formerly listed query to do a Count(strCustomerJobNum) where datDateEntered matched the Last_Calibration_Date from the calling query - but I got the 'missing aggregate' error. If I leave this condition out - it will run - but will list every instrument ever sent out only if it's had a usage count of at least one (not what I want at all, sadly).
I also want to make sure that if I should get a zero uses count - I will get a zero back instead of my expected records minus the null results.
I hope someone out there can tell me where I am going wrong with this - I want to save the time I am currently spending running an activity report in another program whenever I want to generate this report. Thanks in advance, and let me know if you need me to post more information.
-- Edited to add 08/15/2011 # 14:41 --
I managed to solve the Max() aggregate problem by creating a 'pure' first-step query to get a listing of all instrument with most modern date as qryRCEquipmentUsed.
qryRCEquipmentLastUse:
SELECT dbo.tblGageActivity.strGageID, Max(dbo.tblGageActivity.datDateEntered) AS datLastDateUsed
FROM dbo.tblGageActivity
GROUP BY dbo.tblGageActivity.strGageID;
Then I created a 'pure' listing of all instruments that have no usage at all as a query named qryRCEquipmentNeverUsed.
qryRCEquipmentNeverUsed:
SELECT dbo_Gage_Master.Gage_ID, NULL AS datLastDateUsed
FROM dbo_Gage_Master LEFT JOIN dbo_tblGageActivity ON dbo_Gage_Master.Gage_ID = dbo_tblGageActivity.strGageID
WHERE (((dbo_tblGageActivity.strGageID) Is Null));
NOTE: The NULL was inserted so that the third combining UNION query will not fail due to a mismatch in the number of fields being retrieved from the tables.
At last, I created a UNION query named qryCombinedUseEquipment to combine the two into a list:
qryCombinedUseEquipment:
SELECT *
FROM qryRCEquipmentLastUse
UNION SELECT *
FROM qryRCEquipmentNeverUsed;
Using this last union query to feed the Last Used date to the parent query works in datasheet view, but when the parent query is called in the report - I get a blank report; so a nudge in the right direction would still be wonderfully appreciated.
APPENDIX
Same script as above, but with shorter table aliases (in case someone finds that clearer):
SELECT
gd.intGagePOID,
gd.strGageDetailID,
gm.Description,
gm.Manufacturer,
gm.Model_No,
gm.Gage_SN,
gm.Unit_of_Meas,
gm.User_Defined,
gm.Calibration_Frequency,
gm.Calibration_Frequency_UOM,
gd.bolGageLeavePriceBlank,
gd.intGageCost,
gm.Last_Calibration_Date,
gm.Next_Due_Date,
gd.bolGageEvaluate,
gd.bolGageExpedite,
gd.bolGageAccredited,
gd.bolGageCalibrate,
gd.bolGageRepair,
gd.bolGageReturned,
gd.bolGageBER,
gd.intTurnaroundDaysOut,
lu.MaxOfdatDateEntered
FROM (dbo_tblPOGaugeDetail gd
INNER JOIN dbo_Gage_Master gm ON gd.strGageDetailID = gm.Gage_ID)
INNER JOIN qryRCEquipmentLastUse lu ON gm.Gage_ID = lu.Gage_ID
ORDER BY gd.strGageDetailID;
Piece by piece...
First -- I suspect you're trying to answer too many questions at once (as evidenced by 23 fields in your SELECT), which will make aggregation near-impossible. Start by narrowing down the scope of the query -- What question is this query attempting to answer? (You can always make more queries to answer other questions... :-)
1) How many uses since last calibration?
2) How many uses since last ...use? (not sure what you mean by that -- maybe last sign-out, or last rental, etc.?)
Tip -- learn to use table aliases. Large queries are difficult to read; worse because of repeated table names.
1) Ex.: dbo_tbl_POGaugeDetail.intGagePOID becomes d.intGagePOID
Here's a sample that might get you started:
SELECT
d.strCustomerJobNum,
Max(d.last_calibration_date) -- not sure what you named that field
Count(d.strCustomerJobNum)
FROM
dbo_tblPOGaugeDetail d
GROUP BY
d.strCustomerJobNum
Does this work:
SELECT dbo_tblPOGaugeDetail.intGagePOID, dbo_tblPOGaugeDetail.strGageDetailID,
OuterGageMaster.Description, OuterGageMaster.Manufacturer, OuterGageMaster.Model_No,
OuterGageMaster.Gage_SN, OuterGageMaster.Unit_of_Meas, OuterGageMaster.User_Defined,
OuterGageMaster.Calibration_Frequency, OuterGageMaster.Calibration_Frequency_UOM,
dbo_tblPOGaugeDetail.bolGageLeavePriceBlank, dbo_tblPOGaugeDetail.intGageCost,
OuterGageMaster.Last_Calibration_Date, OuterGageMasterNext_Due_Date,
dbo_tblPOGaugeDetail.bolGageEvaluate, dbo_tblPOGaugeDetail.bolGageExpedite,
dbo_tblPOGaugeDetail.bolGageAccredited, dbo_tblPOGaugeDetail.bolGageCalibrate,
dbo_tblPOGaugeDetail.bolGageRepair, dbo_tblPOGaugeDetail.bolGageReturned,
dbo_tblPOGaugeDetail.bolGageBER, dbo_tblPOGaugeDetail.intTurnaroundDaysOut,
qryRCEquipmentLastUse.MaxOfdatDateEntered,
(Select Count(strCustomerJobNum)
FROM tblGageActivity WHERE
OuterGageMaster.Last_Calibration_Date=tblGageActivity.datDateEntered) As JobCount
FROM
(dbo_tblPOGaugeDetail INNER JOIN dbo_Gage_Master OuterGageMaster ON
dbo_tblPOGaugeDetail.strGageDetailID = OuterGageMaster.Gage_ID) INNER JOIN
qryRCEquipmentLastUse ON OuterGageMaster.Gage_ID = qryRCEquipmentLastUse.Gage_ID
ORDER BY
dbo_tblPOGaugeDetail.strGageDetailID;
or is that what you tried?
Summary Problem:
Attempts to put an aggregate function in my report for the number of uses since the item's calibration are met either with undesired results, or the dreaded 'aggregate missing' error.
Solution:
I decided to leave the query driving the report alone - instead choosing to employ the use of DLookup and DCount as appropriate to retrieve the last used date from a query that provides the last used date of all the instruments, and the number of uses an instrument has had since it's last calibration, using the aforementioned domain aggregates respectively.
Using the query described in the problem description, I am able to retrieve the last used date for all instruments. I used a =DLookup statement as the source for a text box on the report's subreport dealing with various items as such:
=IIf((DLookUp("[qryRCCombinedUseEquipment]![datLastDateUsed]","[qryRCCombinedUseEquipment]","[qryRCCombinedUseEquipment]![strGageID]=[strGageDetailID]")) Is Null Or ([bolGageReturned]=True),"",DLookUp("[qryRCCombinedUseEquipment]![datLastDateUsed]","[qryRCCombinedUseEquipment]","[qryRCCombinedUseEquipment]![strGageID]=[strGageDetailID]"))
This allows items that have never been used to return a NULL result, which will display as a blank text box.
The number of uses, however, would not feed off a query using =DCount (I tried, it would take over ten minutes to retrieve results, if it ever did). However, using the underlying activity table, I used the following statement:
=IIf([bolGageReturned],"","Used " & DCount("[dbo_tblGageActivity]![strGageID]","[dbo_tblGageActivity]","[dbo_tblGageActivity]![strGageID] = [strGageDetailID] And [dbo_tblGageActivity]![datDateEntered] Between [txtLastCalibrationDate] And date()") & " times since last calibration")
It would retrieve a number of times used since the instrument was last calibrated, but no uses that are before that or after today (some jobs are post dated, strangely). Of course, this is SLOW (about thirty seconds for a large document with thirty or forty instruments).
Does anyone else have a better solution for this, or will I have to take the performance hit? If no one has any better ideas, I will accept this as the answer after five days (8/21/2011) .

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.