Difference between "TOP" and "SAMPLE" in TeraData SQL - sql

What is the difference between "TOP" and "SAMPLE" in TeraData SQL? Are they the same?

From TOP vs SAMPLE:
TOP 10 means "first 10 rows in sorted
order". If you don't have an ORDER BY,
then by extension it will be
interpreted as asking for "ANY 10
rows" in any order. The optimizer is
free to select the cheapest plan it
can find and stop processing as soon
as it has found enough rows to return.
If this query is the only thing
running on your system, TOP may appear
to always give you exactly the same
answer, but that behavior is NOT
guaranteed.
SAMPLE, as you have observed, does
extra processing to try to randomize
the result set yet maintain the same
approximate distribution. At a very
simple level, for example, it could
pick a random point at which to start
scanning the table and a number of
rows to skip between rows that are
returned.

‘Sample’ command:
Sel * from tablename
sample 100
That will give you a sample of 100 different records from the table.
The SAMPLE command will give DIFFERENT results each time you run it.
TOP command:
sel top 100 * from tablename;
This will give the first 100 rows of the table.
The TOP command will give you THE SAME results each time you run it.

Related

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

CosmosDB: How does SELECT TOP work?

I have a 450GB database... with millions of records.
Here is an example query:
SELECT TOP 1 * FROM c WHERE c.type='water';
To speed up our queries, I thought about just taking the first one but we have noticed that the query still takes quite a while, despite the very first record in the database matching our constraints.
So, my question is, how does the SELECT TOP 1 really work? Does it:
A) Select ALL records and then return just the first (top) one where
type='water'
B) Return the first record which is encountered where type='water'
Try this line, noting the offset limit:
SELECT * FROM c WHERE c.type='water' OFFSET 0 LIMIT 1
For more information about the offset limit:
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-offset-limit
Assuming you aren't sorting your results (which you query isn't) then TOP 1 will return the first result as soon as it finds one. This should then end the query.
Cosmos db Explorer doesn't work with the TOP Command, It's an existing issue. It works fine in SDK Call.
Check some Top command usage below
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-subquery

PDO SQL issue displaying multiple rows when using COUNT()

To display my results from PDO, I always use following PHP code for example:
$STH = $DBH->prepare("SELECT logo_id, guess_count, guessed, count(id) AS counter FROM guess WHERE user_id=:id");
$STH->bindParam(":id",$loginuser['id']);
$STH->execute();
while($row = $STH->fetch()){
print_r($row);
}
Now the issue is that I only get one result. I used to use $STH->rowCount() to check the amount of rows returned, but this method isn't really advised for SELECT statements because in some databases it doesn't react correctly. So I used the count(id) AS counter, but now I only get one result every time, even though the value of $row['counter'] is larger than one.
What is the correct way to count the amount of results in one query?
If you want to check the number of rows that are returned by a query, there are a couple of options.
You could do a ->fetchAll to get an array of all rows. (This isn't advisable for large result sets (i.e. a lot of rows returned by the query); you could add a LIMIT clause on your query to avoid returning more than a certain number of rows, if what you are checking is whether you get more than one row back, you would only need to retrieve two rows.) Checking the length of the array is trivial.
Another option is to run a another, separate query, to get the count separately, e.g.
SELECT COUNT(1) AS counter FROM guess WHERE user_id=:id
But, that approach requires another round trip to the database.
And the old standby SQL_CALC_ROUND_ROWS is another option, though that too can have problematic performance with large sets.
You could also just add a loop counter in your existing code:
$i = 0;
while($row = $STH->fetch()){
$i++
print_r($row);
}
print "fetched row count: ".$i;
If what you need is an exact count of the number of rows that satisfy a particular predicate PRIOR to running a query to return the rows, then the separate COUNT(1) query is likely the most suitable approach. Yes, it's extra code in your app; I recommend you preface the code with a comment that indicates the purpose of the code... to get an exact count of rows that satisfy a set of predicates, prior to running a query that will retrieve the rows.
If I had to process the rows anyway, and adding LIMIT 0,100 to the query was acceptable, I would go for the ->fetchAll(), get the count from the length of the array, and process the rows from the array.
You have to use GROUP BY. Your query should look like
SELECT logo_id, guess_count, guessed, COUNT(id) AS counter
FROM guess
WHERE user_id=:id
GROUP BY logo_id, guess_count, guessed

Biased random in SQL?

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost

How to limit result set size for arbitrary query in Ingres?

In Oracle, the number of rows returned in an arbitrary query can be limited by filtering on the "virtual" rownum column. Consider the following example, which will return, at most, 10 rows.
SELECT * FROM all_tables WHERE rownum <= 10
Is there a simple, generic way to do something similar in Ingres?
Blatantly changing my answer. "Limit 10" works for MySql and others, Ingres uses
Select First 10 * from myTable
Ref
select * from myTable limit 10 does not work.
Have discovered one possible solution:
TIDs are "tuple identifiers" or row addresses. The TID contains the
page number and the index of the offset to the row relative to the
page boundary. TIDs are presently implemented as 4-byte integers.
The TID uniquely identifies each row in a table. Every row has a
TID. The high-order 23 bits of the TID are the page number of the page
in which the row occurs. The TID can be addressed in SQL by the name
`tid.'
So you can limit the number of rows coming back using something like:
select * from SomeTable where tid < 2048
The method is somewhat inexact in the number of rows it returns. It's fine for my requirement though because I just want to limit rows coming back from a very large result set to speed up testing.
Hey Craig. I'm sorry, I made a Ninja Edit.
No, Limit 10 does not work, I was mistaken in thinking it was standard SQL supported by everyone. Ingres uses (according to doc) "First" to solve the issue.
Hey Ninja editor from Stockholm! No worries, have confirmed that "first X" works well and a much nicer solution than I came up with. Thankyou!