Getting the same results in snowflake using RANDOM([seed]) - sql

From the snowflake documentation -
If a statement that calls RANDOM is executed more than once, there is no guarantee that RANDOM will generate the same set of values each time. This is true whether or not you specify a seed.
What's the use of a random seed if it doesn't allow to create reproducible code? Is there a way around this, so that if I want to use the same query again I'll get the same rows every time, even if the rows are ordered randomly using a seed?
For example,
SELECT ID,
ROW_NUMBER() OVER (PARTITION BY group_name ORDER BY RANDOM(123)) AS random_n
FROM my_table
WHERE random_n < 100

Repeatable random numbers are really tricky in a parallel database -- and in general, simply not worth the effort. This is even harder on cross-platform databases.
As the documentation suggests, the purpose of random(seed) is to return the same value for multiple calls within a row. This seems like a micro-efficiency, because you should be able to generate the same effect using a CTE or subquery.
The documentation also suggests using SEQ functions for certain purposes. In fact, you can generate your own repeatable pseudo-random number generator using the seq values -- assuming the underlying ordering of the data is constant. My guess is that Snowflake prefers this method for a repeatable generator.

Related

Does the ORDER BY clause return a Virtual Table?

My understanding is that relational tables aren't ordered.
I also understand that each step, or phase, of the query execution returns a "virtual table" which is passed as input to the next phase.
But if tables are never actually ordered, what's happening during/after the ORDER BY phase?
I'm just trying to understand what might happen with a query like this:
SELECT col1, col2
FROM mytable
ORDER BY col1
LIMIT 1;
Edit:
To clarify. I know what the query above outputs. I'm trying to better understand each phase/step of the underlying execution.
The (logical) order of execution (EDIT: different from the physical execution) for the above query would be:
FROM
SELECT
ORDER BY
LIMIT
I'm trying to understand what's going on during the ORDER BY phase. My understanding is that a virtual table is passed from the SELECT phase to the ORDER BY phase (in this case, a table with col1 and col2, but I don't know what's being returned by ORDER BY and subsequently passed to LIMIT.
Does the ORDER BY clause return a Virtual Table?
Sometimes.
The database engine tries as much as it can not to produce a materialized result set (that you call virtual table). Most of the time it's more efficient to work the rows one by one, so they can be successively processed by each execution step until they are returned to the client app.
However, this is not always possible. In such cases, the engine is forced to materialize an intermediate result that actually takes the form you are thinking about. But again, this is expensive, and is usually avoided.
The (logical) order of execution for the above query would be:
FROM
SELECT
ORDER BY
LIMIT
No. This is just how a SQL query is written and is unrelated to the actual execution steps. Take that sequence as a good pedagogical tool, useful [for learning purposes only] to understand how the result is produced. Behind the scenes, the engine cheats in all kinds of ways to do as less effort as possible to produce the result you asked. You wouldn't believe it if you saw it.
The underlying table is not sorted when you use ORDER BY, only the results returned by the SELECT statement are. That query will return the first result from mytable. Since the default order is ASCENDING, it will be the lowest value in col1.
the order is in tables unknown, and by definition unsorted
a result set as the end product of a SELECT without an ORDER BYis also unsorted.
but as the ORDER BY is the penultimate command before LIMIT and OFFSET , the result set is in that specific order

Does GROUP BY in any way imply the records' order?

Consider the following code:
SELECT s, COUNT(*)
FROM p
GROUP BY s;
Should I expect the records to be sorted with respect to s? In my experience, in Access 2007, it seems to be the case that the command implies an order.
You should never make that assumption when using SQL. It is always best to add an explicit ORDER BY:
order by s
This is because SQL (the language) does not guarantee the ordering of result sets with no ORDER BY.
That said, MS Access is going to return the results in order, because I think it has only one algorithm for calculating ORDER BY -- sorting the list.
However, other algorithms are definitely out there. SQL Server, for instance, has hash-based algorithms and parallel algorithms.
So, you might as well learn how to write correct queries.

In SQL, does the LIMIT returns the row which is inserted the last in chronological order?

Suppose, if following rows are inserted in chronological order into a table:
row1, row2, row3, row4, ..., row1000, row1001.
After a while, we delete/remove the latest row1001.
As in this post: How to get Top 5 records in SqLite?
If the below command is run:
SELECT * FROM <table> LIMIT 1;
Will it assuredly provide the "row1000"?
If no, then is there any efficient way to get the latest row(s)
without traversing through all the rows? -- i.e. without using
combination of ORDER BY and DESC.
[Note: For now I am using "SQLite", but it will be interesting for me to know about SQL in general as well.]
You're misunderstanding how SQL works. You're thinking row-by-row which is wrong. SQL does not "traverse rows" as per your concern; it operates on data as "sets".
Others have pointed out that relational database cannot be assumed to have any particular ordering, so you must use ORDER BY to explicitly specify ordering.
However (not mentioned yet is that), in order to ensure it performs efficiently, you need to create an appropriate index.
Whether you have an index or not, the correct query is:
SELECT <cols>
FROM <table>
ORDER BY <sort-cols> [DESC] LIMIT <no-rows>
Note that if you don't have an index the database will load all data and probably sort in memory to find the TOP n.
If you do have the appropriate index, the database will use the best index available to retrieve the TOP n rows as efficiently as possible.
Note that the sqllite documentation is very clear on the matter. The section on ORDER BY explains that ordering is undefined. And nothing in the section on LIMIT contradicts this (it simply constrains the number of rows returned).
If a SELECT statement that returns more than one row does not have an ORDER BY clause, the order in which the rows are returned is undefined.
This behaviour is also consistent with the ANSI standard and all major SQL implementations. Note that any database vendor that guaranteed any kind of ordering would have to sacrifice performance to the detriment of queries trying to retrieve data but not caring about order. (Not good for business.)
As a side note, flawed assumptions about ordering is an easy mistake to make (similar to flawed assumptions about uninitialised local variables).
RDBMS implementations are very likely to make ordering appear consistent. They follow a certain algorithm for adding data, a certain algorithm for retrieving data. And as a result, their operations are highly repeatable (it's what we love (and hate) about computers). So things repeatably look the same.
Theoretical examples:
Inserting a row results in the row being added to the next available free space. So data appears sequential. But an update would have to move the row to a new location if it no longer fits.
The DB engine might retrieve data sequentially from clustered index pages and seem to use clustered index as the 'natural ordering' ... until one day a page-split puts one of the pages in a different location. * Or a new version of the DMBS might cache certain data for performance, and suddenly order changes.
Real-world example:
The MS SQL Server 6.5 implementation of GROUP BY had the side-effect of also sorting by the group-by columns. When MS (in version 7 or 2000) implemented some performance improvements, GROUP BY would by default, return data in a hashed order. Many people blamed MS for breaking their queries when in fact they had made false assumptions and failed to ORDER BY their results as needed.
This is why the only guarantee of a specific ordering is to use the ORDER BY clause.
No. Table records have no inherent order. So it is undefined which row(s) to get with a LIMIT clause without an ORDER BY.
SQLite in its current implemantation may return the latest inserted row, but even if that were the case you must not rely on it.
Give a table a datetime column or some sortkey, if record order is important for you.
In SQL, data is stored in tables unordered. What comes out first one day might not be the same the next.
ORDER BY, or some other specific selection criteria is required to guarantee the correct value.

Select without order by

It is my understanding that select is not guaranteed to always return the same result.
Following query is not guaranteed to return the same result every time:
select * from myTable offset 10000 limit 100
My question is if myTable is not changed between executions of select (no deletions or inserts) can i rely on it returning the same result set every time?
Or to put it in another way if my database is locked for changes can I rely on select returning the same result?
I am using postgresql.
Tables and result sets (without order by) are simply not ordered. It really is that simple.
In some databases, under some circumstances, the order will be consistent. However, you should never depend on this. Subsequent releases, for instance, might invalidate the query.
For me, I think the simplest way to understand this is by thinking of parallel processing. When you execute a query, different threads might go out and start to fetch data; which values are returned first depends on non-reproducible factors.
Another way to think of it is to consider a page cache that already has pages in memory -- probably from the end of the table. The SQL engine could read the pages in any order (although in practice this doesn't really happen).
Or, some other query might have a row or page lock, so that page gets skipped when reading the records.
So, just accept that unordered means what ordered means. Add an order by if you want data in a particular order. If you use a clustered index key, then there is basically no performance hit.

Is LIMIT clause in HIVE really random?

The documentation of HIVE notes that LIMIT clause returns rows chosen at random. I have been running a SELECT table on a table with more than 800,000 records with LIMIT 1, but it always return me the same record.
I'm using the Shark distribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated.
Thanks,
Visakh
Even though the documentation states it returns rows at random, it's not actually true.
It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.
As soon as you slap a order by x DESC limit 5 on there, it returns the last 5 rows of whatever you're selecting from.
To get rows returned at random, you would need to use something like: order by rand() LIMIT 1
However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset
To be safe you want to use
select * from table
distribute by rand()
sort by rand()
limit 10000;
The documentation may have been updated since this question was originally posted in 2014, but as of December,2017, the documentation now reads, "The following query returns 5 arbitrary customers".
In this case, "arbitrary" means method of selecting either is not deterministic or may not be worth the trouble to document. In other words, you shouldn't count on it as a reliable method for getting specific subset of records (e.g., for sampling). You should only use the Limit clause without an Order By clause if you are looking for expediency and want to get a small result set as quickly as possible (e.g., for QA purposes). Otherwise, use one of Order By, Cluster By, or Distribute By/Sort By as appropriate.