Order by random number not working in Impala

Order by random number not working in Impala - sql

When I run this in Hue against an Impala data source, I get an unexpected result
select id, rand(unix_timestamp(now())) as ord
from (
select 1 as id union select 2 union select 3 union select 4
) x
order by ord
I'd expect to get the rows ordered in a random order by whatever comes up in the ord column. Instead, I always get 3, 4, 1, 2 with numbers in ord that aren't in order:
id ord
3 0.8899110606358904
4 0.3766916056055071
1 0.09029531483086539
2 0.7446466436351866
So, what part of ORDER BY does Impala not understand? Or what am I missing?

It seems that Impala 2.8 and lower has this bug. This is claimed to be fixed in Impala 2.9, but I haven't tested it yet.
In the meantime, since I don't need very robust randomness, I just used:
order by mod((((unix_timestamp() + MY_ID_COLUMN) * 7621) + 1), 32768)
(There is a definite pattern to such results, but it is probably good enough to fool a casual observer.)
Thanks to Vamsi Prabhala for pointing me in the right direction.

Related

WHILE Window Operation with Different Starting Point Values From Column - SQL Server [duplicate]

In SQL there are aggregation operators, like AVG, SUM, COUNT. Why doesn't it have an operator for multiplication? "MUL" or something.
I was wondering, does it exist for Oracle, MSSQL, MySQL ? If not is there a workaround that would give this behaviour?

By MUL do you mean progressive multiplication of values?
Even with 100 rows of some small size (say 10s), your MUL(column) is going to overflow any data type! With such a high probability of mis/ab-use, and very limited scope for use, it does not need to be a SQL Standard. As others have shown there are mathematical ways of working it out, just as there are many many ways to do tricky calculations in SQL just using standard (and common-use) methods.
Sample data:
Column
1
2
4
8
COUNT : 4 items (1 for each non-null)
SUM : 1 + 2 + 4 + 8 = 15
AVG : 3.75 (SUM/COUNT)
MUL : 1 x 2 x 4 x 8 ? ( =64 )
For completeness, the Oracle, MSSQL, MySQL core implementations *
Oracle : EXP(SUM(LN(column))) or POWER(N,SUM(LOG(column, N)))
MSSQL : EXP(SUM(LOG(column))) or POWER(N,SUM(LOG(column)/LOG(N)))
MySQL : EXP(SUM(LOG(column))) or POW(N,SUM(LOG(N,column)))
Care when using EXP/LOG in SQL Server, watch the return type http://msdn.microsoft.com/en-us/library/ms187592.aspx
The POWER form allows for larger numbers (using bases larger than Euler's number), and in cases where the result grows too large to turn it back using POWER, you can return just the logarithmic value and calculate the actual number outside of the SQL query
* LOG(0) and LOG(-ve) are undefined. The below shows only how to handle this in SQL Server. Equivalents can be found for the other SQL flavours, using the same concept
create table MUL(data int)
insert MUL select 1 yourColumn union all
select 2 union all
select 4 union all
select 8 union all
select -2 union all
select 0
select CASE WHEN MIN(abs(data)) = 0 then 0 ELSE
EXP(SUM(Log(abs(nullif(data,0))))) -- the base mathematics
* round(0.5-count(nullif(sign(sign(data)+0.5),1))%2,0) -- pairs up negatives
END
from MUL
Ingredients:
taking the abs() of data, if the min is 0, multiplying by whatever else is futile, the result is 0
When data is 0, NULLIF converts it to null. The abs(), log() both return null, causing it to be precluded from sum()
If data is not 0, abs allows us to multiple a negative number using the LOG method - we will keep track of the negativity elsewhere
Working out the final sign
sign(data) returns 1 for >0, 0 for 0 and -1 for <0.
We add another 0.5 and take the sign() again, so we have now classified 0 and 1 both as 1, and only -1 as -1.
again use NULLIF to remove from COUNT() the 1's, since we only need to count up the negatives.
% 2 against the count() of negative numbers returns either
--> 1 if there is an odd number of negative numbers
--> 0 if there is an even number of negative numbers
more mathematical tricks: we take 1 or 0 off 0.5, so that the above becomes
--> (0.5-1=-0.5=>round to -1) if there is an odd number of negative numbers
--> (0.5-0= 0.5=>round to 1) if there is an even number of negative numbers
we multiple this final 1/-1 against the SUM-PRODUCT value for the real result

No, but you can use Mathematics :)
if yourColumn is always bigger than zero:
select EXP(SUM(LOG(yourColumn))) As ColumnProduct from yourTable

I see an Oracle answer is still missing, so here it is:
SQL> with yourTable as
2 ( select 1 yourColumn from dual union all
3 select 2 from dual union all
4 select 4 from dual union all
5 select 8 from dual
6 )
7 select EXP(SUM(LN(yourColumn))) As ColumnProduct from yourTable
8 /
COLUMNPRODUCT
-------------
64
1 row selected.
Regards,
Rob.

With PostgreSQL, you can create your own aggregate functions, see http://www.postgresql.org/docs/8.2/interactive/sql-createaggregate.html
To create an aggregate function on MySQL, you'll need to build an .so (linux) or .dll (windows) file. An example is shown here: http://www.codeproject.com/KB/database/mygroupconcat.aspx
I'm not sure about mssql and oracle, but i bet they have options to create custom aggregates as well.

You'll break any datatype fairly quickly as numbers mount up.
Using LOG/EXP is tricky because of numbers <= 0 that will fail when using LOG. I wrote a solution in this question that deals with this

Using CTE in MS SQL:
CREATE TABLE Foo(Id int, Val int)
INSERT INTO Foo VALUES(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)
;WITH cte AS
(
SELECT Id, Val AS Multiply, row_number() over (order by Id) as rn
FROM Foo
WHERE Id=1
UNION ALL
SELECT ff.Id, cte.multiply*ff.Val as multiply, ff.rn FROM
(SELECT f.Id, f.Val, (row_number() over (order by f.Id)) as rn
FROM Foo f) ff
INNER JOIN cte
ON ff.rn -1= cte.rn
)
SELECT * FROM cte

Not sure about Oracle or sql-server, but in MySQL you can just use * like you normally would.
mysql> select count(id), count(id)*10 from tablename;
+-----------+--------------+
| count(id) | count(id)*10 |
+-----------+--------------+
| 961 | 9610 |
+-----------+--------------+
1 row in set (0.00 sec)

where clause with = sign matches multiple records while expected just one record

I have a simple inline view that contains 2 columns.
-----------------
rn | val
-----------------
0 | A
... | ...
25 | Z
I am trying to select a val by matching the rn randomly by using the dbms_random.value() method as in
with d (rn, val) as
(
select level-1, chr(64+level) from dual connect by level <= 26
)
select * from d
where rn = floor(dbms_random.value()*25)
;
My expectation is it should return one row only without failing.
But now and then I get multiple rows returned or no rows at all.
on the other hand,
>>select floor(dbms_random.value()*25) from dual connect by level <1000
returns a whole number for each row and I failed to see any abnormality.
What am I missing here?

The problem is that the random value is recalculated for each row. So, you might get two random values that match the value -- or go through all the values and never get a hit.
One way to get around this is:
select d.*
from (select d.*
from d
order by dbms_random.value()
) d
where rownum = 1;
There are more efficient ways to calculate a random number, but this is intended to be a simple modification to your existing query.
You also might want to ask another question. This question starts with a description of a table that is not used, and then the question is about a query that doesn't use the table. Ask another question, describing the table and the real problem you are having -- along with sample data and desired results.

Oracle SQL - How to "pivot" one row to many

In Oracle 12c, I have a view, which takes a little time to run. When I add the where clause, it will return exactly one row of interest. The row has columns/value like this...
I need this flipped so that I can see one row per EACH "set". I need the SQL to return something like
I know I can do a UNION ALL for each of the entry sets, but as the view takes a little while to run, plus there are about 30 different sets (I only showed 3 - Car, Boat, and truck)
Is there a better way of doing this? I have looked at PIVOT/UNPIVOT, but I didn't see how to make this work.

I think you are looking for UNPIVOT
WITH TEMP_DATA (ID1, CarPrice, CarTax, BoatPrice, BoatTax, TruckPrice, TruckTax)
AS (
select 'AAA', 1, 2, 3, 4, 5, 6 from dual )
select TYPE, PRICE, TAX
from temp_data
unpivot
(
(PRICE, TAX)
for TYPE IN
(
(CarPrice, CarTax) as 'CAR',
(BoatPrice, BoatTax) as 'BOAT',
(TruckPrice, TruckTax) as 'TRUCK'
)
)
;
OUTPUT:
TYPE PRICE TAX
----- ---------- ----------
CAR 1 2
BOAT 3 4
TRUCK 5 6

Ordered count of consecutive repeats / duplicates

I highly doubt I'm doing this in the most efficient manner, which is why I tagged plpgsql on here. I need to run this on 2 billion rows for a thousand measurement systems.
You have measurement systems that often report the previous value when they lose connectivity, and they lose connectivity for spurts often but sometimes for a long time. You need to aggregate but when you do so, you need to look at how long it was repeating and make various filters based on that information. Say you are measuring mpg on a car but it's stuck at 20 mpg for an hour than moves around to 20.1 and so on. You'll want to evaluate the accuracy when it's stuck. You could also place some alternative rules that look for when the car is on the highway, and with window functions you can generate the 'state' of the car and have something to group on. Without further ado:
--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
(
select
system_measured, time_of_measurement, measurement,
case when
measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc)
then 1 else 0 end as repeat
FROM
(
SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
UNION
SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
UNION
SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
) as data
) as data;
--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
(
select
*,
sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
from cumulative_repeat_calculator_data
order by system_measured, time_of_measurement
) as data;
--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example
select *,
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement
So, what would you do differently in order to run this on a huge table, or what alternative tools would you use? I'm thinking plpgsql because I suspect this needs to done in-database, or during the data insertion process, although I generally work with the data after it's loaded. Is there any way to get this in one sweep without resorting to sub-queries?
I have tested one alternative method, but it still relies on a sub-query and I think this is faster. For that method you create a "starts and stops" table with start_timestamp, end_timestamp, system. Then you join to the larger table and if the timestamp is between those, you classify it as being in that state, which is essentially an alternative to cumlative_sum_of_nonrepeats_by_system. But when you do this, you join on 1=1 for thousands of devices and thousands or millions of 'events'. Do you think that's a better way to go?

Test case
First, a more useful way to present your data - or even better, in an sqlfiddle, ready to play with:
CREATE TEMP TABLE data(
system_measured int
, time_of_measurement int
, measurement int
);
INSERT INTO data VALUES
(1, 1, 5)
,(1, 2, 150)
,(1, 3, 5)
,(1, 4, 5)
,(2, 1, 5)
,(2, 2, 5)
,(2, 3, 5)
,(2, 4, 5)
,(2, 5, 150)
,(2, 6, 5)
,(2, 7, 5)
,(2, 8, 5);
Simplified query
Since it remains unclear, I am assuming only the above as given.
Next, I simplified your query to arrive at:
WITH x AS (
SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
ORDER BY time_of_measurement) = measurement
THEN 0 ELSE 1 END AS step
FROM data
)
, y AS (
SELECT *, sum(step) OVER(PARTITION BY system_measured
ORDER BY time_of_measurement) AS grp
FROM x
)
SELECT * ,row_number() OVER (PARTITION BY system_measured, grp
ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM y
ORDER BY system_measured, time_of_measurement;
Now, while it is all nice and shiny to use pure SQL, this will be much faster with a plpgsql function, because it can do it in a single table scan where this query needs at least three scans.
Faster with plpgsql function:
CREATE OR REPLACE FUNCTION x.f_repeat_ct()
RETURNS TABLE (
system_measured int
, time_of_measurement int
, measurement int, repeat_ct int
) LANGUAGE plpgsql AS
$func$
DECLARE
r data; -- table name serves as record type
r0 data;
BEGIN
-- SET LOCAL work_mem = '1000 MB'; -- uncomment an adapt if needed, see below!
repeat_ct := 0; -- init
FOR r IN
SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement
LOOP
IF r.system_measured = r0.system_measured
AND r.measurement = r0.measurement THEN
repeat_ct := repeat_ct + 1; -- start new array
ELSE
repeat_ct := 0; -- start new count
END IF;
RETURN QUERY SELECT r.*, repeat_ct;
r0 := r; -- remember last row
END LOOP;
END
$func$;
Call:
SELECT * FROM x.f_repeat_ct();
Be sure to table-qualify your column names at all times in this kind of plpgsql function, because we use the same names as output parameters which would take precedence if not qualified.
Billions of rows
If you have billions of rows, you may want to split this operation up. I quote the manual here:
Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function, as
discussed above. That means that if a PL/pgSQL function produces a
very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation. Currently, the point at
which data begins being written to disk is controlled by the work_mem
configuration variable. Administrators who have sufficient memory to
store larger result sets in memory should consider increasing this
parameter.
Consider computing rows for one system at a time or set a high enough value for work_mem to cope with the load. Follow the link provided in the quote on more about work_mem.
One way would be to set a very high value for work_mem with SET LOCAL in your function, which is only effective for for the current transaction. I added a commented line in the function. Do not set it very high globally, as this could nuke your server. Read the manual.

Get a certain number of contiguous rows

I've got a school project, kind of huge, and only a few days left, so here is one problem I'm stuck on, hope you can help.
I've got this table:
places(id, otherInfo)
Simple enough, well I need to make a SQL query or PL/SQL function to retrieve a certain number of contiguous rows. For instance if I call the function using getContiguousPlaces(3); On the table which has the rows:
ID
1
4
5
6
18
19
I want to get rows with ID 4, 5 and 6.
How could I do that?

SELECT p.id, p.prev_id1, p.prev_id2
FROM (
SELECT id, LAG(id, 1) OVER(ORDER BY id) prev_id1, LAG(id, 2) OVER(ORDER BY id) prev_id2
FROM places
) p
WHERE p.prev_id1 = id-1
AND p.prev_id2 = id-2
here you go: http://sqlfiddle.com/#!4/a20e1/1
But I guess you could retrieve the datas differently if you wished.
edit: this case if for the parameter "3". But if you want to adapt it with a number "n", you either have to use it within a dynamic query, or maybe use the partition by clause. I'm going to have a look at this one now...

Try this:
WITH t AS
(SELECT p.*, LEVEL l,
CONNECT_BY_ROOT id AS cbr
FROM places p CONNECT BY
PRIOR id = id-1)
SELECT *
FROM t
WHERE cbr IN
(SELECT cbr
FROM t
WHERE l = 3 )
and l <= 3
ORDER BY cbr,
id
The constant 3 should be a parameter
Here is a fiddle

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Order by random number not working in Impala - sql

Related

WHILE Window Operation with Different Starting Point Values From Column - SQL Server [duplicate]

where clause with = sign matches multiple records while expected just one record

Oracle SQL - How to "pivot" one row to many

Ordered count of consecutive repeats / duplicates

Get a certain number of contiguous rows

Categories

Resources