Random() in Redshift CTE returns wildly incorrect results under certain conditions - sql

(Cross posting this from the AWS forums...)
Need a fairly sizable chunk of dummy data for this. I used this list of English words: http://www.mieliestronk.com/corncob_lowercase.txt
I'm seeing a MASSIVE difference in the number of results I get for seemingly equivalent queries involving the random() function within a CTE in Amazon Redshift. (I'm trying to take a random sample - one query returns an actual sample as expected, the other basically just returns the entire list of items I was trying to sample.)
Can somebody take a look at this? Am I doing something wrong? Is there another issue here?
/* Create tables to hold words */
create table main_words(word varchar(max));
create table couple_words(word varchar(max));
/* Get some words */
copy main_words
from 'S3 LOCATION OF CORNCOB FILE'
credentials 'aws_access_key_id=ID;aws_secret_access_key=KEY'
csv;
/* Put a few in another table */
insert into
couple_words
select top 5000
word
from
main_words;
/* Returns about 500 results */
with the_cte as
(
select
word,
random() as random_value
from
main_words
where
word not in (select word from couple_words)
)
select
count(*)
from
the_cte
where
random_value > .99;
/* Returns about 58,000 results (basically, the whole list) */
with the_cte as
(
select
word
from
main_words
where
word not in (select word from couple_words)
and random() > .99
)
select
count(*)
from
the_cte;
/* Clean up */
drop table if exists main_words;
drop table if exists couple_words;

Have you try it on a different server?
I just create a sample on SqlFidle with 100 rows plus random() > 0.9 and result are very similar.
First CTE
| count |
|-------|
| 4 |
Second CTE
| count |
|-------|
| 13 |
Average count(*) with 10 runs
CTE 1 CTE 2
8.3 9.8

I suspect some funky query rewriting. If you have to have the inner query, you can use LIMIT 2147483647 inside and see what comes up.

Related

where clause with = sign matches multiple records while expected just one record

I have a simple inline view that contains 2 columns.
-----------------
rn | val
-----------------
0 | A
... | ...
25 | Z
I am trying to select a val by matching the rn randomly by using the dbms_random.value() method as in
with d (rn, val) as
(
select level-1, chr(64+level) from dual connect by level <= 26
)
select * from d
where rn = floor(dbms_random.value()*25)
;
My expectation is it should return one row only without failing.
But now and then I get multiple rows returned or no rows at all.
on the other hand,
>>select floor(dbms_random.value()*25) from dual connect by level <1000
returns a whole number for each row and I failed to see any abnormality.
What am I missing here?
The problem is that the random value is recalculated for each row. So, you might get two random values that match the value -- or go through all the values and never get a hit.
One way to get around this is:
select d.*
from (select d.*
from d
order by dbms_random.value()
) d
where rownum = 1;
There are more efficient ways to calculate a random number, but this is intended to be a simple modification to your existing query.
You also might want to ask another question. This question starts with a description of a table that is not used, and then the question is about a query that doesn't use the table. Ask another question, describing the table and the real problem you are having -- along with sample data and desired results.

PostgreSQL efficiently find last decendant in linear list

I currently try to retrieve the last decendet efficiently from a linked list like structure.
Essentially there's a table with a data series, with certain criteria I split it up to get a list like this
current_id | next_id
for example
1 | 2
2 | 3
3 | 4
4 | NULL
42 | 43
43 | 45
45 | NULL
etc...
would result in lists like
1 -> 2 -> 3 -> 4
and
42 -> 43 -> 45
Now I want to get the first and the last id from each of those lists.
This is what I have right now:
WITH RECURSIVE contract(ruid, rdid, rstart_ts, rend_ts) AS ( -- recursive Query to traverse the "linked list" of continuous timestamps
SELECT start_ts, end_ts FROM track_caps tc
UNION
SELECT c.rstart_ts, tc.end_ts AS end_ts0 FROM contract c INNER JOIN track_caps tc ON (tc.start_ts = c.rend_ts AND c.rend_ts IS NOT NULL AND tc.end_ts IS NOT NULL)
),
fcontract AS ( --final step, after traversing the "linked list", pick the largest timestamp found as the end_ts and the smallest as the start_ts
SELECT DISTINCT ON(start_ts, end_ts) min(rstart_ts) AS start_ts, rend_ts AS end_ts
FROM (
SELECT rstart_ts, max(rend_ts) AS rend_ts FROM contract
GROUP BY rstart_ts
) sq
GROUP BY end_ts
)
SELECT * FROM fcontract
ORDER BY start_ts
In this case I just used timestamps which work fine for the given data.
Basically I just use a recursive query that walks through all the nodes until it reaches the end, as suggested by many other posts on StackOverflow and other sites. The next query removes all the sub-steps and returns what I want, like in the first list example: 1 | 4
Just for illustration, the produced result set by the recursive query looks like this:
1 | 2
2 | 3
3 | 4
1 | 3
2 | 4
1 | 4
As nicely as it works, it's quite a memory hog however which is absolutely unsurprising when looking at the results of EXPLAIN ANALYZE.
For a dataset of roughly 42,600 rows, the recursive query produces a whopping 849,542,346 rows. Now it was actually supposed to process around 2,000,000 rows but with that solution right now it seems very unfeasible.
Did I just improperly use recursive queries? Is there a way to reduce the amount of data it produces?(like removing the sub-steps?)
Or are there better single-query solutions to this problem?
The main problem is that your recursive query doesn't properly filter the root nodes which is caused by the the model you have. So the non-recursive part already selects the entire table and then Postgres needs to recurse for each and every row of the table.
To make that more efficient only select the root nodes in the non-recursive part of your query. This can be done using:
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
Now that is still not very efficient (compared to the "usual" where parent_id is null design), but at least makes sure the recursion doesn't need to process more rows then necessary.
To find the root node of each tree, just select that as an extra column in the non-recursive part of the query and carry it over to each row in the recursive part.
So you wind up with something like this:
with recursive contract as (
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
union
select c.current_id, c.next_id, p.root_id
from track_caps c
join contract p on c.current_id = p.next_id
and c.next_id is not null
)
select *
from contract
order by current_id;
Online example: http://rextester.com/DOABC98823

sql logical compression of records

I have a table in SQL with more than 1 million records which I want to compress using following algorithm ,and now I'm looking for the best way to do that ,preferably without using a cursor .
if the table contains all 10 possible last digits(from 0 to 9) for a number (like 252637 in following example) we will find the most used Source (in our example 'A') and then remove all digits where Source = 'A' and insert the collapsed digit instead of that (here 252637) .
the example below would help for better understanding.
Original table :
Digit(bigint)| Source
|
2526370 | A
2526371 | A
2526372 | A
2526373 | B
2526374 | C
2526375 | A
2526376 | B
2526377 | A
2526378 | B
2526379 | B
Compressed result :
252637 |A
2526373 |B
2526374 |C
2526376 |B
2526378 |B
2526379 |B
This is just another version of Tom Morgan's accepted answer. It uses division instead of substring to trim the least significant digit off the BIGINT digit column:
SELECT
t.Digit/10
(
-- Foreach t, get the Source character that is most abundant (statistical mode).
SELECT TOP 1
Source
FROM
table i
WHERE
(i.Digit/10) = (t.Digit/10)
GROUP BY
i.Source
ORDER BY
COUNT(*) DESC
)
FROM
table t
GROUP BY
t.Digit/10
HAVING
COUNT(*) = 10
I think it'll be faster, but you should test it and see.
You could identify the rows which are candidates for compression without a cursor (I think) by GROUPing by a substring of the Digit (the length -1) HAVING count = 10. That would identify digits with 10 child rows. You could use this list to insert to a new table, then use it again to delete from the original table. What would be left would be rows that don't have all 10, which you'd also want to insert to the new table (or copy the new data back to the original).
Does that makes sense? I can write it out a bit better if it doesn't.
Possible SQL Solution:
SELECT
SUBSTRING(t.Digit,0,len(t.Digit)-1)
(SELECT TOP 1 Source
FROM innerTable i
WHERE SUBSTRING(i.Digit,0,len(i.Digit)-1)
= SUBSTRING(t.Digit,0,len(t.Digit)-1)
GROUP BY i.Source
ORDER BY COUNT(*) DESC
)
FROM table t
GROUP BY SUBSTRING(t.Digit,0,len(t.Digit)-1)
HAVING COUNT(*) = 10

SQL based data diff: longest common subsequence

I'm looking for research papers or writings in applying Longest Common Subsquence algorithm to SQL tables for obtaining a data diff view. Other sugestions on how to resolve a table diff problem are also welcomed. The challenge being that SQL tables have this nasty habit of geting rather BIG and applying straightforward algorithms designed for text processing may result in a program that never ends...
so given a table Original:
Key Content
1 This row is unchanged
2 This row is outdated
3 This row is wrong
4 This row is fine as it is
and the table New:
Key Content
1 This row was added
2 This row is unchanged
3 This row is right
4 This row is fine as it is
5 This row contains important additions
I need to find out the Diff:
+++ 1 This row was added
--- 2 This row is outdated
--- 3 This row is wrong
+++ 3 This row is right
+++ 5 This row contains important additions
If you export your tabls into csv files, you can use http://sourceforge.net/projects/csvdiff/
Quote:
csvdiff is a Perl script to diff/compare two csv files with the
possibility to select the separator. Differences will be shown like:
"Column XYZ in record 999" is different. After this, the actual and the
expected result for this column will be shown.
This is probably too simple for what you're after, and it's not research :-), but just conceptual. I imagine you're looking to compare different methods for processing overhead (?).
--This is half of what you don't want ( A )
SELECT o.Key FROM tbl_ORIGINAL o INNER JOIN tbl_NEW n WHERE o.Content = n.Content
--This is the other half of what you don't want ( B )
SELECT n.Key FROM tbl_ORIGINAL o INNER JOIN tbl_NEW n WHERE o.Content = n.Content
--This is half of what you DO want ( C )
SELECT '+++' as diff, n.key, Content FROM tbl_New n WHERE n.KEY NOT IN( B )
--This is the other half of what you DO want ( D )
SELECT '---' as diff, o.key, Content FROM tbl_Original o WHERE o.Key NOT IN ( A )
--Combining C & D
( C )
Union
( D )
Order By diff, key
Improvements...
try creating indexed views of the
base tables first
try reducing the length of the
content field to it's min for
uniqueness (trial/error), and then
use that shorter result to do your
comparisons
-- e.g. to get min length (1000 is arbitrary -- just need an exit)
declare #i int
set #i = 1
While i < 1000 and Exists (
Select Count(key), Left(content,#i) From Table Having Count(key) > 1 )
BEGIN
i = #i + 1
END

Display more than one row with the same result from a field

I need to show more than one result from each field in a table. I need to do this with only one SQL sentence, I donĀ“t want to use a Cursor.
This seems silly, but the number of rows may vary for each item. I need this to print afterwards this information as a Crystal Report detail.
Suppose I have this table:
idItem Cantidad <more fields>
-------- -----------
1000 3
2000 2
3000 5
4000 1
I need this result, using one only SQL Sentence:
1000
1000
1000
2000
2000
3000
3000
3000
3000
3000
4000
where each idItem has Cantidad rows.
Any ideas?
It seems like something that should be handled in the UI (or the report). I don't know Crystal Reports well enough to make a suggestion there. If you really, truly need to do it in SQL, then you can use a Numbers table (or something similar):
SELECT
idItem
FROM
Some_Table ST
INNER JOIN Numbers N ON
N.number > 0 AND
N.number <= ST.cantidad
You can replace the Numbers table with a subquery or function or whatever other method you want to generate a result set of numbers that is at least large enough to cover your largest cantidad.
Check out UNPIVOT (MSDN)
Another example
If you use a "numbers" table that is useful for this and many similar purposes, you can use the following SQL:
select t.idItem
from myTable t
join numbers n on n.num between 1 and t.Cantidad
order by t.idTtem
The numbers table should just contain all integer numbers from 0 or 1 up to a number big enough so that Cantidad never exceeds it.
As others have said, you need a Numbers or Tally table which is just a sequential list of integers. However, if you knew that Cantidad was never going to be larger than five for example, you can do something like:
Select idItem
From Table
Join (
Select 1 As Value
Union All Select 2
Union All Select 3
Union All Select 4
Union All Select 5
) As Numbers
On Numbers.Value <= Table.Cantidad
If you are using SQL Server 2005, you can use a CTE to do:
With Numbers As
(
Select 1 As Value
Union All
Select N.Value + 1
From Numbers As N
)
Select idItem
From Table
Join Numbers As N
On N.Value <= Table.Cantidad
Option (MaxRecursion 0);