How do I sample the number of records in another table? - sql

I have code where I'm sampling 50,000 random records. I.e.,
SELECT * FROM Table1
SAMPLE 50000;
That works. However, what I really want to do is sample the number of records that are in a different table. I.e.,
SELECT * FROM Table1
SAMPLE count(*) FROM Table2;
I get an error. What am I doing wrong?

This is not randomized like sample, so bear that in mind. But there also won't be an obvious pattern, I believe it's determined by disk location (don't quote me on that).
SELECT *
FROM Table1
QUALIFY ROW_NUMBER() OVER
( PARTITION BY 1
ORDER BY 1
) <=
( SELECT COUNT(*)
FROM Table2
);
Better way
SELECT TMP.* -- Or list the columns you want with "rnd"
FROM ( SELECT RANDOM(-10000000,10000000) rnd,
T1.*
FROM Table1 T1
) TMP
QUALIFY ROW_NUMBER() OVER
( ORDER BY rnd
) <=
( SELECT COUNT(*)
FROM Table2
);

SELECT TOP 50000 * FROM Table1 ORDER BY NEWID()

Related

sql: first row after the last row with a property

I would like to write a query that returns the first row immediately after the last row with a given property (ordered by id). Id's may not be consecutive.
Ideally it would look something like this:
...
JOIN (select max(id) id from my_table where CONDITION) m
JOIN (select min(id) from my_table where id > m.id) n
However, I can not use identifier m in the second subselect.
It is possible to use nested queries in nested queries, but is there an easier way?
Thank you.
You could use lead() to get the next id before applying the condition:
select t.*
from my_table t join
(select max(next_id) as max_next_id
from (select t.*, lead(id) over (order by id) as next_id
from my_table t
) t
where <condition>
) tt
on t.id = tt.max_next_id;
You could also do:
select t.*
from my_table t
where t.id > (select max(t2.id) from my_table t2 where <condition>)
order by t2.id asc
fetch first 1 row only;
I am not sure how this is getting woven into the rest of your query, so I have used a CTE
WITH max_next AS (
SELECT r.id as max_id
,r.next_id
FROM (
SELECT m.id
,m.next_id
,ROW_NUMBER() OVER (ORDER BY m.id DESC) AS rn
FROM (
SELECT n.* -- to provide data to satisfy CONDITIONS
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE CONDITIONS
) AS r
WHERE r.rn = 1
)
I would also shrink the n.* to the columns needed by CONDITIONS to a, not be implicit as the * slows the compile time down (or historically has) as all meta data needs to be read to understand what columns is in the ANY, and the while the compile can also prune not used columns, it's faster if you just ask for what you want (in best case just a compile time savings, worse case, it read all the data when you only need x number of columns read)
And borrowing from Gordon solution, the ROW_NUMBER part could be simpler
WITH max_next AS (
SELECT m.id
,m.next_id
--, plus what ever other things you want from m
FROM (
SELECT n.* -- to satisfy CONDITIONS needs
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE CONDITIONS
ORDER BY m.id DESC LIMIT 1
)
So for an example for #PIG,
WITH my_table AS (
SELECT column1 AS id
,column2 AS con1
,column3 AS other
FROM VALUES (1,'a',123),(2,'b',234),(3,'a',345),(5,'b',456),(7,'a',567),(10,'c',678)
)
SELECT m.id
,m.next_id
,m.other
FROM (
SELECT n.* -- to satisfy CONDITIONS needs
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE m.con1 = 'b'
ORDER BY m.id DESC LIMIT 1;
gives 5, 7, 456 which is the last 'b' and the new row, and an extra value on my_table for entertainment purposes (and run on Snowflake to, which means I fixed the prior SQL also.)
This should work, it's pretty straightforward (easy), and it's good that you know records may not be stored in a ordered/consecutive fashion.
SELECT *
FROM my_table
WHERE id = (
SELECT min(id)
FROM my_table
WHERE id > (
SELECT max(id)
FROM my_table
WHERE CONDITION));

Select duplicated data from table

Query
select * from table1
where having count(reference)>1
I want to select * the data which have duplicate data,any idea why my query is not working?
Below are my expect result..
You can make use of window function count to find number of rows per id and reference and then filter to get those which have count more than 1.
;with cte as (
select t.*, count(*) over (partition by id, reference) cnt
from table1 t
)
select * from cte where cnt > 1;
Demo
In the above solution, I have made an assumption that name and id has one to one correspondence (which is true as per your given data). If that's not the case, add name too in the partition by clause:
;with cte as (
select t.*, count(*) over (partition by name, id, reference) cnt
from table1 t
)
select * from cte where cnt > 1;
I might actually approach this by using a subquery with GROUP BY:
SELECT t1.*
FROM table1 t1
INNER JOIN
(
SELECT Name, ID, reference
FROM table1
GROUP BY Name, ID, reference
HAVING COUNT(*) > 1
) t2
ON t1.Name = t2.Name AND
t1.ID = t2.ID AND
t1.reference = t2.reference
Demo here:
Rextester
Try this ), first i get count by partition, after that i get row with count > 1
select No, Name, ID, Reference
from (select count(*) over (partition by name, ID, reference) cnt, table1.* from table1)
where cnt>1
The easy way (although maybe not the best for performance) would be:
select * from table1 where reference in (
select reference from table1 group by reference having count(*)>1
)
In a subselect you have the duplicated data, and in the outter select you have all the data for these references.

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded error that I guess is because of ORDER BY in the query.
Second
Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
Short and scalable version:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
If you don't care about writing a piece of code for every column:
SELECT ID,
ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID
I see no one has mentioned window functions with QUALIFY:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
The window function adds a column max_timestamp that is accessible in the QUALIFY clause to filter on.
As per your comment, Considering you have a table with unique ID's for which you need to find latest collection_time. Here is another way to do it using Correlated Sub-Query. Give it a try.
SELECT id,
(SELECT Max(collection_time)
FROM mytable B
WHERE A.id = B.id) AS Max_collection_time
FROM id_table A
Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):
SELECT
ID,
(SELECT srow.*
FROM UNNEST(t.srows) srow
WHERE srow.collection_time = MAX(srow.collection_time))
FROM
(SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows
FROM id_table
GROUP BY ID) t

Select more columns with MAX function

Need to find in databse max value, but then i need read other values in columns.
Can this be done with one SQL command or I have to use this two commands?
SELECT MAX(id) FROM Table;
SELECT * FROM Table WHERE id = $value;
where $value is variable from 1st command
select * from your_table
where id = (select max(id) from your_table)
or
select t1.* from your_table t1
inner join
(
select max(id) as mid
from your_table
)
t2 on t1.id = t2.mid
Probably the simplest way is:
select *
from t
order by id
limit 1
Or use top 1 or where rownum = 1 or whatever is the right logic for your database.
Note: this only returns one row. If you have duplicate such rows, then comparison to the maximum will give you all of them.
Also, if you are using a database that supports window functions:
select *
from (select t.*, row_number() over (order by id desc) as seqnum
from t
) t
where seqnum = 1;

Select top and bottom rows

I'm using SQL Server 2005 and I'm trying to achieve something like this:
I want to get the first x rows and the last x rows in the same select statement.
SELECT TOP(5) BOTTOM(5)
Of course BOTTOM does not exist, so I need another solution. I believe there is an easy and elegant solution that I'm not getting. Doing the select again with GROUP BY DESC is not an option.
Using a union is the only thing I can think of to accomplish this
select * from (select top(5) * from logins order by USERNAME ASC) a
union
select * from (select top(5) * from logins order by USERNAME DESC) b
Check the link
SQL SERVER – How to Retrieve TOP and BOTTOM Rows Together using T-SQL
Did you try to using rownumber?
SELECT *
FROM
(SELECT *, ROW_NUMBER() OVER (Order BY columnName) as TopFive
,ROW_NUMBER() OVER (Order BY columnName Desc) as BottomFive
FROM Table
)
WHERE TopFive <=5 or BottomFive <=5
http://www.sqlservercurry.com/2009/02/select-top-n-and-bottom-n-rows-using.html
I think you've two main options:
SELECT TOP 5 ...
FROM ...
ORDER BY ... ASC
UNION
SELECT TOP 5 ...
FROM ...
ORDER BY ... DESC
Or, if you know how many items there are in the table:
SELECT ...
FROM (
SELECT ..., ROW_NUMBER() OVER (ORDER BY ... ASC) AS intRow
FROM ...
) AS T
WHERE intRow BETWEEN 1 AND 5 OR intRow BETWEEN #Number - 5 AND #Number
Is it an option for you to use a union?
E.g.
select top 5 ... order by {specify columns asc}
union
select top 5 ... order by {specify columns desc}
i guess you have to do it using subquery only
select * from table where id in (
(SELECT id ORDER BY columnName LIMIT 5) OR
(SELECT id ORDER BY columnName DESC LIMIT 5)
)
select * from table where id in (
(SELECT TOP(5) id ORDER BY columnName) OR
(SELECT TOP(5) id ORDER BY columnName DESC)
)
EDITED
select * from table where id in (
(SELECT TOP 5 id ORDER BY columnName) OR
(SELECT TOP 5 id ORDER BY columnName DESC)
)
No real difference between this and the union that I'm aware of, but technically it is a single query.
select t.*
from table t
where t.id in (select top 5 t2.id from table t2 order by MyColumn)
or
t.id in (select top 5 t2.id from table t2 order by MyColumn desc);
SELECT *
FROM (
SELECT x, rank() over (order by x asc) as rown
FROM table
) temp
where temp.rown = 1
or temp.rown = (select count(x) from table)
Then you are out - doing the select again IS the only option, unless you want to pull in the complete result set and then throwing away everything in between.
ANY sql I cna think of is the same way - for the bottom you need to know first either how many items you have (materialize everything or use count(*)) or a reverse sort order.
Sorry if that does not suit you, but at the end.... reality does not care, and I do not see any other way to do that.
I had to do this recently for a very large stored procedure; if your query is quite large, and you want to minimize the amount of queries you could declare a #tempTable, insert into that #tempTable then query from that #tempTable,
DECLARE #tempTable TABLE ( columns.. )
INSERT INTO #tempTable
VALUES ( SELECT.. your query here ..)
SELECT TOP(5) columns FROM #tempTable ORDER BY column ASC -- returns first to last
SELECT TOP(5) columns FROM #tempTable ORDER BY column DESC -- returns last to first