Random sample in bigquery gives inconsistent results - google-bigquery

I'm using the RAND function in bigquery to provide me with a random sample of data, and unioning it with another sample of the same dataset.
This is for a machine learning problem where I'm interested in one class more than the other.
I've recreated the logic using a public dataset.
SELECT
COUNT(1),
bigarticle
FROM
(
SELECT
1 as [bigarticle]
FROM [bigquery-public-data:samples.wikipedia]
WHERE num_characters > 50000
),
(
SELECT
0 as [bigarticle]
FROM [bigquery-public-data:samples.wikipedia]
WHERE (is_redirect is null) AND (RAND() < 0.01)
)
GROUP BY bigarticle
Most of the time this behaves as expected,
giving one row with the count of rows where num_characters is more than 50k,
and giving another row with a count of a 1% sample of rows where is_redirect is null.
(This is an approximation of the logic I use in my internal dataset).
If you run this query repeatedly, occasionally it gives unexpected results.
In this result set (bquijob_124ad56f_15da8af982e) I only get a single row, containing the count of bigarticle = 1.

RAND does not use a deterministic seed. If you want deterministic results, you need to hash/fingerprint a column in the table and use a modulus to select a subset of values instead. Using legacy SQL:
#legacySQL
SELECT
COUNT(1),
bigarticle
FROM (
SELECT
1 as [bigarticle]
FROM [bigquery-public-data:samples.wikipedia]
WHERE num_characters > 50000
), (
SELECT
0 as [bigarticle]
FROM [bigquery-public-data:samples.wikipedia]
WHERE (is_redirect is null) AND HASH(title) % 100 = 0
)
GROUP BY bigarticle;
Using standard SQL in BigQuery, which is recommended since legacy SQL is not under active development:
#standardSQL
SELECT
COUNT(*),
bigarticle
FROM (
SELECT
1 as bigarticle
FROM `bigquery-public-data.samples.wikipedia`
WHERE num_characters > 50000
UNION ALL
SELECT
0 as bigarticle
FROM `bigquery-public-data.samples.wikipedia`
WHERE (is_redirect is null) AND MOD(FARM_FINGERPRINT(title), 100) = 0
)
GROUP BY bigarticle;

Related

Using a case statement as an if statement

I am attempting to create an IF statement in BigQuery. I have built a concept that will work but it does not select the data from a table, I can only get it to display 1 or 0
Example:
SELECT --AS STRUCT
CASE
WHEN (
Select Count(1) FROM ( -- If the records are the same, then return = 0, if the records are not the same then > 1
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Prior_Filtered`
Except Distinct
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest_Filtered`
)
)= 0
THEN
(Select * from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest`) -- This Does not
work Scalar subquery cannot have more than one column unless using SELECT AS
STRUCT to build STRUCT values at [16:4] END
SELECT --AS STRUCT
CASE
WHEN (
Select Count(1) FROM ( -- If the records are the same, then return = 0, if the records are not the same then > 1
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Prior_Filtered`
Except Distinct
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest_Filtered`
)
)= 0
THEN 1 --- This does work
Else
0
END
How can I Get this query to return results from an existing table?
You question is still a little generic, so my answer same as well - and just mimic your use case at extend I can reverse engineer it from your comments
So, in below code - project.dataset.yourtable mimics your table ; whereas
project.dataset.yourtable_Prior_Filtered and project.dataset.yourtable_Latest_Filtered mimic your respective views
#standardSQL
WITH `project.dataset.yourtable` AS (
SELECT 'aaa' cols, 'prior' filter UNION ALL
SELECT 'bbb' cols, 'latest' filter
), `project.dataset.yourtable_Prior_Filtered` AS (
SELECT cols FROM `project.dataset.yourtable` WHERE filter = 'prior'
), `project.dataset.yourtable_Latest_Filtered` AS (
SELECT cols FROM `project.dataset.yourtable` WHERE filter = 'latest'
), check AS (
SELECT COUNT(1) > 0 changed FROM (
SELECT DISTINCT cols FROM `project.dataset.yourtable_Latest_Filtered`
EXCEPT DISTINCT
SELECT DISTINCT cols FROM `project.dataset.yourtable_Prior_Filtered`
)
)
SELECT t.* FROM `project.dataset.yourtable` t
CROSS JOIN check WHERE check.changed
the result is
Row cols filter
1 aaa prior
2 bbb latest
if you changed your table to
WITH `project.dataset.yourtable` AS (
SELECT 'aaa' cols, 'prior' filter UNION ALL
SELECT 'aaa' cols, 'latest' filter
) ......
the result will be
Row cols filter
Query returned zero records.
I hope this gives you right direction
Added more explanations:
I can be wrong - but per your question - it looks like you have one table project.dataset.yourtable and two views project.dataset.yourtable_Prior_Filtered and project.dataset.yourtable_Latest_Filtered which present state of your table prior and after some event
So, first three CTE in the answer above just mimic those table and views which you described in your question.
They are here so you can see concept and can play with it without any extra work before adjusting this to your real use-case.
For your real use-case you should omit them and use your real table and views names and whatever columns the have.
So the query for you to play with is:
#standardSQL
WITH check AS (
SELECT COUNT(1) > 0 changed FROM (
SELECT DISTINCT cols FROM `project.dataset.yourtable_Latest_Filtered`
EXCEPT DISTINCT
SELECT DISTINCT cols FROM `project.dataset.yourtable_Prior_Filtered`
)
)
SELECT t.* FROM `project.dataset.yourtable` t
CROSS JOIN check WHERE check.changed
It should be a very simple IF statement in any language.
Unfortunately NO! it cannot be done with just simple IF and if you see it fit you can submit a feature request to BigQuery team for whatever you think makes sense

MS SQL does not return the expected top row when ordering by DIFFERENCE()

I have noticed strange behaviour in some SQL code used for address matching at the company I work for & have created some test SQL to illustrate the issue.
; WITH Temp (Id, Diff) AS (
SELECT 9218, 0
UNION
SELECT 9219, 0
UNION
SELECT 9220, 0
)
SELECT TOP 1 * FROM Temp ORDER BY Diff DESC
Returns 9218 but
; WITH Temp (Id, Name) AS (
SELECT 9218, 'Sonnedal'
UNION
SELECT 9219, 'Lammermoor'
UNION
SELECT 9220, 'Honeydew'
)
SELECT TOP 1 *, DIFFERENCE(Name, '') FROM Temp ORDER BY DIFFERENCE(Name, '') DESC
returns 9219 even though the Difference() is 0 for all records as you can see here:
; WITH Temp (Id, Name) AS (
SELECT 9218, 'Sonnedal'
UNION
SELECT 9219, 'Lammermoor'
UNION
SELECT 9220, 'Honeydew'
)
SELECT *, DIFFERENCE(Name, '') FROM Temp ORDER BY DIFFERENCE(Name, '') DESC
which returns
9218 Sonnedal 0
9219 Lammermoor 0
9220 Honeydew 0
Does anyone know why this happens? I am writing C# to replace existing SQL & need to return the same results so I can test that my code produces the same results. But I can't see why the actual SQL used returns 9219 rather than 9218 & it doesn't seem to make sense. It seems it's down to the Difference() function but it returns 0 for all the record in question.
When you call:
SELECT TOP 1 *, DIFFERENCE(Name, '')
FROM Temp l
ORDER BY DIFFERENCE(Name, '') DESC
All three records have a DIFFERENCE value of zero, and hence SQL Server is free to choose from any of the three records for ordering. That is to say, there is no guarantee which order you will get. The same is true for your second query. Actually, it is possible that the ordering for the same query could even change over time. In practice, if you expect a certain ordering, you should provide exact logic for it, e.g.
SELECT TOP 1 *
FROM Temp
ORDER BY Id;

Offset and fetch in oracle sql developer

We have data in millions (total rows 1698393). While exporting this data in text takes 4 hours. I need to know if there is a way to reduce the exporting time for those many records from Oracle database using SQL Developer.
with cte as (
select *
from (
select distinct
system_serial_number,
( select s.system_status
from eim_pr_system s
where .system_serial_number=a.system_serial_number
) system_status,
( select SN.cmat_customer_id
from EIM.eim_pr_ib_latest SN
where SN.role_id=19
and SN.system_serial_number=a.system_serial_number
) SN_cmat_customer_id,
( select EC.cmat_customer_id
from EIM.eim_pr_ib_latest EC
where EC.role_id=1
and a.system_serial_number=EC.system_serial_number
) EC_cmat_customer_id
from EIM.eim_pr_ib_latest a
where a.role_id in (1,19)
)
where nvl(SN_cmat_customer_id,0)!=nvl(EC_cmat_customer_id,0)
)
select system_serial_number,
system_status,
SN_CMAT_Customer_ID,
EC_CMAT_Customer_ID,
C.Customer_Name SN_Customer_Name,
D.Customer_Name EC_Customer_Name
from cte,
eim.eim_party c,
eim.eim_party D
where c.CMAT_Customer_ID=SN_cmat_customer_id
and D.CMAT_Customer_ID=EC_cmat_customer_id;
offset first 5001 rows fetch next 200000 rows only
You can get rid of a lot of the joins and correlated sub-queries (which will speed things up by reducing the number of table scans) by doing something like:
SELECT a.system_serial_number,
s.system_status,
a.SN_cmat_customer_id,
a.EC_cmat_customer_id,
a.SN_customer_name,
a.EC_customer_name
FROM (
SELECT l.system_serial_number,
MAX( CASE l.role_id WHEN 19 THEN l.cmat_customer_id END ) AS SN_cmat_customer_id,
MAX( CASE l.role_id WHEN 1 THEN l.cmat_customer_id END ) AS EC_cmat_customer_id
MAX( CASE l.role_id WHEN 19 THEN p.customer_name END ) AS SN_customer_name,
MAX( CASE l.role_id WHEN 1 THEN p.customer_name END ) AS EC_customer_name
FROM EIM.eim_pr_ib_latest l
INNER JOIN
EIM.eim_aprty p
ON ( p.CMAT_Customer_ID= l.cmat_customer_id )
WHERE l.role_id IN ( 1, 19 )
GROUP BY system_serial_number
HAVING NVL( MAX( CASE l.role_id WHEN 19 THEN l.cmat_customer_id END ), 0 )
<> NVL( MAX( CASE l.role_id WHEN 1 THEN l.cmat_customer_id END ), 0 )
) a
LEFT OUTER JOIN
eim_pr_system s
ON ( s.system_serial_number=a.system_serial_number )
Since your original query is not throwing a TOO_MANY_ROWS exception on the correlated sub-queries, I am assuming that your data is such that there is only a single row being returned for each correlated query and the above query will reflect your output (although without some sample data it is difficult to test).
Apart from 'making the query faster' - there is a way to achieve a faster export using SQL Developer.
When you use the data grid, export feature - this will execute the query, again. The only time this won't happen is if you have fetched ALL the rows into the grid. Doing this will for very large data sets will be 'expensive' on the client side, but you can avoid that.
For a faster export, add a /*csv*/ comment in your select, and wrap the statement with a spool c:\my_file.csv - then collapse the script output panel, and run that with F5. As we fetch the data, we'll write it to that file in a CSV format.
/*csv*/
/*xml*/
/*json*/
/*html*/
/*insert*/
I talk about this feature in detail here.

SQL query to sort and select from the selected

i would some help with this issue , i have news tables
i want to select 2000 terms and sort them, then check if the terms exist in the 2000 show it else 0.
some thing like that .
SELECT TOP 1000 [terms]
,[frequency]
,[occurance]
,[idf]
,[tfidf]
FROM [Central].[news]
ORDER BY tfidf DESC;
IF ##ROWCOUNT= 0
select 0 as FinalResult;
ELSE
if ##ROWCOUNT< 2000
select * from [CentralFinance].[dbo].[TFIDF_1] where terms = 'project'
Perhaps this is helpful. It is a total guess:
select top 2000 -- or 1000?
terms, frequency, occurance, idf, tfidf
from Central.news
order by tfidf desc;
if ##rowcount > 0 begin
select * from CentralFinance.dbo.TFIDF_1
where terms in (
select top 2000 terms
from Central.news
order by tfidf desc
);
select 1 as FinalResult;
end
else begin
select 0 as FinalResult;
end
Another thought is:
if exists (select 1 from Central.news) begin
select * from CentralFinance.dbo.TFIDF_1
where terms in (
select top 2000 terms
from Central.news
order by tfidf desc
);
select 1 as FinalResult;
end
else begin
select 0 as FinalResult;
end
And finally a third guess:
select sgn(count(*)) as FinalResult
from (select 1) dummy
where 'project' in
(
select top 2000 terms
from Central.news
order by tfidf desc
)
You can use a Temp table to store the initial query and the query the temp table for the new dataset. Or you can add a sub query to the where clause. What do you want returned from the query ?

SQL Server check if where clause is true for any row

I'm going to select those provinces which intersects any railroad. So I do it like this (Using SQL Spatial):
SELECT * FROM ProvinceTable
WHERE (
SELECT count(*)
FROM RailroadTable
WHERE ProvinceTable.Shape.STIntersects(RailroadTable.Shape) > 1
) > 0
But it is not efficient because it has to check the intersection between every single railroad geometry and province geometry in order to calculate the count. However it is better to stop the where clause as soon as every first intersection detected and there is no need to check others. Here is what I mean:
SELECT * FROM ProvinceTable
WHERE (
--return true if this is true for any row in the RailroadTable:
-- "ProvinceTable.Shape.STIntersects(RailroadTable.Shape) > 1"
)
So is there a better way to rewrite this query for such a goal?
EDIT
Surprisingly This query takes the same time and returns no row:
SELECT * FROM ProvinceTable
WHERE EXISTS (
SELECT *
FROM RailroadTable
WHERE ProvinceTable.Shape.STIntersects(RailroadTable.Shape) > 1
)
You want to use exists:
SELECT pt.*
FROM ProvinceTable pt
WHERE EXISTS (SELECT 1
FROM RailroadTable rt
WHERE pt.Shape.STIntersects(rt.Shape) = 1
);