Most Efficient way to Search Massive Redshift Table for Duplicate Values

Most Efficient way to Search Massive Redshift Table for Duplicate Values - sql

I have a large Redshift tables (hundreds of millions of ROWS with ~50 columns per row).
There is a need for me to find rows that have duplicate columns for a specific value.
Example:
if my table has the columns 'column_of_interest' and 'date_time', In those hundreds of millions of columns, I need to find all the instances where 'column_of_interest' has more than one value between a certain 'date_time'.
eg:
column_of_interest date_time
ROW 1: ABCD-1234 165895896565
ROW 2: FCEG-3434 165895896577
ROW 3: ABCD-1234 165895986688
ROW 4: ZZZZ-9999 165895986689
ROW 5: ZZZZ-9999 165895987790
in the above.. since ROW 1 and ROW 3 have the same column_of_interest i would like that column_of_interest returned. and ROW 4 and ROW 5 as well, so i would like those returned.
So the end result would be:
duplicates
ABCD-1234
ZZZZ-9999
I have found a few things online, but the table is so large, the query times about before any results are returned. Am I going about this the wrong way? Here are a couple that I tried just to get the results back (but they timeout before returning).
SELECT column_of_interest, COUNT(*)
FROM my_table
GROUP BY column_of_interest
HAVING COUNT(*) > 1
WHERE date_time >= 1601510400000 AND date_time < 1601596800000
LIMIT 200
SELECT a.*
FROM my_table a
JOIN (SELECT column_of_interest, COUNT(*)
FROM my_table
GROUP BY column_of_interest
HAVING count(*) > 1 ) b
ON a.column_of_interest = b.column_of_interest
ORDER BY a.column_of_interest
LIMIT 200

This should be a fine method. And it should not "time out". Your version has a syntax error.
So try:
SELECT column_of_interest, COUNT(*)
FROM my_table
WHERE date_time >= 1601510400000 AND date_time < 1601596800000
GROUP BY column_of_interest
HAVING COUNT(*) > 1
LIMIT 200

Related

SQL select two rows in one table generating independant output

I would like to generate an output, where I count two rows from the same table but with different conditions.
Now I have this SQL Statement which works:
select Datum, Count(ID), Count(Fläche)
FROM gustavo
where Fläche > 200
Group by Datum;
but it only gives me the sizes over 480 and the id's over 480 in both rows. In the first row I would like to have the count of all IDs though. Any idea how that would work?
Thanks a lot

Try this (not checked)
SELECT
Datum,
COUNT(ID) all_ID ,
SUM(if(ID>480) THEN 1 ELSE 0 END IF) ID_480,
Count(Fläche) all_Flache
FROM
gustavo
WHERE
Fläche > 200
GROUP BY
Datum;

How can I retrieve data from a Hive table from two columns with non null values and top 500 records in one query?

I have a Hive table (my_table) which is in ORC format and has 30 columns. Two of the columns (col_us, col_ds) store numeric values which can be 0 or null or some integer. The table is partitioned on the bases of day and hourly.
The table has approx. 8 Million x 96 records in a days partition and I am referring to 15 daily partitions
Currently I am running separate queries to retrieve top 500 records with value greater than 0 using a rank function. One query to retrieve col_us and other for col_ds
It is possible that clo_US may have a numeric value while col_DS is 0 or null
Question:
I want to retrieve top 500 non null and non 0 records from each of these columns from one query.
My Query:
From(
SELECT D.COL_US, D.DATESTAMP,
ROW_NUMBER() OVER (PARTITION BY D.ID,D.SUB_ID ORDER BY CONCAT (D.DATESTAMP,D.HOURSTAMP,D.TIMESTAMP) DESC) AS RNK
FROM ${wf_table_name} D
WHERE DATESTAMP >= '${datestamp_15}' AND DATESTAMP < '${datestamp}'
AND COL_US > 0)T
INSERT OVERWRITE TABLE ${wf_us_table}
SELECT T.COL_US, T.DATESTAMP, T.RNK WHERE T.RNK < 500;

As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
As per your question your table may have 2 type of value. for example.
col_us
0
NULL
10
5
col_ds
5
10
0
NULL
or both column may have >0 value.
So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
But with this condition you will not get any value from above stated 4 rows.
So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.
Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.
What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.

How to subtract the content of a column of two rows

I have a table like this
and I want to return the difference between the two rows

SQL tables represent unordered sets. There is no ordering, unless a column specifies the ordering.
So, you can get the two values using MAX() and MIN(). This should do what you want:
select max(nbaction) - min(nbaction)
from t;
EDIT:
Given your actual problem, you have multiple choices. Here is one:
SELECT (SELECT nbaction
FROM analyse_page_fait
WHERE operateurdimid = 2
ORDER BY datedimid DESC
FETCH FIRST 1 ROW ONLY
) -
(SELECT nbaction
FROM analyse_page_fait
WHERE operateurdimid = 2
ORDER BY datedimid DESC
OFFSET 1
FETCH FIRST 1 ROW ONLY
) as diff

Using "order by" and fetch inside a union in SQL on as400 database

Let's say I have this table
Table name: Traffic
Seq. Type Amount
1 in 10
2 out 30
3 in 50
4 out 70
What I need is to get the previous smaller and next larger amount of a value. So, if I have 40 as a value, I will get...
Table name: Traffic
Seq. Type Amount
2 out 30
3 in 50
I already tried doing it with MYSQL and quite satisfied with the results
(select * from Traffic where
Amount < 40 order by Amount desc limit 1)
union
(select * from Traffic where
Amount > 40 order by Amount desc limit 1)
The problem lies when I try to convert it to a SQL statement acceptable by AS400. It appears that the order by and fetch function (AS400 doesn't have a limit function so we use fetch, or does it?) is not allowed inside the select statement when I use it with a union. I always get a keyword not expected error. Here is my statement;
(select seq as sequence, type as status, amount as price from Traffic where
Amount < 40 order by price asc fetch first 1 rows only)
union
(select seq as sequence, type as status, amount as price from Traffic where
Amount > 40 order by price asc fetch first 1 rows only)
Can anyone please tell me what's wrong and how it should be? Also, please share if you know other ways to achieve my desired result.

How about a CTE? From memory (no machine to test with):
with
less as (select * from traffic where amount < 40),
more as (select * from traffic where amount > 40)
select * from traffic
where id = (select id from less where amount = (select max(amount from less)))
or id = (select id from more where amount = (select min(amount from more)))

I looked at this question from possibly another point of view. I have seen other questions about date-time ranges between rows, and I thought perhaps what you might be trying to do is establish what range some value might fall in.
If working with these ranges will be a recurring theme, then you might want to create a view for it.
create or replace view traffic_ranges as
with sorted as
( select t.*
, smallint(row_number() over (order by amount)) as pos
from traffic t
)
select b.pos range_seq
, b.id beg_id
, e.id end_id
, b.typ beg_type
, e.typ end_type
, b.amount beg_amt
, e.amount end_amt
from sorted b
join sorted e on e.pos = b.pos+1
;
Once you have this view, it becomes very simple to get your answer:
select *
from traffic_ranges
where 40 is between beg_amt and end_amt
Or to get only one range where the search amount happens to be an amount in your base table, you would want to pick whether to include the beginning value or ending value as part of the range, and exclude the other:
where beg_amt < 40 and end_amt >= 40
One advantage of this approach is performance. If you are finding the range for multiple values, such as a column in a table or query, then having the range view should give you significantly better performance than a query where you must aggregate all the records that are more or less than each search value.

Here's my query using CTE and union inspired by Buck Calabro's answer. Credits go to him and WarrenT for being SQL geniuses!
I won't be accepting my own answer. That will be unfair. hehe
with
apple(seq, type, amount) as (select seq, type, amount from traffic where amount < 40
order by amount desc fetch first 1 rows only),
banana(seq, type, amount) as (select seq, type, amount from traffic where
amount > 40 fetch first 1 rows only)
select * from apple
union
select * from banana
It's a bit slow but I can accept that since I'll only use it once in the progam.
This is just a sample. The actual query is a bit different.

Split a query result based on the result count

I have a query based on basic criteria that will return X number of records on any given day.
I'm trying to check the result of the basic query then apply a percentage split to it based on the total of X and split it in 2 buckets. Each bucket will be a percentage of the total query result returned in X.
For example:
Query A returns 3500 records.
If the number of records returned from Query A is <= 3000, then split the 3500 records into a 40% / 60% split (1,400 / 2,100).
If the number of records returned from Query A is >=3001 and <=50,000 then split the records into a 10% / 90% split.Etc. Etc.
I want the actual records returned, and not just the math acting on the records that returns one row with a number in it (in the column).

I'm not sure how you want to display different parts of the resulting set of rows, so I've just added additional column(part) in the resulting set of rows that contains values 1 indicating that row belongs to the first part and 2 - second part.
select z.*
, case
when cnt_all <= 3000 and cnt <= 40
then 1
when (cnt_all between 3001 and 50000) and (cnt <= 10)
then 1
else 2
end part
from (select t.*
, 100*(count(col1) over(order by col1) / count(col1) over() )cnt
, count(col1) over() cnt_all
from split_rowset t
order by col1
) z
Demo #1 number of rows 3000.
Demo #2 number of rows 3500.
For better usability you can create a view using the query above and then query that view filtering by part column.
Demo #3 using of a view.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Most Efficient way to Search Massive Redshift Table for Duplicate Values - sql

This should be a fine method. And it should not "time out". Your version has a syntax error. So try: SELECT column_of_interest, COUNT() FROM my_table WHERE date_time >= 1601510400000 AND date_time < 1601596800000 GROUP BY column_of_interest HAVING COUNT() > 1 LIMIT 200

Related

SQL select two rows in one table generating independant output

How can I retrieve data from a Hive table from two columns with non null values and top 500 records in one query?

How to subtract the content of a column of two rows

Using "order by" and fetch inside a union in SQL on as400 database

Split a query result based on the result count

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Most Efficient way to Search Massive Redshift Table for Duplicate Values - sql

This should be a fine method. And it should not "time out". Your version has a syntax error. So try: SELECT column_of_interest, COUNT(*) FROM my_table WHERE date_time >= 1601510400000 AND date_time < 1601596800000 GROUP BY column_of_interest HAVING COUNT(*) > 1 LIMIT 200

Related

SQL select two rows in one table generating independant output

How can I retrieve data from a Hive table from two columns with non null values and top 500 records in one query?

How to subtract the content of a column of two rows

Using "order by" and fetch inside a union in SQL on as400 database

Split a query result based on the result count

Categories

Resources

This should be a fine method. And it should not "time out". Your version has a syntax error. So try: SELECT column_of_interest, COUNT() FROM my_table WHERE date_time >= 1601510400000 AND date_time < 1601596800000 GROUP BY column_of_interest HAVING COUNT() > 1 LIMIT 200