Sampling unique set of records in Oracle table - sql

I have an Oracle table that from which I need to select a given percentage of records for each type of a given set of unique column combination.
For example,
SELECT distinct column1, column2, Column3 from TableX;
provides me all the combination of unique records from that table. I need a % of each rows from each such combination. Currently I am using the following query to accomplish this, which is lengthy and slow.
SELECT *
FROM tableX Sample ( 3 )
WHERE Column1 = ‘value1’ and
Column2 = ‘value2’ and
Column3 = ‘value3
UNION
SELECT *
FROM tableX Sample ( 3 )
WHERE Column1 = ‘value1’ and
Column2 = ‘value2’ and
Column3 = ‘value4
UNION
…
…
SELECT *
FROM tableX Sample ( 3 )
WHERE Column1 = ‘valueP’ and
Column2 = ‘valueQ’ and
Column3 = ‘valueR’
Where the combination of suffix in the “Value” is unique for that table (obtained from the first query)
How can I improve the length of the query and speed?

Here is one approach:
select t.*
from (select t.*,
row_number() over (partition by column1, column2, column3 order by dbms_random()
) as seqnum,
count(*) over (partition by column1, column2, column3) as totcnt
from tablex t
) t
where seqnum / totcnt <= 0.10 -- or whatever your threshold is
It uses row_number() to assign a sequential number to rows in each group, in a random order. The where clause chooses the proportion that you want.

Related

Select row after filter row has a coincident column in sql

I have a database as below
Column1 column2 column3
A123 abc Def
A123 xyz Abc
B456 Gh Ui
I want to select rows which don't have coincident content in column 1 by sql command.
In this case, The expected result is only row 3rd.
How to do it?
Thanks
you could use a join with a subselect for count =1
select * from my_table m
inner join (
select column1, count(*)
from my_table
group by column_1
having count(*) =1
) t on t.column_1 = m.column_1
WITH CTE AS (Select COUNT(Column1) OVER(PARTITION BY Column1 ) as coincident,* from table )Select * from CTE where coincident =1
I would use window functions:
select Column1, column2, column3
from (select t.*, count(*) over (partition by column1) as cnt
from t
) t
where cnt = 1;
However, there are other fun ways. For instance, aggregation:
select column1, max(column2) as column2, max(column3) as column3
from t
group by column1
having count(*) = 1;
Or if you know one of the other columns is going to have different values on different rows, then not exists may be the most efficient solution:
select t.*
from t
where not exists (select 1
from t t2
where t2.column1 = t.column1 and
t2.column2 <> t.column2
);

Subtract 2 rows using case statement in SQL Server 2008

My data is like below, it's in a single table
Column1 Column2
abc 100
abc 200
Now I need like below
abc 100 //here 200-100
I am banging my head on how to achieve this.
I have tried to use the row_number and then subtract using case statement like
Select
column1,
sum(
case when rownum=1
then column2
end
-
case when rownum=2
then column2
end
)
from table
group by column1
But this is giving me null.
Assuming there is no attribute which can define row ordering -
;with cte as(
select
row_number() over (order by (select null)) as IndexId,
Column1,
Column2
from #xyz
)
select sum(case when IndexID=1 then (-1 * Column2) else Column2 end), Column1
from cte
group by Column1
Input data-
declare #xyz table(Column1 varchar(10),Column2 int)
insert into #xyz
select 'abc' ,100 union all
select 'abc' ,200
Assuming you have an attribute rownum in table which is always 1 or 2 (it can be generated by some row_number() as you suggest in question, according to any order that is suitable for you)
Column1 Column2 Rownum
------------------------
abc 100 1
abc 200 2
then you can simply use
Select
column1,
sum(
case when rownum=1
then column2
else -column2
end
)
from table
group by column1
It performs a sum of the Column2 per Column1, however, in the row having rownum = 2 the Column2 value is negated. Therefore in our example you end up with 100 + (-200) = -100
You could do:
select column1, max(column2) - min(column2)
from t
group by column1;
Here is a short form of the answer above if you care:
SELECT
column1,
SUM(IIF(rownum=1,column2,-column2))
FROM table
GROUP BY column1

Recursive Lag Column Calculation in SQL

I am trying to write a procedure that inserts calculated table data into another table.
The problem I have is that I need each row's calculated column to be influenced by the result of the previous row's calculated column. I tried to lag the calculation itself but this does not work!
Such as:
(Max is a function I created that returns the highest of two values)
Id Product Model Column1 Column2
1 A 1 5 =MAX(Column1*2, Lag(Column2))
2 A 2 2 =MAX(Column1*2, Lag(Column2))
3 B 1 3 =MAX(Column1*2, Lag(Column2))
If I try the above in SQL:
SELECT
Column1,
MyMAX(Column1,LAG(Column2, 1, 0) OVER (PARTITION BY Product ORDER BY Model ASC) As Column2
FROM Source
...it says column2 is unknown.
Output I get if I LAG the Column2 calculation:
Select Column1, MyMAX(Column1,LAG(Column1*2, 1, 0) OVER (PARTITION BY Product ORDER BY Model ASC) As Column2
Id Column1 Column2
1 5 10
2 2 10
3 3 6
Why 6 on row 3? Because 3*2 > 2*2.
Output that I want:
Id Column1 Column2
1 5 10
2 2 10
3 3 10
Why 10 on row 3? Because previous result of 10 > 3*2
The problem is I can't lag the result of Column2 - I can only lag other columns or calculations of them!
Is there a technique of achieving this with LAG or must I use Recursive CTE? I read that LAG succeeds CTE so I assumed it would be possible. If not, what would this 'CTE' look like?
Edit: Or alternatively - what else could I do to resolve this calculation?
Edit
In hindsight, this problem is a running partitioned maximum over Column1 * 2. It can be done as simply as
SELECT Id, Column1, Model, Product,
MAX(Column1 * 2) OVER (Partition BY Model, Product Order BY ID ASC) AS Column2
FROM Table1;
Fiddle
Original Answer
Here's a way to do this with a recursive CTE, without LAG at all, by joining on incrementing row numbers. I haven't assumed that your Id is contiguous, hence have added an additional ROW_NUMBER(). You haven't mentioned any partitioning, so haven't applied same. The query simply starts at the first row, and then projects the greater of the current Column1 * 2, or the preceding Column2
WITH IncrementingRowNums AS
(
SELECT Id, Column1, Column1 * 2 AS Column2,
ROW_NUMBER() OVER (Order BY ID ASC) AS RowNum
FROM Table1
),
lagged AS
(
SELECT Id, Column1, Column2, RowNum
FROM IncrementingRowNums
WHERE RowNum = 1
UNION ALL
SELECT i.Id, i.Column1,
CASE WHEN (i.Column2 > l.Column2)
THEN i.Column2
ELSE l.Column2
END,
i.RowNum
FROM IncrementingRowNums i
INNER JOIN lagged l
ON i.RowNum = l.RowNum + 1
)
SELECT Id, Column1, Column2
FROM lagged;
SqlFiddle here
Edit, Re Partitions
Partitioning is much the same, by just dragging the Model + Product columns through, then partitioning by these in the row numbering (i.e. starting back at 1 each time the Product or Model resets), including these in the CTE JOIN condition and also in the final ordering.
WITH IncrementingRowNums AS
(
SELECT Id, Column1, Column1 * 2 AS Column2, Model, Product,
ROW_NUMBER() OVER (Partition BY Model, Product Order BY ID ASC) AS RowNum
FROM Table1
),
lagged AS
(
SELECT Id, Column1, Column2, Model, Product, RowNum
FROM IncrementingRowNums
WHERE RowNum = 1
UNION ALL
SELECT i.Id, i.Column1,
CASE WHEN (i.Column2 > l.Column2)
THEN i.Column2
ELSE l.Column2
END,
i.Model, i.Product,
i.RowNum
FROM IncrementingRowNums i
INNER JOIN lagged l
ON i.RowNum = l.RowNum + 1
AND i.Model = l.Model AND i.Product = l.Product
)
SELECT Id, Column1, Column2, Model, Product
FROM lagged
ORDER BY Model, Product, Id;
Updated Fiddle

Finding rows that have many similar values and one different one

I'm trying to isolate a problem with a violation of a unique key index. I'm pretty certain that the cause is resulting from columns that have the same value in 3 columns not having the same value in the 4th (when they should). As an example...
Key Column1 Column2 Column3 Column4
1 A B C D
2 A B C D
3 A B C D
4 A B C Z
I basically want to select column 4, or some way to let me identify column 4. I know it's a matter of using aggregrate functions but I'm not very familiar with them. Can anyone assist on a way to select Key, Column4 for rows that have a different column 4 value and the same column 1-3 values?
This is what you want:
select column1, column2, column3
from t
group by column1, column2, column3
having min(column4) <> max(column4)
Once you get the right values for the first three columns, you can join back in to get the specific rows.
Or, you can use window functions like this:
select t.*
from (select t.*, min(column4) over (partition by column1, column2 column3) as min4,
max(column4) over (partition by column1, column2 column3) as max4
from t
) t
where min4 <> max4;
If NULL is a valid "other" value that you want to count, you will need additional logic for that.
If you want to get all columns, then (it could be simpler if windowed count supported distinct but it's not):
with cte1 as (
select distinct * from Table1
), cte2 as (
select
*,
count(column4) over(partition by column1, column2, column3) as cnt
from cte1
)
select * from cte2 where cnt > 1;
if you want just to select key:
select
column1, column2, column3
from Table1
group by column1, column2, column3
having count(distinct column4) > 1
sql fiddle demo

How do I commit/execute deletion row by row in SQL (MSSQL)

I have a simple table but a long one (few millions or rows).
The table contains many paired rows which I need to delete.
The row data is not distinct!
There are single rows (which has no a paired row)
The table pairs are defined by cross info in two columns concatenated to a 3rd column.
I would like to have only one row of each data identifier.
Therefore, I need the myTable to shrink immediately whereis a condition is met.
I tried:
myIndexColumn = Column1 + Column2 + Column3
myReversedIndexColumn = Column2 + Column1 + Column3
CREATE NONCLUSTERED INDEX myIndex1 ON myDB.dbo.myTable (
myIndexColumn ASC
)
CREATE NONCLUSTERED INDEX myIndex2 ON myDB.dbo.myTable (
myReversedIndexColumn ASC
)
DELETE FROM myDB.dbo.myTable
WHERE myIndexColumn in (SELECT myReversedIndex FROM myDB.dbo.myTable)
The problem is that both paired data is deleted instead of leaving one row of the data.
Obviously, that is because the DELETE commits changes only after running the entire transaction.
If I could persuid the MS SQL 2008 R2 Express edition to commit the DELETE upon condition is met, the SELECT clause would have output a shorter list on each row test to delete.
How do I do that?
To not delete the cases where column1 = column2
DELETE FROM myDB.dbo.myTable
WHERE myIndexColumn in (SELECT myReversedIndex FROM myDB.dbo.myTable)
AND column1 <> column2
To remove column1 = column2
;with cte as
(
select *,
row_number() over (
partition by Column1 + Column2 + Column3
order by (SELECT 1)
) rn
from yourtable
where column1 = column2
)
delete cte where rn > 1
The CTE can be used to delete all duplicates too
;with cte as
(
select *,
row_number() over (
partition by
CASE WHEN Column1 > Column2 THEN Column2 ELSE Column1 END +
CASE WHEN Column1 > Column2 THEN Column1 ELSE Column2 END +
Column3
order by (SELECT 1)
) rn
from yourtable
)
delete cte where rn > 1