Aggregate randomly? - sql

I'm using Microsoft's SQL Server 2008. I need to aggregate by a foreign key to randomly get a single value, but I'm stumped. Consider the following table:
id fk val
----------- ----------- ----
1 100 abc
2 101 def
3 102 ghi
4 102 jkl
The desired result would be:
fk val
----------- ----
100 abc
101 def
102 ghi
Where the val for fk 102 would randomly be either "ghi" or "jkl".
I tried using NEWID() to get unique random values, however, the JOIN fails since the NEWID() value is different depending on the sub query.
WITH withTable AS (
SELECT id, fk, val, CAST(NEWID() AS CHAR(36)) random
FROM exampleTable
)
SELECT t1.fk, t1.val
FROM withTable t1
JOIN (
SELECT fk, MAX(random) random
FROM withTable
GROUP BY fk
) t2 ON t2.random = t1.random
;
I'm stumped. Any ideas would be greatly appreciated.

I might think about it a little differently, using a special ranking function called ROW_NUMBER().
You basically apply a number to each row, grouped by fk, starting with 1, ordered randomly by using the NEWID() function as a sort value. From this you can select all rows where the row number was 1. The effect of this technique is that it randomizes which row gets assigned the value 1.
WITH withTable(id, fk, val, rownum) AS
(
SELECT
id, fk, val, ROW_NUMBER() OVER (PARTITION BY fk ORDER BY NEWID())
FROM
exampleTable
)
SELECT
*
FROM
withTable
WHERE
rownum = 1
This approach has the added benefit in that it takes care of the grouping and the randomization in one pass.

You can do this not with aggregation but with row_number():
select id, fk, val
from (select t1.*,
row_number() over (partition by fk order by newid()) as seqnum
from withTable t1
) t1
where seqnum = 1

One option is to get the values that are the same fk into a temp table then SELECT TOP 1 ORDER by NEWID()
That should work for you.

Related

SQL script to identify row based on min value

How to write a SQL statement (in SQL Server) to get a row with minimum value based on two columns?
For example:
Type Rank Val1 val2
------------------------------
A 6 486.57 38847
B 6 430 56345
C 5 390 99120
D 5 329 12390
E 4 350 11109
E 4 320 11870
The SQL statement should return the last row in above table, because it has min value for Rank, and Val1.
Something like this:
select *
from Table1
where rank = (select min(rank) from Table1)
and Val1 = (select min(Val1)
from Table1
where rank = (select min(rank) from Table1))
Or this, if you like a simple life:
select top 1 *
from Table1
order by rank asc, Val1 asc
with cte as (
select *, row_number() over (order by rank, val1) as rn
from dbo.yourTable
)
select *
from cte
where rn = 1;
The idea here is that I'm assigning a 1..n enumeration to the rows based on rank and, in the case of ties, Val1. I return the row that takes the value of 1. If there is the possibility of a tie, use rank() instead of row_number().
I'm assuming that Type is the primary key for your table, and that you only want a row that has both the lowest Val1 and lowest Val2 (so if one row has the lowest Val1, but not the lowest Val2, this returns no data). I'm not sure about these assumptions, but your question could probably be clarified a bit.
Here's the code:
SELECT
*
FROM
Table1
WHERE
Type IN
(
SELECT
Type
FROM
Table1
GROUP BY
Type
HAVING
MIN(Val1) AND MIN(val2)
)

How to find minimum values in a column in sql

If I have a table like this:
id name value
1 abc 1
2 def 4
3 ghi 1
4 jkl 2
How can I select a new table that still has id, name, value but only the ones with a minimum value.
In this example I need this table back:
1 abc 1
3 ghi 1
Finding those values is pretty straightforward:
SELECT *
FROM YourTable
WHERE value = (SELECT MIN(Value) FROM YourTable);
As for the right syntax for putting those rows in another table, that will depend on the database engine that you are using.
An alternative to #Lamak's solution could be to use the rank window function. Depending on the exact scenario, it may perform quite better:
SELECT id, name, value
FROM (SELECT id, name, value, RANK() OVER (ORDER BY value ASC) AS rk
FROM mytable) t
WHERE rk = 1
not sure exactly if this is what you're trying to do, but I think this would work:
--creating #temp1 to recreate your table/example
CREATE TABLE #TEMP1
(id INT NOT NULL PRIMARY KEY,
name CHAR(3) NOT NULL,
value INT NOT NULL)
INSERT INTO #TEMP1
VALUES
(1,'abc',1),
(2,'def',4),
(3,'ghi',1),
(4,'jkl',2)
-verify correct
SELECT * FROM #temp1
--populate new table with min value from table 1
SELECT *
INTO #TEMP2
FROM #TEMP1
WHERE value = (SELECT MIN(value)
FROM #TEMP1)
SELECT * FROM #TEMP2

Update based on Row number

Please consider this table:
id Col1 Col2
--------------------------
1 nima null
18 john null
25 sara null
I have a select that retutn this records:
id Col
---------------
1 LA
2 WA
3 FL
I want to update this record on above table with same order that you see.for example update LA for nima and...
How I can do this?
thanks
You can update based on row_number() and the fact that you can update common table expressions in SQL Server, like this:
with cte1 as (
select Col2, row_number() over(order by id) as rn
from Table1
), cte2 as (
select col, row_number() over(order by id) as rn
from Table2
)
update c1 set
Col2 = c2.col
from cte1 as c1
left outer join cte2 as c2 on c2.rn = c1.rn
sql fiddle demo
Note that if your tables are large, the performance could be not very good. If this is the case, you can think about creating temporary tables with row_number columns and make this columns primary key of just create appropriate indexes

Checking for duplicate data in SQL Server

Please don't ask me why but there is a lot of duplicate data where every field is duplicated.
For example
alex, 1
alex, 1
liza, 32
hary, 34
I will need to eliminate from this table one of the alex, 1 rows
I know this algorithm will be very ineffecient, but it does not matter. I will need to remove duplicate data.
What is the best way to do this? Please keep in mind I do not have 2 fields, I actually have about 10 fields to check on.
As you said, yes this will be very inefficient, but you can try something like
DECLARE #TestTable TABLE(
Name VARCHAR(20),
SomeVal INT
)
INSERT INTO #TestTable SELECT 'alex', 1
INSERT INTO #TestTable SELECT 'alex', 1
INSERT INTO #TestTable SELECT 'liza', 32
INSERT INTO #TestTable SELECT 'hary', 34
SELECT *
FROM #TestTable
;WITH DuplicateVals AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY Name, SomeVal ORDER BY (SELECT NULL)) RowID
FROM #TestTable
)
DELETE FROM DuplicateVals WHERE RowID > 1
SELECT *
FROM #TestTable
I understand this does not answer the specific question (eliminating dupes in SAME table), but I'm offering the solution because it is very fast and might work best for the author.
Speedy solution, if you don't mind creating a new table, create a new table with the same schema named NewTable.
Execute this SQL
Insert into NewTable
Select
name,
num
from
OldTable
group by
name,
num
Just include every field name in both the select and group by clauses.
Method A. You can get a deduped version of your data using
SELECT field1, field2, ...
INTO Deduped
FROM Source
GROUP BY field1, field2, ...
for example, for your sample data,
SELECT name, number
FROM Source
GROUP BY name, number
yields
alex 1
hary 34
liza 32
then simply delete the old table, and rename the new one. Of course, there are a number of fancy in-place solutions, but this is the clearest way to do it.
Method B. An in-place method is to create a primary key and delete duplicates that way. For example, you can
ALTER TABLE Source ADD sid INT IDENTITY(1,1);
which makes Source look like this
alex 1 1
alex 1 2
liza 32 3
hary 34 4
then you can use
DELETE FROM Source
WHERE sid NOT IN
(SELECT MIN(sid)
FROM Source
GROUP BY name, number)
which will give the desired result. Of course, "NOT IN" is not exactly the most efficient, but it will do the job. Alternatively, you can LEFT JOIN the grouped table (maybe stored in a TEMP table), and do the DELETE that way.
create table DuplicateTable(name varchar(10), number int)
insert DuplicateTable
values
('alex', 1),
('alex', 1),
('liza', 32),
('hary', 34);
with cte
as
(
select *, row_number() over(partition by name, number order by name) RowNumber
from DuplicateTable
)
delete cte
where RowNumber > 1
A bit different solution which requires primary key(or unique index):
Suppose you have a table your_table(id - PK, name, and num)
DELETE
FROM your_table
FROM your_table AS t2
WHERE
(select COUNT(*) FROM your_table y
where t2.name = y.name and t2.num = y.num) >1
AND t2.id !=
(SELECT top 1 id FROM your_table z
WHERE t2.name = z.name and t2.num = z.num);
I assumed that name and num are NOT NULL, if they can contain NULL values, you need to change wheres in sub-queries.

Select DISTINCT, return entire row

I have a table with 10 columns.
I want to return all rows for which Col006 is distinct, but return all columns...
How can I do this?
if column 6 appears like this:
| Column 6 |
| item1 |
| item1 |
| item2 |
| item1 |
I want to return two rows, one of the records with item1 and the other with item2, along with all other columns.
In SQL Server 2005 and above:
;WITH q AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY col6 ORDER BY id) rn
FROM mytable
)
SELECT *
FROM q
WHERE rn = 1
In SQL Server 2000, provided that you have a primary key column:
SELECT mt.*
FROM (
SELECT DISTINCT col6
FROM mytable
) mto
JOIN mytable mt
ON mt.id =
(
SELECT TOP 1 id
FROM mytable mti
WHERE mti.col6 = mto.col6
-- ORDER BY
-- id
-- Uncomment the lines above if the order matters
)
Update:
Check your database version and compatibility level:
SELECT ##VERSION
SELECT COMPATIBILITY_LEVEL
FROM sys.databases
WHERE name = DB_NAME()
The key word "DISTINCT" in SQL has the meaning of "unique value". When applied to a column in a query it will return as many rows from the result set as there are unique, different values for that column. As a consequence it creates a grouped result set, and values of other columns are random unless defined by other functions (such as max, min, average, etc.)
If you meant to say you want to return all rows for which Col006 has a specific value, then use the "where Col006 = value" clause.
If you meant to say you want to return all rows for which Col006 is different from all other values of Col006, then you still need to specify what that value is => see above.
If you want to say that the value of Col006 can only be evaluated once all rows have been retrieved, then use the "having Col006 = value" clause. This has the same effect as the "where" clause, but "where" gets applied when rows are retrieved from the raw tables, whereas "having" is applied once all other calculations have been made (i.e. aggregation functions have been run etc.) and just before the result set is returned to the user.
UPDATE:
After having seen your edit, I have to point out that if you use any of the other suggestions, you will end up with random values in all other 9 columns for the row that contains the value "item1" in Col006, due to the constraint further up in my post.
You can group on Col006 to get the distinct values, but then you have to decide what to do with the multiple records in each group.
You can use aggregates to pick a value from the records. Example:
select Col006, min(Col001), max(Col002)
from TheTable
group by Col006
order by Col006
If you want the values to come from a specific record in each group, you have to identify it somehow. Example of using Col002 to identify the record in each group:
select Col006, Col001, Col002
from TheTable t
inner join (
select Col006, min(Col002)
from TheTable
group by Col006
) x on t.Col006 = x.Col006 and t.Col002 = x.Col002
order by Col006
SELECT *
FROM (SELECT DISTINCT YourDistinctField FROM YourTable) AS A
CROSS APPLY
( SELECT TOP 1 * FROM YourTable B
WHERE B.YourDistinctField = A.YourDistinctField ) AS NewTableName
I tried the answers posted above with no luck... but this does the trick!
select * from yourTable where column6 in (select distinct column6 from yourTable);
SELECT *
FROM harvest
GROUP BY estimated_total;
You can use GROUP BY and MIN() to get more specific result.
Lets say that you have id as the primary_key.
And we want to get all the DISTINCT values for a column lets say estimated_total, And you also need one sample of complete row with each distinct value in SQL. Following query should do the trick.
SELECT *, min(id)
FROM harvest
GROUP BY estimated_total;
create table #temp
(C1 TINYINT,
C2 TINYINT,
C3 TINYINT,
C4 TINYINT,
C5 TINYINT,
C6 TINYINT)
INSERT INTO #temp
SELECT 1,1,1,1,1,6
UNION ALL SELECT 1,1,1,1,1,6
UNION ALL SELECT 3,1,1,1,1,3
UNION ALL SELECT 4,2,1,1,1,6
SELECT * FROM #temp
SELECT *
FROM(
SELECT ROW_NUMBER() OVER (PARTITION BY C6 Order by C1) ID,* FROM #temp
)T
WHERE ID = 1