I'm using SQL Server and I have table like this:
id
size
reign
1
large
Riyadh
1
small
Riyadh
2
large
Makkah
2
medium
Makkah
2
small
Jeddah
3
medium
Dammam
I want a query to take only one size for each regain and id.
For example in id "1", I want to remove the second value ("small")
Notice: I can't alter or make changes in the table.
The result should be like this:
id
size
reign
1
large
Riyadh
2
large
Makkah
2
small
Jeddah
3
medium
Dammam
Assuming you want to prioritize large, medium, and small, in this order, we can try using ROW_NUMBER as follows:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id, reign
ORDER BY CASE size WHEN 'large' THEN 1
WHEN 'medium' THEN 2
WHEN 'small' THEN 3 END) rn
FROM yourTable
)
SELECT id, size, reign
FROM cte
WHERE rn = 1
ORDER BY id;
If your data is always in this simple form (meaning there are those three sizes only) and if you always want to select large first if present, else medium first if present and last small, this can just be done using MIN and GROUP BY with a good ORDER BY clause:
SELECT id, MIN(size) AS size, reign
FROM yourtable
GROUP BY id, reign
ORDER BY id, size;
This query will produce exactly the result shown in your question.
Verify this here: db<>fiddle
If this logic is not sufficient to meet your requirements, please edit your question and explain more detailed what you need.
Related
I have a table with id and score. I want to create a new set of data with a sampling method. The sampling method would be to order the id in decreasing order of the scores and sample the 3rd id, starting with the first form the beginning until we get 10k positive samples. And we would like to do the same in the other direction, starting from the end to get 10k negative samples.
id
score
24
0.55
58
0.43
987
0.93
How can I write a SQL query to execute this sampling and get the expected output?
To start with, this would be more straightforward to write an answer if you included the database you used (SQL Server, MySQL, etc). Different SQL versions have different syntax.
BACKGROUND
To answer this question, the main tools you need are the ability to sort, and an ability to take every 3rd row.
I'm using SQL Server here, so sorting includes
TOP modifier in SELECT statements - in other databases it's often LIMIT (e.g., SELECT * FROM Test LIMIT 1000)
ROW_NUMBER() which I believe is relatively common
To get every third row, I use the 'modulus' mathematical function - in SQL Server signified by a % symbol - so, for example
1 % 3 = 1
2 % 3 = 2
3 % 3 = 0
4 % 3 = 1
APPROACH
There is an example of this in this db<>fiddle - but note that it is only dealing with test data (1000 rows, selecting top and bottom 10).
Running through the steps - and assuming your data is stored in #DataTable:
The following command assigns a row number rn to the data, sorted by the score.
SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable;
To get every third value, start with that data and take every third value (e.g., where the row number is a multiple of 3)
SELECT *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0;
To get the first 10,000 of them, use TOP (or LIMIT, etc)
SELECT TOP 10000 *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0
ORDER BY rn;
Note - to get it the other way/get the highest scores, take the ROW_NUMBER() in reverse order (e.g., ORDER BY score DESC, id DESC).
FINAL ANSWER
Take the above 10,000 rows, and do a similar for the other way (e.g., to get the highest scores) then UNION them together. Below it is done with a CTE.
WITH TopScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score DESC, id DESC) as rn
FROM #DataTable
) AS RankedScores_down
WHERE RankedScores_down.rn % 3 = 0
ORDER BY RankedScores_down.rn
),
LowScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable
) AS RankedScores_up
WHERE RankedScores_up.rn % 3 = 0
ORDER BY RankedScores_up.rn
)
SELECT * FROM TopScores
UNION
SELECT * FROM LowScores
ORDER BY score, id;
Notes
I used 'UNION' rather than 'UNION ALL' because, in the chance that there is overlap (e.g., you have less than 60,000 datapoints) we only want to include each sample once
If you use a different database, you'll need to translate this! Here are the benefits of specifying the database you use.
Note that taking every third value (when sorted by score) is not really 'independent' sampling - one would ask why you just don't use all of the top/bottom 30,000 scores? If you to sample 1 in 3 of them, instead you could use id % 3 instead of rn % 3. But once again, why would you do this? Why not just collect fewer results and use them all?
Of course, one good reason is to use half the data to check the validity of stats e.g., take half the data, do your model - then check against the other half how good your model is.
I have a sample table named assets which looks like this:
id
name
block_no
1
asset1
2
2
asset2
2
3
asset3
3
There can be any number of assets in a specific block. I need a minimum of 100 rows from the table, and containing all the data from the block_no. Like, if there are 95 rows to block_no 2 and around 20 on block_no 3, I need all 20 of block_no 3 as if I am fetching data in packets based on block_no.
Is this possible and feasible?
Postgres 13 or later
There is a dead simple solution using WITH TIES in Postgres 13 or later:
SELECT *
FROM assets
WHERE block_no >= 2 -- your starting block
ORDER BY block_no
FETCH FIRST 100 ROWS WITH TIES;
This will return at least 100 rows (if enough qualify), plus all peers of the 100th row.
If your table isn't trivially small, an index on (block_no) is essential for performance.
See:
Get top row(s) with highest value, with ties
Older versions
Use the window function rank() in a subquery:
SELECT (a).*
FROM (
SELECT a, rank() OVER (ORDER BY block_no) AS rnk
FROM assets a
) sub
WHERE rnk <= 100;
Same result.
I use a little trick with the row type to strip the added rnk from the result. That's an optional addition.
See:
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?
I have some data like this as shown below:
Acc_Id || Row_No
1 1
2 1
2 2
2 3
3 1
3 2
3 3
3 4
and I need a query to get the results as shown below:
Acc_Id || Row_No
1 1
2 3
3 4
Please consider that I'm a beginner in SQL.
I assume you want the Count of the row
SELECT Acc_Id, COUNT(*)
FROM Table
GROUP BY Acc_Id
Try this:
select Acc_Id, MAX(Row_No)
from table
group by Acc_Id
As a beginner then this is your first exposure to aggregation and grouping. You may want to look at the documentation on group by now that this problem has motivated your interest in a solutions. Grouping operates by looking at rows with common column values, that you specify, and collapsing them into a single row which represents the group. In your case values in Acc_Id are the names for your groups.
The other answers are both correct in the the final two columns are going to be equivalent with your data.
select Acc_Id, count(*), max(Row_No)
from T
group by Acc_Id;
If you have gaps in the numbering then they won't be the same. You'll have to decide whether you're actually looking for a count of rows of a maximum of a value within a column. At this point you can also consider a number of other aggregate functions that will be useful to you in the future. (Note that the actual values here are pretty much meaningless in this context.)
select Acc_Id, min(Row_No), sum(Row_No), avg(Row_No)
from T
group by Acc_Id;
Let's say I have this table
Table name: Traffic
Seq. Type Amount
1 in 10
2 out 30
3 in 50
4 out 70
What I need is to get the previous smaller and next larger amount of a value. So, if I have 40 as a value, I will get...
Table name: Traffic
Seq. Type Amount
2 out 30
3 in 50
I already tried doing it with MYSQL and quite satisfied with the results
(select * from Traffic where
Amount < 40 order by Amount desc limit 1)
union
(select * from Traffic where
Amount > 40 order by Amount desc limit 1)
The problem lies when I try to convert it to a SQL statement acceptable by AS400. It appears that the order by and fetch function (AS400 doesn't have a limit function so we use fetch, or does it?) is not allowed inside the select statement when I use it with a union. I always get a keyword not expected error. Here is my statement;
(select seq as sequence, type as status, amount as price from Traffic where
Amount < 40 order by price asc fetch first 1 rows only)
union
(select seq as sequence, type as status, amount as price from Traffic where
Amount > 40 order by price asc fetch first 1 rows only)
Can anyone please tell me what's wrong and how it should be? Also, please share if you know other ways to achieve my desired result.
How about a CTE? From memory (no machine to test with):
with
less as (select * from traffic where amount < 40),
more as (select * from traffic where amount > 40)
select * from traffic
where id = (select id from less where amount = (select max(amount from less)))
or id = (select id from more where amount = (select min(amount from more)))
I looked at this question from possibly another point of view. I have seen other questions about date-time ranges between rows, and I thought perhaps what you might be trying to do is establish what range some value might fall in.
If working with these ranges will be a recurring theme, then you might want to create a view for it.
create or replace view traffic_ranges as
with sorted as
( select t.*
, smallint(row_number() over (order by amount)) as pos
from traffic t
)
select b.pos range_seq
, b.id beg_id
, e.id end_id
, b.typ beg_type
, e.typ end_type
, b.amount beg_amt
, e.amount end_amt
from sorted b
join sorted e on e.pos = b.pos+1
;
Once you have this view, it becomes very simple to get your answer:
select *
from traffic_ranges
where 40 is between beg_amt and end_amt
Or to get only one range where the search amount happens to be an amount in your base table, you would want to pick whether to include the beginning value or ending value as part of the range, and exclude the other:
where beg_amt < 40 and end_amt >= 40
One advantage of this approach is performance. If you are finding the range for multiple values, such as a column in a table or query, then having the range view should give you significantly better performance than a query where you must aggregate all the records that are more or less than each search value.
Here's my query using CTE and union inspired by Buck Calabro's answer. Credits go to him and WarrenT for being SQL geniuses!
I won't be accepting my own answer. That will be unfair. hehe
with
apple(seq, type, amount) as (select seq, type, amount from traffic where amount < 40
order by amount desc fetch first 1 rows only),
banana(seq, type, amount) as (select seq, type, amount from traffic where
amount > 40 fetch first 1 rows only)
select * from apple
union
select * from banana
It's a bit slow but I can accept that since I'll only use it once in the progam.
This is just a sample. The actual query is a bit different.
This is a continuation of my previous question here.
In the following example:
id PRODUCT ID COLOUR
1 1001 GREEN
2 1002 GREEN
3 1002 RED
4 1003 RED
Given a product ID, I want to retrieve only one record - that with GREEN colour, if one exists, or the RED one otherwise. It sounds like I need to employ DISTINCT somehow, but I don't know how to supply the priority rule.
Pretty basic I'm sure, but my SQL skills are more than rusty..
Edit: Thank you everybody. One more question please: how can this be made to work with multiple records, ie. if the WHERE clause returns more than just one record? The LIMIT 1 would limit across the entire set, while what I'd want would be to limit just within each product.
For example, if I had something like SELECT * FROM table WHERE productID LIKE "1%" ... how can I retrieve each unique product, but still respecting the colour priority (GREEN>RED)?
try this:
SELECT top 1 *
FROM <table>
WHERE ProductID = <id>
ORDER BY case when colour ='GREEN' then 1
when colour ='RED' then 2 end
If you want to order it based on another color, you can give it in the case statement
SELECT *
FROM yourtable
WHERE ProductID = (your id)
ORDER BY colour
LIMIT 1
(Green will come before Red, you see. The LIMIT clause returns only one record)
For your subsequent edit, you can do this
select yourtable.*
from
yourtable
inner join
(select productid, min(colour) mincolour
from yourtable
where productid like '10%'
group by productid) v
on yourtable.productid=v.productid
and yourtable.colour=v.mincolour