PostgreSQL if query? - sql

Is there a way to select records based using an if statement?
My table looks like this:
id | num | dis
1 | 4 | 0.5234333
2 | 4 | 8.2234
3 | 8 | 2.3325
4 | 8 | 1.4553
5 | 4 | 3.43324
And I want to select the num and dis where dis is the lowest number... So, a query that will produce the following results:
id | num | dis
1 | 4 | 0.5234333
4 | 8 | 1.4553

If you want all the rows with the minimum value within the group:
SELECT id, num, dis
FROM table1 T1
WHERE dis = (SELECT MIN(dis) FROM table1 T2 WHERE T1.num = T2.num)
Or you could use a join to get the same result:
SELECT T1.id, T1.num, T1.dis
FROM table1 T1
JOIN (
SELECT num, MIN(dis) AS dis
FROM table1
GROUP BY num
) T2
ON T1.num = T2.num AND T1.dis = T2.dis
If you only want a single row from each group, even if there are ties then you can use this:
SELECT id, dis, num FROM (
SELECT id, dis, num, ROW_NUMBER() OVER (PARTITION BY num ORDER BY dis) rn
FROM table1
) T1
WHERE rn = 1
Unfortunately this won't be very efficient. If you need something more efficient then please see Quassnoi's page on selecting rows with a groupwise maximum for PostgreSQL. Here he suggests several ways to perform this query and explains the performance of each. The summary from the article is as follows:
Unlike MySQL, PostgreSQL implements
several clean and documented ways to
select the records holding group-wise
maximums, including window functions
and DISTINCT ON.
However to the lack of the loose index
scan support by the PostgreSQL’s
optimizer and the less efficient usage
of indexes in PostgreSQL, the queries
using these function take too long.
To work around these problems and
improve the queries against the low
cardinality grouping conditions, a
certain solution described in the
article should be used.
This solution uses recursive CTE’s to
emulate loose index scan and is very
efficient if the grouping columns have
low cardinality.

Use this:
SELECT DISTINCT ON (num) id, num, dis
FROM tbl
ORDER BY num, dis
Or if you intend to use other RDBMS in future, use this:
select * from tbl a where dis =
(select min(dis) from tbl b where b.num = a.num)

If you need to have IF logic you can use PL/pgSQL.
http://www.postgresql.org/docs/8.4/interactive/plpgsql-control-structures.html
But try to solve your issue with SQL first if possible, it will be faster and use PL/pgSQL when SQL can't solve your problem.

Related

SQL query for all possible combinations from table

I have a table as result of some calculations from SQL database and it looks like this:
[ID] [PAR1] [PAR2]
[A] [110] [0.5]
[B] [105] [1.5]
[C] [120] [2.0]
[D] [130] [3.0]
[E] [115] [5.5]
[F] [130] [6.5]
[G] [120] [7.0]
[H] [110] [7.5]
[I] [105] [8.0]
[J] [120] [9.0]
[K] [110] [9.5]
It's sorted by PAR2 - less means better result.
I need to find the best result of SUM PAR2 from 3 rows, where sum of PAR1 is minimum 350 (at least 350). For ex.:
combination of A+B+C give the the best result of sum PAR2 (0.5+1.5+2.0=4.0), but sum of PAR1: 110+105+120=335 <(350) - condition is not ok, can't use the result,
combination of A+B+D give the result of sum PAR2 (0.5+1.5+3.0=5.0), but sum of PAR1: 110+105+130=345 <(350)- condition is not ok, cant's use the result
combination of A+B+E give the result of sum PAR2 (0.5+1.5+5.5=7.5), but sum of PAR1: 110+105+115=330 <(350)- condition is not ok, cant's use the result
combination of A+B+F give the result of sum PAR2 (0.5+1.5+6.5=8.5), but sum of PAR1: 110+105+130=345 <(350)- condition is not ok, cant's use the result
(...)
combination of B+C+D give the result of sum PAR2 (1.5+2.0+3.0=6.5), and sum of PAR1: 105+120+130=355 >(350)- condition is ok!, so we have a winner with best result 6.5
It is an ASP.NET application, so I tried to get the table from database and use VB code behind to get the result, but this is a "manually" work using FOR..NEXT LOOP, takes a time. So it's not nice and good option for calculations like this and also too slow.
I am wondering if there is a better smooth and smart SQL code to get the result directly from SQL Query. Maybe some advanced math functions? Any ideas?
Thanks in advance.
I made some test using forpas solution, and yes, it works very good. But it takes to much time when i added a lot of WHERE conditions, because original table is very large. So I will try to find a solution for using temp tables in function (not procedures). Thank you all for your answers.
forpas, special thanks also for example and explanation, in this way you let me quikly understand your idea - this is master level ;)
You can use a double inner self-join like this:
select top 1 * from tablename t1
inner join tablename t2 on t2.id > t1.id
inner join tablename t3 on t3.id > t2.id
where t1.par1 + t2.par1 + t3.par1 >= 350
order by t1.par2 + t2.par2 + t3.par2
See the demo.
Results:
> ID | PAR1 | PAR2 | ID | PAR1 | PAR2 | ID | PAR1 | PAR2
> :- | ---: | :--- | :- | ---: | :--- | :- | ---: | :---
> A | 110 | 0.5 | C | 120 | 2.0 | D | 130 | 3.0
So the winner is A+C+D because:
110 + 120 + 130 = 360 >= 350
and the sum of PAR2 is
0.5 + 2.0 + 3.0 = 5.5
which is the minimum
Check this. I feel its accurate or close to your requiremnt-
WITH CTE (ID,PAR1,PAR2)
AS
(
SELECT 'A',110,0.5 UNION ALL
SELECT 'B',105,1.5 UNION ALL
SELECT 'C',120,2.0 UNION ALL
SELECT 'D',130,3.0 UNION ALL
SELECT 'E',115,5.5 UNION ALL
SELECT 'F',130,6.5 UNION ALL
SELECT 'G',120,7.0 UNION ALL
SELECT 'H',110,7.5 UNION ALL
SELECT 'I',105,8.0 UNION ALL
SELECT 'J',120,9.0 UNION ALL
SELECT 'K',110,9.5
)
SELECT B.AID,B.BID,B.CID,SUM_P2,SUM_P1
(
SELECT * , ROW_NUMBER() OVER (PARTITION BY CHAR_SUM ORDER BY CHAR_SUM) CS
FROM
(
SELECT ASCII(A.ID) + ASCII(B.ID)+ASCII(C.ID) CHAR_SUM,
A.ID AID,B.ID BID,C.ID CID,
(A.PAR2+B.PAR2+C.PAR2) AS SUM_P2,
(A.PAR1+B.PAR1+C.PAR1) AS SUM_P1
FROM CTE A
CROSS APPLY CTE B
CROSS APPLY CTE C
WHERE A.ID <> B.ID AND A.ID <> C.ID AND B.ID <> C.ID
AND (A.PAR1+B.PAR1+C.PAR1) >= 350
) A
)B
WHERE CS = 1
You might try to cross join the table with itself three times. This way you would have all the combination of three rows pivoted on a single row, thus making you able to apply the conditions required and picking the maximum value.
select t1.ID, t2.ID, t3.ID, t1.PAR2 + t2.PAR2 + t3.PAR2
from yourTable t1
cross join
yourTable t2
cross join
yourTable t3
where t1.ID < t2.ID and t2.ID < t3.ID and
t1.PAR1 + t2.PAR1 + t3.PAR1 >= 350
order by t1.PAR2 + t2.PAR2 + t3.PAR2 ASC
While this solution should technically work, cross joining tables is not ideal performance-wise, even more when doing it multiple times. If the size of the table is going to grow over time, and you have the option to apply the calculation at code level, I think it would be advisable to do so.
Edit
Changed the where clause including Serg's suggestion

Slightly different greatest-n-per-group

I have read this comment which explains the greatest-n-per-group problem and its solution. Unfortunately, I am facing a slightly different approach, and I am failing to find a solution for it.
Let's suppose I have a table with some basic info regarding users. Due to implementation, this info may or may not repeat itself:
+----+-------------------+----------------+---------------+
| id | user_name | user_name_hash | address |
+----+-------------------+----------------+---------------+
| 1 | peter_jhones | 0xFF321345 | Some Av |
| 2 | sally_whiterspoon | 0x98AB5454 | Certain St |
| 3 | mark_jackobson | 0x0102AB32 | Some Av |
| 4 | mark_jackobson | 0x0102AB32 | Particular St |
+----+-------------------+----------------+---------------+
As you can see, mark_jackobson appears twice, although its address is different in each appearance.
Every now and then, an ETL process queries new user_names and fetches the most recent records of each. Aftewards, it stores the user_name_hash in a table to sign it has already imported that certain user_name
+----------------+
| user_name_hash |
+----------------+
| 0xFF321345 |
| 0x98AB5454 |
+----------------+
Everything begins with the following query:
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table
This way, I am able to select the new hashes from my table. Since I need to query the most recent occurrence of a hash, I wrap it as a sub-query:
SELECT MAX(id)
FROM my_table
WHERE user_name_hash IN (
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table)
GROUP BY user_name_hash
Perfect! With the ids of my new users, I can query the addresses as follows:
SELECT
address,
user_name_hash
FROM my_table
WHERE Id IN (
SELECT MAX(id)
FROM my_table
WHERE user_name_hash IN (
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table)
GROUP BY user_name_hash)
From my perspective, the above query works, but it does not seem optimal. Reading this comment, I noticed I could query the same data, using joins. Since I am failing to write the desired query, could anyone help me out and point me to a direction?
This is the query I have attempted, without success.
SELECT
tb1.address,
tb1.user_name_hash
FROM my_table tb1
INNER JOIN my_table tb2
ON tb1.user_name_hash = tb2.user_name_hash
LEFT JOIN my_hash_table ht
ON tb1.user_name_hash = ht.user_name_hash AND tb1.id > tb2.id
WHERE ht.user_name_hash IS NULL;
Thanks in advance.
EDIT > I am working with PostgreSQL
I believe you are looking for something like this:
SELECT
address,
user_name_hash
FROM my_table t1
JOIN (
SELECT MAX(id) maxid
FROM my_table t2
WHERE NOT EXISTS (
SELECT 1
FROM my_hash_table t3
WHERE t2.user_name_hash = t3.user_name_hash
)
GROUP BY user_name_hash
) t ON t1.ID = t.maxid
I'm using NOT EXISTS instead of EXCEPT since it is more clear to the optimizer.
You can get a better performance using a left outer join (to get the newest records not already imported) and then compute the max id for these records (subquery in the HAVING clause).
SELECT t1.address,
t1.user_name_hash,
MAX(id) AS maxid
FROM my_table t1
LEFT JOIN my_hash_table th ON t1.user_name_hash = th.user_name_hash
WHERE th.user_name_hash IS NULL
GROUP BY t1.address,
t1.user_name_hash
HAVING MAX(id) = (SELECT MAX(id)
FROM my_table t1)

Hive QL Difference between two closest elements in a column

Let's say I have a very simple table like this:
ID: Integer
A 4
A 9
A 2
B 4
B 7
B 3
And I want to groupBy(ID). What would be an appropriate query that tells me the minimum difference - like this
ID: MIN_DIF:
A 2
B 1
Simplicity of the query right now is more important than efficiency, but both the most basic and the most efficient query would be appreciated.
Sidenote: Finding the average distance would be a bonus, but I need min first
You can use lag() or lead():
select id, min(int - prev_int)
from (select t.*, lag(int) over (partition by id order by int) as prev_int
from t
) t
group by id
where prev_int is not null;
An alternative method avoids window functions but would probably have much worse performance is:
select t.id, min(t2.integer - t.integer)
from t join
t t2
on t.id = t2.id
where t2.integer > t.integer
group by t.id;

How to efficiently get a value from the last row in bulk on SQL Server

I have a table like so
Id | Type | Value
--------------------
0 | Big | 2
1 | Big | 3
2 | Small | 3
3 | Small | 3
I would like to get a table like this
Type | Last Value
--------------------
Small | 3
Big | 3
How can I do this. I understand there is an SQL Server method called LAST_VALUE(...) OVER .(..) but I can't get this to work with GROUP BY.
I've also tried using SELECT MAX(ID) & SELECT TOP 1.. but this seems a bit inefficient since there would be a subquery for each value. The queries take too long when the table has a few million rows in it.
Is there a way to quickly get the last value for these, perhaps using LAST_VALUE?
You can do it using rownumber:
select
type,
value
from
(
select
type,
value,
rownumber() over (partition by type order by id desc) as RN
) TMP
where RN = 1
Can't test this now since SQL Fiddle doesn't seem to work, but hopefully that's ok.
The most efficient method might be not exists, which uses an anti-join for the underlying operator:
select type, value
from likeso l
where not exists (select 1 from likeso l2 where l2.type = l.type and l2.id > l.id)
For performance, you want an index on likeso(type, id).
I really wonder if there is more efficent solution but, I use following query on such needs;
Select Id, Type, Value
From ( Select *, Max (Id) Over (Partition By Type) As LastId
From #Table) T
Where Id = LastId

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2
You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1
I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.