SQL select minimum date from same column - sql

I'm trying to write a query based on accounts and their contracts. The table has all contracts for each account, whether the contract is active, expired, etc. I want the query to only bring back the contract with earliest start date per account, so only one row for each account. However i don't know the status of the earliest contract for each account. Some might have active, some might have pending. I run into the problem now where it brings back multiple records for each account if the contract status is in the list i specify. Simple sample code below:
Select t.account, t.contract, t.status Min(t.start_date)
From table t
where t.status in ('Active','Countersigned','Pending')

If your database supports it (e.g. Oracle, Postgres, SQL Server, but not MySQL or SQLite), you can use Window Functions. For instance, you can rank your contracts within each account by starting_at:
SELECT *, rank() OVER (PARTITION BY account_id ORDER BY starting_at ASC) AS rank
FROM contracts
Then you can use that in a subquery to join to accounts and only take contracts with a rank of 1. You'll need to put it in a subquery, because unfortunately (in Postgres at least) you
can't use window functions inside WHERE. So this won't work:
SELECT *, rank() OVER (PARTITION BY account_id ORDER BY starting_at ASC) AS rank
FROM contracts
WHERE rank = 1
but this will:
SELECT *
FROM (SELECT *, rank() OVER (PARTITION BY account_id ORDER BY starting_at ASC) AS rank
FROM contracts) x
WHERE rank = 1
Note you can easily add filtering by status, etc. to any of these queries.

This should work:
select account, contract, status, MinDate
from
(
Select t.account, t.contract, t.status, t.start_date,
Min(t.start_date) over(partition by t.account) MinDate
From table t
where t.status in ('Active','Countersigned','Pending')
) x
where start_date=MinDate

a solution that works if you don't have multiple contracts for each account on the same MIN(date) (in that case you'd get multiple rows for each account and you should decide which of these N contracts you want to see, I can't decide for you)
SELECT t.*
FROM (
Select t.account, Min(t.start_date) AS MinDate
From table t
where t.status in ('Active','Countersigned','Pending')
GROUP BY t.account
) AS t2
INNER JOIN table t ON t.account = t2.account AND t.start_date = t2.MinDate

Related

How do we find frequency of one column based off two other columns in SQL?

I'm relatively new to working with SQL and wasn't able to find any past threads to solve my question. I have three columns in a table, columns being name, customer, and location. I'd like to add an additional column determining which location is most frequent, based off name and customer (first two columns).
I have included a photo of an example where name-Jane customer-BEC in my created column would be "Texas" as that has 2 occurrences as opposed to one for California. Would there be anyway to implement this?
If you want 'Texas' on all four rows:
select t.Name, t.Customer, t.Location,
(select t2.location
from table1 t2
where t2.name = t.name
group by name, location
order by count(*) desc
fetch first 1 row only
) as most_frequent_location
from table1 t ;
You can also do this with analytic functions:
select t.Name, t.Customer, t.Location,
max(location) keep (dense_rank first order by location_count desc) over (partition by name) most_frequent_location
from (select t.*,
count(*) over (partition by name, customer, location) as location_count
from table1 t
) t;
Here is a db<>fiddle.
Both of these version put 'Texas' in all four rows. However, each can be tweaks with minimal effort to put 'California' in the row for ARC.
In Oracle, you can use aggregate function stats_mode() to compute the most occuring value in a group.
Unfortunately it is not implemented as a window function. So one option uses an aggregate subquery, and then a join with the original table:
select t.*, s.top_location
from mytable t
inner join (
select name, customer, stats_mode(location) top_location
from mytable
group by name, customer
) s where s.name = t.name and s.customer = t.customer
You could also use a correlated subquery:
select
t.*,
(
select stats_mode(t1.location)
from mytable t1
where t1.name = t.name and t1.customer = t.customer
) top_location
from mytable t
This is more a question about understanding the concepts of a relational database. If you want that information, you would not put that in an additional column. It is calculated data over multiple columns - why would you store that in the table itself ? It is complex to code and it would also be very expensive for the database (imagine all the rows you have to calculate that value for if someone inserted a million rows)
Instead you can do one of the following
Calculate it at runtime, as shown in the other answers
if you want to make it more persisent, you could embed that query above in a view
if you want to physically store the info, you could use a materialized view
Plenty of documentation on those 3 options in the official oracle documentation
Your first step is to construct a query that determines the most frequent location, which is as simple as:
select Name, Customer, Location, count(*)
from table1
group by Name, Customer, Location
This isn't immediately useful, but the logic can be used in row_number(), which gives you a unique id for each row returned. In the query below, I'm ordering by count(*) in descending order so that the most frequent occurrence has the value 1.
Note that row_number() returns '1' to only one row.
So, now we have
select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1 tb_
group by Name, Customer, Location
The final step puts it all together:
select tab.*, tb_.Location most_freq_location
from table1 tab
inner join
(select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1
group by Name, Customer, Location) tb_
on tb_.Name = tab.Name
and tb_.Customer = tab.Customer
and freq_name_cust = 1
You can see how it all works in this Fiddle where I deliberately inserted rows with the same frequency for California and Texas for one of the customers for illustration purposes.

SQL How to select customers with highest transaction amount by state

I am trying to write a SQL query that returns the name and purchase amount of the five customers in each state who have spent the most money.
Table schemas
customers
|_state
|_customer_id
|_customer_name
transactions
|_customer_id
|_transact_amt
Attempts look something like this
SELECT state, Sum(transact_amt) AS HighestSum
FROM (
SELECT name, transactions.transact_amt, SUM(transactions.transact_amt) AS HighestSum
FROM customers
INNER JOIN customers ON transactions.customer_id = customers.customer_id
GROUP BY state
) Q
GROUP BY transact_amt
ORDER BY HighestSum
I'm lost. Thank you.
Expected results are the names of customers with the top 5 highest transactions in each state.
ERROR: table name "customers" specified more than once
SQL state: 42712
First, you need for your JOIN to be correct. Second, you want to use window functions:
SELECT ct.*
FROM (SELECT c.customer_id, c.name, c.state, SUM(t.transact_amt) AS total,
ROW_NUMBER() OVER (PARTITION BY c.state ORDER BY SUM(t.transact_amt) DESC) as seqnum
FROM customers c JOIN
transaactions t
ON t.customer_id = c.customer_id
GROUP BY c.customer_id, c.name, c.state
) ct
WHERE seqnum <= 5;
You seem to have several issues with SQL. I would start with understanding aggregation functions. You have a SUM() with the alias HighestSum. It is simply the total per customer.
You can get them using aggregation and then by using the RANK() window function. For example:
select
state,
rk,
customer_name
from (
select
*,
rank() over(partition by state order by total desc) as rk
from (
select
c.customer_id,
c.customer_name,
c.state,
sum(t.transact_amt) as total
from customers c
join transactions t on t.customer_id = c.customer_id
group by c.customer_id
) x
) y
where rk <= 5
order by state, rk
There are two valid answers already. Here's a third:
SELECT *
FROM (
SELECT c.state, c.customer_name, t.*
, row_number() OVER (PARTITION BY c.state ORDER BY t.transact_sum DESC NULLS LAST, customer_id) AS rn
FROM (
SELECT customer_id, sum(transact_amt) AS transact_sum
FROM transactions
GROUP BY customer_id
) t
JOIN customers c USING (customer_id)
) sub
WHERE rn < 6
ORDER BY state, rn;
Major points
When aggregating all or most rows of a big table, it's typically substantially faster to aggregate before the join. Assuming referential integrity (FK constraints), we won't be aggregating rows that would be filtered otherwise. This might change from nice-to-have to a pure necessity when joining to more aggregated tables. Related:
Why does the following join increase the query time significantly?
Two SQL LEFT JOINS produce incorrect result
Add additional ORDER BY item(s) in the window function to define which rows to pick from ties. In my example, it's simply customer_id. If you have no tiebreaker, results are arbitrary in case of a tie, which may be OK. But every other execution might return different results, which typically is a problem. Or you include all ties in the result. Then we are back to rank() instead of row_number(). See:
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?
While transact_amt can be NULL (has not been ruled out) any sum may end up to be NULL as well. With an an unsuspecting ORDER BY t.transact_sum DESC those customers come out on top as NULL comes first in descending order. Use DESC NULLS LAST to avoid this pitfall. (Or define the column transact_amt as NOT NULL.)
PostgreSQL sort by datetime asc, null first?

What is the most efficient way to find the first and last entry of an entity in SQL?

I was asked this question in an interview. A table, trips, contains the following columns( customer_id, start_from, end_at, start_at_time, end_at_time), with data structured so that each trip is stored as a separate row and a part of the table looks like this: How would you find the list of all the customers who started yesterday from point A and ended yesterday at point P?
I provided solution using windowing functions that identified the list of all customers that started their day at A and then did an inner join of a list of these customers with the customers who ended their day at P( using the same windowing functions).
The solution I gave was this:
SELECT a.customer_id
FROM
(SELECT a.customer_id
FROM
(SELECT customer_id,
start_from,
row_number() OVER (PARTITION BY customer_id
ORDER BY start_at_time ASC) AS rnk
FROM trips
WHERE to_date(start_at_time)= date_sub(CURRENT_DATE, 1) ) AS a
WHERE a.rnk=1
AND a.start_from='A' ) AS a
INNER JOIN
(SELECT a.customer_id
FROM
(SELECT customer_id,
end_at,
row_number() OVER (PARTITION BY customer_id
ORDER BY end_at_time DESC) AS rnk
FROM trips
WHERE to_date(end_at_time)= date_sub(CURRENT_DATE, 1) ) AS a
WHERE a.rnk=1
AND a.end_at='P' ) AS b ON a.customer_id=b.customer_id
My interviewer said my solution was correct but there is a more efficient way to solve this problem. I've searching and trying to find a more efficient way but I could not find one so far. Can you suggest a more efficient way to solve this problem?
I might use first_value() for this:
select t.customer_id
from (select t.*,
first_value(start_from) over (partition by customer_id order by start_at_time) as first_start,
first_value(end_at) over (partition by customer_id order by start_at_time desc) as last_end
from t
where start_at_time >= date_sub(CURRENT_DATE, 1) and
start_at_time < CURRENT_DATE
) t
where first_start = start_from and -- just some filtering so select distinct is not needed
first_start = 'A' and
last_end = 'P';
I should add that many databases support an equivalent function for aggregation, and I would use that instead.
This assumes that starts are not repeated. To be safe, you can add select distinct, but there is a performance hit for that.
A generalized version of what I would probably have done:
SELECT fandl.a
FROM (
SELECT a, MIN(start) AS t0, MAX(start) AS tN
FROM someTable
WHERE start >= DATE_SUB(CURRENT_DATE, 1) AND start < CURRENT_DATE
GROUP BY a
) AS fandl
INNER JOIN someTable AS st0 ON fandl.a = st0.a AND fandl.t0 = st0.start
INNER JOIN someTable AS stN ON fandl.a = stN.a AND fandl.tN = stN.start
WHERE st0.b1 = 'A' AND stN.b2 = 'P'
;
Using the date function you did, since you did not specify sql dialect.
Note that, in many RDBMS, if there is an (a, start) index, the subquery and joins can be done with the index alone; actual table access would only be required for the final WHERE evaluation.

filtering out duplicate rows using max

I have a table that, for the most part, is individual users. Occasionally there is a joint user. For a joint user, all the fields in the table will be exactly the same as the primary user except for a b-score field. I want to only display one row of data per account, and use the highest b-score to decide which row to use when it is a joint account (so the highest score is displayed only)
I thought it would be a simple
SELECT DISTINCT accountNo, MAX(bscore) FROM table, GROUP BY accountNo
but I'm still getting multiple rows for joints
You seem to want the ANSI-standard row_number() function:
select t.*
from (select t.*, row_number() over (partition by accountNo order by bscore desc) as seqnum
from t
) t
where seqnum = 1;
This worked for me, maybe not the most efficient. Correlated sub-query. The key part is accountNo = a.accountNo.
SELECT DISTINCT a.accountNo, (SELECT MAX(bscore) FROM table WHERE accountNo =
a.accountNo) bscore
FROM table a
GROUP BY a.accountNo

How to make a SQL query for last transaction of every account?

Say I have a table "transactions" that has columns "acct_id" "trans_date" and "trans_type" and I want to filter this table so that I have just the last transaction for each account. Clearly I could do something like
SELECT acct_id, max(trans_date) as trans_date
FROM transactions GROUP BY acct_id;
but then I lose my trans_type. I could then do a second SQL call with my list of dates and account id's and get my trans_type back but that feels very cludgy since it means either sending data back and forth to the sql server or it means creating a temporary table.
Is there a way to do this with a single query, hopefully a generic method that would work with mysql, postgres, sql-server, and oracle.
This is an example of a greatest-n-per-group query. This question comes up several times per week on StackOverflow. In addition to the subquery solutions given by other folks, here's my preferred solution, which uses no subquery, GROUP BY, or CTE:
SELECT t1.*
FROM transactions t1
LEFT OUTER JOIN transactions t2
ON (t1.acct_id = t2.acct_id AND t1.trans_date < t2.trans_date)
WHERE t2.acct_id IS NULL;
In other words, return a row such that no other row exists with the same acct_id and a greater trans_date.
This solution assumes that trans_date is unique for a given account, otherwise ties may occur and the query will return all tied rows. But this is true for all the solutions given by other folks too.
I prefer this solution because I most often work on MySQL, which doesn't optimize GROUP BY very well. So this outer join solution usually proves to be better for performance.
This works on SQL Server...
SELECT acct_id, trans_date, trans_type
FROM transactions a
WHERE trans_date = (
SELECT MAX( trans_date )
FROM transactions b
WHERE a.acct_id = b.acct_id
)
Try this
WITH
LastTransaction AS
(
SELECT acct_id, max(trans_date) as trans_date
FROM transactions
GROUP BY acct_id
),
AllTransactions AS
(
SELECT acct_id, trans_date, trans_type
FROM transactions
)
SELECT *
FROM AllTransactions
INNER JOIN LastTransaction
ON AllTransactions.acct_id = LastTransaction.acct_id
AND AllTransactions.trans_date = LastTransaction.trans_date
select t.acct_id, t.trans_type, tm.trans_date
from transactions t
inner join (
SELECT acct_id, max(trans_date) as trans_date
FROM transactions
GROUP BY acct_id;
) tm on t.acct_id = tm.acct_id and t.trans_date = tm.trans_date