Transform long rows to wide, filling all cells - sql

I have long format data on businesses, with a row for each occurrence of a move to a different location, keyed on business id -- there can be several move events for any one business establishment.
I wish to reshape to a wide format, which is typically cross-tab territory per the tablefunc module.
+-------------+-----------+---------+---------+
| business_id | year_move | long | lat |
+-------------+-----------+---------+---------+
| 001013580 | 1991 | 71.0557 | 42.3588 |
| 001015924 | 1993 | 71.0728 | 42.3504 |
| 001015924 | 1996 | -122.28 | 37.654 |
| 001020684 | 1992 | 84.3381 | 33.5775 |
+-------------+-----------+---------+---------+
Then I transform like so:
SELECT longbyyear.*
FROM crosstab($$
SELECT
business_id,
year_move,
max(longitude::float)
from business_moves
where year_move::int between 1991 and 2010
group by business_id, year_move
order by business_id, year_move;
$$
)
AS longbyyear(biz_id character varying, "long91" float,"long92" float,"long93" float,"long94" float,"long95" float,"long96" float,"long97" float, "long98" float, "long99" float,"long00" float,"long01" float,
"long02" float,"long03" float,"long04" float,"long05" float,
"long06" float, "long07" float, "long08" float, "long09" float, "long10" float);
And it --mostly-- gets me to the desired output.
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| biz_id | long91 | long92 | long93 | long94 | … | long08 | long09 | long10 |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| 1000223 | 121.3784 | 121.3063 | 121.3549 | 82.821 | … | | | |
| 1000678 | 118.224 | | | | … | | | |
| 1002158 | 121.98 | | | | … | | | |
| 1004092 | 71.2384 | | | | … | | | |
| 1007801 | 118.0312 | | | | … | | | |
| 1007855 | 71.1769 | | | | … | | | |
| 1008697 | 71.0394 | 71.0358 | | | … | | | |
| 1008986 | 71.1013 | | | | … | | | |
| 1009617 | 119.9965 | | | | … | | | |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
The only snag is that I would ideally have populated values for each year and not just have values in move years. Thus all fields would be populated, with a value for each year, with the most recent address carrying over to the next year. I could hack this with manual updates if each is blank, use the previous column, I just wondered if there was a clever way to do it either with the crosstab() function, or some other way, possibly coupled with a custom function.

In order to get the current location for each business_id for any given year you need two things:
A parameterized query to select the year, implemented as a SQL language function.
A dirty trick to aggregate on year, group by the business_id, and leave the coordinates untouched. That is done by a sub-query in a CTE.
The function then looks like this:
CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
WITH last_move AS (
SELECT business_id, MAX(year_move) AS yr
FROM business_moves
WHERE year_move <= $1
GROUP BY business_id)
SELECT lm.business_id, $1::int AS yr, longitude, latitude
FROM business_moves bm, last_move lm
WHERE bm.business_id = lm.business_id
AND bm.year_move = lm.yr;
$$ LANGUAGE sql;
The sub-query selects only the most recent moves for every business location. The main query then adds the longitude and latitude columns and put the requested year in the returned table, rather than the year in which the most recent move took place. One caveat: you need to have a record in this table that gives the establishment and initial location of each business_id or it will not show up until after it has moved somewhere else.
Call this function with the usual SELECT * FROM business_location_in_year_x(1997). See also the SQL fiddle.
If you really need a crosstab then you can tweak this code around to give you the business location for a range of years and then feed that into the crosstab() function.

I assume you have actual dates for each business move, so we can make meaningful picks per year:
CREATE TEMP TABLE business_moves (
business_id int, -- why would you use inefficient varchar here?
move_date date,
longitude float,
latitude float);
Building on this, a more meaningful test case:
INSERT INTO business_moves VALUES
(001013580, '1991-1-1', 71.0557, 42.3588),
(001015924, '1993-1-1', 71.0728, 42.3504),
(001015924, '1993-3-3', 73.0728, 43.3504), -- 2nd move this year
(001015924, '1996-1-1', -122.28, 37.654),
(001020684, '1992-1-1', 84.3381, 33.5775);
Complete, very fast solution
SELECT *
FROM crosstab($$
SELECT business_id, year
, first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year) AS x
FROM (
SELECT *
, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
) sub
$$
,'VALUES
(1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
) AS t(biz_id int
, x91 point, x92 point, x93 point, x94 point, x95 point
, x96 point, x97 point, x98 point, x99 point, x00 point
, x01 point, x02 point, x03 point, x04 point, x05 point
, x06 point, x07 point, x08 point, x09 point, x10 point);
Result:
biz_id | x91 | x92 | x93 | x94 | x95 | x96 | x97 ...
---------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------
1013580 | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) ...
1015924 | | | (73.0728,43.3504) | (73.0728,43.3504) | (73.0728,43.3504) | (-122.28,37.654) | (-122.28,37.654) ...
1020684 | | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) ...
Step-by-step
Step 1
Repair what you had:
SELECT *
FROM crosstab($$
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date) AS year
, point(longitude, latitude) AS long_lat
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
$$
,'VALUES
(1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
) AS t(biz_id int
, x91 point, x92 point, x93 point, x94 point, x95 point
, x96 point, x97 point, x98 point, x99 point, x00 point
, x01 point, x02 point, x03 point, x04 point, x05 point
, x06 point, x07 point, x08 point, x09 point, x10 point);
You want lat & lon to make it meaningful, so form a point from both. Alternatively, you could just concatenate a text representation.
You may want even more data. Use DISTINCT ON instead of max() to get the latest (complete) row per year. Details here:
Select first row in each GROUP BY group?
As long as there can be missing values for the whole grid, you must use the crosstab() variant with two parameters. Detailed explanation here:
PostgreSQL Crosstab Query
Adapted the function to work with move_date date instead of year_move.
Step 2
To address your request:
I would ideally have populated values for each year
Build a full grid of values (one cell per business and year) with a CROSS JOIN of businesses and years:
SELECT *
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
The set of years comes from a generate_series() call.
Distinct businesses from a separate SELECT. You might have a table of businesses, you could use instead (and cheaper)? This would also account for businesses that never moved.
LEFT JOIN to actual business moves per year to arrive at a full grid of values.
Step 3
Fill in defaults:
with the most recent address carrying over to the next year.
SELECT business_id, year
, COALESCE(first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year)
,'(0,0)') AS x
FROM (
SELECT *, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
) sub;
In the subquery sub build on the query from step 2, form groups (grp) of cells that share the same location.
For this purpose utilize the well known aggregate function count() as window aggregate function. NULL values don't count, so the value increases with every actual move, thereby forming groups of cells that share the same location.
In the outer query pick the first value per group for each row in the same group using the window function first_value(). Voilá.
To top it off, optionally(!) wrap that in COALESCE to fill the remaining cells with unknown location (no move yet) with (0,0). If you do that, there are no remaining NULL values, and you can use the simpler form of crosstab(). That's a matter of taste.
SQL Fiddle with base queries. crosstab() is not currently installed on SQL Fiddle.
Step 4
Use the query from step 3 in an updated crosstab() call.
All in all, this should be as fast as it gets. Indexes may help some more.

Related

How to query to capture recency and scale in one query?

I've built a query that calculates the number of ids from a table, per url_count.
with cte as (
select id, count(distinct.url) url_count
from table
group by id
)
select sum(if(url_count >= 1,1,0) scale
from cte
union all
select sum(if(url_count >= 2,1,0) scale
from cte
union all
select sum(if(url_count >= 3,1,0) scale
from cte
union all
select sum(if(url_count >= 4,1,0) scale
from cte
union all
select sum(if(url_count >= 5,1,0) scale
from cte
The query above says; "Give me the list of ids and the number of urls they each go to, then accumulate the number of ids who have gone to [1-5] or more urls"
It's ofc a tedious method, but works and outputs something like;
---------
| scale |
---------
|1213432|
|867554 |
|523523 |
|342232 |
|145889 |
---------
From this table, I also have a date field on the last 5 days which I'm working on adding into this query. Thus lies the challenge; Trying to add a second layer of information to the query; i.e. Recency. Been working on multiple approaches to building a query that outputs all the combinations of different scales, per the date.
The sort of output I've imagined is a pivot table which presents something like;
-------------------------------------------------------------
| date | url_co1 | url_co2 | url_co3 | url_co4 | url_co5|
-------------------------------------------------------------
|2020-01-05| 1213432 | 1112321 | 984332 | 632131 | 234124 |
|2020-01-04| 1012131 | 934242 | 867554 | 533242 | 134234 |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
-------------------------------------------------------------
Where url_co[1-5] represents the number of ids that visited [1-5] or more urls and dates gives up the date that volume was captured. No idea how to write that because once I query:
with cte as (
select id, date, count(distinct.url) url_count
from table
group by id, date
)
I've aggregated to per id, per date, which therefore something goes wrong. =/
Hope that all made sense!
Please, please help! I would appreciate some guidance.
There must be a methodology for getting the combination of volumes per recency that I've missed!
I don't really follow the full question, but the first query can be simplified to:
select url_count, count(*) as this_count,
sum(url_count) over (order by url_count desc) as descending_count
from (select id, count(distinct url) as url_count
from table
group by id
) t
group by url_count
order by url_count;

SQL: Count of rows since certain value first occurred: keep counting

This is a similar scenario to
SQL: Count of rows since certain value first occurred
In SQL Server, I'm trying to calculate the count of days since the same weather as today (let's assume today is 6th August 2018) was observed first in the past 5 days. Per town.
Here's the data:
+---------+---------+--------+--------+--------+
| Date | Toronto | Cairo | Zagreb | Ankara |
+---------+---------+--------+--------+--------+
| 1.08.18 | Rain | Sun | Clouds | Sun |
| 2.08.18 | Sun | Sun | Clouds | Sun |
| 3.08.18 | Rain | Sun | Clouds | Rain |
| 4.08.18 | Clouds | Sun | Clouds | Clouds |
| 5.08.18 | Rain | Clouds | Rain | Rain |
| 6.08.18 | Rain | Sun | Sun | Sun |
+---------+---------+--------+--------+--------+
This needs to perform well but all I came up with so far is single queries for each town (and there are going to be dozens of towns, not just the four). This works but is not going to scale.
Here's the one for Toronto...
SELECT
DATEDIFF(DAY, MIN([Date]), GETDATE()) + 1
FROM
(SELECT TOP 5 *
FROM Weather
WHERE [Date] <= GETDATE()
ORDER BY [Date] DESC) a
WHERE
Toronto = (SELECT TOP 1 Toronto
FROM Weather
WHERE DataDate = GETDATE())
...which correctly returns 4 since today there is rain and the first occurrence of rain within the past 5 days was 3rd August.
But what I want returned is a table like this:
+---------+-------+--------+--------+
| Toronto | Cairo | Zagreb | Ankara |
+---------+-------+--------+--------+
| 4 | 5 | 1 | 5 |
+---------+-------+--------+--------+
Slightly modified from the accepted answer by #Used_By_Already is this code:
CREATE TABLE mytable(
Date date NOT NULL
,Toronto VARCHAR(9) NOT NULL
,Cairo VARCHAR(9) NOT NULL
,Zagreb VARCHAR(9) NOT NULL
,Ankara VARCHAR(9) NOT NULL
);
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180801','Rain','Sun','Clouds','Sun');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180802','Sun','Sun','Clouds','Sun');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180803','Rain','Sun','Clouds','Rain');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180804','Clouds','Sun','Clouds','Clouds');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180805','Rain','Clouds','Rain','Rain');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180806','Rain','Sun','Sun','Sun');
with cte as (
select
date, city, weather
FROM (
SELECT * from mytable
) AS cp
UNPIVOT (
Weather FOR City IN (Toronto, Cairo, Zagreb, Ankara)
) AS up
)
select
date, city, weather, datediff(day,ca.prior,cte.date)+1 as daysPresent
from cte
cross apply (
select min(prev.date) as prior
from cte as prev
where prev.city = cte.city
and prev.date between dateadd(day,-4,cte.date) and dateadd(day,0,cte.date)
and prev.weather = cte.weather
) ca
order by city,date
Output:
However, what I'm trying now is to keep counting "daysPresent" up even after those five past days in question. Meaning that the last marked row in the output sample should show 6. The logic being to increase the previous number by the count of days between them if there is less than 5 days of a gap between them. If there has not been the same weather in the past 5 days, go back to 1.
I experimented with LEAD and LAG but cannot get it to work. Is it even the right way to add another layer to it or would the query need to look different entirely?
I'm a but puzzled.
You have a major problem with your data structure. The values should be in rows, not columns. So, start with:
select d.dte, v.*from data d cross apply
(values ('Toronto', Toronto), ('Cairo', Cairo), . . .
) v(city, val)
where d.date >= dateadd(day, -5, getdate());
From there, we can use the window function first_value() (or last_value()) to get the most recent reading. The rest is just aggregation by city:
with d as (
select d.dte, v.*,
first_value(v.val) over (partition by v.city order by d.dte desc) as last_val
from data d cross apply
(values ('Toronto', Toronto), ('Cairo', Cairo), . . .
) v(city, val)
where d.date >= dateadd(day, -5, getdate())
)
select city, datediff(day, min(dte), getdate()) + 1
from d
where val = last_val
group by city;
This gives you the information you want, in rows rather than columns. You can re-pivot if you really want. But I advise you to keep the data with city data in different rows.

SQL to find max of sum of data in one table, with extra columns

Apologies if this has been asked elsewhere. I have been looking on Stackoverflow all day and haven't found an answer yet. I am struggling to write the query to find the highest month's sales for each state from this example data.
The data looks like this:
| order_id | month | cust_id | state | prod_id | order_total |
+-----------+--------+----------+--------+----------+--------------+
| 67212 | June | 10001 | ca | 909 | 13 |
| 69090 | June | 10011 | fl | 44 | 76 |
... etc ...
My query
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders GROUP BY `month`, `state`
ORDER BY sales;
| month | state | sales |
+------------+--------+--------+
| September | wy | 435 |
| January | wy | 631 |
... etc ...
returns a few hundred rows: the sum of sales for each month for each state. I want it to only return the month with the highest sum of sales, but for each state. It might be a different month for different states.
This query
SELECT `state`, MAX(order_sum) as topmonth
FROM (SELECT `state`, SUM(order_total) order_sum FROM orders GROUP BY `month`,`state`)
GROUP BY `state`;
| state | topmonth |
+--------+-----------+
| ca | 119586 |
| ga | 30140 |
returns the correct number of rows with the correct data. BUT I would also like the query to give me the month column. Whatever I try with GROUP BY, I cannot find a way to limit the results to one record per state. I have tried PartitionBy without success, and have also tried unsuccessfully to do a join.
TL;DR: one query gives me the correct columns but too many rows; the other query gives me the correct number of rows (and the correct data) but insufficient columns.
Any suggestions to make this work would be most gratefully received.
I am using Apache Drill, which is apparently ANSI-SQL compliant. Hopefully that doesn't make much difference - I am assuming that the solution would be similar across all SQL engines.
This one should do the trick
SELECT t1.`month`, t1.`state`, t1.`sales`
FROM (
/* this one selects month, state and sales*/
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders
GROUP BY `month`, `state`
) AS t1
JOIN (
/* this one selects the best value for each state */
SELECT `state`, MAX(sales) AS best_month
FROM (
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders
GROUP BY `month`, `state`
)
GROUP BY `state`
) AS t2
ON t1.`state` = t2.`state` AND
t1.`sales` = t2.`best_month`
It's basically the combination of the two queries you wrote.
Try this:
SELECT `month`, `state`, SUM(order_total) FROM orders WHERE `month` IN
( SELECT TOP 1 t.month FROM ( SELECT `month` AS month, SUM(order_total) order_sum FROM orders GROUP BY `month`
ORDER BY order_sum DESC) t)
GROUP BY `month`, state ;

Select first & last date in window

I'm trying to select first & last date in window based on month & year of date supplied.
Here is example data:
F.rates
| id | c_id | date | rate |
---------------------------------
| 1 | 1 | 01-01-1991 | 1 |
| 1 | 1 | 15-01-1991 | 0.5 |
| 1 | 1 | 30-01-1991 | 2 |
.................................
| 1 | 1 | 01-11-2014 | 1 |
| 1 | 1 | 15-11-2014 | 0.5 |
| 1 | 1 | 30-11-2014 | 2 |
Here is pgSQL SELECT I came up with:
SELECT c_id, first_value(date) OVER w, last_value(date) OVER w FROM F.rates
WINDOW w AS (PARTITION BY EXTRACT(YEAR FROM date), EXTRACT(MONTH FROM date), c_id
ORDER BY date ASC)
Which gives me a result pretty close to what I want:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 15-01-1991 |
| 1 | 01-01-1991 | 30-01-1991 |
.................................
Should be:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 30-01-1991 |
.................................
For some reasons last_value(date) returns every record in a window. Which giving me a thought that I'm misunderstanding how windows in SQL works. It's like SQL forming a new window for each row it iterates through, but not multiple windows for entire table based on YEAR and MONTH.
So could any one be kind and explain if I'm wrong and how do I achieve the result I want?
There is a reason why i'm not using MAX/MIN over GROUP BY clause. My next step would be to retrieve associated rates for dates I selected, like:
| c_id | first_date | last_date | first_rate | last_rate | avg rate |
-----------------------------------------------------------------------
| 1 | 01-01-1991 | 30-01-1991 | 1 | 2 | 1.1 |
.......................................................................
If you want your output to become grouped into a single (or just fewer) row(s), you should use simple aggregation (i.e. GROUP BY), if avg_rate is enough:
SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)
More about window functions in PostgreSQL's documentation:
But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.
...
There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition.
...
There are options to define the window frame in other ways ... See Section 4.2.8 for details.
EDIT:
If you want to collapse (min/max aggregation) your data and want to collect more columns than those what listed in GROUP BY, you have 2 choice:
The SQL way
Select min/max value(s) in a sub-query, then join their original rows back (but this way, you have to deal with the fact, that min/max-ed column(s) usually not unique):
SELECT c_id,
min first_date,
max last_date,
first.rate first_rate,
last.rate last_rate,
avg avg_rate
FROM (SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)) agg
JOIN F.rates first ON agg.c_id = first.c_id AND agg.min = first.date
JOIN F.rates last ON agg.c_id = last.c_id AND agg.max = last.date
PostgreSQL's DISTINCT ON
DISTINCT ON is typically meant for this task, but highly rely on ordering (only 1 extremum can be searched for this way at a time):
SELECT DISTINCT ON (c_id, date_trunc('month', date))
c_id,
date first_date,
rate first_rate
FROM F.rates
ORDER BY c_id, date
You can join this query with other aggregated sub-queries of F.rates, but this point (if you really need both minimum & maximum, and in your case even an average) the SQL compliant way is more suiting.
Windowing functions aren't appropriate for this. Use aggregate functions instead.
select
c_id, date_trunc('month', date)::date,
min(date) first_date, max(date) last_date
from rates
group by c_id, date_trunc('month', date)::date;
c_id | date_trunc | first_date | last_date
------+------------+------------+------------
1 | 2014-11-01 | 2014-11-01 | 2014-11-30
1 | 1991-01-01 | 1991-01-01 | 1991-01-30
create table rates (
id integer not null,
c_id integer not null,
date date not null,
rate numeric(2, 1),
primary key (id, c_id, date)
);
insert into rates values
(1, 1, '1991-01-01', 1),
(1, 1, '1991-01-15', 0.5),
(1, 1, '1991-01-30', 2),
(1, 1, '2014-11-01', 1),
(1, 1, '2014-11-15', 0.5),
(1, 1, '2014-11-30', 2);

SQL - Select unique rows from a group of results

I have wrecked my brain on this problem for quite some time. I've also reviewed other questions but was unsuccessful.
The problem I have is, I have a list of results/table that has multiple rows with columns
| REGISTRATION | ID | DATE | UNITTYPE
| 005DTHGP | 172 | 2007-09-11 | MBio
| 005DTHGP | 1966 | 2006-09-12 | Tracker
| 013DTHGP | 2281 | 2006-11-01 | Tracker
| 013DTHGP | 2712 | 2008-05-30 | MBio
| 017DTNGP | 2404 | 2006-10-20 | Tracker
| 017DTNGP | 508 | 2007-11-10 | MBio
I am trying to select rows with unique REGISTRATIONS and where the DATE is max (the latest). The IDs are not proportional to the DATE, meaning the ID could be a low value yet the DATE is higher than the other matching row and vise-versa. Therefore I can't use MAX() on both the DATE and ID and grouping just doesn't seem to work.
The results I want are as follows;
| REGISTRATION | ID | DATE | UNITTYPE
| 005DTHGP | 172 | 2007-09-11 | MBio
| 013DTHGP | 2712 | 2008-05-30 | MBio
| 017DTNGP | 508 | 2007-11-10 | MBio
PLEASE HELP!!!?!?!?!?!?!?
You want embedded queries, which not all SQLs support. In t-sql you'd have something like
select r.registration, r.recent, t.id, t.unittype
from (
select registration, max([date]) recent
from #tmp
group by
registration
) r
left outer join
#tmp t
on r.recent = t.[date]
and r.registration = t.registration
TSQL:
declare #R table
(
Registration varchar(16),
ID int,
Date datetime,
UnitType varchar(16)
)
insert into #R values ('A','1','20090824','A')
insert into #R values ('A','2','20090825','B')
select R.Registration,R.ID,R.UnitType,R.Date from #R R
inner join
(select Registration,Max(Date) as Date from #R group by Registration) M
on R.Registration = M.Registration and R.Date = M.Date
This can be inefficient if you have thousands of rows in your table depending upon how the query is executed (i.e. if it is a rowscan and then a select per row).
In PostgreSQL, and assuming your data is indexed so that a sort isn't needed (or there are so few rows you don't mind a sort):
select distinct on (registration), * from whatever order by registration,"date" desc;
Taking each row in registration and descending date order, you will get the latest date for each registration first. DISTINCT throws away the duplicate registrations that follow.
select registration,ID,date,unittype
from your_table
where (registration, date) IN (select registration,max(date)
from your_table
group by registration)
This should work in MySQL:
SELECT registration, id, date, unittype FROM
(SELECT registration AS temp_reg, MAX(date) as temp_date
FROM table_name GROUP BY registration) AS temp_table
WHERE registration=temp_reg and date=temp_date
The idea is to use a subquery in a FROM clause which throws up a single row containing the correct date and registration (the fields subjected to a group); then use the correct date and registration in a WHERE clause to fetch the other fields of the same row.