MySQL Query: Select most-recent items with a twist - sql

Sorry the title isn't more help. I have a database of media-file URLs that came from two sources:
(1) RSS feeds and (2) manual entries.
I want to find the ten most-recently added URLs, but a maximum of one from any feed. To simplify, table 'urls' has columns 'url, feed_id, timestamp'.
feed_id='' for any URL that was entered manually.
How would I write the query? Remember, I want the ten most-recent urls, but only one from any single feed_id.

Assuming feed_id = 0 is the manually entered stuff this does the trick:
select p.* from programs p
left join
(
select max(id) id1 from programs
where feed_id <> 0
group by feed_id
order by max(id) desc
limit 10
) t on id1 = id
where id1 is not null or feed_id = 0
order by id desc
limit 10;
It works cause the id column is constantly increasing, its also pretty speedy. t is a table alias.
This was my original answer:
(
select
feed_id, url, dt
from feeds
where feed_id = ''
order by dt desc
limit 10
)
union
(
select feed_id, min(url), max(dt)
from feeds
where feed_id <> ''
group by feed_id
order by dt desc
limit 10
)
order by dt desc
limit 10

Assuming this table
CREATE TABLE feed (
feed varchar(20) NOT NULL,
add_date datetime NOT NULL,
info varchar(45) NOT NULL,
PRIMARY KEY (feed,add_date);
this query should do what you want. The inner query selects the last entry by feed and picks the 10 most recent, and then the outer query returns the original records for those entries.
select f2.*
from (select feed, max(add_date) max_date
from feed f1
group by feed
order by add_date desc
limit 10) f1
left join feed f2 on f1.feed=f2.feed and f1.max_date=f2.add_date;

Here's the (abbreviated) table:
CREATE TABLE programs (
id int(11) NOT NULL auto_increment,
feed_id int(11) NOT NULL,
`timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
PRIMARY KEY (id)
) ENGINE=InnoDB;
And here's my query based on sambo99's concept:
(SELECT feed_id,id,timestamp
FROM programs WHERE feed_id=''
ORDER BY timestamp DESC LIMIT 10)
UNION
(SELECT feed_id,min(id),max(timestamp)
FROM programs WHERE feed_id<>'' GROUP BY feed_id
ORDER BY timestamp DESC LIMIT 10)
ORDER BY timestamp DESC LIMIT 10;
Seems to work. More testing needed, but at least I understand it. (A good thing!). What's the enhancement using the 'id' column?

You probably want a union. Something like this should work:
(SELECT
url, feed_id, timestamp
FROM rss_items
GROUP BY feed_id
ORDER BY timestamp DESC
LIMIT 10)
UNION
(SELECT
url, feed_id, timestamp
FROM manual_items
GROUP BY feed_id
ORDER BY timestamp DESC
LIMIT 10)
ORDER BY timestamp DESC
LIMIT 10

Would it work to group by the field that you want to be distinct?
SELECT url, feedid FROM urls GROUP BY feedid ORDER BY timestamp DESC LIMIT 10;

MySQL doesn't have the greatest support for this type of query.
You can do it using a combination of "GROUP-BY" and "HAVING" clauses, but you'll scan the whole table, which can get costly.
There is a more efficient solution published here, assuming you have an index on group ids:
http://www.artfulsoftware.com/infotree/queries.php?&bw=1390#104
(Basically, create a temp table, insert into it top K for every group, select from the table, drop the table. This way you get the benefit of the early termination from the LIMIT clause).

Related

Select newest entry for each user without using group by (postgres)

I have a table myTable with four columns:
id UUID,
user_id UUID ,
text VARCHAR ,
date TIMESTAMP
(id is the primary key and user_id is not unique in this table)
I want to retrieve the user_ids ordered by their newest entry, which i am currently doing with this query:
SELECT user_id FROM myTable GROUP BY user_id ORDER BY MAX(date) DESC
The problem is that GROUP BY takes a long time. Is there a faster way to accomplish this? I tried using a window function with PARTITION BY as described here Retrieving the last record in each group - MySQL, but it didn't really speed things up. I've also made sure that user_id is indexed.
My postgres version is 10.4
Edit: The query above that I'm currently using is functionally correct, the problem is that it's slow.
Your query seems like a relevant approach for your requirement:
select user_id
from mytable
group by user_id
order by max(date) desc
I would recommend an index on (user, date desc) to speed things up. It needs to be a single index on both colums.
You could also give a try to distinct on, which might, or might not, give you better performance:
select user_id
from (
select distinct on(user_id) user_id, date
from mytable
order by user_id, date desc
) t
order by date desc
Start with an index on user_id, date desc. That might help.
You can also try filtering -- once you have such an index:
select t.user_id
from myTable t
where t.date = (select max(t2.date)
from myTable t2
where t2.user_id = t.user_id
)
order by t.date desc
However, you might find that the order by ends up taking almost as much time as the group by.
This version will definitely use the index for the subquery:
select user_id
from (select distinct on (user_id) user_id, date
from myTable t
order by user_id, date desc
) t
order by date desc;

Select a row with preceding and following rows

I have a table as follows:
CREATE TABLE results (
id uuid primary key UNIQUE,
score integer NOT NULL
)
I need to select a record with particular UUID and what's around it (say, 5 before and after) ordered by score
SELECT * FROM results
WHERE id = <SOME_UUID>
ORDERED BY score
OFFSET -5 LIMIT 10; -- apparently this is wrong
How can I effectively do that?
Its not 'effective', but you could try this:
select a.* from (SELECT * FROM results
WHERE id <> <SOME_UUID> and score <= (select score from results WHERE id = <SOME_UUID>)
ORDERED BY score,id desc
LIMIT 5) as a
UNION ALL
SELECT * FROM results
WHERE id = <SOME_UUID>
UNION ALL
select b.* from (SELECT * FROM results
WHERE id <> <SOME_UUID> and score >= (select score from results WHERE id = <SOME_UUID>)
ORDERED BY score, id asc
LIMIT 5) as b
I tried this an SQL-Server, which needded the 'ALL' to compute.
So you may get records with equal score as duplicates. To avoid this make it again to a subquery and use select distinct.
One way of solving this is with a rank for each row assigned using a window function and then finding out which ranks you are interested in:
WITH ranked AS (
SELECT id, score, rank() OVER (ORDER BY score) AS rnk
FROM results),
this_rank AS (
SELECT rnk - 5 AS low_rnk FROM ranked
WHERE id = <some uuid>::uuid)
SELECT id, score
FROM ranked, this_rank
WHERE rnk >= low_rnk
ORDER BY rnk
LIMIT 11;
For very low or high scores you get fewer than 11 rows, rather than rows with NULLs.
SQLFiddle
One further detail: A PRIMARY KEY already implies uniqueness so you do not have to use the UNIQUE clause in your table definition.

How do I get 5 latest comments (SQL query for SQL Server ) for each user?

I have a table that looks like this: comment_id, user_id, comment, last_updated.
Comment_id is a key here. Each user may have multiple comments.
How do I get 5 latest comments (SQL query for SQL Server ) for each user?
Output should be similar to the original table, just limit user's comments to 5 most recent for every user.
Assuming at least SQL Server 2005 so you can use the window function (row_number) and the CTE:
;with cteRowNumber as (
select comment_id, user_id, comment, last_updated, ROW_NUMBER() over (partition by user_id order by last_updated desc) as RowNum
from comments
)
select comment_id, user_id, comment, last_updated
from cteRowNumber
where RowNum <= 5
order by user_id, last_updated desc
Joe's answer is the best way to do this in SQL Server (at least, I assume it is, I'm not familiar with CTEs). But here's a solution (not very fast!) using standard SQL:
SELECT * FROM comments c1
WHERE (SELECT COUNT(*) FROM comments c2
WHERE c2.user_id = c1.user_id AND c2.last_updated >= c1.updated) <= 5
In SqlServer 2005, LIMIT is not valid.
Instead, do something like:
SELECT TOP(5) * FROM Comment WHERE user_id = x ORDER BY comment_id ASC
Note that this assumes that comment_id is monotonically increasing, which may not always be a valid assumption for identity fields (if they need to be renumbered for example). You may want to consider an alternate field, but the basic structure would be the same.
Note that if you were ordering by a date field, you would want to sort in descending order rather than ascending order, e.g.
SELECT TOP(5) * FROM Comment WHERE user_id = x ORDER BY last_updated DESC
SELECT TOP 5 * FROM table WHERE user_id = x ORDER BY comment_id ASC
I think that should do it.

Postgres, table1 left join table2 with only 1 row per ID in table1

Ok, so the title is a bit convoluted. This is basically a greatest-n-per-group type problem, but I can't for the life of me figure it out.
I have a table, user_stats:
------------------+---------+---------------------------------------------------------
id | bigint | not null default nextval('user_stats_id_seq'::regclass)
user_id | bigint | not null
datestamp | integer | not null
post_count | integer |
friends_count | integer |
favourites_count | integer |
Indexes:
"user_stats_pk" PRIMARY KEY, btree (id)
"user_stats_datestamp_index" btree (datestamp)
"user_stats_user_id_index" btree (user_id)
Foreign-key constraints:
"user_user_stats_fk" FOREIGN KEY (user_id) REFERENCES user_info(id)
I want to get the stats for each id by latest datestamp. This is a biggish table, somewhere in the neighborhood of 41m rows, so I've created a temp table of user_id, last_date using:
CREATE TEMP TABLE id_max_date AS
(SELECT user_id, MAX(datestamp) AS date FROM user_stats GROUP BY user_id);
The problem is that datestamp isn't unique since there can be more than 1 stat update in a day (should have been a real timestamp but the guy who designed this was kind of an idiot and theres too much data to go back at the moment). So some IDs have multiple rows when I do the JOIN:
SELECT user_stats.user_id, user_stats.datestamp, user_stats.post_count,
user_stats.friends_count, user_stats.favorites_count
FROM id_max_date JOIN user_stats
ON id_max_date.user_id=user_stats.user_id AND date=datestamp;
If I was doing this as subselects I guess I could LIMIT 1, but I've always heard those are horribly inefficient. Thoughts?
DISTINCT ON is your friend.
select distinct on (user_id) * from user_stats order by datestamp desc;
Basically you need to decide how to resolve ties, and you need some other column besides datestamp which is guaranteed to be unique (at least over a given user) so it can be used as the tiebreaker. If nothing else, you can use the id primary key column.
Another solution if you're using PostgreSQL 8.4 is windowing functions:
WITH numbered_user_stats AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY datestamp DESC) AS RowNum
FROM user_stats) AS numbered_user_stats
) SELECT u.user_id, u.datestamp, u.post_count, u.friends_count, u.favorites_count
FROM numbered_user_stats AS u
WHERE u.RowNum = 1;
Using the existing infrastructure, you can use:
SELECT u.user_id, u.datestamp,
MAX(u.post_count) AS post_count,
MAX(u.friends_count) AS friends_count,
MAX(u.favorites_count) AS favorites_count
FROM id_max_date AS m JOIN user_stats AS u
ON m.user_id = u.user_id AND m.date = u.datestamp
GROUP BY u.user_id, u.datestamp;
This gives you a single value for each of the 'not necessarily unique' columns. However, it does not absolutely guarantee that the three maxima all appeared in the same row (though there is at least a moderate chance that they will - and that they will all come from the last of entries created on the given day).
For this query, the index on date stamp alone is no help; an index on user ID and date stamp could speed this query up considerably - or, perhaps more accurately, it could speed up the query that generates the id_max_date table.
Clearly, you can also write the id_max_date expression as a sub-query in the FROM clause:
SELECT u.user_id, u.datestamp,
MAX(u.post_count) AS post_count,
MAX(u.friends_count) AS friends_count,
MAX(u.favorites_count) AS favorites_count
FROM (SELECT u2.user_id, MAX(u2.datestamp) AS date
FROM user_stats AS u2
GROUP BY u2.user_id) AS m
JOIN user_stats AS u ON m.user_id = u.user_id AND m.date = u.datestamp
GROUP BY u.user_id, u.datestamp;

selecting subsequent records arbitrarily with limit

I want to do a query to retrieve the record immediately after a record for any given record, in a result set ordered by list. I do not understand how to make use of the limit keyword in sql syntax to do this.
I can use WHERE primarykey = number, but how will limiting the result help when I will only have one result?
How would I obtain the next record with an arbitrary primary key number?
I have an arbitrary primary key, and want to select the next one ordered by date.
This will emulate the LEAD() analytic function (i. e. select the next value for each row from the table)
SELECT mo.id, mo.date,
mi.id AS next_id, mi.date AS next_date
FROM (
SELECT mn.id, mn.date,
(
SELECT id
FROM mytable mp
WHERE (mp.date, mp.id) > (mn.date, mn.id)
ORDER BY
mp.date, mp.id
LIMIT 1
) AS nid
FROM mytable mn
ORDER BY
date
) mo,
mytable mi
WHERE mi.id = mo.nid
If you just want to select next row for a given ID, you may use:
SELECT *
FROM mytable
WHERE (date, id) >
(
SELECT date, id
FROM mytable
WHERE id = #myid
)
ORDER BY
date, id
LIMIT 1
This will work most efficiently if you have an index on (date, id)
How about something like this, if you're looking for the one after 34
SELECT * FROM mytable WHERE primaryKey > 34 ORDER BY primaryKey LIMIT 1
Might be as simple as:
select *
from mytable
where datecolumn > (select datecolumn from mytable where id = #id)
order by datecolumn
limit 1
(Edited after comments)