Approach to Selecting top item matching a criteria - sql

EDIT: my apologies, this was a MSSQL2008 issue.
I have a SQL problem that I've come up against routinely, and normally just solved w/ a nested query. I'm hoping someone can suggest a more elegant solution.
It often happens that I need to select a result set for a user, conditioned upon it being the most recent, or the most sizeable or whatever.
For example: Their complete list of pages created, but I only want the most recent name they applied to a page. It so happens that the database contains many entries for each page, and only the most recent one is desired.
I've been using a nested select like:
SELECT pg.customName, pg.id
FROM (
select id, max(createdAt) as mostRecent
from pages
where userId = #UserId
GROUP BY id
) as MostRecentPages
JOIN pages pg
ON pg.id = MostRecentPages.id
AND pg.createdAt = MostRecentPages.mostRecent
Is there a better syntax to perform this selection?

Looks like you want
SELECT id, customname
FROM (SELECT id, customname,
row_number() OVER(PARTITION BY id ORDER BY createdat DESC) as pos
FROM pages
WHERE pages.userid = #UserId
) x
WHERE x.row_number = 1
(I'm assuming you're using SQL Server from the #UserId parameter. row_number() will work for SQL Server 2005, and tbh the above should also work for Oracle, Postgresql 8.4...)
This will select all the pages by userid and work out which is the most recent using a sort. An alternative would be sth like:
SELECT id, (SELECT TOP 1 customname
FROM pages pages_inner
WHERE pages_inner.id = pages_outer.id
ORDER BY pages_inner.createdat DESC) as customname
FROM (SELECT DISTINCT id FROM pages WHERE pages.userid = #UserId) pages_inner
Which approach is better depends on how many pages rows per id you have compared to pages per userid, I guess.

I'm not sure about better but a different syntax you could try is
SELECT pg.customName, pg.id
FROM pages pg
WHERE userId = #UserId
AND NOT EXISTS
(
SELECT * FROM pages pg2
WHERE pg2.UserId = pg.UserId
AND pg2.id = pg.id
AND pg2.createdAt > pg.createdAt
)
Another alternative would be an OUTER JOIN as in Bill Karwin's answer here How to get all the fields of a row using the SQL MAX function?

For what database (including version)? What you posted could be MySQL, SQL Server, or Sybase...
Using:
SELECT pg.customName,
pg.id
FROM PAGES pg
JOIN (SELECT t.userid,
MAX(t.createdAt) as mostRecent
FROM PAGES t
GROUP BY t.userid) x ON x.id = pg.id
AND x.mostRecent = pg.createdAt
AND x.userid = #UserId
This is the best approach for a portable query, assuming column references are correct. But there are alternatives for limiting the data set - SQL Server uses TOP, MySQL/Postgre/SQLite use LIMIT, Oracle uses ROWNUM.
What's best depends on your data & how the respective optimizer sees it, and your needs (portable vs not). Check the explain plan for the respective database to see how efficient the query is.

Are you using Oracle? Try to see if this query that uses analytic function would work for you. (Don't have access to db right now, so can't test myself.)
SELECT DISTINCT pg.id,
FIRST_VALUE(pg.customName) OVER (PARTITION BY pg.id ORDER BY pg.createdAt DESC) AS customName
FROM pages pg

Assuming SQL Server and your Pages table like so:
CREATE TABLE Pages (
Id int IDENTITY(1, 1) PRIMARY KEY
, CustomName nvarchar(20) NOT NULL
, CreatedAt datetime NOT NULL DEFAULT GETDATE()
, UserId int references Users(Id)
)
I would do the following:
select TOP 1 p.Id as PageId
, p.CustomName
from Pages p
where p.UserId = #UserId
order by p.Created desc
Or even:
select TOP 1 p.Id as PageId
, p.CustomName
, MAX(p.CreatedAt) DateTimeCreated
from Pages p
where p.UserId = #UserId
group by p.Id
, p.CustomName
I hope this helps! (If not, please provide further details so that we may be of better helping hand)

I don't know what your table looks like
Select top 1 pg.createdAt
,pg.customName
,pg.id
from table pg
where pg.UserId = #UserId
order by pg.createdAt Desc
I need a bit more info on your table(s)

Related

How to query top record group conditional on the counts and strings in a second table

I call on the SQL Gods of the internet!! O so desperately need your help with this query, my livelyhood depends on it. I've solved it in Alteryx in like 2 minutes but i need to write this query in SQL and I am relatively new to the language in terms of complex blending and syntax.
Your help would be so appreciated!! :) xoxox I cant begin to describe
Using SSMS I need to use 2 tables 'searches' and 'events' to query...
the TOP 2 [user]s with the highest count of unique search ids in Table 'searches'
Condition that the [user]s in the list have at least 1 eventid in 'events' where [event type] starts with "great"
Here is an example of what needs to happen
search event and end result example
So the only pieces i have so far are below but boy oh boy please don't Laugh :(
What i was trying to do is..
select a table of unique users with the searchcounts from the search table
inner join selected table from 1 on userid with a table described in 3
create table of unique user ids with counts of events with [type] starting with "great"
Filter the inner joined table for the top 2 search counts from step 1
SELECT userid, COUNT() as searchcount
FROM searches
GROUP BY userid
INNER JOIN (SELECT userid, COUNT() as eventcount
FROM events WHERE LEFT(type, 5) = "great" AND eventcount>0 Group by userid)
ON searches.userid=events.userId
Obviously, this doesn't work at all!!! I think my structure is off and my method of filtering for "great" is errored. Also i dont know how to add the "top 2" clause to the search table query without affecting the inner join. This code needs to be fairly efficient so if you have a better more computationally efficient idea...I love you long time
SELECT top(2) userid, COUNT() as searchcount FROM searches
where userid in (select userid from events where left(type, 5)='great')
GROUP BY userid
order by count() desc
hope above query will serve your purpose.
I think you need exists and windows function dense_rank as follows:
Select * from
(Select u.userid, dense_rank() over (partition by u.userid order by count(*) desc) as rn
From users u join searches s on u.userid = s.userid
Where exists
(select 1 from events e
Where e.userid = u.userid And LEFT(e.type, 5) = 'great')
Group by u.userid ) t Where rn <= 2

Select records where column value is unique

I have a table of posts in a forum (mybb_posts, with the username of the poster).
I want all the posts posted by people who only posted once, in other words, all the rows where username is a single occurrence in the username column.
So far I am using this:
SELECT *
FROM mybb_posts
WHERE username IN
(SELECT username
FROM
(SELECT username,
count(*) COUNT
FROM `mybb_posts`
GROUP BY username) tbl1
WHERE COUNT=1)
But the three nested SELECTs look ugly.
Is there a more elegant/efficient/simple way? All the answers I have seen on SO and elsewhere focus on getting the unique ids.
This is for a MySQL database, if you want to suggest non-standard solutions (but standard ones are preferred).
all the rows where username is a single occurrence in the username column.
This suggests window functions:
SELECT p.*
FROM (SELECT p.*, COUNT(*) OVER (PARTITION BY p.username) as cnt
FROM mybb_posts p
) p
WHERE cnt = 1;
As a note: You don't need two nested subqueries for your version. You can use a HAVING clause:
SELECT p.*
FROM mybb_posts p
WHERE p.username IN (SELECT p2.username
FROM mybb_posts p2
GROUP BY p2.username
HAVING COUNT(*) = 1
);
The most portable solution that I can think of is not exists and a correlated subquery. This works in most databases, including those that do not support window functions (such as MySQL 5.x versions, or MS Access). This should also be a rather efficient option.
For this, you need a primary key in your table. Assuming that it is called post_id, that would be:
select p.*
from mybb_posts p
where not exists (
select 1
from mybb_posts p1
where p1.username = p.username and p1.post_id <> p.post_id
)
For performance, you need an index on (username, post_id).

Bigquery SQL code to pull earliest contact

I have a copy of our salesforce data in bigquery, I'm trying to join the contact table together with the account table.
I want to return every account in the dataset but I only want the contact that was created first for each account.
I've gone around and around in circles today googling and trying to cobble a query together but all roads either lead to no accounts, a single account or loads of contacts per account (ignoring the earliest requirement).
Here's the latest query. that produces no results. I think I'm nearly there but still struggling. any help would be most appreciated.
SELECT distinct
c.accountid as Acct_id
,a.id as a_Acct_ID
,c.id as Cont_ID
,a.id AS a_CONT_ID
,c.email
,c.createddate
FROM `sfdcaccounttable` a
INNER JOIN `sfdccontacttable` c
ON c.accountid = a.id
INNER JOIN
(SELECT a2.id, c2.accountid, c2.createddate AS MINCREATEDDATE
FROM `sfdccontacttable` c2
INNER JOIN `sfdcaccounttable` a2 ON a2.id = c2.accountid
GROUP BY 1,2,3
ORDER BY c2.createddate asc LIMIT 1) c3
ON c.id = c3.id
ORDER BY a.id asc
LIMIT 10
The solution shared above is very BigQuery specific: it does have some quirks you need to work around like the memory error you got.
I once answered a similar question here that is more portable and easier to maintain.
Essentially you need to create a smaller table(even better to make it a view) with the ID and it's first transaction. It's similar to what you shared by slightly different as you need to group ONLY in the topmost query.
It looks something like this
select
# contact ids that are first time contacts
b.id as cont_id,
b.accountid
from `sfdccontacttable` as b inner join
( select accountid,
min(createddate) as first_tx_time
FROM `sfdccontacttable`
group by 1) as a on (a.accountid = b.accountid and b.createddate = a.first_tx_time)
group by 1, 2
You need to do it this way because otherwise you can end up with multiple IDs per account (if there are any other dimensions associated with it). This way also it is kinda future proof as you can have multiple dimensions added to the underlying tables without affecting the result and also you can use a where clause in the inner query to define a "valid" contact and so on. You can then save that as a view and simply reference it in any subquery or join operation
Setup a view/subquery for client_first or client_last
as:
SELECT * except(_rank) from (
select rank() over (partition by accountid order by createddate ASC) as _rank,
*
FROM `prj.dataset.sfdccontacttable`
) where _rank=1
basically it uses a Window function to number the rows, and return the first row, using ASC that's first client, using DESC that's last client entry.
You can do that same for accounts as well, then you can join two simple, as exactly 1 record will be for each entity.
UPDATE
You could also try using ARRAY_AGG which has less memory footprint.
#standardSQL
SELECT e.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.createddate ASC LIMIT 1
)[OFFSET(0)] e
FROM `dataset.sfdccontacttable` t
GROUP BY t.accountid
)

Inner Join a Table to Itself

I have a table that uses two identifying columns, let's call them id and userid. ID is unique in every record, and userid is unique to the user but is in many records.
What I need to do is get a record for the User by userid and then join that record to the first record we have for the user. The logic of the query is as follows:
SELECT v1.id, MIN(v2.id) AS entryid, v1.userid
FROM views v1
INNER JOIN views v2
ON v1.userid = v2.userid
I'm hoping that I don't have to join the table to a subquery that handles the min() piece of the code as that seems to be quite slow.
I guess (it's not entirely clear) you want to find for every user, the rows of the table that have minimum id, so one row per user.
In that case, you an use a subquery (a derived table) and join it to the table:
SELECT v.*
FROM views AS v
JOIN
( SELECT userid, MIN(id) AS entryid
FROM views
GROUP BY userid
) AS vm
ON vm.userid = v.userid
AND vm.entryid = v.id ;
The above can also be written using a Common Table Expression (CTE), if you like them:
; WITH vm AS
( SELECT userid, MIN(id) AS entryid
FROM views
GROUP BY userid
)
SELECT v.*
FROM views AS v
JOIN vm
ON vm.userid = v.userid
AND vm.entryid = v.id ;
Both would be quite efficient with an index on (userid, id).
With SQL-Server, you could write this using the ROW_NUMBER() window function:
; WITH viewsRN AS
( SELECT *
, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY id) AS rn
FROM views
)
SELECT * --- skipping the "rn" column
FROM viewsRN
WHERE rn = 1 ;
Well, to use the MIN function along with non-aggregate columns, you'd have to group the statement. That's possible with the query you have... (EDIT based on additional info)
SELECT MIN(v2.id) AS entryid, v1.id, v1.userid
FROM views v1
INNER JOIN views v2
ON v1.userid = v2.userid
GROUP BY v1.id, v1.userid
... however if this is just a simple example and you're looking to pull more data with this query, it quickly becomes an unfeasible solution.
What you seem to want is a list of all the user data in this view, with a link on each row leading back to the "first" record that exists for the same user. The above query will get you what you want, but there are much easier ways to determine the first record for each user:
SELECT v1.id, v1.userid
FROM views v1
ORDER BY v1.userid, v1.id
The first record for each unique user is your "entry point". I think I understand why you want to do it the way you specified, and the first query I gave will be reasonably performant, but you'll have to consider whether not having to use the order by clause to get the correct answer is worth it.
edit-1: as pointed out in the comments, this solution also uses a sub-query. However, it does not use aggregate functions, which (depending on the database) might have a huge impact on the performance.
Can achieve without sub-query (see below).
Obviously, an index on views.userid is of tremedous value for the performance.
SELECT v1.*
FROM views v1
WHERE v1.id = (
SELECT TOP 1 v2.id
FROM views v2
WHERE v2.userid = v1.userid
ORDER BY v2.id ASC
)

SQL query performance question (multiple sub-queries)

I have this query:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND (
r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId AND r2.status = 'active')
OR r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
)
Which returns each page and the latest active revision for each, unless no active revision is available, in which case it simply returns the latest revision.
Is there any way this can be optimised to improve performance or just general readability? I'm not having any issues right now, but my worry is that when this gets into a production environment (where there could be a lot of pages) it's going to perform badly.
Also, are there any obvious problems I should be aware of? The use of sub-queries always bugs me, but to the best of my knowledge this cant be done without them.
Note:
The reason the conditions are in the JOIN rather than a WHERE clause is that in other queries (where this same logic is used) I'm LEFT JOINing from the "site" table to the "page" table, and If no pages exist I still want the site returned.
Jack
Edit: I'm using MySQL
If "active" is the first in alphabetical order you migt be able to reduce subqueries to:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND
r.id = (SELECT r2.id
FROM page_revision as r2
WHERE r2.pageId = r.pageId
ORDER BY r2.status, r2.id DESC
LIMIT 1)
Otherwise you can replace ORDER BY line with
ORDER BY CASE r2.status WHEN 'active' THEN 0 ELSE 1 END, r2.id DESC
These all come from my assumptions on SQL Server, your mileage with MySQL may vary.
Maybe a little re-factoring is in order?
If you added a latest_revision_id column onto pages your problem would disappear, hopefully with only a couple of lines added to your page editor.
I know it's not normalized but it would simplify (and greatly speed up) the query, and sometimes you do have to denormalize for performance.
Your problem is a particular case of what is described in this question.
The best you can get using standard ANSI SQL seems to be:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id
AND r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
Other approaches are available but dependent on what database you're using. I'm not really sure it can be improved much for MySQL.
In MS SQL 2005+ and Oracle:
SELECT p.id, r.status, r.title
FROM (
SELECT p.*, r,*,
ROW_NUMBER() OVER (PARTITION BY p.pageId ORDER BY CASE WHEN p.status = 'active' THEN 0 ELSE 1 END, r.id DESC) AS rn
FROM page AS p, page_revision r
WHERE r.id = p.pageId
) o
WHERE rn = 1
In MySQL that can become a problem, as subqueries cannot use the INDEX RANGE SCAN as the expression from the outer query is not considered constant.
You'll need to create two indexes and a function that returns the last page revision to use those indexes:
CREATE INDEX ix_revision_page_status_id ON page_revision (page_id, id, status);
CREATE INDEX ix_revision_page_id (page_id, id);
CREATE FUNCTION `fn_get_last_revision`(input_id INT) RETURNS int(11)
BEGIN
DECLARE id INT;
SELECT r_id
INTO id
FROM (
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_status_id)
WHERE page_id = input_id
AND status = 'active'
ORDER BY id DESC
LIMIT 1
UNION ALL
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_id)
WHERE page_id = input_id
ORDER BY id DESC
LIMIT 1
) o
LIMIT 1;
RETURN id;
END;
SELECT po.id, r.status, r.title
FROM (
SELECT p.*, fn_get_last_revision(p.page_id) AS rev_id
FROM page p
) po, page_revision r
WHERE r.id = po.rev_id;
This will efficiently use index to get the last revision of the page.
P. S. If you will use codes for statuses and use 0 for active, you can get rid of the second index and simplify the query.