Inner Join a Table to Itself - sql

I have a table that uses two identifying columns, let's call them id and userid. ID is unique in every record, and userid is unique to the user but is in many records.
What I need to do is get a record for the User by userid and then join that record to the first record we have for the user. The logic of the query is as follows:
SELECT v1.id, MIN(v2.id) AS entryid, v1.userid
FROM views v1
INNER JOIN views v2
ON v1.userid = v2.userid
I'm hoping that I don't have to join the table to a subquery that handles the min() piece of the code as that seems to be quite slow.

I guess (it's not entirely clear) you want to find for every user, the rows of the table that have minimum id, so one row per user.
In that case, you an use a subquery (a derived table) and join it to the table:
SELECT v.*
FROM views AS v
JOIN
( SELECT userid, MIN(id) AS entryid
FROM views
GROUP BY userid
) AS vm
ON vm.userid = v.userid
AND vm.entryid = v.id ;
The above can also be written using a Common Table Expression (CTE), if you like them:
; WITH vm AS
( SELECT userid, MIN(id) AS entryid
FROM views
GROUP BY userid
)
SELECT v.*
FROM views AS v
JOIN vm
ON vm.userid = v.userid
AND vm.entryid = v.id ;
Both would be quite efficient with an index on (userid, id).
With SQL-Server, you could write this using the ROW_NUMBER() window function:
; WITH viewsRN AS
( SELECT *
, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY id) AS rn
FROM views
)
SELECT * --- skipping the "rn" column
FROM viewsRN
WHERE rn = 1 ;

Well, to use the MIN function along with non-aggregate columns, you'd have to group the statement. That's possible with the query you have... (EDIT based on additional info)
SELECT MIN(v2.id) AS entryid, v1.id, v1.userid
FROM views v1
INNER JOIN views v2
ON v1.userid = v2.userid
GROUP BY v1.id, v1.userid
... however if this is just a simple example and you're looking to pull more data with this query, it quickly becomes an unfeasible solution.
What you seem to want is a list of all the user data in this view, with a link on each row leading back to the "first" record that exists for the same user. The above query will get you what you want, but there are much easier ways to determine the first record for each user:
SELECT v1.id, v1.userid
FROM views v1
ORDER BY v1.userid, v1.id
The first record for each unique user is your "entry point". I think I understand why you want to do it the way you specified, and the first query I gave will be reasonably performant, but you'll have to consider whether not having to use the order by clause to get the correct answer is worth it.

edit-1: as pointed out in the comments, this solution also uses a sub-query. However, it does not use aggregate functions, which (depending on the database) might have a huge impact on the performance.
Can achieve without sub-query (see below).
Obviously, an index on views.userid is of tremedous value for the performance.
SELECT v1.*
FROM views v1
WHERE v1.id = (
SELECT TOP 1 v2.id
FROM views v2
WHERE v2.userid = v1.userid
ORDER BY v2.id ASC
)

Related

Update of many rows with join extremely slow

I have a table with five relevant fields - id, source, iid, track_hash, alias. I want to group all entries into groups with a common track_hash and then for each row save the id of the row with the lowest source (with ties broken in favor of the highest iid) entry from its group into the alias field. To do this I wrote the following query:
with best as
(SELECT id as bid, track_hash FROM
(SELECT id, track_hash,
RANK () OVER (
PARTITION BY track_hash
ORDER BY source asc, iid DESC
) rank
from albums
)
where rank = 1
)
select bid, a.* from albums a inner join best
on a.track_hash = best.track_hash
This takes a completely reasonable 2 seconds on 24k rows. Now, instead of simply seeing this id, I want to actually save it. For this, I used the following very similar query:
with best as
(SELECT id as bid, track_hash FROM
(SELECT id, track_hash,
RANK () OVER (
PARTITION BY track_hash
ORDER BY source asc, iid DESC
) rank
from albums
)
where rank = 1
)
update albums
set alias = bid FROM albums a inner join best
on a.track_hash = best.track_hash
However, this one takes anywhere between 1 and 10 minutes, and I really don't understand why. Doesn't the engine have to match every row to its best.id/alias anyway, which is exactly what I'm doing with my update? Why is this happening and what am I doing wrong?
Query plan looks like this:
MATERIALIZE 1
CO-ROUTINE 4
SCAN TABLE albums USING INDEX track_hash_idx
USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
SCAN SUBQUERY 4
SCAN TABLE albums USING COVERING INDEX track_hash_idx
SEARCH SUBQUERY 1 USING AUTOMATIC PARTIAL COVERING INDEX (rank=?)
SEARCH TABLE albums AS a USING COVERING INDEX track_hash_idx (track_hash=?)
You don't need the join to albums (again).
The UPDATE ... FROM syntax provides actually an implicit join of albums to best:
UPDATE albums AS a
SET alias = b.bid
FROM best AS b
WHERE a.track_hash = b.track_hash

Bigquery SQL code to pull earliest contact

I have a copy of our salesforce data in bigquery, I'm trying to join the contact table together with the account table.
I want to return every account in the dataset but I only want the contact that was created first for each account.
I've gone around and around in circles today googling and trying to cobble a query together but all roads either lead to no accounts, a single account or loads of contacts per account (ignoring the earliest requirement).
Here's the latest query. that produces no results. I think I'm nearly there but still struggling. any help would be most appreciated.
SELECT distinct
c.accountid as Acct_id
,a.id as a_Acct_ID
,c.id as Cont_ID
,a.id AS a_CONT_ID
,c.email
,c.createddate
FROM `sfdcaccounttable` a
INNER JOIN `sfdccontacttable` c
ON c.accountid = a.id
INNER JOIN
(SELECT a2.id, c2.accountid, c2.createddate AS MINCREATEDDATE
FROM `sfdccontacttable` c2
INNER JOIN `sfdcaccounttable` a2 ON a2.id = c2.accountid
GROUP BY 1,2,3
ORDER BY c2.createddate asc LIMIT 1) c3
ON c.id = c3.id
ORDER BY a.id asc
LIMIT 10
The solution shared above is very BigQuery specific: it does have some quirks you need to work around like the memory error you got.
I once answered a similar question here that is more portable and easier to maintain.
Essentially you need to create a smaller table(even better to make it a view) with the ID and it's first transaction. It's similar to what you shared by slightly different as you need to group ONLY in the topmost query.
It looks something like this
select
# contact ids that are first time contacts
b.id as cont_id,
b.accountid
from `sfdccontacttable` as b inner join
( select accountid,
min(createddate) as first_tx_time
FROM `sfdccontacttable`
group by 1) as a on (a.accountid = b.accountid and b.createddate = a.first_tx_time)
group by 1, 2
You need to do it this way because otherwise you can end up with multiple IDs per account (if there are any other dimensions associated with it). This way also it is kinda future proof as you can have multiple dimensions added to the underlying tables without affecting the result and also you can use a where clause in the inner query to define a "valid" contact and so on. You can then save that as a view and simply reference it in any subquery or join operation
Setup a view/subquery for client_first or client_last
as:
SELECT * except(_rank) from (
select rank() over (partition by accountid order by createddate ASC) as _rank,
*
FROM `prj.dataset.sfdccontacttable`
) where _rank=1
basically it uses a Window function to number the rows, and return the first row, using ASC that's first client, using DESC that's last client entry.
You can do that same for accounts as well, then you can join two simple, as exactly 1 record will be for each entity.
UPDATE
You could also try using ARRAY_AGG which has less memory footprint.
#standardSQL
SELECT e.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.createddate ASC LIMIT 1
)[OFFSET(0)] e
FROM `dataset.sfdccontacttable` t
GROUP BY t.accountid
)

SQL Server JOINS

Can someone help explain to me how when I have 12 rows in table A and 10 in B and I do an inner join , I would get more rows than
in both A and B ?
Same with left and right joins...
This is just a simplified example. Let me share one of my issues with you
I have 2 views ; which was originally SQL on 2 base tables Culture and Trials.
And then when attempting to add another table Culture Steps, one of the team members separated the SQL into 2 views
Since this produces an error when updating(modification cannot be done as it affects multiple base tables), I would like to get
back to changing the SQL such that I no longer use the views but achieve the same results.
One of the views has
SELECT some columns
FROM dbo.Culture RIGHT JOIN
dbo.Trial ON dbo.Culture.cultureID = dbo.Trial.CultureID LEFT OUTER JOIN
dbo.TrialCultureSteps_view_part1 ON dbo.Culture.cultureID = dbo.TrialCultureSteps_view_part1.cultureID
The other TrialCultureSteps_view_part1 view
SELECT DISTINCT dbo.Culture.cultureID,
(SELECT TOP (1) WeekNr
FROM dbo.CultureStep
WHERE (CultureID = dbo.Culture.cultureID)
ORDER BY CultureStepID) AS normalstartweek
FROM dbo.Culture INNER JOIN
dbo.CultureStep AS CultureStep_1 ON dbo.Culture.cultureID = CultureStep_1.CultureID
So how can I combine the joins the achieve the same results using SQL only on tables without the need for views?
Welcome to StackOverflow! This link might be a good place to start in your understanding of JOINs. Essentially, the 'problem' you describe boils down to the fact that one or more of your sources (Trial, Culture, or the TrialCultureSteps view) has more than one record per CultureID - in other words, the same CultureID (#1) shows up on multiple rows.
Based solely on that ID, I'd execute the following three queries. Anything that is returned by them is the 'cause' of your duplications - the culture ID shows up more than once, so you'll have to JOIN on more than just CultureID. If, as I half-suspect, your view is the one that has multiple Culture IDs, you'll need to modify it to only return one record, or change the way that you JOIN to it.
SELECT *
FROM Trial
WHERE CultureID IN
(
SELECT CultureID
FROM Trial
GROUP BY CultureID
HAVING COUNT(*) > 1
)
ORDER BY CultureID
SELECT *
FROM Culture
WHERE CultureID IN
(
SELECT CultureID
FROM Culture
GROUP BY CultureID
HAVING COUNT(*) > 1
)
ORDER BY CultureID
SELECT *
FROM TrialCultureSteps_view_part1
WHERE CultureID IN
(
SELECT CultureID
FROM TrialCultureSteps_view_part1
GROUP BY CultureID
HAVING COUNT(*) > 1
)
ORDER BY CultureID
Let me know if any of these return values!
The comments explain the JOIN issues. As for rewriting, any views could be replaced with CTEs.
One other way to rewrite the query, would be : (Though having sample data and expected result would make this easier to confirm that it's correct)
;with TrialCultureSteps_view_part1 AS
(
Select Row_number() OVER (Partition BY CultureID ORDER BY CultureStepID) RowNumber
, WeekNr
, CultureID
)
SELECT some columns
dbo.trial LEFT OUTER JOIN
dbo.Culture ON dbo.Culture.cultureID = dbo.Trial.CultureID LEFT OUTER JOIN
TrialCultureSteps_view_part1 ON dbo.Culture.cultureID = dbo.TrialCultureSteps_view_part1.cultureID and RowNumber=1
Access code, I'm less familiar with the syntax, but I know that Row_Number() isn't available and I don't believe it has CTE syntax either. So, we'd need to put in some more nested derived tables.
SELECT some columns
dbo.trial LEFT OUTER JOIN
dbo.Culture ON dbo.Culture.cultureID = dbo.Trial.CultureID LEFT OUTER JOIN
( Select cs.CultureID, cs.WeekNr FROM
( SELECT CultureID, MIN(CultureStepID) CultureStepID
FROM dbo.CultureStep
GROUP BY CultureID
) Fcs INNER JOIN
CultureStep cs ON fcs.cultureStepID=cs.CultureStepID
) TrialCultureSteps_view_part1 ON dbo.Culture.cultureID = TrialCultureSteps_view_part1.cultureID
Assumptions here, is that CultureStepID is a PK for CultureStep. No assumption that a step must exist for each Culture entry.

What is the best way to find the max record of a table per a foreign key?

At work, I often have to find the max status per a foreign key. I have for the most part always used a correlated sub-query on the join to get the right record. This is assuming the highest primary key is the most recent. Here is a little demo
select
c.plate_number, o.name
from
Car c
inner join Owner o
on o.owner_id = (
select max(owner_id)
from Owner
where owner_type = 'PRIMARY'
)
This is pretty fast in most queries I use, not to mention being able to put extra criteria in the sub-query for type columns. I have tried using NOT EXIST clauses to make sure there are no higher records, but can't find anything else. Can someone suggest anything better and if so why?
I recommend using the sandard windowing functions....
;with cte as (
select c.plateNumber, o.name,
row_number() over (partition by c.ownerId order by purchaseDate desc) rw
from car c
inner join owner o
on o.ownerid = c.ownerid
)
select *
from cte
where rw=1;
allows you to get whatever you want from either table, and still only get one record

Approach to Selecting top item matching a criteria

EDIT: my apologies, this was a MSSQL2008 issue.
I have a SQL problem that I've come up against routinely, and normally just solved w/ a nested query. I'm hoping someone can suggest a more elegant solution.
It often happens that I need to select a result set for a user, conditioned upon it being the most recent, or the most sizeable or whatever.
For example: Their complete list of pages created, but I only want the most recent name they applied to a page. It so happens that the database contains many entries for each page, and only the most recent one is desired.
I've been using a nested select like:
SELECT pg.customName, pg.id
FROM (
select id, max(createdAt) as mostRecent
from pages
where userId = #UserId
GROUP BY id
) as MostRecentPages
JOIN pages pg
ON pg.id = MostRecentPages.id
AND pg.createdAt = MostRecentPages.mostRecent
Is there a better syntax to perform this selection?
Looks like you want
SELECT id, customname
FROM (SELECT id, customname,
row_number() OVER(PARTITION BY id ORDER BY createdat DESC) as pos
FROM pages
WHERE pages.userid = #UserId
) x
WHERE x.row_number = 1
(I'm assuming you're using SQL Server from the #UserId parameter. row_number() will work for SQL Server 2005, and tbh the above should also work for Oracle, Postgresql 8.4...)
This will select all the pages by userid and work out which is the most recent using a sort. An alternative would be sth like:
SELECT id, (SELECT TOP 1 customname
FROM pages pages_inner
WHERE pages_inner.id = pages_outer.id
ORDER BY pages_inner.createdat DESC) as customname
FROM (SELECT DISTINCT id FROM pages WHERE pages.userid = #UserId) pages_inner
Which approach is better depends on how many pages rows per id you have compared to pages per userid, I guess.
I'm not sure about better but a different syntax you could try is
SELECT pg.customName, pg.id
FROM pages pg
WHERE userId = #UserId
AND NOT EXISTS
(
SELECT * FROM pages pg2
WHERE pg2.UserId = pg.UserId
AND pg2.id = pg.id
AND pg2.createdAt > pg.createdAt
)
Another alternative would be an OUTER JOIN as in Bill Karwin's answer here How to get all the fields of a row using the SQL MAX function?
For what database (including version)? What you posted could be MySQL, SQL Server, or Sybase...
Using:
SELECT pg.customName,
pg.id
FROM PAGES pg
JOIN (SELECT t.userid,
MAX(t.createdAt) as mostRecent
FROM PAGES t
GROUP BY t.userid) x ON x.id = pg.id
AND x.mostRecent = pg.createdAt
AND x.userid = #UserId
This is the best approach for a portable query, assuming column references are correct. But there are alternatives for limiting the data set - SQL Server uses TOP, MySQL/Postgre/SQLite use LIMIT, Oracle uses ROWNUM.
What's best depends on your data & how the respective optimizer sees it, and your needs (portable vs not). Check the explain plan for the respective database to see how efficient the query is.
Are you using Oracle? Try to see if this query that uses analytic function would work for you. (Don't have access to db right now, so can't test myself.)
SELECT DISTINCT pg.id,
FIRST_VALUE(pg.customName) OVER (PARTITION BY pg.id ORDER BY pg.createdAt DESC) AS customName
FROM pages pg
Assuming SQL Server and your Pages table like so:
CREATE TABLE Pages (
Id int IDENTITY(1, 1) PRIMARY KEY
, CustomName nvarchar(20) NOT NULL
, CreatedAt datetime NOT NULL DEFAULT GETDATE()
, UserId int references Users(Id)
)
I would do the following:
select TOP 1 p.Id as PageId
, p.CustomName
from Pages p
where p.UserId = #UserId
order by p.Created desc
Or even:
select TOP 1 p.Id as PageId
, p.CustomName
, MAX(p.CreatedAt) DateTimeCreated
from Pages p
where p.UserId = #UserId
group by p.Id
, p.CustomName
I hope this helps! (If not, please provide further details so that we may be of better helping hand)
I don't know what your table looks like
Select top 1 pg.createdAt
,pg.customName
,pg.id
from table pg
where pg.UserId = #UserId
order by pg.createdAt Desc
I need a bit more info on your table(s)