How do I join six tables in SQL while only including certain rows? - sql

I have 6 tables of people corresponding to 6 calendar years of data,2010-2015. Each row in each table has a unique variable id corresponding to an individual who participated for the entire year, and each table has the variable year which is set to whichever participation year it is.
If an individual doesn't participate for the entire year, there is no corresponding row in that table.
For example,
enyear2010
id year (other variables)
0000001 2010 .
0000002 2010 .
000003 2010 .
0000004 2010 .
enyear2011
id year (other variables)
0000002 2011 .
0000003 2011 .
0000004 2011 .
0000005 2011 .
enyear2012
id year (other variables)
0000001 2012 .
0000002 2012 .
0000003 2012 .
0000005 2012 .
In the case of id 1, they didn't participate for the entirety of 2011 but did come back in in 2012, id 4 left in 2012, and id 5 joined in 2011.
I’d like to join all these tables together and take rows which occur in at least 2 consecutive years (such that for id 1, they wouldn’t be in this joined table), and create a new variable that corresponds to the # of years a person is in the dataset and when that person started.
merged-table
id startyear enrolledyears (other variables)
0000002 2010 3 .
0000003 2010 3 .
0000004 2010 2 .
0000005 2011 2 .
So far, I was able to conceptualize this as a series of left joins, such that the year variable in each table becomes the startyear variable, but I think the process breaks down when somebody enters not in 2010.
Any advice is greatly appreciated!

Firstly, splitting things into yearly-named table is not a good table design. You should just put everything in the same table. Now every year that you add will need to be added to whatever SQL you come up with.
You can make it look like one table like this:
SELECT ID, Year FROM entear2010
UNION ALL
SELECT ID, Year FROM entear2011
UNION ALL
SELECT ID, Year FROM entear2012
Now you can use that construct to get what you want. You put that into something called a CTE:
WITH AllData AS (
SELECT ID, Year FROM entear2010
UNION ALL
SELECT ID, Year FROM entear2011
UNION ALL
SELECT ID, Year FROM entear2012
)
SELECT * FROM AllData
Now you can 'self join' to check if an id is in the prior year also:
WITH AllData AS (
SELECT ID, Year FROM entear2010
UNION ALL
SELECT ID, Year FROM entear2011
UNION ALL
SELECT ID, Year FROM entear2012
)
SELECT Current.ID, Current.Year
FROM AllData As Current
INNER JOIN AllData As Prior
ON Current.ID = Prior.ID
AND Current.Year-1 = Prior.Year
That gets you the list of people with two consecutive years. Now you just summarise it:
WITH AllData AS (
SELECT ID, Year FROM entear2010
UNION ALL
SELECT ID, Year FROM entear2011
UNION ALL
SELECT ID, Year FROM entear2012
)
SELECT ID, COUNT(*) YearsEnrolled, MIN(Year) As StartYear
FROM AllData
WHERE ID IN (
SELECT DISTINCT Current.ID
FROM AllData As Current
INNER JOIN AllData As Prior
ON Current.ID = Prior.ID
AND Current.Year-1 = Prior.Year
)
GROUP BY ID
I think that's what you're after.
There is probably a smarter way to do it using windowing functions... but someone else will no doubt post it.

You have to merge all tables first (By union all or creating temp table), then run below SQL:
select * from (
select MEMBER_ID, max(YEAR_NUM) MAX_YEAR, MIN(YEAR_NUM) MIN_YEAR, COUNT(YEAR_NUM) YEAR_COUNT
from merged_tables
group by MEMBER_ID) w1
where MAX_YEAR=MIN_YEAR+YEAR_COUNT-1 and YEAR_COUNT>1
Above SQL will return all member ID whose consecutive enrolled years is greater than one year.

Related

LAG function alternative. I need the results for the missing year in between

I have this table so far. However, I would like to obtain the results for 2019 which there are no records so it becomes 0. Are there any alternatives to the LAG funciton.
ID
Year
Year_Count
1
2018
10
1
2020
20
Whenever I use the LAG function in SQL it gives me the results for 2018. However, I would like to get 0 for 2019 and then 10 for 2018
LAG(YEAR_COUNT) OVER (PARTITION BY ID ORDER BY YEAR) AS previous_year_count
untested notepad scribble
CASE
WHEN 1 = YEAR - LAG(YEAR) OVER (PARTITION BY ID ORDER BY YEAR)
THEN LAG(YEAR_COUNT) OVER (PARTITION BY ID ORDER BY YEAR)
ELSE 0
END AS previous_year_count
I'll add on to Nick's comment here with an example.
The YEARS CTE here is creating that table of years as he suggested, the RECORDS table is matching the above posted. Then they get joined together with COALESCE to fill in the null values left by the LEFT JOIN (filled ID with 0, not sure what your case would be).
You would need to LEFT JOIN onto the YEAR table and select the YEAR variable from the YEAR table in the final query, otherwise you'd only end up with only 2018/2020 or those years and some null values
WITH
YEARS AS
(
SELECT 2016 AS YEAR UNION ALL
SELECT 2017 UNION ALL
SELECT 2018 UNION ALL
SELECT 2019 UNION ALL
SELECT 2020 UNION ALL
SELECT 2021 UNION ALL
SELECT 2022
)
,
RECORDS AS
(
SELECT 1 ID, 2018 YEAR, 10 YEAR_COUNT UNION ALL
SELECT 1, 2020, 20)
SELECT
COALESCE(ID, 0) AS ID,
Y.YEAR,
COALESCE(YEAR_COUNT, 0) AS YEAR_COUNT
FROM YEARS AS Y
LEFT JOIN RECORDS AS R
ON R.YEAR = Y.YEAR
Here is the dbfiddle so you can visualize - https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=9e777ad925b09eb8ba299d610a78b999
Vertica SQL is not an available test environment, so this may not work directly but should at least get you on the right track.
The LAG function would not work to get 2019 for a few reasons
It's a window function and can only grab from data that is available - the default for LAG in your case appears to be 1 aka LAG(YEAR_COUNT, 1)
Statements in the select typically can't add any rows data back into a table, you would need to add in data with JOINs
If 2019 does exist in a prior table and you're using group by to get year count, it's possible that you have a where clause excluding the data.

taking account another value in different column with the same value in the main column

id
year
1
2010
2
2010
1
2009
3
2010
I'm trying to output 2 and 3 with the SQL query by using the logic in the year. it outputs id who only has a year equals to 2010, but not 2009 if it ever has it. id 1 is not in the output because it has 2009 once in the table. I wonder how to exeCute this with SQL query.
You can use a subquery:
Select * from table
where id not in
(select id from table where year = 2009)
If you want ids where the minimum year is 2010, you can use group by and having:
select id
from t
group by id
having min(year) = 2010;
EDIT:
Based on the comment, you can be more specific. If you want only 2010, then:
having min(year) = max(year) and min(year) = 2010

How do I know which lines are not in certain groups?

I have two table products and products_years. The first table contains only the product id and name (but I just want the id). The second table contains the id and the start and end date of the period that it was marketed and the price of each product.
Table products is like this:
SELECT id from products
Results in the following output:
id
1
2
3
4
5
And the query:
select Id, DateStart,DateFinish sum(price) as price from products_years
group by DateStart,DateFinish, Id
order by Id, DateStart
Results in :
Id DateStart DateFinish price
1 2017 2019 100
2 2017 2019 200
3 2017 2019 40
2 2014 2016 30
4 2014 2016 140
I want to know which products stopped being sold in each period that they became available.
The output would be something like this:
id DateStart DateFinish
4 2017 2019
5 2017 2019
1 2014 2016
3 2014 2016
5 2014 2016
Compute all possible combinations of id and range and exclude those actually coming from table2.
with table1 (id) as (values
(1),(2),(3),(4),(5)
), table2 (id,datestart,datefinish) as (values
(1, 2017, 2019),
(2, 2017, 2019),
(3, 2017, 2019),
(2, 2014, 2016),
(4, 2014, 2016)
)
select id, datestart, datefinish
from table1
cross join (select distinct datestart, datefinish from table2) dates
except
select * from table2
order by datestart desc, id
Note the example uses PG12 syntax though the query is universal.
If I understand you correctly you're trying to find records in table1 that are not in table2?
If so, you'll want to use JOIN. To do what you're looking for, you could use a LEFT JOIN and check for a null ID on the right (as in RIGHT JOIN, not as in "correct") table.
SELECT `Id`, `DateStart`, `DateFinish` FROM `table1`
LEFT JOIN `table2`
ON `table1`.`Id` = `table2`.`Id`
WHERE `table2`.`Id` IS NULL
GROUP BY `DateStart`, `DateFinish`, `Id`
ORDER BY `Id`, `DateStart`
That statement would return all records in table1 that aren't found in table2.
Here's a great answer to another question that has some helpful visuals for using join statements.

SQL query to duplicate each row 12 times

I have a table which has columns site,year and sales . this table is unique on site+year eg
site year sales
-------------------
a 2012 50
b 2013 100
a 2006 35
Now what I want to do is make this table unique on site+year+month. Thus each row gets duplicated 12 times, a month column is added which is labelled from 1-12 and the sales values get divided by 12 thus
site year month sales
-------------------------
a 2012 1 50/12
a 2012 2 50/12
...
a 2012 12 50/12
...
b 2013 1 100/12
...
a 2006 12 35/12
I am doing this on python currently and it works like a charm, but I need to do this in SQL (ideally PostgreSQL since I will be using this as a datasource for tableau)
It would be very helpful if someone can provide the explanations with the solution as well, since I am a novice at this
You can use generate_series() for that
select t.site, t.year, g.month, t.sales / 12
from the_table t
cross join generate_series(1,12) as g (month)
order by t.site, t.year, g.month;
If the column sales is an integer, you should cast that to a numeric to avoid the integer division: t.sales::numeric / 12
Online example: http://rextester.com/GUWPI39685
Try this approach (For T-SQL - MS SQL) :
DECLARE #T TABLE
(
[site] VARCHAR(5),
[year] INT,
sales INT
)
INSERT INTO #T
VALUES('A',2012,50),('B',2013,100),('C',2006,35)
;WITH CTE
AS
(
SELECT
MonthSeq = 1
UNION ALL
SELECT
MonthSeq = MonthSeq+1
FROM CTE
WHERE MonthSeq <12
)
SELECT
T.[site],
T.[year],
[Month] = CTE.MonthSeq,
sales = T.[sales]/12
FROM CTE
CROSS JOIN #T T
ORDER BY T.[site],CTe.MonthSeq

How to make a single line query include multiple lines in Oracle

I would like to take a set of data and expand it by adding date rows based an existing field. For instance, If I have the following table (TABLE1):
ID NAME YEAR
1 John 2001
2 Jim 2012
3 Sally 2005
I want to take this data and put it into another table but expand it to include a set of months (and from there I can add monthly information). If I just look at the first record (John) my result would be:
ID NAME YEAR MONTH
1 John 2001 01-JAN-2001
1 John 2001 01-FEB-2001
1 John 2001 01-MAR-2001
...
1 John 2001 01-DEC-2001
I have the mechanism to derive my monthly dates but how do I extract the data from TABLE1 to make TABLE2. Here is just a quick query but, of course, I get the ORA-01427 single-row subquery returns more than one row as expect. Just not sure how to organize the query to put these two pieces together:
select id,
name,
year,
book_cd,
(SELECT ADD_MONTHS('01-JAN-'|| year, LEVEL - 1)
FROM DUAL CONNECT BY LEVEL <= 12) month
from table1 ;
I realize I cant do this but I'm not sure how to put the two pieces together. I plan to bulk process records so it wont be one ID at a time Thanks for the help.
You can use a cross join:
select t.id,
t.name,
t.year,
t.book_cd,
ADD_MONTHS(to_date(t.year || '-01-01', 'YYYY-MM-DD'), m.rn) as mnth
from table1 t
cross join (select rownum - 1 as rn
from dual
connect by rownum <= 12) m