How to index for a self join - sql

I'm using SAS University Edition to analyze the following table (actually has 2.5M rows in it)
p_id c_id startyear endyear
0001 3201 2008 2013
0001 2131 2013 2015
0013 3201 2006 2010
where p_id is person_id and c_id is companyid.
I want to get number of colleagues (number of persons that worked during an overlapping span at the same companies) in a certain year, so I created a table with the distinct p_ids and do the following query:
PROC SQL;
UPDATE no_colleagues AS t1
SET c2007 = (
SELECT COUNT(DISTINCT t2.p_id) - 1
FROM table AS t2
INNER JOIN table AS t3
ON t3.p_id = t1.p_id
AND t3.c_id = t2.c_id
AND t3.startyear <= t2.endyear % checks overlapping criteria
AND t3.endyear >= t2.startyear % checks overlapping criteria
AND t3.startyear <= 2007 % limits number of returns
AND t2.startyear <= 2007 % limits number of returns
);
A single lookup on an indexed query (p_id, c_id, startyear, endyear) takes 0.04 seconds. The query above takes about 1.8 seconds for a single update, and does not use any indexes.
So my question is:
How to improve the query, and/or how to use indices to make sure the self join can use the indices?
Thanks in advance.

Based on your data, I'd do something like this, but maybe you need to tweak the code to fit your needs.
First, create a table with p_id, c_id, year.
So your first guy working at the company 3201 will have 6 observations in this table, one for each worked year.
data have_count;
set have;
do i=startyear to endyear;
worked_in = i;
output;
end;
drop i startyear endyear;
run;
Now you just count and agreggate:
proc sql;
select
worked_in as year
,c_id
,count(distinct p_id) as no_colleagues
from have_count
group by 1,2;
quit;
Result:
year c_id no_colleagues
2006 3201 1
2007 3201 1
2008 3201 2
2009 3201 2
2010 3201 2
2011 3201 1
2012 3201 1
2013 2131 1
2013 3201 1
2014 2131 1
2015 2131 1

A more efficient method:
1) Create a long format table for the results rather than wide format. This will be both easier to populate and easier to work with later.
create table colleagues_by_year (
p_id int,
year int,
colleagues int
);
Now this can be populated with a single insert statement. The only trick is getting the full list of years you want in the final table. There are a few options, but since I'm not too familiar with SAS SQL I'm going to go with a very simple one: a lookup table of years, to which you can join.
create table years (
year int
);
insert into years
values (2007),(2008),...
(A more sophisticated approach would be a recursive query that found the range of all years in the input data).
Now the final insert:
insert into colleagues_by_year
select p_id,
year,
count(*)
from colleagues
join years on
years.year between colleagues.startyear and colleagues.endyear
group by p_id,year
This won't have any rows where the number of colleagues for the year would be 0. If you wanted that you could make years be a left join and only count the rows where years.year is not null.

Related

Hive QL to populate a sequence of numbers between limits

Not sure how to put this in a straight forward manner but I'm trying to make something work in Hive SQL. I need to create a sequence of numbers from lower limit to upper limit.
Ex:
select min(year) from table
Let's assume it results in 2010
select max(year) from table
Let's assume it results in 2015
I need to publish each year from 2010 to 2015 in a select query.
And I'm trying to put the min calculation & max calculation inside the same SQL which will/should create sequential years in the output.
Any ideas?
Well I have an idea but in order to use it, you will have to define the lowest possible and the largest possible values for the years that might be present in your table.
Let's say the smallest possible year is 1900 and the largest possible year is 2200.
Since the largest possible difference in this case is 2200-1900=300, you will have to use the following string: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ... ... 298 299 300.
In the query, you split this string using space as a delimiter thus getting an array, and then you explode that array.
Have a look:
SELECT
minval + delta
FROM
(
SELECT
min(year) minval,
max(year) maxval,
split('0 1 2 3 4 5 6 7 8 9 10 11 12 13 ... ... ... 298 299 300', ' ') delta_list
FROM
table
) t
LATERAL VIEW explode(delta_list) dlist AS delta
WHERE (maxval-minval) >= delta
;
So you end up with 301 rows but you only need the rows with delta values not exceeding the difference between max year and min year, which is reflected in the where clause
set hivevar:end_year=2019;
set hivevar:start_year=2010;
select ${hivevar:start_year}+i as year
from
(
select posexplode(split(space((${hivevar:end_year}-${hivevar:start_year})),' ')) as (i,x)
)s;
Result:
year
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
Have a look also at this answer about generating missing dates.

SQL Server : list months between range

I am trying to calculate the values of financial hedges but I want the input to be as simple as possible. So for example, I have a contract with a quantity of 1,000 per month from 2/1/2017 to 12/31/2018 (it won't always be this date range, it could be 1 month or 3+ years) with a strike price of $3. I want to enter just 1 row of data into 4 columns: [Volume],[Start],[End],[Strike].
The issue is I need to multiply the 2017 data by one price and the 2018 by a different price. The easy answer would be to enter a row for 2017 and a row for 2018 but I don't want to do this because I may have 50 or more contracts to enter in and I want to do as little input as possible.
I would use two columns [Year],[Price] from a price table. It would look like [2017],[4.50] and [2018],[5.25]. I can easily modify this to be monthly instead of annual if it helps simplify things.
I need the final calculation to be along the lines of:
2017: 11,000 * (3 - 4.50) = -$16,500
2018: 12,000 * (3 - 5.25) = -$27,000
Total value = -$43,500
So my question is, how can I get a count of months for each year in the range?
I would like the output to be something like
2017, 11
2018, 12
Calendar tables can be helpful in situations like this, there are many sample scripts out there, here is one: https://www.mssqltips.com/sqlservertip/4054/creating-a-date-dimension-or-calendar-table-in-sql-server/
You could also simply use year and month in your price table and join on the range, an Integer type Year_Month_Int field could be used '201601, 201602, 201603...`
Then:
SELECT *
FROM contracts c
JOIN prices p
ON p.Year_Month_Int BETWEEN (YEAR(c.start)*100)+MONTH(c.start) AND (YEAR(c.end)*100)+MONTH(c.end)
If you included the integer version year-month for start and end in the contracts table you could simplify the JOIN criteria.
Another variant on how to count the number of months in a range would be:
Setup:
create table soTest(
id integer not null auto_increment,
start_date datetime,
end_date datetime,
primary key (id)
);
insert into soTest (start_date, end_date) values
('2017-02-01 00:00:00','2018-12-31 00:00:00');
Query:
select distinct case when extract(year from start_date) = yr
then format(datediff(concat(yr,'-12-31'), start_date)/30,0)
else format(datediff(end_date, concat(yr,'-01-01'))/30,0) end as qtyMonths,
yr
from (select s.*, extract(year from start_date) yr from soTest s
union all
select s.*, extract(year from end_date) yr from soTest s
) a;
Result:
qtyMonths yr
11 2017
12 2018
Be aware that this technique will only count the months in case of one registry. To get data from your contracts you would have to JOIN this with your dataset tables.

SQL: Can GROUP BY contain an expression as a field?

I want to group a set of dated records by year, when the date is to the day. Something like:
SELECT venue, YEAR(date) AS yr, SUM(guests) AS yr_guests
FROM Events
...
GROUP BY venue, YEAR(date);
The above is giving me results instead of an error, but the results are not grouping by year and venue; they do not appear to be grouping at all.
My brute force solution would be a nested subquery: add the YEAR() AS yr as an extra column in the subquery, then do the grouping on yr in the outer query. I'm just trying to learn to do as much as possible without nesting, because nesting usually seems horribly inefficient.
I would tell you the exact SQL implementation I'm using, but I've had trouble discovering it. (I'm working through the problems on http://www.sql-ex.ru/ and if you can tell what they're using, I'd love to know.) Edited to add: Per test in comments, it is probably not SQL Server.
Edited to add the results I am getting (note the first two should be summed):
venue | yr | yr_guests
1 2012 15
1 2012 35
2 2012 12
1 2008 15
I expect those first two lines to instead be summed as
1 2012 50
Works Fine in SQL Server 2008.
See working Example here: http://sqlfiddle.com/#!3/3b0f9/6
Code pasted Below.
Create The Events Table
CREATE TABLE [Events]
( Venue INT NOT NULL,
[Date] DATETIME NOT NULL,
Guests INT NOT NULL
)
Insert the Rows.
INSERT INTO [Events] VALUES
(1,convert(datetime,'2012'),15),
(1,convert(datetime,'2012'),35),
(2,convert(datetime,'2012'),12),
(1,convert(datetime,'2008'),15);
GO
-- Testing, select newly inserted rows.
--SELECT * FROM [Events]
--GO
Run the GROUP BY Sql.
SELECT Venue, YEAR(date) AS yr, SUM(guests) AS yr_guests
FROM Events
GROUP BY venue, YEAR(date);
See the Output Results.
VENUE YR YR_GUESTS
1 2008 15
1 2012 50
2 2012 12
it depends of your database engine (or SQL)
to be sure (over different DB Systems & Versions), make a subquery
SELECT venue, theyear, SUM(guests) from (
SELECT venue, YEAR(date) AS theyear, guest
FROM Events
)
GROUP BY theyear
you make a subtable of
venue, date as theyear, guest
aaaa, 2001, brother
aaaa, 2001, bbrother
bbbb, 2001, nobody
... and so on
and then
count them

Generating "resource allocation" report using SQL

I am trying to generate a report based on the below two tables:
Name Start Year End Year No. Of Students Fill Order
School-ABC 2000 2004 1 1
School-DEF 2000 2004 2 3
School-GHI 2000 2004 1 2
Name Start Year End Year Joined On
Student-1 2000 2004 01-Jan
Student-2 2000 2004 03-Jan
Student-3 2000 2004 02-Jan
Student-4 2000 2004 15-Jan
The expected output is below:
Name Start Year End Year Joined On School
Student-1 2000 2004 01-Jan School-ABC
Student-2 2000 2004 03-Jan School-DEF
Student-3 2000 2004 02-Jan School-GHI
Student-4 2000 2004 15-Jan School-DEF
Logic behind generating the data:
First table contains the list of schools and the seats available (along with the priority in which seats will be allocated to students on FCFS basis)
The second table contains data on the list of students enrolled to schools, with their admission date and the start/end year of course.
I am required to populate based on the "Fill Order", the school that is allocated to each student.
After analyzing the problem for a while, I have come to a conclusion that, this might not be achievable using select queries alone. Currently, I am planning to do it using two Cursors for each table and process the records row-by-row. Is there a better way of doing it or is it possible through select statements? TIA
Note:
The database I use is Oracle 10g
I cannot create any temporary tables or alter the data in any of the tables. I strictly have read-only access to the database.
You could use Oracle analytic functions. row_number() over () can assign a number to each student based on their join date. sum() over () can calculate the first and last student for each school. Combining the two you get:
select stud.name
, stud.startyear
, stud.endyear
, stud.joinedon
, schl.name as SchoolName
from (
select name
, coalesce(sum(NoOfStudents) over (order by FillOrder
range between unbounded preceding and 1 preceding),0)+1 FirstStudent
, sum(NoOfStudents) over (order by FillOrder) as LastStudent
from Schools
) schl
join (
select row_number() over (order by JoinedOn) as StudentRank
, Students.*
from Students
) stud
on stud.StudentRank between schl.FirstStudent and schl.LastStudent
order by
stud.name
Live example at SQL Fiddle.

id's who have particulars years data

I have a question regarding Oracle SQL.
My data looks like this:
id year
-- ----
1 2000
1 2001
1 2002
1 2003
1 2006
1 2000
2 2001
2 2002
2 2003
3 2003
3 2005
4 2012
4 2013
I want the id's which have the years 2001, 2002, 2003.
My result set:
id
--
1
2
Please help me with this. I actually tried searching this, but couldn't figure a way to search about my particular problem.
SQL
SELECT t.id
FROM TABLE t
WHERE t.year in(2001,2002,2003)
GROUP BY t.id
Sample SqlFiddle
http://sqlfiddle.com/#!2/4ec9f/2/0
Explanation
You want to filter your data set to only show rows with certain years, so that is what you put in the where clause WHERE t.year in(2001,2002,2003).
Since a single id can be in multiple years, your result set would contain duplicates. To remove the duplicates you could GROUP BY the ID or use the DISTINCT statement to only show unique elements.
UPDATE
Based on comments, here's a version that will only display id's that have all three years. We use DISTINCT t.YEAR to avoid counting id's that perhaps would have a single year repeated multiple times. The HAVING COUNT(DISTINCT t.YEAR) = 3 part ensures that we only include id's that have all three years.
SELECT t.id
FROM years t
WHERE t.year in(2001,2002,2003)
GROUP BY t.id
HAVING COUNT(DISTINCT t.YEAR) = 3
Updated sqlFiddle, which includes a data set where id of 3 has two rows for 2003 to show off the logic that only counts unique years for an ID.
select distinct id
from table
where year in(2001,2002,2003)