Modify left join clause [duplicate] - sql

This question already has answers here:
Joining tables that compute values between dates
(2 answers)
Closed 4 years ago.
I am trying to write an impala query that does the follows with two tables provided below:
Table A
Date num
01-16 10
02-20 12
03-20 13
Table B contains everyday between 02-20 and 03-20 exclusively, i.e.
Date val
02-21 100
02-22 101
02-23 102
. .
. .
03-19 110
And now we want to calculate everyday the total value between 02-20 and 03-20 exclusively using the A.num of date 02-20(starting date of the period). So for example, the total value of 02-21 is 100*12, 02-22 is 101*12, and 03-19 is 110*12.
I have written the query
SELECT A.Date,A.num*B.val AS total
FROM TableA A
LEFT JOIN Tableb B
ON B.Date >= A.Date
GROUP BY A.Date,A.num,B.val
But it returns me two entries for each day. For instance, on 02-20, it will return 101*12 and 101*10, but I only want 101*12. I have noticed that this is caused by the join on B.Date >= A.Date where 02-21 is indeed greater than 01-16, so it takes both value of num at 01-16 and 02-20 to compute the total value.
Anyone know how should I modify this join clause so it would only use the num on 02-20 only instead of 02-20 and 01-16?
EDIT
Sample output
Date total
02-21 1200
02-22 1212
02-23 1224
. .
. .
03-19 1320

This should work. If need be, change the SUM to either MIN or MAX.
SELECT A.`Date`,SUM(A.`num`*B.`val`) AS `total`
FROM `TableA` A
LEFT JOIN `Tableb` B
ON B.`Date` >= A.`Date`
GROUP BY A.`Date`

This produces the results you need. I didn't see a need for GROUP BY, and you said you only wanted results for the date '02-20', so I just added a WHERE and changed the SELECT to grab Table B's date.
SELECT
B.Date,
A.num * B.val AS total
FROM TableA A
LEFT JOIN Tableb B ON B.Date >= A.Date
WHERE A.Date = '02-20'

NOTE : You should avoid using reserved keywords as table/column name.
Well here you go..
SELECT b.`date`,max(a.`num`*b.`val`) AS `total`
FROM test a
LEFT JOIN `test2` b
ON b.`date` >= a.`date`
WHERE b.date is not null
GROUP BY b.`dates`;
sqlfiddle : http://www.sqlfiddle.com/#!9/55c0e1/1

Related

BQ/SQL join two tables in a way that one column fills up with all distinct values from the other table while remaining columns get a null

Hello everyone this is my first question here. I have been browsing thru the questions but couldnt quite find the answer to my problem:
I have a couple of tables which I need to join. The key I join with is non unique(in this case its a date). This is working fine but now I also need to group the results based on another column without getting cross-join like results (meaning each value of this column should only appear once but depending on the table used the column can have different values in each table)
Here is an example of what I have and what I would like to get:
Table1
Date/Key
Group Column
Example Value1
01-01-2022
a
1
01-01-2022
d
2
01-01-2022
e
3
01-01-2022
f
4
Table 2
Date/Key
Group Column
Example Value 2
01-01-2022
a
1
01-01-2022
b
2
01-01-2022
c
3
01-01-2022
d
4
Wanted Result :
Table Result
Date/Key
Group Column
Example Value1
Example Value2
01-01-2022
a
1
1
01-01-2022
b
NULL
2
01-01-2022
c
NULL
3
01-01-2022
d
2
4
01-01-2022
e
3
NULL
01-01-2022
f
4
NULL
I have tryed a couple of approaches but I always get results with values in group column appear multiple times. I am under the impression that full joining and then grouping over the group column shoul work but apparently I am missing something. I also figured I could bruteforce the result by left joining everything with setting the on to table1.date = table2.date AND table1.Groupcolumn = table2.Groupcolumn ect.. and then doing UNIONs of all permutations (so each table was on "the left" once) but this is not only tedious but bigquery doesnt like it since it contains too many sub queries.
I feel kinda bad that my first question is something that I should actually know but I hope someone can help me out!
I do not need a full code solution just a hint to the correct approach would suffice (also incase I missed it: if this was already answered I also appreciate just a link to it!)
Edit:
So one solution I came up with, which appears to work, was to select the group column of each table and union them as a with() and then join this "list" onto the first table like
list as(Select t1.GroupColumn FROM Table_1 t1 WHERE CONDITION1
UNION DISTINCT Select t1.GroupColumn FROM Table_1 t1 WHERE CONDITION2 ... ect)
result as (
SELECT l.GoupColumn, t1.Example_Value1, t2.Example_Value2
FROM Table_1 t1
LEFT JOIN( SELECT * FROM list) s
ON S.GroupColumn = t1.GroupColumn
LEFT JOIN Table_2 t2
on S.GroupColumn = t2.GroupColumn
and t1.key = t2.key
...
)
SELECT * FROM result
I think what you are looking for is a FULL OUTER JOIN and then you can coalesce the date and group columns. It doesn't exactly look like you need to group anything based on the example data you posted:
SELECT
coalesce(table1.date_key, table2.date_key) AS date_key,
coalesce(table1.group_column, table2.group_column) AS group_column,
table1.example_value_1,
table2.example_value_2
FROM
table1
FULL OUTER JOIN
table2
USING
(date_key,
group_column)
ORDER BY
date_key,
group_column;
Consider below simple approach
select * from (
select *, 'example_value1' type from table1 union all
select *, 'example_value2' type from table2
)
pivot (
any_value(example_value1)
for type in ('example_value1', 'example_value2')
)
if applied to sample data in your question - output is

How do you join a table with a different WHERE condition after you already used a join

Hi i have 2 tables employees and medical leaves related through the employee ID, basically i want to make a result set where there is one column that filters by month and year, and another column that filters by year only
EMPLOYEES MEDICAL
|employee|ID| |ID|DateOfLeave|
A 1 1 2019/1/3
B 2 1 2019/4/15
C 3 2 2019/5/16
D 4
select employees.employee,Employees.ID,count(medical.dateofleave) as
NumberofLeaves
from employees
left outer join Medical on employees.emp = MedBillInfo.emp
and month(medbillinfo.date) in(1) and year(medbillinfo.date) in (2019)
group by Employees.employee,employees.ID
RESULT SET
|Employee|ID|NumberOfLeaves|YearlyLeaves|--i want to join this column
A 1 1 2
B 2 0 1
C 3 0 0
D 4 0 0
But i have no idea how to write inside the current sql statement to join a yearly leaves column to my current result set which is only employee,id and numberofleaves
I think you want conditional aggregation:
select e.employee, e.ID,
count(*) as num_leaves,
sum(case when month(m.date) = 1 then 1 else 0 end) as num_leaves_in_month_1
from employees e left join
Medical m
on e.emp = m.emp
where m.date >= '2019-01-01' and m.date < '2020-01-01'
group by e.employee, e.ID;
Notes:
This removes the where clause which seems to refer to a non-existent table alias.
The date arithmetic uses direct comparisons rather than functions.
This introduces table aliases so the question is easier to write and to read.
Your question probably needs to be corrected as the group by condition does not match with select columns. But based on what you asked, I think you need to use truncate date function in order to group the leaves by year. For SQL Server, there is YEAR(date) function which returns the year of the given date. This date would be MEDICAL.DateOfLeave in your case.

Adding in missing dates from results in SQL

I have a database that currently looks like this
Date | valid_entry | profile
1/6/2015 1 | 1
3/6/2015 2 | 1
3/6/2015 2 | 2
5/6/2015 4 | 4
I am trying to grab the dates but i need to make a query to display also for dates that does not exist in the list, such as 2/6/2015.
This is a sample of what i need it to be:
Date | valid_entry
1/6/2015 1
2/6/2015 0
3/6/2015 2
3/6/2015 2
4/6/2015 0
5/6/2015 4
My query:
select date, count(valid_entry)
from database
where profile = 1
group by 1;
This query will only display the dates that exist in there. Is there a way in query that I can populate the results with dates that does not exist in there?
You can generate a list of all dates that are between the start and end date from your source table using generate_series(). These dates can then be used in an outer join to sum the values for all dates.
with all_dates (date) as (
select dt::date
from generate_series( (select min(date) from some_table), (select max(date) from some_table), interval '1' day) as x(dt)
)
select ad.date, sum(coalesce(st.valid_entry,0))
from all_dates ad
left join some_table st on ad.date = st.date
group by ad.date, st.profile
order by ad.date;
some_table is your table with the sample data you have provided.
Based on your sample output, you also seem to want group by date and profile, otherwise there can't be two rows with 2015-06-03. You also don't seem to want where profile = 1 because that as well wouldn't generate two rows with 2015-06-03 as shown in your sample output.
SQLFiddle example: http://sqlfiddle.com/#!15/b0b2a/2
Unrelated, but: I hope that the column names are only made up. date is a horrible name for a column. For one because it is also a keyword, but more importantly it does not document what this date is for. A start date? An end date? A due date? A modification date?
You have to use a calendar table for this purpose. In this case you can create an in-line table with the tables required, then LEFT JOIN your table to it:
select "date", count(valid_entry)
from (
SELECT '2015-06-01' AS d UNION ALL '2015-06-02' UNION ALL '2015-06-03' UNION ALL
'2015-06-04' UNION ALL '2015-06-05' UNION ALL '2015-06-06') AS t
left join database AS db on t.d = db."date" and db.profile = 1
group by t.d;
Note: Predicate profile = 1 should be applied in the ON clause of the LEFT JOIN operation. If it is placed in the WHERE clause instead then LEFT JOIN essentially becomes an INNER JOIN.

Count number of repeats in SQL

I tried to solve one problem but without success.
I have two list of number
{1,2,3,4}
{5,6,7,8,9}
And I have table
ID Number
1 1
1 2
1 7
1 2
1 6
2 8
2 7
2 3
2 9
Now I need to count how many times number from second list come after number from first list but I should count only one by one id
in example table above result should be 2
three matched pars but because we have only two different IDs result is 2 instead 3
Pars:
1 2
1 7
1 2
1 6
2 3
2 9
note. I work with MSSQL
Edit. There is one more column Date which determined order
Edit2 - Solution
i write this query
SELECT * FROM table t
left JOIN table tt ON tt.ID = t.ID
AND tt.Date > t.Date
AND t.Number IN (1,2,3,4)
AND tt.Number IN (6,7,8,9)
And after this I had a plan to group by id and use only one match for each id but execution take a lot time
Here is a query that would do it:
select a.id, min(a.number) as a, min(b.number) as b
from mytable a
inner join mytable b
on a.id = b.id
and a.date < b.date
and b.number in (5,6,7,8,9)
where a.number in (1,2,3,4)
group by a.id
Output is:
id a b
1 1 6
2 3 9
So the two pairs are output each on one line, with the value a belonging to the first group of numbers, and the value of column b to the second group.
Here is a fiddle
Comments on attempt (edit 2 to question)
Later you added a query attempt to your question. Some comments about that attempt:
You don't need a left join because you really want to have a match for both values. inner join has in general better performance, so use that.
The condition t.Number IN (1,2,3,4) does not belong in the on clause. In combination with a left join the result will include t records that violate this condition. It should be put in the where clause.
Your concern about performance may be warranted, but can be resolved by adding a useful index on your table, i.e. on (id, number, date) or (id, date, number)

Joining multiple tables containing historical data

I have multiple tables containing historical data, so there is not a 1 to 1 relation between id.
I have to join on id and the time stamp indicating when the data has been active, TO_TIMESTMP can be null if the data is still active or if it has never been set for old data.
My main table after some grouping outputs something like this:
TABLE_A
AID USER_ID AMOUNT FROM_TIMESTMP TO_TIMESTMP
1 1 2 11/21/2012 00:00:00 12/04/2012 11:59:00
1 2 3 11/24/2012 12:00:00 null
2 1 2 11/21/2012 01:00:00 null
then i have another table that i use to link further
TABLE_B
AID CID FROM_TIMESTMP TO_TIMESTMP HIST_ID
1 3 11/01/2012 00:00:00 null 1
1 3 11/21/2012 00:00:00 12/04/2012 11:59:00 2
1 3 11/24/2012 12:00:00 null 3
2 4 11/21/2012 00:59:59 null 4
and my 3rd table looks something like this:
TABLE_C
CID VALUE FROM_TIMESTMP TO_TIMESTMP HIST_ID
3 A 11/01/2012 00:00:00 null 1
3 B 11/21/2012 00:00:00 11/24/2012 11:59:00 2
3 C 11/24/2012 12:00:00 null 3
4 D 11/21/2012 01:00:01 null 4
My expected output if I want to combine table A with Value of from Table C through Table B is:
AID USER_ID AMOUNT FROM_TIMESTMP TO_TIMESTMP VALUE
1 1 2 11/21/2012 00:00:00 12/04/2012 11:59:00 B
1 2 3 11/24/2012 12:00:00 null C
2 1 2 11/21/2012 01:00:00 null D
There is indexes on everything except AMOUNT in Table A and VALUE in Table C and I use the following SQL to pull out the data.
SELECT a.AID, a.USER_ID, a.AMOUNT, a.FROM_TIMESTMP, a.TO_TIMESTMP, c.VALUE from
(SELECT AID, USER_ID, SUM(AMOUNT), FROM_TIMESTMP, TO_TIMESTMP from TABLE_A GROUP BY AID, USER_ID, FROM_TIMESTMP, TO_TIMESTMP) a
inner join TABLE_B b on b.HIST_ID in (select max(HIST_ID) from TABLE_B
where AID = a.AID and FROM_TIMESTMP <= a.FROM_TIMESTMP+1/2880 and (TO_TIMESTMP>= a.FROM_TIMESTMP or TO_TIMESTMP is null))
inner join TABLE_C c on c.HIST_ID in (select max(HIST_ID) from TABLE_C
where CID = b.CID and FROM_TIMESTMP <= a.FROM_TIMESTMP+1/2880 and (TO_TIMESTMP>= a.FROM_TIMESTMP or TO_TIMESTMP is null));
Due to some inconsistencies on when data is saved I have added a 30 sec grace period when comparing starting time stamps in case they where created around the same time, is there a way to improve the way I do this?
I select the one with MAX(HIST_ID) so cases like AID=1 and USER_ID=2 in TABLE_A only get the newest row that matches id/timestamp from other tables.
In my real data I Inner join 4 tables like this(instead of just 2) and it works good on my local test data (pulling just over 42000 lines in 11 sec when asking for all data).
But when I try and run it on test environment where the data amount is closer to production it runs to slow even when I limit the amount of lines I query in the first table to about 6000 lines by setting FROM_TIMESTMP has to be between 2 dates.
Is there a way to improve the performance of my joining of tables by doing it another way?
one simple change to avoid the max() repeated sub queries is:
select a.aid,a.user_id,a.amount,a.from_timestmp,a.to_timestmp,a.value
from (select a.aid,a.user_id,a.amount,a.from_timestmp,a.to_timestmp,c.value,
row_number() over (partition by a.aid,a.user_id order by b.hist_id desc, c.hist_id desc) rn
from (select aid,user_id,sum(amount) amount,from_timestmp,to_timestmp
from table_a
group by aid,user_id,from_timestmp,to_timestmp) a
inner join table_b b
on b.aid = a.aid
and b.from_timestmp <= a.from_timestmp + (1 / 2880)
and ( b.to_timestmp >= a.from_timestmp or b.to_timestmp is null)
inner join table_c c
on c.cid = b.cid
and c.from_timestmp <= a.from_timestmp + (1 / 2880)
and ( c.to_timestmp >= a.from_timestmp or c.to_timestmp is null)) a
where rn = 1
order by a.aid, a.user_id;
There could be many reasons why your query runs faster on one environment and slower on another. Most probably it's because the optimizer has defined two distinct plans and one runs faster. Probably because the statistics are slightly different.
You can certainly optimize your query to use your indexes but I think your main problem lies with the data and/or data model. And with bad data you'll run into these kind of problems again and again.
It's pretty common to archive data into the same table, it can be useful to represent transient data that needs to be queried historically. However, having archived data should not make you forget essential rules about database design.
In your case it seems you have three related tables: they would be linked in your entity-relationship model. However, somewhere along the designing process, they lost this link so now you can't reliably identify which row is relied to which one.
I suggest the following:
If two tables are related in your ER model, add a foreign key. This will ensure that you can always join them if you need to. Foreign keys only add a small cost in DML operations (and only INSERT, DELETE and update to the primary key (?!)). If your data is inserted once and queried many times, the performance impact is negligible.
In your case if (AID, FROM_TIMESTAMP) is your primary key in TABLE_A, then have the same columns in TABLE_B reference TABLE_A's primary key columns. You may need FROM_TIMESTAMP_A and FROM_TIMESTAMP_C if A and C (which seem unrelated) have distinct updating scheme.
If you don't follow this logic, you will have to build your queries differently. If A, B and C are each historically archived yet not fully referenced, you will only be able to answer questions with a single point-in-time reference, questions such as "What was the status of the DB at time TS":
SELECT *
FROM A
JOIN B on A.aid = B.aid
JOIN C on C.cid = B.cid
WHERE a.timestamp_from <= :TS
AND nvl(a.timestamp_to, DATE '9999-12-31') > :TS
AND b.timestamp_from <= :TS
AND nvl(b.timestamp_to, DATE '9999-12-31') > :TS
AND c.timestamp_from <= :TS
AND nvl(c.timestamp_to, DATE '9999-12-31') > :TS