Postgresql Joining Time Series With Missing Data - sql

I am trying to outer join multiple time series tables in PostgreSQL on multiple conditions - which include the date column and several other identifier columns.
However the tables do not have continuous time series i.e. some dates are missing for some of the join conditions. Furthermore I don't want "duplicate" table specific new columns to be added for a row when there is not match
I have tried COALESCE() on the dates to fill in missing dates, which is fine. However it is the subsequent joins that are causing me problems. I also can't assume that one of the tables will have rows for all the dates required.
I thought perhaps to use generate series for a date range, with empty columns (? if possible) and then join all the tables on to that?
Please see example below:
I want to join Table A and Table B on columns date, identifier_1 and identifier_2 as an outer join. However where a value is not matched I do not want new columns to be added e.g. table_b_identifier_1.
Table A
id1 and id2 are missing rows on the 03/07 and 04/07, and id1 is also missing a row for the 05/07.
Table B
id2 is missing a row on the 02/07
Desired Output:
Essentially it is a conditional join. If there is a row in both tables for identifier_1 and

It's not clear what is wrong with your attempt to use COALESCE to fill columns data, but it works well as intended in such a query
SELECT
COALESCE(a.date, b.date) AS date,
COALESCE(a.identifier_1, b.identifier_1) AS identifier_1,
COALESCE(a.identifier_2, b.identifier_2) AS identifier_2,
a.value_a,
b.value_b
FROM table_a a
FULL JOIN table_b b ON a.date = b.date
AND a.identifier_1 = b.identifier_1
AND a.identifier_2 = b.identifier_2
Please, check a demo

Related

Why is Big Query creating a new column instead of joining two columns when using a Join?

When I use a Join in BigQuery, it completes it but creates a new column which are named Id_1 and Date_1 with the same information from the primary key. What could cause this? Here is the code.
SELECT
*
FROM
`bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`
JOIN
`bellabeat-case-study-373821.bellabeat_case_study.sleep_day`
ON
`bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`.Id = `bellabeat-case-study-373821.bellabeat_case_study.sleep_day`.Id
AND `bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`.Date = `bellabeat-case-study-373821.bellabeat_case_study.sleep_day`.Date
I made the query and expected the tables to join by the Primary keys of Id and Date, but instead this created two new columns with the same information.
When you use * in the select list the ON variant of a JOIN clause produces all columns from both tables in the result set. If there are columns with the same name on both sides, then both will show up in the result [with slightly different names] as you can see.
You can use the USING variant of the JOIN clause instead, that merges the columns and produces only one resulting column for each column mentioned in the USING clause. This is probably what you want. See BigQuery - INNER JOIN.
Your query could take the form:
SELECT
*
FROM
`bellabeat-case-study-373821.bellabeat_case_study.daily_Activity`
JOIN
`bellabeat-case-study-373821.bellabeat_case_study.sleep_day`
USING (Id, Date)
Note: USING can only be used when the columns you want to join with have the exact same name. It won't be possible to use it if a column is, for example, called id in one table and employee_id in the other one.

Concatenate ALL values from 2 tables using SQL?

I am trying to use SQL to create a table that concatenates all dates from a specific range to all items in another table. See image for an example.
I have a solution where I can create a column of "null" values in both tables and join on that column but wondering if there is a more sophisticated approach to doing this.
Example image
I've tried the following:
Added a constant value to each table
Then I joined the 2 tables on that constant value so that each row matched each row of both tables.
This got the intended result but I'm wondering if there's a better way to do this where I don't have to add the constant values:
SELECT c.Date_,k.user_email
FROM `operations-div-qa.7_dbtCloud.calendar_table_hours_st` c
JOIN `operations-div-qa.7_dbtCloud.table_key` k
ON c.match = k.match
ORDER BY Date_,user_email asc
It's not a concatenation in the image given, Its a join
select t1.dates Date ,t2.name Person
from table t1,table t2;
Cross join should work for you:
It joins every row from both tables with each other. Use this when there is no relationship between the tables.
Did not test so syntax may be slightly off.
SELECT c.Date_,k.user_email
FROM `operations-div-qa.7_dbtCloud.calendar_table_hours_st` c
CROSS JOIN `operations-div-qa.7_dbtCloud.table_key` k
ORDER BY Date_,user_email asc

Best way to combine two tables, remove duplicates, but keep all other non-duplicate values in SQL

I am looking for the best way to combine two tables in a way that will remove duplicate records based on email with a priority of replacing any duplicates with the values in "Table 2", I have considered full outer join and UNION ALL but Union all will be too large as each table has several 1000 columns. I want to create this combination table as my full reference table and save as a view so I can reference it without always adding a union or something to that effect in my already complex statements. From my understanding, a full outer join will not necessarily remove duplicates. I want to:
a. Create table with ALL columns from both tables (fields that don't apply to records in one table will just have null values)
b. Remove duplicate records from this master table based on email field but only remove the table 1 records and keep the table 2 duplicates as they have the information that I want
c. A left-join will not work as both tables have unique records that I want to retain and I would like all 1000+ columns to be retained from each table
I don't know how feasible this even is but thank you so much for any answers!
If I understand your question correctly you want to join two large tables with thousands of columns that (hopefully) are the same between the two tables using the email column as the join condition and replacing duplicate records between the two tables with the records from Table 2.
I had to do something similar a few days ago so maybe you can modify my query for your purposes:
WITH only_in_table_1 AS(
SELECT *
FROM table_1 A
WHERE NOT EXISTS
(SELECT * FROM table_2 B WHERE B.email_field = A.email_field))
SELECT * FROM table_2
UNION ALL
SELECT * FROM only_in_table_1
If the columns/fields aren't the same between tables you can use a full outer join on only_in_table_1 and table_2
try using a FULL OUTER JOIN between the two tables and then a COALESCE function on each resultset column to determine from which table/column the resultset column is populated

SQL: Combine two tables, into same columns, based on relation table

Not being well versed in complex SQL, I am trying to figure out how I can write a query that will return (almost) the same columns from two tables, based on a "relationship" table. I have tried using UNION, but the number of columns are different between the three tables. I also tried IF...ELSE, but could not get that to function. I have also looked at INCLUDE and EXCLUDE.
Here is my current query:
SELECT
/* Relation Table */
[data_Related_Asset].[ID_Related_Asset]
,[data_Related_Asset].[BIOMED_Tag]
,[data_Related_Asset].[Related_BIOMED_Tag]
/* Lab Table */
,[data_Lab_Asset].[Room]
,[Lab_Area].[Work_Area]
,[data_Lab_Asset].[Pet_Name_Bench]
,[data_Lab_Asset].[BGL_ID]
,[data_Lab_Asset].[BIOMED_Tag] AS LAB_BIOMED
,[data_Lab_Asset].[Endpoint_Tag]
,[Lab_Class].[Class]
,[Lab_Class].[Subclass]
,[Lab_Class].[Subcategory]
/* IT Table */
,[data_IT_Asset].[Room]
,[IT_Area].[Work_Area]
,[data_IT_Asset].[Bench_Instrument]
,[data_IT_Asset].[BIOMED_Tag] AS IT_BIOMED
,[data_IT_Asset].[Endpoint_Tag]
,[IT_Class].[Class]
,[IT_Class].[Subclass]
,[IT_Class].[Subcategory]
FROM [data_Related_Asset]
LEFT JOIN [data_Lab_Asset] ON [data_Lab_Asset].[BIOMED_Tag] = [data_Related_Asset].[Related_BIOMED_Tag]
LEFT JOIN [data_IT_Asset] ON [data_IT_Asset].[BIOMED_Tag] = [data_Related_Asset].[Related_BIOMED_Tag]
LEFT JOIN [tbl_Class] Lab_Class ON [Lab_Class].[ID_Class] = [data_Lab_Asset].[Class_ID]
LEFT JOIN [tbl_Class] IT_Class ON [IT_Class].[ID_Class] = [data_IT_Asset].[Class_ID]
LEFT JOIN [tbl_Work_Area] Lab_Area ON [Lab_Area].[ID_Work_Area] = [data_Lab_Asset].[Work_Area_ID]
LEFT JOIN [tbl_Work_Area] IT_Area ON [IT_Area].[ID_Work_Area] = [data_IT_Asset].[Work_Area_ID]
ORDER BY ID_Related_Asset
The query is being used in a custom app and is set up to search for an "ID" in the [data_Related_Asset].[BIOMED_Tag] column, and return all [Related_BIOMED_Tag] records.
When I run the above query I get all the results I need, but across a lot of columns. If the item being return is in the LAB table, then the LAB_Asset columns are populated, but the IT_Asset columns are all NULL. And if the item is in the IT table, the opposite is true - the LAB_Asset columns are all NULL and the IT_Asset columns are populated. For example, below you can see where rows 2 & 12 returned the IT_Asset information.
I'd like to be able to return everything in the same set of NINE columns to condense the viewed table. (Room, Work_Area, Bench, BGL_ID, BIOMED_Tag, Endpoint_Tag, Class, Subclass, Subcategory) For example, below you can see where I moved the info from the IT_Asset table over to the first columns.
I'm sure I'm missing a simple solution/function here. Any help is greatly appreciated!
You can use UNION but you just have to ensure that you have the same columns in the same order in each statement being union'd.
So for missing columns just use nulls (or any suitable dummy data) e.g.
SELECT col1, col2, null, col4
from tableA
UNION
SELECT col1, null, col3, null
from tableB

Create new table by merging two existing tables based on matching field

I am attempting to create a new table using columns from two existing tables and it's not behaving the way I expected.
Table A has 91255063 records and table B has 2372294 records. Both tables have a common field named link_id. Link_id is not unique in either table and will not always exist in table B.
The end result I am looking for is a new table with 91255063 records, essentially all of Table A with any additional data from table B for the records with matching link_id's. I had thought outer join would accomplish this as follows:
use database1
SELECT a.*
,b.[AdditionalData1]
,b.[AdditionalData2]
,b.[AdditionalData3]
into dbo.COMBINEDTABLE
FROM Table1 a
left outer join Table2 b
ON a.LINK_ID = b.LINK_ID
This seems to work when looking at the resulting data however my row total for the newly created table COMBINEDTABLE now has 98011015 rows. Am I not using the correct join method here?
Most likely you have duplicate LINK_IDs on the right, thus for quite a few rows from Table1, there are multiplle rows from Table2. You could try using DISTINCT in your SELECT, or specify that you want only the records with the smallest or highest identifier column value (if you have one).