Join on minimum date between two dates - Spark SQL - sql

I have a table of daily data and a table of monthly data. I'm trying to retrieve one daily record corresponding to each monthly record. The wrinkles are that some days are missing from the daily data and the field I care about, new_status, is sometimes null on the month_end_date.
month_df
| ID | month_end_date |
| -- | -------------- |
| 1 | 2019-07-31 |
| 1 | 2019-06-30 |
| 2 | 2019-10-31 |
daily_df
| ID | daily_date | new_status |
| -- | ---------- | ---------- |
| 1 | 2019-07-29 | 1 |
| 1 | 2019-07-30 | 1 |
| 1 | 2019-08-01 | 2 |
| 1 | 2019-08-02 | 2 |
| 1 | 2019-08-03 | 2 |
| 1 | 2019-06-29 | 0 |
| 1 | 2019-06-30 | 0 |
| 2 | 2019-10-30 | 5 |
| 2 | 2019-10-31 | NULL |
| 2 | 2019-11-01 | 6 |
| 2 | 2019-11-02 | 6 |
I want to fuzzy join daily_df to monthly_df where daily_date is >= month_end_dt and less than some buffer afterwards (say, 5 days). I want to keep only the record with the minimum daily date and a non-null new_status.
This post solves the issue using an OUTER APPLY in SQL Server, but that seems not to be an option in Spark SQL. I'm wondering if there's a method that is similarly computationally efficient that works in Spark.

Related

Time Series Downsampling/Upsampling

I am trying to downsample and upsample time series data on MonetDB.
Time series database systems (TSDS) usually have an option to make the downsampling and upsampling with an operator like SAMPLE BY (1h).
My time series data looks like the following:
sql>select * from datapoints limit 5;
+----------------------------+------------+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
| time | id_station | temperature | discharge | ph | oxygen | oxygen_saturation |
+============================+============+==========================+==========================+==========================+==========================+==========================+
| 2019-03-01 00:00:00.000000 | 0 | 407.052 | 0.954 | 7.79 | 12.14 | 12.14 |
| 2019-03-01 00:00:10.000000 | 0 | 407.052 | 0.954 | 7.79 | 12.13 | 12.13 |
| 2019-03-01 00:00:20.000000 | 0 | 407.051 | 0.954 | 7.79 | 12.13 | 12.13 |
| 2019-03-01 00:00:30.000000 | 0 | 407.051 | 0.953 | 7.79 | 12.12 | 12.12 |
| 2019-03-01 00:00:40.000000 | 0 | 407.051 | 0.952 | 7.78 | 12.12 | 12.12 |
+----------------------------+------------+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
I tried the following query but the results are obtained by aggregating all the values from different days, which is not what I am looking for:
sql>SELECT EXTRACT(HOUR FROM time) AS "hour",
AVG(pH) AS avg_ph
FROM datapoints
GROUP BY "hour";
| hour | avg_ph |
+======+==========================+
| 0 | 8.041121283524923 |
| 1 | 8.041086970785418 |
| 2 | 8.041152801724111 |
| 3 | 8.04107828783526 |
| 4 | 8.041060110153223 |
| 5 | 8.041167286877407 |
| ... | ... |
| 23 | 8.041219444444451 |
I tried then to aggregate the time series data first based on the day then on the hour:
SELECT EXTRACT(DATE FROM time) AS "day", EXTRACT(HOUR FROM time) AS "hour",
AVG(pH) AS avg_ph
FROM datapoints
GROUP BY "day", "hour";
But I am getting the following exception:
syntax error, unexpected sqlDATE in: "select extract(date"
My question: how could I aggregate/downsample the data to a specific period of time (e.g. obtain an aggregated value every 2 days or 12 hours)?

Create results grid from database tables: SQL

I have a table which describes patients' medical symptoms which has the following structure.
Note that patient 1 and patient 2 have two symptoms.
| patientID | symptomName | SymptomStartDate | SymptomDuration |
|-----------|----------------|------------------|-----------------|
| 1 | Fever | 01/01/2020 | 10 |
| 1 | Cough | 02/01/2020 | 5 |
| 2 | ChestPain | 03/01/2020 | 6 |
| 2 | DryEyes | 04/01/2020 | 8 |
| 3 | SoreThroat | 05/01/2020 | 2 |
| 4 | AnotherSymptom | 06/01/2020 | 1 |
Using this data, I want to create a grid showing which symptoms each patient had, in the following format (with 1 indicating that the patient had that symptom and 0 indicating that the patient did not have that symptom)
| patientID | Fever | Cough | ChestPain | DryEyes | SoreThroat | AnotherSymptom |Headache|
|-----------|-------|-------|-----------|---------|------------|----------------|--------|
| 1 | 1 | 1 | 0 | 0 | 0 | 0 |0 |
| 2 | 0 | 0 | 1 | 1 | 0 | 0 |0 |
| 3 | 0 | 0 | 0 | 0 | 1 | 0 |0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 1 |0 |
Note that none of the patients in this first table have headache but table 2 does have a column for headache filled with 0s. I have a list of all symptoms I want to include as columns in a separate table. (let's call that table symptom: The symptom table has only two columns : symptomName and symptomID)
Use a crosstab query:
TRANSFORM
Count(Symptoms.SymptomStartDate)
SELECT
Symptoms.PatientID
FROM
Symptoms
GROUP BY
Symptoms.PatientID
PIVOT
Symptoms.SymptomName
IN ('Fever','Cough','ChestPain','DryEyes','SoreThroat','AnotherSymptom','Headache');
Apply this format to the Format property of field SymptomStartDate:
0;;;0
Output:

Conditional Merge with nested Rank in Pandas

I’m trying to translate a conditional merge with nested rank from SQL to Python-Pandas.
Specifically, I would like to merge two tables and add a condition, which ensures a 1:1 relationship and specifies which value to take.
In SQL this would be implemented by a subquery with Ranks which is one-sided joined with a condition.
Example
I merge the customer records table with a table of customer requests.
The result should show the latest record before or at the time of its own timestamp.
table: Customer_records
+---------+------+------------+
| Cust_ID | Name | Timestamp |
+---------+------+------------+
| 1 | A | 2013-01-01 |
| 1 | A | 2014-01-01 |
| 1 | A | 2015-12-01 |
| 2 | B | 2014-01-01 |
| 3 | C | 2014-01-01 |
+---------+------+------------+
table: customer_request
+--------+---------+------------+
| Req_ID | Cust_ID | Timestamp |
+--------+---------+------------+
| 1 | 1 | 2013-01-01 |
| 2 | 1 | 2013-12-01 |
| 3 | 1 | 2015-01-01 |
| 4 | 2 | 2013-01-01 |
+--------+---------+------------+
table: merged
+---------+------+------------+--------+
| Cust_ID | Name | Timestamp | Req_ID |
+---------+------+------------+--------+
| 1 | A | 2013-01-01 | 1 |
| 1 | A | 2014-01-01 | 2 |
| 1 | A | 2015-12-01 | 3 |
| 2 | B | 2014-01-01 | 4 |
| 3 | C | 2014-01-01 | None |
+---------+------+------------+--------+
Use merge_asof, only necessary sorting both DataFrames by Timestamp columns:
Customer_records['Timestamp'] = pd.to_datetime(Customer_records['Timestamp'])
customer_request['Timestamp'] = pd.to_datetime(customer_request['Timestamp'])
Customer_records = Customer_records.sort_values('Timestamp')
customer_request = customer_request.sort_values('Timestamp')
df = pd.merge_asof(Customer_records, customer_request, on='Timestamp', by='Cust_ID')
Cust_ID Name Timestamp Req_ID
0 1 A 2013-01-01 1.0
1 1 A 2014-01-01 2.0
2 2 B 2014-01-01 4.0
3 3 C 2014-01-01 NaN
4 1 A 2015-12-01 3.0

SQL window excluding current group?

I'm trying to provide rolled up summaries of the following data including only the group in question as well as excluding the group. I think this can be done with a window function, but I'm having problems with getting the syntax down (in my case Hive SQL).
I want the following data to be aggregated
+------------+---------+--------+
| date | product | rating |
+------------+---------+--------+
| 2018-01-01 | A | 1 |
| 2018-01-02 | A | 3 |
| 2018-01-20 | A | 4 |
| 2018-01-27 | A | 5 |
| 2018-01-29 | A | 4 |
| 2018-02-01 | A | 5 |
| 2017-01-09 | B | NULL |
| 2017-01-12 | B | 3 |
| 2017-01-15 | B | 4 |
| 2017-01-28 | B | 4 |
| 2017-07-21 | B | 2 |
| 2017-09-21 | B | 5 |
| 2017-09-13 | C | 3 |
| 2017-09-14 | C | 4 |
| 2017-09-15 | C | 5 |
| 2017-09-16 | C | 5 |
| 2018-04-01 | C | 2 |
| 2018-01-13 | D | 1 |
| 2018-01-14 | D | 2 |
| 2018-01-24 | D | 3 |
| 2018-01-31 | D | 4 |
+------------+---------+--------+
Aggregated results:
+------+-------+---------+----+------------+------------------+----------+
| year | month | product | ct | avg_rating | avg_rating_other | other_ct |
+------+-------+---------+----+------------+------------------+----------+
| 2018 | 1 | A | 5 | 3.4 | 2.5 | 4 |
| 2018 | 2 | A | 1 | 5 | NULL | 0 |
| 2017 | 1 | B | 4 | 3.6666667 | NULL | 0 |
| 2017 | 7 | B | 1 | 2 | NULL | 0 |
| 2017 | 9 | B | 1 | 5 | 4.25 | 4 |
| 2017 | 9 | C | 4 | 4.25 | 5 | 1 |
| 2018 | 4 | C | 1 | 2 | NULL | 0 |
| 2018 | 1 | D | 4 | 2.5 | 3.4 | 5 |
+------+-------+---------+----+------------+------------------+----------+
I've also considered producing two aggregates, one with the product in question and one without, but having trouble with creating the appropriate joining key.
You can do:
select year(date), month(date), product,
count(*) as ct, avg(rating) as avg_rating,
sum(count(*)) over (partition by year(date), month(date)) - count(*) as ct_other,
((sum(sum(rating)) over (partition by year(date), month(date)) - sum(rating)) /
(sum(count(*)) over (partition by year(date), month(date)) - count(*))
) as avg_other
from t
group by year(date), month(date), product;
The rating for the "other" is a bit tricky. You need to add everything up and subtract out the current row -- and calculate the average by doing the sum divided by the count.

Outer Join multible tables keeping all rows in common colums

I'm quite new to SQL - hope you can help:
I have several tables that all have 3 columns in common: ObjNo, Date(year-month), Product.
Each table has 1 other column, that represents an economic value (sales, count, netsales, plan ..)
I need to join all tables on the 3 common columns giving. The outcome must have one row for each existing combination of the 3 common columns. Not every combination exists in every table.
If I do full outer joins, I get ObjNo, Date, etc. for each table, but only need them once.
How can I achieve this?
+--------------+-------+--------+---------+-----------+
| tblCount | | | | |
+--------------+-------+--------+---------+-----------+
| | ObjNo | Date | Product | count |
| | 1 | 201601 | Snacks | 22 |
| | 2 | 201602 | Coffee | 23 |
| | 4 | 201605 | Tea | 30 |
| | | | | |
| tblSalesPlan | | | | |
| | ObjNo | Date | Product | salesplan |
| | 1 | 201601 | Beer | 2000 |
| | 2 | 201602 | Sancks | 2000 |
| | 5 | 201605 | Tea | 2000 |
| | | | | |
| | | | | |
| tblSales | | | | |
| | ObjNo | Date | Product | Sales |
| | 1 | 201601 | Beer | 1000 |
| | 2 | 201602 | Coffee | 2000 |
| | 3 | 201603 | Tea | 3000 |
+--------------+-------+--------+---------+-----------+
Thx
Devon
It sounds like you're using SELECT * FROM... which is giving you every field from every table. You probably only want to get the values from one table, so you should be explicit about which fields you want to include in the results.
If you're not sure which table is going to have a record for each case (i.e. there is not guaranteed to be a record in any particular table) you can use the COALESCE function to get the first non-null value in each case.
SELECT COALESCE(tbl1.ObjNo, tbl2.ObjNo, tbl3.ObjNo) AS ObjNo, ....
tbl1.Sales, tbl2.Count, tbl3.Netsales