Conditional Merge with nested Rank in Pandas - pandas

I’m trying to translate a conditional merge with nested rank from SQL to Python-Pandas.
Specifically, I would like to merge two tables and add a condition, which ensures a 1:1 relationship and specifies which value to take.
In SQL this would be implemented by a subquery with Ranks which is one-sided joined with a condition.
Example
I merge the customer records table with a table of customer requests.
The result should show the latest record before or at the time of its own timestamp.
table: Customer_records
+---------+------+------------+
| Cust_ID | Name | Timestamp |
+---------+------+------------+
| 1 | A | 2013-01-01 |
| 1 | A | 2014-01-01 |
| 1 | A | 2015-12-01 |
| 2 | B | 2014-01-01 |
| 3 | C | 2014-01-01 |
+---------+------+------------+
table: customer_request
+--------+---------+------------+
| Req_ID | Cust_ID | Timestamp |
+--------+---------+------------+
| 1 | 1 | 2013-01-01 |
| 2 | 1 | 2013-12-01 |
| 3 | 1 | 2015-01-01 |
| 4 | 2 | 2013-01-01 |
+--------+---------+------------+
table: merged
+---------+------+------------+--------+
| Cust_ID | Name | Timestamp | Req_ID |
+---------+------+------------+--------+
| 1 | A | 2013-01-01 | 1 |
| 1 | A | 2014-01-01 | 2 |
| 1 | A | 2015-12-01 | 3 |
| 2 | B | 2014-01-01 | 4 |
| 3 | C | 2014-01-01 | None |
+---------+------+------------+--------+

Use merge_asof, only necessary sorting both DataFrames by Timestamp columns:
Customer_records['Timestamp'] = pd.to_datetime(Customer_records['Timestamp'])
customer_request['Timestamp'] = pd.to_datetime(customer_request['Timestamp'])
Customer_records = Customer_records.sort_values('Timestamp')
customer_request = customer_request.sort_values('Timestamp')
df = pd.merge_asof(Customer_records, customer_request, on='Timestamp', by='Cust_ID')
Cust_ID Name Timestamp Req_ID
0 1 A 2013-01-01 1.0
1 1 A 2014-01-01 2.0
2 2 B 2014-01-01 4.0
3 3 C 2014-01-01 NaN
4 1 A 2015-12-01 3.0

Related

Join on minimum date between two dates - Spark SQL

I have a table of daily data and a table of monthly data. I'm trying to retrieve one daily record corresponding to each monthly record. The wrinkles are that some days are missing from the daily data and the field I care about, new_status, is sometimes null on the month_end_date.
month_df
| ID | month_end_date |
| -- | -------------- |
| 1 | 2019-07-31 |
| 1 | 2019-06-30 |
| 2 | 2019-10-31 |
daily_df
| ID | daily_date | new_status |
| -- | ---------- | ---------- |
| 1 | 2019-07-29 | 1 |
| 1 | 2019-07-30 | 1 |
| 1 | 2019-08-01 | 2 |
| 1 | 2019-08-02 | 2 |
| 1 | 2019-08-03 | 2 |
| 1 | 2019-06-29 | 0 |
| 1 | 2019-06-30 | 0 |
| 2 | 2019-10-30 | 5 |
| 2 | 2019-10-31 | NULL |
| 2 | 2019-11-01 | 6 |
| 2 | 2019-11-02 | 6 |
I want to fuzzy join daily_df to monthly_df where daily_date is >= month_end_dt and less than some buffer afterwards (say, 5 days). I want to keep only the record with the minimum daily date and a non-null new_status.
This post solves the issue using an OUTER APPLY in SQL Server, but that seems not to be an option in Spark SQL. I'm wondering if there's a method that is similarly computationally efficient that works in Spark.

Create results grid from database tables: SQL

I have a table which describes patients' medical symptoms which has the following structure.
Note that patient 1 and patient 2 have two symptoms.
| patientID | symptomName | SymptomStartDate | SymptomDuration |
|-----------|----------------|------------------|-----------------|
| 1 | Fever | 01/01/2020 | 10 |
| 1 | Cough | 02/01/2020 | 5 |
| 2 | ChestPain | 03/01/2020 | 6 |
| 2 | DryEyes | 04/01/2020 | 8 |
| 3 | SoreThroat | 05/01/2020 | 2 |
| 4 | AnotherSymptom | 06/01/2020 | 1 |
Using this data, I want to create a grid showing which symptoms each patient had, in the following format (with 1 indicating that the patient had that symptom and 0 indicating that the patient did not have that symptom)
| patientID | Fever | Cough | ChestPain | DryEyes | SoreThroat | AnotherSymptom |Headache|
|-----------|-------|-------|-----------|---------|------------|----------------|--------|
| 1 | 1 | 1 | 0 | 0 | 0 | 0 |0 |
| 2 | 0 | 0 | 1 | 1 | 0 | 0 |0 |
| 3 | 0 | 0 | 0 | 0 | 1 | 0 |0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 1 |0 |
Note that none of the patients in this first table have headache but table 2 does have a column for headache filled with 0s. I have a list of all symptoms I want to include as columns in a separate table. (let's call that table symptom: The symptom table has only two columns : symptomName and symptomID)
Use a crosstab query:
TRANSFORM
Count(Symptoms.SymptomStartDate)
SELECT
Symptoms.PatientID
FROM
Symptoms
GROUP BY
Symptoms.PatientID
PIVOT
Symptoms.SymptomName
IN ('Fever','Cough','ChestPain','DryEyes','SoreThroat','AnotherSymptom','Headache');
Apply this format to the Format property of field SymptomStartDate:
0;;;0
Output:

SQL window excluding current group?

I'm trying to provide rolled up summaries of the following data including only the group in question as well as excluding the group. I think this can be done with a window function, but I'm having problems with getting the syntax down (in my case Hive SQL).
I want the following data to be aggregated
+------------+---------+--------+
| date | product | rating |
+------------+---------+--------+
| 2018-01-01 | A | 1 |
| 2018-01-02 | A | 3 |
| 2018-01-20 | A | 4 |
| 2018-01-27 | A | 5 |
| 2018-01-29 | A | 4 |
| 2018-02-01 | A | 5 |
| 2017-01-09 | B | NULL |
| 2017-01-12 | B | 3 |
| 2017-01-15 | B | 4 |
| 2017-01-28 | B | 4 |
| 2017-07-21 | B | 2 |
| 2017-09-21 | B | 5 |
| 2017-09-13 | C | 3 |
| 2017-09-14 | C | 4 |
| 2017-09-15 | C | 5 |
| 2017-09-16 | C | 5 |
| 2018-04-01 | C | 2 |
| 2018-01-13 | D | 1 |
| 2018-01-14 | D | 2 |
| 2018-01-24 | D | 3 |
| 2018-01-31 | D | 4 |
+------------+---------+--------+
Aggregated results:
+------+-------+---------+----+------------+------------------+----------+
| year | month | product | ct | avg_rating | avg_rating_other | other_ct |
+------+-------+---------+----+------------+------------------+----------+
| 2018 | 1 | A | 5 | 3.4 | 2.5 | 4 |
| 2018 | 2 | A | 1 | 5 | NULL | 0 |
| 2017 | 1 | B | 4 | 3.6666667 | NULL | 0 |
| 2017 | 7 | B | 1 | 2 | NULL | 0 |
| 2017 | 9 | B | 1 | 5 | 4.25 | 4 |
| 2017 | 9 | C | 4 | 4.25 | 5 | 1 |
| 2018 | 4 | C | 1 | 2 | NULL | 0 |
| 2018 | 1 | D | 4 | 2.5 | 3.4 | 5 |
+------+-------+---------+----+------------+------------------+----------+
I've also considered producing two aggregates, one with the product in question and one without, but having trouble with creating the appropriate joining key.
You can do:
select year(date), month(date), product,
count(*) as ct, avg(rating) as avg_rating,
sum(count(*)) over (partition by year(date), month(date)) - count(*) as ct_other,
((sum(sum(rating)) over (partition by year(date), month(date)) - sum(rating)) /
(sum(count(*)) over (partition by year(date), month(date)) - count(*))
) as avg_other
from t
group by year(date), month(date), product;
The rating for the "other" is a bit tricky. You need to add everything up and subtract out the current row -- and calculate the average by doing the sum divided by the count.

Sql query for special record

At the first excuse me for my bad english.
I have two tables:
master table:
| product id | pr_name | remain_Qty |
+--------------+------------------+-------------------+
| 1 | x | 13 |
| 2 | y | 18 |
| 3 | z | 21 |
+--------------+------------------+-------------------+
Detail Table (This table contain detail data of bought product):
+--------------+------------------+----------+--------+
| date | pr_id | Qty |price |
+--------------+------------------+----------+--------+
| 2010-01-01 | 1 | 3 | 1000 |
| 2010-01-02 | 1 | 5 | 1200 |
| 2010-01-01 | 2 | 11 | 1100 |
| 2010-01-03 | 1 | 4 | 1400 |
| 2010-01-04 | 3 | 3 | 1300 |
| 2010-01-01 | 2 | 6 | 1600 |
| 2010-01-03 | 1 | 7 | 1700 |
| 2010-01-02 | 3 | 3 | 1300 |
| 2010-01-01 | 3 | 5 | 1500 |
| 2010-01-04 | 3 | 7 | 1700 |
| 2010-01-06 | 2 | 8 | 1800 |
| 2010-01-07 | 2 | 4 | 1400 |
| 2010-01-03 | 1 | 3 | 1300 |
| 2010-01-04 | 3 | 6 | 1600 |
| 2010-01-08 | 1 | 1 | 1100 |
+--------------+------------------+----------+--------+
sum Qty of product 1 = 23
sum Qty of product 2 = 29
sum Qty of product 3 = 21
As a result I want list of the Details table, where the list is sorted by pr_id , date and price, but the sum(Qty) per pr_id don't exceed the remain_Qty of the product_id of the Master table.
For example:
+--------------+------------------+----------+--------+
| date | pr_id | Qty |price |
+--------------+------------------+----------+--------+
| 2010-01-01 | 1 | 3 | 1000 |
| 2010-01-02 | 1 | 5 | 1200 |
| 2010-01-03 | 1 | 4 | 1400 |
| 2010-01-03 | 1 | 1 | 1700 |
| 2010-01-01 | 2 | 11 | 1100 |
| 2010-01-01 | 2 | 6 | 1600 |
| 2010-01-01 | 3 | 5 | 1500 |
| 2010-01-02 | 3 | 3 | 1300 |
| 2010-01-04 | 3 | 3 | 1300 |
| 2010-01-04 | 3 | 7 | 1700 |
+--------------+------------------+----------+--------+
More of a clarification than a direct SQL answer. But what it LOOKS like they may be wanting is based on an inventory being depleted to fill orders from the known available quantity, but even that falls short as the may be missing a second qty of 3 on 2010-01-03 for product 1... which if looking at just ID=1 from his sample data would show...
| date | pr_id | Qty |price | Qty Available to fill order
+--------------+--------+-----+-------+
| 2010-01-01 | 1 | 3 | 1000 | 13 - 3 = 10 avail next order
| 2010-01-02 | 1 | 5 | 1200 | 10 - 5 = 5 avail next order
| 2010-01-03 | 1 | 3 | 1300 | 5 - 3 = 2 avail next order
| 2010-01-03 | 1 | 4 | 1400 | only 2 to PARTIALLY fill this order
| 2010-01-03 | 1 | 7 | 1700 | none available
| 2010-01-08 | 1 | 1 | 1100 | none available
With the extra sample record removed, would result in...
| date | pr_id | Qty |price | Qty Available to fill order
+--------------+--------+-----+-------+
| 2010-01-01 | 1 | 3 | 1000 | 13 - 3 = 10 avail next order
| 2010-01-02 | 1 | 5 | 1200 | 10 - 5 = 5 avail next order
| 2010-01-03 | 1 | 4 | 1400 | 5 - 4 = 1 avail for next order
| 2010-01-03 | 1 | 7 | 1700 | only 1 of the 7 available
| 2010-01-08 | 1 | 1 | 1100 | no more available...
So Aliasghar, does this better represent what you are trying to do??? Fill the available orders based on which order was entered into the system first, fill as many as possible based on inventory and stop there?
Please confirm by adding comment to this answer and maybe we can help resolve... Also, confirm WHICH Database are you using... SQL-Server, Oracle, MySQL, etc...
Here a working query for pr_id=1 , I used MySql:
select final.pr_date, final.pr_id, count(t_qty) as qty, final.price from
(select * FROM (select q.pr_date, q.pr_id, 1 as t_qty, q.price , #t := #t + t_qty total
FROM(
SELECT d.pr_date, d.pr_id, 1 as t_qty, d.price
FROM detail_table d
JOIN generator_4k i
ON i.n between 1 and d.qty
WHERE d.pr_id= 1
Order by d.id, d.pr_date) q
CROSS JOIN (SELECT #t := 0) i) c
WHERE c.total <= (select remain_qty from master_table WHERE product_id = 1)) final
group by final.pr_date , final.pr_id , final.price ;
Here SQL FIDDLE
You have to adapt your detail_table to add a technical id as primary key and create some views, I renamed the date column as pr_date, You'll find the schema on the sql fiddle.
Here another query Using SQL SERVER
select final.pr_date, final.pr_id, count(t_qty) as qty, final.price from
(SELECT top(select remain_qty from master_table WHERE product_id = 1) d.pr_date, d.pr_id, 1 as t_qty, d.price
FROM detail_table d
JOIN generator_4k i
ON i.n between 1 and d.qty
WHERE d.pr_id= 1
Order by d.id, d.pr_date) final
group by final.pr_date , final.pr_id , final.price ;
Here SQL FIDDLE
Daywise product by info
Beneath a suggested statement.
select t2.date,t2.pr_id,t1.pr_name,sum(qty) as qty_buy,sum(price) as amount from master_table as t1
inner join detail_table as t2 on t1.product_id=t2.pr_id
group by t2.date,t2.pr_id
order by t1.date,t2.pr_id
I had a hard time to understand what you really wanted.
So if I understood well, you want some data that correspond to a product but do not go over your remained item.
So I coudn't bypass yet the first query that goes over, and only take the remaining from it.
So my query for now just stop until it s get to the remained items allowed
SQL FIDDLE
To be able to do what you want, you need to first create a view that generate row based on your quantity.
Like something like
> +--------------+------------------+----------+--------+
| date | pr_id | Qty |price |
+--------------+------------------+----------+--------+
| 2010-01-01 | 1 | 3 | 1000 |
turn into something like
> +--------------+------------------+----------+--------+
| date | pr_id | Qty |price |
+--------------+------------------+----------+--------+
| 2010-01-01 | 1 | 1 | 1000
| 2010-01-01 | 1 | 1 | 1000 |
| 2010-01-01 | 1 | 1 | 1000 |
Then you count your rows until your remained item allows you to do it.
After you regroup all of the row by price,pr_id and date.
VOILA

How to join 2 tables with some of transpose row to columns

for example i have 2 tables
info_table
id | Title | description
1 | title1 | dec1
2 | title2 | dec2
3 | title3 | dec3
Instance_Table
e_id | name | string
1 | date | 2015/01/19
2 | time | 10:00
3 | value | 10
1 | date | 2015/01/20
2 | time | 11:00
3 | value | 12
1 | date | 2015/01/21
2 | time | 12:00
3 | value | 13
What result expected:
id | Title | date | Time | value | Description
1 | title1 | 2015/01/19 | 10:00 | 10 | Des1
2 | title2 | 2015/01/20 | 11:00 | 11 | Des2
3 | title3 | 2015/01/21 | 12:00 | 13 | Des3
You should integrate a Foreign Key on the Instance_Table, and your e_id should be a Primary Key.
info_table
id | Title | description
1 | title1 | dec1
2 | title2 | dec2
3 | title3 | dec3
Instance_Table
e_id | name | string | FK_InfoTable
1 | date | 2015/01/19 | 1
2 | time | 10:00 | 1
3 | value | 10 | 1
4 | date | 2015/01/20 | 2
5 | time | 11:00 | 2
6 | value | 12 | 2
7 | date | 2015/01/21 | 3
8 | time | 12:00 | 3
9 | value | 13 |3
And with that kind of SQL Statement you should get what you want.
SELECT * FROM info_table INNER JOIN Instance_Table ON info_table.id = Instance_Table.FK_InfoTable
You can read on relationnal database here
Relationnal database WIKI