Pyspark find min price within a date range in pyspark dataframe

Pyspark find min price within a date range in pyspark dataframe - sql

I have a Dataframe which has following data(simulation).
Price
Date
InvoiceNumber
Product_Type
12.65
12/30/2021
INV_19984
AXN UN1234
18.78
1/23/2022
INV_200174
AXN UN1234
11.78
1/25/2022
INV_200173
AXN UN1234
11.1
3/2/2022
INV_9912
AXN UN1234
I am trying to find minimum price of the product in the last 30days, 60days and 90days without filtering the data. This data is huge and hence avoiding join operation so using a window function. But It returns null.
wi = Window.partitionBy("Product_Type").orderBy("Date").rowsBetween(30 ,Window.currentRow)
f1 = df.withColumn("minPrice30",F.min("price").over(wi))
Expected output is :
minPrice60
minPrice30
Price
Date
InvoiceNumber
Product_Type
12.65
12.65
12.65
12/30/2021
INV_19984
AXN UN1234
12.65
12.65
18.78
1/23/2022
INV_200174
AXN UN1234
11.78
11.78
11.78
1/25/2022
INV_200173
AXN UN1234
11.1
11.1
11.1
3/2/2022
INV_9912
AXN UN1234
Is there any way we can achieve this within the date range?
Kindly Suggest

Related

How to apply adjustment factor to stock prices in SQL?

I have 2 tables - Table A(Stock prices with date and symbols) and Table B(Adjustment factor and the effective date *adjustment factor is applicable for all dates before the effective date)
Table A -
date
symbol
price
2021-07-23
IRCON
45
2021-07-23
TIDEWATER
14891
2021-07-22
TIDEWATER
15309
2021-07-22
IRCON
45
2020-04-03
IRCON
91
2020-04-03
TIDEWATER
3182
2020-04-01
IRCON
393
2020-04-01
TIDEWATER
3171
2020-03-31
IRCON
381
2020-03-31
TIDEWATER
3207
Table B -
symbol
effective_date
adjustment_factor
TIDEWATER
2021-07-26
3
IRCON
2021-07-26
2
IRCON
2020-04-03
5
Requirement is -
The adjustment_factor of symbol needs to be applied(as a divisor) to Prices of symbol for all the dates less than effective_date of that adjustment_factor
E.g- For IRCON and adjustment_factor of 2 dated 2021-07-26, all prices of IRCON in Table A earlier than 2021-07-26 need to be divided by 2.
Similarly, for IRCON, adjustment_factor of 5 dated 2021-04-03,all prices of IRCON in Table A earlier than 2021-04-03 need to be divided by 5
(so, effectively all prices of IRCON before 2021-04-03 need to be divided by 2x5=10)
Desired output -
date
symbol
price
adjustment_factor
adjusted_price
2021-07-23
IRCON
45
2
22.38
2021-07-23
TIDEWATER
14891
3
4963.75
2021-07-22
TIDEWATER
15309
3
5103.00
2021-07-22
IRCON
45
2
22.58
2020-04-03
IRCON
91
2
45.43
2020-04-03
TIDEWATER
3182
3
1060.50
2020-04-01
IRCON
393
10=2x5
39.30
2020-04-01
TIDEWATER
3171
3
1057.13
2020-03-31
IRCON
381
10=2x5
38.10
2020-03-31
TIDEWATER
3207
3
1069.13
I have been trying using INNER JOIN, however, I am stuck for prices that need to be divided multiple times. There are some prices that need to be adjusted/divided with 5-6 factors combined/multiplied together. Is it possible to write some query for this in Postgres, maybe using Windows functions? Is there any scalable query to do this?

I am pretty sure there must be a better solution, But till the time you are not finding a better solution, You may try below solution -
SELECT A.date, A.symbol, A.price,
(SELECT COALESCE((SELECT adjustment_factor
FROM B B2
WHERE B1.symbol = B2.symbol
AND B1.effective_date < B2.effective_date), 1)*B1.adjustment_factor adjustment_factor
FROM B B1
WHERE A.symbol = B1.symbol
AND A.date < B1.effective_date
ORDER BY B1.effective_date
LIMIT 1),
ROUND(A.price :: DECIMAL / (SELECT COALESCE((SELECT adjustment_factor
FROM B B2
WHERE B1.symbol = B2.symbol
AND B1.effective_date < B2.effective_date), 1)*B1.adjustment_factor
FROM B B1
WHERE A.symbol = B1.symbol
AND A.date <= B1.effective_date
ORDER BY B1.effective_date
LIMIT 1), 2) adjusted_price
FROM A;
Fiddle Demo.

You can use a lateral join. The real trick is that Postgres (and SQL in general) does not offer a product() aggregation function. However, you can implement one using logs and exponentiation:
select a.*,
b.net_adjustment_factor,
a.price / coalesce(b.net_adjustment_factor, 1)
from a left join lateral
(select exp(sum(ln(adjustment_factor)) over (partition by symbol order by effective_date)) as net_adjustment_factor
from b
where b.symbol = a.symbol and
b.effective_date >= a.date
) b
on 1=1;
Here is a db-fiddle.

generate random dates where delta between another date column is exponentially distributed in python

How can I generate a date column shipping_date (either in 2019 or 2020), where the difference between sale_date and shipping_date is exponentially distributed? (and shipping_date comes before sales_date)
Let's say this is my dataset in YYYY-MM_DD format:
ID Sale Date Shipping Date
5464 2019-01-06
5423 2020-01-07
3490 2019-04-08
3945 2019-03-09
2387 2019-10-10
2393 2019-11-11
2395 2020-01-12
4331 2019-04-13
3982 2019-05-14
1875 2019-08-15

I suggest you use the scipy package for this problem; see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html
You can do something of the following:
from scipy.stats import expon
df['delta'] = pd.Series(expon.rvs(scale=1, size=len(df)))
Note the scale parameter you can set. Please see the documentation link above. After you have done this, just use pandas to subtract the column 'delta' from 'Sale Date' using the usual time difference techniques.

List the total power consumed for a given metal in the last one hour of the process

I have two tables. Metal master table has actual metal start and end timestamps. The Metal interval table has the electricity reading for a specific timestamp. The power for a specific time interval is the cumulative difference in electricity consumed for all the readings with
in the time interval from the Metal Interval data. I'm trying to get the list the total power consumed for a given metal in the last one hour of the process. Can you help me with this?
Here is the sample table:
Metal_Interval_Data
MetalID TreatID Plant ElctrctyConsumed ReadingTime
123456 8 MEAF 21500 2017-07-01 14:01:34.34
123456 8 MEAF 21650 2017-07-01 14:01:44.44
123456 8 MEAF 0 2017-07-01 09:54:53.53
123478 8 MEAF 0 2017-07-01 23:37:19.19
123478 8 MEAF 150 2017-07-02 00:32:08.08
Metal_Master_Data
MetalID MetalStartActual MetalStartPlanned MetalEndActual MetalEndPlanned
123456 2017-07-01 09:51:42.42 2017-06-30 08:59:35.35 2017-07-01 16:05:33.33 2017-06-30 14:59:35.35
123478 2017-07-01 23:30:31.31 2017-06-30 20:00:00.00 2017-07-02 00:33:25.25 2017-06-30 20:59:59.59
124302 2017-07-02 01:42:42.42 2017-07-01 20:51:47.47 2017-07-02 02:17:14.14 2017-07-01 21:51:47.47

Try this following script-
SELECT A.MetalID,SUM(B.ElctrctyConsumed) TotalConsumed
FROM Metal_Master_Data A
INNER JOIN Metal_Interval_Data B ON A.MetalID = B.MetalID
WHERE B.ReadingTime BETWEEN DATEADD(HH,-1,A.MetalEndActual) AND A.MetalEndActual
GROUP BY A.MetalID

#mkRabbani is correct but it specific for a given metal id
SELECT A.MetalID,SUM(B.ElctrctyConsumed) TotalConsumed
FROM Metal_Master_Data A
INNER JOIN Metal_Interval_Data B ON A.MetalID = B.MetalID
WHERE B.ReadingTime BETWEEN DATEADD(HH,-1,A.MetalEndActual)
AND A.MetalEndActual and A.MetalID=#givenmetalid
GROUP BY A.MetalID

SQL CTE to Sybase IQ to identify 30 consecutive days

I need to identify the member/providers (created a key called: PRV_MBR_KEY) who have 30 consecutive dates of service with no breaks. I've tested and validated on sample data successfully in SQL SVR. I need to re-code this to work in Sybase IQ platform.
Sample Data:
PRV_MBR_KEY FDOS_LN
330800913-00369544518 10/10/2016
330800913-00369544518 10/11/2016
330800913-00369544518 10/12/2016
330800913-00369544518 10/13/2016
330800913-00369544518 10/14/2016
330800913-00369544518 10/15/2016
330800913-00369544518 10/16/2016
330800913-00369544518 10/17/2016
330800913-00369544518 10/18/2016
330800913-00369544518 10/19/2016
Here's the SQL code that works:
WITH CTE AS (
SELECT PRV_MBR_KEY,
[FDOS_LN],
RangeId = DATEDIFF(DAY, 0, [FDOS_LN])
- ROW_NUMBER() OVER (PARTITION BY PRV_MBR_KEY ORDER BY [FDOS_LN])
FROM TEST_table)
SELECT PRV_MBR_KEY,
StartDate=MIN([FDOS_LN]),
EndDate=MAX([FDOS_LN]),
DAYCOUNT=DATEDIFF(DAY,MIN([FDOS_LN]),MAX([FDOS_LN]))+1
FROM CTE
GROUP BY PRV_MBR_KEY,RangeID
HAVING DATEDIFF(DAY,MIN([FDOS_LN]),MAX([FDOS_LN]))+1>=30
ORDER BY PRV_MBR_KEY,MIN([FDOS_LN])
Expected Results:
PRV_MBR_KEY StartDate EndDate DAYCOUNT
330800913-00369544518 2016-10-10 2016-11-13 35
330800913-00565274557 2017-01-26 2017-02-24 30
The error I get when running code in Sybase IQ:
[Sybase][ODBC Driver][Sybase IQ]Data exception - argument must be DATE or
DATETIME --(dflib/dfe_datepart.cxx 1450)

Count days based on couple of date fields in SQL Server

Here is a sample of what I have in my table (SQL Server):
patientID DateCreated StartOn EndOn
---------------------------------------------------
1234 2015-09-16 2015-09-01 2015-09-30
2345 2015-09-16 2015-09-01 2015-09-30
2346 2015-09-16 2015-09-01 2015-09-30
Currently, it counts the "days" to be 30. So it is really looking at days elapsed between StartOn and EndOn. I want to be able to do this counting based on StartOn and DateCreated. So, in my example the "days" should be 16, that is days elapsed from StartOn to DateCreated.

You can use DateDiff(Day,StartOn,DateCreated)

So you can go with:
Select (EndOn - DateCreated +1) As "Days"
from Tablename
where patientID = 1234;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pyspark find min price within a date range in pyspark dataframe - sql

Related

How to apply adjustment factor to stock prices in SQL?

generate random dates where delta between another date column is exponentially distributed in python

List the total power consumed for a given metal in the last one hour of the process

SQL CTE to Sybase IQ to identify 30 consecutive days

Count days based on couple of date fields in SQL Server

Categories

Resources