How can I compare dates from two dataframes with an if statement? - dataframe

Table 1 contains the history of all the employee information but only captures the data every 90 days. Table 2 contains the current information of all employees and is updated weekly with a timestamp.
Table 1 gets appended by table two every 90 days. I figured by taking the timestamp in table 1 and adding 90 days to it, and comparing it to the time stamp in table 2, I could use the logic below to execute the append, but I'm getting an error:
TypeError: '<' not supported between instances of 'DataFrame' and 'DataFrame'
Am I missing something?
# Let's say the max date in table 1 is 2023-01-15. Adding 90 days would put us on 2023-04-15
futr_date = spark.sql('SELECT date_add(MAX(tm_update), 90) AS future_date FROM tbl_one')
# Checking the date in the weekly refresh table, i have a timestamp of 2023-02-03
curr_date = spark.sql('SELECT DISTINCT tm_update AS current_date FROM tbl_two')
if curr_date > futr_date:
print('execute block of code that transforms table 2 data and append to table 1')
else:
print('ignore and check again next week')

Select statement is not returning value but dataframe and thats why you are getting error. If you want to get value you need to collect
futr_date = spark.sql('SELECT date_add(MAX(tm_update), 90) AS future_date FROM tbl_one').collect()[0]
In second sql you are using distinct to get date, which may return list of values, i am not sure if thats what you want. Maybe here you should use MIN? With onlny one ts value it may be not important, but with more values is may cause some issues
As i said, i am not sure if your logic is correct, but here is working example which you can use for further changes
import time
import pyspark.sql.functions as F
historicalData = [
(1, time.mktime(time.strptime("24/10/2022", "%d/%m/%Y"))),
(2, time.mktime(time.strptime("15/01/2023", "%d/%m/%Y"))),
(3, time.mktime(time.strptime("04/11/2022", "%d/%m/%Y"))),
]
currentData = [
(1, time.mktime(time.strptime("01/02/2023", "%d/%m/%Y"))),
(2, time.mktime(time.strptime("02/02/2023", "%d/%m/%Y"))),
(3, time.mktime(time.strptime("03/02/2023", "%d/%m/%Y"))),
]
oldDf = spark.createDataFrame(historicalData, schema=["id", "tm_update"]).withColumn(
"tm_update", F.to_timestamp("tm_update")
)
oldDf.createOrReplaceTempView("tbl_one")
currentDf = spark.createDataFrame(currentData, schema=["id", "tm_update"]).withColumn(
"tm_update", F.to_timestamp("tm_update")
)
currentDf.createOrReplaceTempView("tbl_two")
futr_date = spark.sql(
"SELECT date_add(MAX(tm_update), 90) AS future_date FROM tbl_one"
).collect()[0]
curr_date = spark.sql(
"SELECT cast(MIN(tm_update) as date) AS current_date FROM tbl_two"
).collect()[0]
print(futr_date)
print(curr_date)
if curr_date > futr_date:
print("execute block of code that transforms table 2 data and append to table 1")
else:
print("ignore and check again next week")
Output
Row(future_date=datetime.date(2023, 4, 15))
Row(current_date=datetime.date(2023, 2, 3))
ignore and check again next week

Related

How can I convert 1 record with a start and end date into multiple records for each day in DolphinDB?

So I have a table with the following columns:
For each record in the above table (e.g., stock A with a ENTRY_DT as 2011.08.22 and REMOVE_DT as 2011.09.03), I’d like to replicate it for each day between the start and end date (excluding weekends). The converted records keep the same value of fields S_INFO_WINDCODE and SW_IND_CODE as the original record.
Table after conversion should look like this:
(only records of stock A are shown)
As the data volume is not large, you can process each record with cj(cross join), then use function unionAll to combine all records into the output table.
The table:
t = table(`A`B`C as S_INFO_WINDCODE, `6112010200`6112010200`6112010200 as SW_IND_CODE, 2011.08.22 1998.11.11 1999.05.27 as ENTRY_DT, 2011.09.03 2010.10.08 2011.09.30 as REMOVE_DT)
Solution:
def f(t, i) {
windCode = t[i][`S_INFO_WINDCODE]
code = t[i][`SW_IND_CODE]
entryDate = t[i][`ENTRY_DT]
removeDate = t[i][`REMOVE_DT]
days = entryDate..removeDate
days = days[weekday(days) between 1:5]
return cj(table(windCode as S_INFO_WINDCODE, code as SW_IND_CODE), table(days as DT))
}
unionAll(each(f{t}, 1..size(t) - 1), false)

SQL - How to get rows within a date period that are within another date period?

I have the following table in the DDBB:
On the other side, i have an interface with an start and end filter parameters.
So i want to understand how to query the table to only get the data from the table which period are within the values introduces by the user.
Next I present the 3 scenarios possible. If i need to create one query per each scenario is ok:
Scenario 1:If the users only defines start = 03/01/2021, then the expected output should be rows with id 3,5 and 6.
Scenario 2:if the users only defines end = 03/01/2021, then the expected output shoud be rows with id 1 and 2.
Scenario 3:if the users defines start =03/01/2021 and end=05/01/2021 then the expected output should be rows with id 3 and 5.
Hope that makes sense.
Thanks
I will assume that start_date and end_date here are DateFields [Django-doc], and that you have a dictionary with a 'start' and 'end' as (optional) key, and these map to date object, so a possible dictionary could be:
# scenario 3
from datetime import date
data = {
'start': date(2021, 1, 3),
'end': date(2021, 1, 5)
}
If you do not want to filter on start and/or end, then either the key is not in the dictionary data, or it maps to None.
You can make a filter with:
filtr = {
lu: data[ky]
ky, lu in (('start', 'start_date__gte'), ('end', 'end_date__lte'))
if data.get(ky)
}
result = MyModel.objects.filter(**filtr)
This will then filter the MyModel objects to only retrieve MyModels where the start_date and end_date are within bounds.

Complex INSERT INTO SELECT statement in SQL

I have two tables in SQL. I need to add rows from one table to another. The table to which I add rows looks like:
timestamp, deviceID, value
2020-10-04, 1, 0
2020-10-04, 2, 0
2020-10-07, 1, 1
2020-10-08, 2, 1
But I have to add a row to this table if a state for a particular deviceID was changed in comparison to the last timestamp.
For example this record "2020-10-09, 2, 1" won't be added because the value wasn't changed for deviceID = 2 and last timestamp = "2020-10-08". In the same time record "2020-10-09, 1, 0" will be added, because the value for deviceID = 1 was changed to 0.
I have a problem with writing a query for this logic. I have written something like this:
insert into output
select *
from values
where value != (
select value
from output
where timestamp = (select max(timestamp) from output) and output.deviceID = values.deviceID)
Of course it doesn't work because of the last part of the query "and output.deviceID = values.deviceID".
Actually the problem is that I don't know how to take the value from "output" table where deviceID is the same as in the row that I try to insert.
I would use order by and something to limit to one row:
insert into output
select *
from values
where value <> (select o2.value
from output o2
where o2.deviceId = v.deviceId
order by o2.timestamp desc
fetch first 1 row only
);
The above is standard SQL. Specific databases may have other ways to express this, such as limit or top (1).

PowerBI Get Previous row value according to filters

I have a table with different objects and the objects evolve over time. One object is identified by object_number and we can track it with the object_line_number. And every evolution of the object has a status.
I want to calculate the time elapsed between some status.
Below is my table for one object_number "us1":
In yellow are the rowscontaining the starting date. They are found if (status_id = 0 and (old_status <> 0 or object_line_number = 1) and emergency_level = 1).
In green are the rows containing the ending date. They are found if (status_id =2,3,4,5 and old_status = 0).
The column old_status does not exist in the table. This is the status of the previous row (according to the object)line_number). I am retrieving it thanks to the following measure:
old_status = CALCULATE (
MAX(fact_object[status_id]),
FILTER (
ALL(fact_object),
fact_object[object_line_number] = IF(fact_object[object_line_number]=1, fact_object[object_line_number], MAX (fact_object[object_line_number])-1)),
VALUES (fact_object[object_number]))
I am in DirectQuery mode, so a lot of functions are not present for Calculated Columns, that's why I am using Measures.
Once that is done, I want then to be able to get for every green row the date_modification of the previous yellow row.
In this example, the result would be 4/4 then 1. So that I can calculate the time difference between the date_modification of the current green row and the date_modification of the previous yellow row.
So I was thinking of adding a new column named date_received, which is the date_modification of the previous yellow row;
From there, I just have to keep only the green rows and calculate the difference between date_modification and date_received.
My final calcul is actually to have this :
Result = (number of green rows which date difference between date_modification and date_received <= 15 min) / (number of green rows
which DAY(date_modification) = DAY(date_received))
But I don't know how to do it.
I have tried in the same spirit of the old_status measure to do this:
date_received = CALCULATE (
MAX(fact_object[date_modification]),
FILTER (
ALL(fact_object),
(fact_object[object_line_number] = MAX (fact_object[object_line_number])-1) && MY OTHER FILTERS
),
VALUES (fact_object[object_number])
)
But didn't succeed.
In SQL, the equivalent would be like this:
SELECT
SUM(CASE WHEN (DATEDIFF(MINUTE, T.date_received, T.date_planification) <= 15) THEN 1 ELSE 0 END) /
SUM(CASE WHEN (DAY(T.date_received) = DAY(T.date_planification)) THEN 1 ELSE 0 END) as result
FROM (
SELECT *, T.status_id as current_status,
LAG(T.date_modification) OVER(PARTITION BY T.object_number ORDER BY T.object_line_number) as date_received,
T.date_modification as date_planification
FROM
(
select *,
LAG (status_id) OVER(PARTITION BY object_number ORDER BY object_line_number) AS old_status
from dbo.fact_object
) AS T
WHERE ((T.status_id = 0 AND (T.old_status <> 0 OR T.object_line_number = 1) AND T.emergency_level = 1) OR (T.old_status = 0 AND T.status_id IN (2,3,4,5)))--974
) AS T
WHERE old_status = 0
(Well maybe there is a better way to do it in SQL that I've done).
How can I achieve this?
In this case, I would first sort the table by Object then by Object Line number (or Date Modified - but these appear not to change for each row.)
Duplicate the table
Table1 - Add an index starting at 1
Table2 - Add an index starting at 0
From Table1, merge it with table2, using the new index fields
Expand the MergedStatus ID (rename to NextStatusID) and the Object (rename to NextObject) field.
You can now create a new conditional column to compare the two status fields. E.g. Status 2 = If Object = NextObject then [NextStatusID] else null

SQL get values for only a few hours in given time period?

My first question here..!
I'm not an expert in SQL, so bear over with me please! :)
I have a web page (not created by me) which gets report data from a MSSQL database, on the web page you enter start date and end date and data are fetched in this time interval from 00:00 on start date until 23:59 on end date.
I have managed to add more queries to the SQL, but now I would like to, for certain values only to return values which were logged in the time range 00:00:00 until 04:00:00 every day in the selected time interval.
Currently values are logged once an hour, but not always consistently. So far I have made a workaround in my web page which shows the first 4 values and skips the next 20, this loops for the selected interval. This method works 98% of the time, but occasionally there are more or fewer than 24 logged values per day which can cause the shown values will be skewed one way or another.
What I would like to do is change my SQL query so that it only returns values in the time range I want (between midnight and 04:00) for every day in the selected period. I hope someone can help me achieve this or give me some hints! :)
This is the existing SQL query running with the variables which I do want all values for. There are more variables than this but I edited them out, all the Ren*Time variables is the ones I want to make a 4-hour-every-day version of.
SET NOCOUNT ON;
IF OBJECT_ID('tempdb..#tmpValues') IS NOT NULL BEGIN DROP TABLE #tmpValues END;
CREATE TABLE #tmpValues(Id INT PRIMARY KEY IDENTITY(1,1),BatchId INT, TimePoint DATETIME, Ren1Time DECIMAL(10,2), Ren2Time DECIMAL(10,2), Ren3Time DECIMAL(10,2), RenTotTime DECIMAL(10,2));
INSERT INTO #tmpValues(BatchId)
SELECT BatchId
FROM Batch
WHERE Batch.LogTime BETWEEN <StartTime> AND <StopTime>;
CREATE UNIQUE INDEX I_BatcId ON #tmpValues(BatchId);
UPDATE #tmpValues SET
TimePoint = (SELECT LogTime FROM Batch WHERE Batch.BatchId = #tmpValues.BatchId),
Ren1Time = (SELECT SUM(_Float) FROM LogData WHERE LogData.BatchId = #tmpValues.BatchId AND LogData.TagId = 21),
Ren2Time = (SELECT SUM(_Float) FROM LogData WHERE LogData.BatchId = #tmpValues.BatchId AND LogData.TagId = 25),
Ren3Time = (SELECT SUM(_Float) FROM LogData WHERE LogData.BatchId = #tmpValues.BatchId AND LogData.TagId = 29),
RenTotTime = (SELECT SUM(_Float) FROM LogData WHERE LogData.BatchId = #tmpValues.BatchId AND (LogData.TagId = 25 OR LogData.TagId = 29 OR LogData.TagId = 33));
DECLARE
#TimePoint DATETIME,
#Ren1Time FLOAT,
#Ren2Time FLOAT,
#Ren3Time FLOAT,
#RenTotTime FLOAT;
INSERT INTO #tmpValues(TimePoint, Ren1Time, Ren2Time, Ren3Time, RenTotTime)
VALUES(#TimePoint, #Ren1Time, #Ren2Time,#Ren3Time, #RenTotTime);
SET NOCOUNT OFF;
SELECT * FROM #tmpValues;
IF OBJECT_ID('tempdb..#tmpValues') IS NOT NULL BEGIN DROP TABLE #tmpValues END;
Don't mess around with temp tables and processing every column separately. I also have no idea what you're trying to do with those variables. You declare them, never set them, then do an INSERT with them, which will just insert a row of NULL values.
Assuming that you're using SQL Server, the DATEPART function will let you get the hour of the day.
SELECT
B.BatchID,
B.LogTime AS TimePoint,
SUM(CASE WHEN B.TagId = 21 THEN _Float ELSE 0 END) AS Ren1Time,
SUM(CASE WHEN B.TagId = 25 THEN _Float ELSE 0 END) AS Ren2Time,
SUM(CASE WHEN B.TagId = 29 THEN _Float ELSE 0 END) AS Ren3Time,
SUM(CASE WHEN B.TagId IN (21, 25, 29) THEN _Float ELSE 0 END) AS RenTotTime
FROM
dbo.Batch B
INNER JOIN LogData LD ON LD.BatchId = B.BatchId
WHERE
B.LogTime BETWEEN <StartTime> AND <StopTime> AND
DATEPART(HOUR, B.LogTime) BETWEEN 0 AND 4
GROUP BY
B.BatchID,
B.TimePoint
Thank you very much for your swift reply!
I couldn't exactly get your suggestion to work, but I did solve my problem thanks to your suggestion!
Since the SQL query which should be limited to the 4 hours is used only from one webpage and there are no values in this web page which should be shown for all 24 hours, I thought I could copy the existing .SQL file to a new file and simply change the following:
WHERE Batch.LogTime BETWEEN <StartTime> AND <StopTime>;
To
WHERE Batch.LogTime BETWEEN <StartTime> AND <StopTime> AND DATEPART(HOUR, Batch.LogTime) BETWEEN 0 AND 3;
I changed the call to the .SQL file in the web page from the old to the new file and it works! Simple as that! Also changed 0 AND 4 to 0 AND 3 since I found 03:59:59 will be included, which is exactly what I want :)