Boolean conditions that span rows in Spark

Boolean conditions that span rows in Spark - sql

I'm trying to calculate a boolean column based on a group and date range.
I have a table that records transactions with the following row structure:
Person GUID - Date - Payment Amount
There are multiple rows per person.
What I want is a new boolean column, called Recent that is determined by whether the person had a transaction within a time period of say, 3 days prior. It would be True if they have, False if they have not.
Any idea for a query to do this?

It depends on when the start time for the beginning of "prior" is. If it's "now" (the current time), then it's quite easy: you want to find the max date per person and then filter on that being no more than some distance from the current time.
Take a look at window functions in Spark and how they can be used with time series.
To find the max date you'll use an expression such as
max(Date) over (partition by Person) as max_date
Hope this helps.

Related

How can I optimally modify this BiqQuery query to retrieve the latest available data

I have the following query. It initially performs a sub-select by querying on a table that is partitioned (sharded) by sample_date_time. It does this by filtering using a date range in the WHERE that is passed via parameters. Then the final SELECT selects the data to be returned.
The query currently returns data for the latest complete hour (from the beginning of the previous hours hourly boundary, to the end of it). I want to adapt it to instead to get the latest hour of data that contains any data sample, up to a maximum of approximately 5hrs ago. The query can't use anything that invalidates the BigQuery cache within any given hour (e.g. I can't use a date function that gets the current date). The table data only updates every hour.
I'm thinking maybe I need to select the max sample_date_time in the initial sub-select, over a range of the last 5 hours. I could pass the hourly end boundary of the current time as a parameter, but I'm not seeing how I can limit the date range for which to retrieve the MAX, then use that max to get the start and end dates of the most recent hour that has any data.
WITH data AS (
SELECT
created_date_time,
sample_date_time,
station,
channel,
value
FROM my.mart
WHERE sample_date_time BETWEEN '2019-07-23 04:00:00.000000+00:00' AND '2019-07-23 04:59:59.999000+00:00'
AND station = '[my_guid]'
)
SELECT sample_date_time, station, channel, value
FROM data
ORDER BY value desc, channel asc, sample_date_time desc

Creating a DAX pattern that counts days between a date field and a month value on a chart's x-axis

I am struggling with a DAX pattern to allow me to plot an average duration value on a chart.
Here is the problem: My dataset has a field called dtOpened which is a date value describing when something started, and I want to be able to calculate the duration in days since that date.
I then want to be able to create an average duration since that date over a time period.
It is very easy to do when thinking about the value as it is now, but I want to be able to show a chart that describes what that average value would have been over various time periods on the x-axis (month/quarter/year).
The problem that I am facing is that if I create a calculated column to find the current age (NOW() - [dtOpened]), then it always uses the NOW() function - which is no use for historic time spans. Maybe I need a Measure for this, rather than a calculated column, but I cannot work out how to do it.
I have thought about using LASTDATE (rather than NOW) to work out what the last date would be in the filter context of any single month/quarter/year, but if the current month is only half way through, then it would probably need to consider today's date as the value from which to subtract the dtOpened value.
I would appreciate any help or pointers that you can give me!

It looks like you have a table (let's call it Cases) storing your cases with one record per case with fields like the following:
casename, dtOpened, OpenClosedFlag
You should create a date table with on record per day spanning your date range. The date table will have a month ending date field identifying the last day of the month (same for quarter & year). But this will be a disconnected date table. Don't create a relationship between the Date on the Date table and your case open date.
Then use iterative averagex to average the date differences.
Average Duration (days) :=
CALCULATE (
AVERAGEX ( Cases, MAX ( DateTable[Month Ending] ) - Cases[dtopened] ),
FILTER ( Cases, Cases[OpenClosedFlag] = "Open" ),
FILTER ( Cases, Cases[dtopened] <= MAX ( DateTable[Month Ending] ) )
)
Once you plot the measure against your Month you should see the average values represented correctly. You can do something similar for quarter & year.

You're a genius, Rory; Thanks.
In my example, I had a dtClosed field rather than an Opened/Closed flag, so there was one extra piece of filtering to do to test if the Case was closed at that point in time. So my measure ended up looking like this:
Average Duration:=CALCULATE(
AVERAGEX(CasesOnly, MAX(DT[LastDateM]) - CasesOnly[Owner Opened dtOnly]),
FILTER(CasesOnly, OR(ISBLANK(CasesOnly[Owner Resolution dtOnly]),
CasesOnly[Owner Resolution dtOnly] > MAX(DT[LastDateM]))),
FILTER(CasesOnly, CasesOnly[Owner Opened dtOnly] <= MAX(DT[LastDateM]))
)
And to get the chart, I plotted the DT[Date] field on the x-axis.
Thanks very much again.

How to Calculate Sum untill start of month of a given date in DAX

I would like to calculate Sum(QTY) until the start date of the month for a given date.
Basically I can calculate Sum(QTY) until given date in my measure like:
SumQTYTillDate:=CALCULATE(SUM([QTY]);FILTER(ALL(DimDateView[Date]);DimDateView[Date]<=MIN(DimDateView[Date])))
But I also would like to calculate Sum(QTY) for dates before 10/1/2015 - which is the first date of selected Date's month. I have changed above measure and used STARTOFMONTH function to find first day of the month for a given date like;
.......DimDateView[Date]<=STARTOFMONTH(MIN(DimDateView[Date]))))
but not avail, it gives me
"A function ‘MIN’ has been used in a True/False expression that is
used as a table filter expression. This is not allowed."
What am I missing? How can I use STARTOFMONTH function in my measure?
Thanks.

STARTOFMONTH() must take a reference to a column of type Date/Time. MIN() is a scalar value, not a column reference. Additionally, your measure wouldn't work, because STARTOFMONTH() is evaluated in the row context of your FILTER(). The upshot of all this is that you would get a measure which just sums [QTY] across the first of every month in your data.
The built in time intelligence functions tend to be unintuitive at best. I always suggest using your model and an appropriate FILTER() to get to what you want.
In your case, I'm not entirely sure what you're looking for, but I think you want the sum of [QTY] for all time before the start of the month that the date you've selected falls in. In this case it's really easy to do. Add a field to your date dimension, [MonthStartDate], which holds, for every date in the table, the date of the start of the month it falls in. Now you can write a measure as follows:
SumQTY=SUM(FactQTY[QTY])
SumQTYTilStartOfMonth:=
CALCULATE(
[SumQTY]
;FILTER(
ALL(DimDateView)
;DimDateView[Date] < MIN(DimDateView[MonthStartDate])
)
)

Aggregating 15-minute data into weekly values

I'm currently working on a project in which I want to aggregate data (resolution = 15 minutes) to weekly values.
I have 4 weeks and the view should include a value for each week AND every station.
My dataset includes more than 50 station.
What I have is this:
select name, avg(parameter1), avg(parameter2)
from data
where week in ('29','30','31','32')
group by name
order by name
But it only displays the avg value of all weeks. What I need is avg values for each week and each station.
Thanks for your help!

The problem is that when you do a 'GROUP BY' on just name you then flatten the weeks and you can only perform aggregate functions on them.
Your best option is to do a GROUP BY on both name and week so something like:
select name, week, avg(parameter1), avg(parameter2)
from data
where week in ('29','30','31','32')
group by name, week
order by name
PS - It' not entirely clear whether you're suggesting that you need one set of results for stations and one for weeks, or whether you need a set of results for every week at every station (which this answer provides the solution for). If you require the former then separate queries are the way to go.

Advanced partitions query

I have a table that contains something similar to the following columns:
infopath_form_id (integer)
form_type (integer)
approver (varchar)
event_timestamp (datetime)
This table contains the approval history for an infopath form and each form that is submitted in the system is given a unique infopath_form_id for this to be stored against. There is no consistent number of approvers for each form (as it differs based on the value of the transaction) however there is always at least two approvers for a form. Each approval task is written as another row to the table and only history of previous approvals is stored within this table.
What I need to find out is the average time that is taken between approvals for each form type. I've tried tackling this every which way using partitions but I'm getting stuck given that there isn't a fixed number of approvers for each form. How should I approach this problem?

I believe you want this:
SELECT infopath_form_id
, DATEDIFF(Minutes,MIN(event_timestamp),MAX(event_timestamp))/CAST(COUNT(*)-1 AS FLOAT)
FROM Table
GROUP BY infopath_form_id
That will give you the average number of minutes between the first and last entry for each InfoPath_form_id.
Explanation of functions used:
MIN() returns the earliest date
MAX() returns the latest date
DATEDIFF() returns the difference between two dates in a given unit (Minutes in this example)
COUNT() returns the number of rows per grouping item (ie InfoPath_form_id)
So simply divide the total minutes elapsed by one less than the number of records giving you the average number of minutes between events.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Boolean conditions that span rows in Spark - sql

Related

How can I optimally modify this BiqQuery query to retrieve the latest available data

Creating a DAX pattern that counts days between a date field and a month value on a chart's x-axis

How to Calculate Sum untill start of month of a given date in DAX

Aggregating 15-minute data into weekly values

Advanced partitions query

Categories

Resources