Want to get order cancellation rate per week for prior 12 months - sql

I have a table that has each transaction along with a field that shows how many units were cancelled in the order. If I filter the table on cancelled_units > 0 i can pull all transactions that are cancelled. There is also detailed date information for each transaction but I think I only need date. I need to create a rate calculation of total cancelled orders / total orders to get cancellation rate and then spread that out across every week for the past 12 months. I was thinking maybe using a CASE statement with some sort of counter in place? Also, I am using Databricks so maybe there is some built in function or operators that would make this easier. Appreciate you taking a look at my question.

From the context provided, you have a data frame with the list of transactions. It is also clear that there is a transaction column, the timestamp indicating when the order was placed, and the number of units cancelled in each transaction. So, when you filter your data frame on condition cancelled_units>0, and count the number of fields, you get number of cancelled_orders.
Using spark in Databricks:
Now, to find the rate of cancellation (cancelled_orders/total_orders) for every week in the past 12 months. I was able to find a way that allows you to calculate the rate of cancellation in a PARTICULAR YEAR (not past 12 months). So, gather all the records in a certain year first.
Since the timestamp indicating when the order was placed is already available in the data frame, we can use this timestamp to find out which week of the year this transaction was made. You can use the following way to achieve this (similar sytax for both pyspark and spark with scala).
df.withColumn("order_placed_week",date_format(col("transaction_date"), "w")).show()
Here transaction_date is a timestamp. If it is a date, then use the following method.
df.withColumn("order_placed_week", date_format(to_date("transaction_date", "dd/mm/yyyy"), "w")).show()
to_date() function helps you to specify the format of the transaction_date in your data frame.
The libraries that are required to be imported are:
For pyspark:
from pyspark.sql.functions import to_date, date_format, col
Reference: https://www.datasciencemadesimple.com/get-month-year-and-quarter-from-date-in-pyspark/
For spark:
import org.apache.spark.sql.functions._
Reference: https://sparkbyexamples.com/spark/spark-how-to-get-a-day-and-week-of-year/
After completing this process, you can use the resulting data frame with order_placed_week column to get cancellation rate.
Get the count of orders for each week number, and then the count of orders with cancelled units using groupBy and filter. Dividing the count_of_cancelled/total_count for each week will give your desired result.

Related

Creating custom timestamp buckets bigquery

I have an hourly (timestamp) dataset of events from the past month.
I would like to check the performance of events that occurred between certain hours, group them together and average the results.
For example: AVG income of the hours 23:00-02:00 per user:
So if I have this data set below. I'd like to summarise the coloured rows and then average them (the result should be 218).
I tried NTILE but it couldn't divide the data properly, ignoring the irrelevant hours.
Is there a good way to create these custom buckets using SQL?
dataset
From description not exactly sure how you want to aggregate. If you provide an example dataset can update answer.
However you can easily achieve this with an AVG and IF statement.
AVG(IF(EXTRACT(HOUR FROM timestamp_field) BETWEEN 0 AND 4, value, NULL) as avg_value
Using the above you can then group by either day or month to get the aggregation level you want.

Query to find average stock ... with a twist

We are trying to calculate average stock from a movements table in a single sql sentence.
As far as we are, no problem with what we thought was a standard approach, instead of adding up the daily stock and divide by the number of days, as we don’t have daily stock, we simply add (movements*remaining days) :
select sum(quantity*(END_DATE-move_date))/(END_DATE-START_DATE)
from move_table
where move_date<=END_DATE
This is a simplified example, in real life we already take care of the initial stock at the starting date. Let’s say there are no movements prior to start_date.
Quantity sign depends on move type (sale, purchase, inventory, etc).
Of course this is done grouping by product, warehouse, ... but you get the idea.
It works as expected and the calculus is fine.
But (there is always a “but”), our customer doesn’t like accounting days when there is no stock (all stock sold out). So, he doesnt like
Sum of (daily_stock) / number_of_days (which is what we calculate using a diferent math)
Instead, he would like
Sum of (daily stock) / number_of_days_in_which_stock_is_not_zero
For sure we can do this in any programming language without much effort, but I was wondering how to do it using plain sql ... and wasn’t able to come up with a solution.
Any suggestion?
Consider creating a new table called something like Stock_EndOfDay_History that has the following columns.
stock#
date
stock_count_eod
This table would get a new row for each stock item at the start of a new day for the prior day. Rows could then be purged from this table once the applicable date value went outside the date window of interest.
To get the "number_of_days_in_which_stock_is_not_zero", use this.
SELECT COUNT(*) AS 'Not_Zero_Stock_Days' FROM Stock_EndOfDay_History
WHERE stock# = <stock#_value>
AND <date_window_clause>
Other approaches might attempt to just add a new column to the existing stock table to maintain a cumulative sum of the " number_of_days_in_which_stock_is_not_zero". But inevitably, questions will be asked as to how did the non-zero stock days count get calculated? Using this new table approach will address those questions better than the new column approach.

SQL GROUPING SETS averages with multiple many-to-many dimensions

I have a table of data with the following:
User,Platform,Dt,Activity_Flag,Total_Purchases
1,iOS,05/05/2016,1,1
1,Android,05/05/2016,1,2
2,iOS,05/05/2016,1,0
2,Android,05/05/2016,1,2
3,iOS,05/05/2016,1,1
3,Android,06/05/2016,1,3
1,iOS,06/05/2016,1,2
4,Android,06/05/2016,1,2
1,Android,06/05/2016,1,0
3,iOS,07/05/2016,1,2
2,iOS,08/05/2016,1,0
I want to do a GROUPING SETS (Platform,Dt,(Platform,Dt),()) aggregation to be able to find for each combination of Platform and Dt the following:
Total Purchases
Total Unique Users
Average Purchases per User per Day
The first two are simple as these can be achieved via a sum(Total_Purchases) and count(distinct user) respectively.
The problem I have is with the last metric. The result set should look like this but I don't know how to get the last column to be calculated correctly:
Platform,Dt,Total_Purchases,Total_Unique_Users,Average_Purchases_Per_User_Per_Day
Android,05/05/2016,4,2,2.0
iOS,05/05/2016,2,3,0.7
Android,06/05/2016,5,3,1.7
iOS,06/05/2016,2,1,2.0
iOS,07/05/2016,2,1,2.0
iOS,08/05/2016,0,1,0.0
,05/05/2016,6,3,2.0
,06/05/2016,7,3,2.3
,07/05/2016,1,1,1.0
,08/05/2016,1,1,1.0
Android,,9,4,1.8
iOS,,6,3,1.2
,,15,4,1.6
For the first ten rows we see that getting the Average purchase per user per day is a simple division of the first two columns as the dimension in these rows represent a single date only. But when we look at the final 3 rows we see that the division is not the way to achieve the desired result. This is because it needs to take an average for each day in turn to get the overall per day amount.
If this isn't clear please let me know and I'll be happy to explain better. This is my first post on this site!

Advanced partitions query

I have a table that contains something similar to the following columns:
infopath_form_id (integer)
form_type (integer)
approver (varchar)
event_timestamp (datetime)
This table contains the approval history for an infopath form and each form that is submitted in the system is given a unique infopath_form_id for this to be stored against. There is no consistent number of approvers for each form (as it differs based on the value of the transaction) however there is always at least two approvers for a form. Each approval task is written as another row to the table and only history of previous approvals is stored within this table.
What I need to find out is the average time that is taken between approvals for each form type. I've tried tackling this every which way using partitions but I'm getting stuck given that there isn't a fixed number of approvers for each form. How should I approach this problem?
I believe you want this:
SELECT infopath_form_id
, DATEDIFF(Minutes,MIN(event_timestamp),MAX(event_timestamp))/CAST(COUNT(*)-1 AS FLOAT)
FROM Table
GROUP BY infopath_form_id
That will give you the average number of minutes between the first and last entry for each InfoPath_form_id.
Explanation of functions used:
MIN() returns the earliest date
MAX() returns the latest date
DATEDIFF() returns the difference between two dates in a given unit (Minutes in this example)
COUNT() returns the number of rows per grouping item (ie InfoPath_form_id)
So simply divide the total minutes elapsed by one less than the number of records giving you the average number of minutes between events.

Sql Queries for finding the sales trend

Suppose ,I have a table which has all the billing records. Now I want to see the sales trend for a user given time duration group by each 3 days ...what should be the sql query regarding this?
please help,Otherwise I am gone ...
I can only give a vague suggestion as per the question, however you may want to have a derived column with a standardised date (as per MS date format, just a number per day) that you could then use a modulus (3) on so that days are equal per 3 day period. You can then group and aggregate over this column to get the values for a 3 day period. Obviously to display the date nicely you would have to multiply back and convert your column as well.
Again I'm not sure of the specifics, but I think this general idea could be achieved to get a result (may well not be the best way so it would help to add more to the question...)