identifying trends and classifying using sql - sql

i have a table xyz, with three columns rcvr_id,mth_id and tpv. rcvr_id is an id given to a customer, mth_id is a column which stores the month number( mth_id is calculated as (2012-1900) * 12 + 1,2,3.. ( depending on the month). So for example Dec 2011 will have month_id of 1344, Jan 2012 1345 etc. Tpv is a variable which shows the customers transaction amount.
Example table
rcvr_id mth_id tpv
1 1344 23
2 1344 27
3 1344 54
1 1345 98
3 1345 102
.
.
.
so on
P.S if a customer does not have a transaction in a given month, his row for that month wont exist.
Now, the question. Based on transactions for the months 1327 to 1350, i need to classify a customer as steady or sporadic.
Here is a description.
The above image is for 1 customer. i have millions of customers.
How do i go about it? I have no clue how to identify trends in sql .. or rather how to do it the best way possible.
ALSO i am working on teradata.
Ok i have found out how to get standard deviation. Now the important question is : How do i set a standard deviation limit on my own? i just cant randomly say "if standard dev is above 40% he is sporadic else steady". I thought of calculating average of standard deviation for all customers and if it is above that then he is sporadic else steady. But i feel there could be a better logic

I would suggest the STDDEV_POP function - a higher value indicates a greater variation in values.
select
rcvr_id, STDDEV_POP(tpv)
from yourtable
group by rcvr_id
STDDEV_POP is the function for Standard Deviation
If this doesn't differentiate enough, you may need to look at regression functions and variance.

Related

How to calculated on created fields? Why the calculation is wrong?

I am working on the workforce analysis project. And I did some case when conditional calculations in Google Data Studio. However, when I successfully conducted the creation of the new field, I couldn't do the calculation again based on the fields I created.
Based on my raw data, I generated the start_headcount, new_hires, terminated, end_headcount by applying the Case When conditional calculations. However, I failed in the next step to calculate the Turnover rate and Retention rate.
The formula for Turnover rate is
terms/((start_headcount+end_headcount)/2)
for retention is
end_headcount/start_headcount
However, the result is wrong. Part of my table is as below:
Supervisor sheadcount newhire terms eheadcount turnover Retention
A 1 3 1 3 200% 0%
B 6 2 2 6 200% 500%
C 6 1 3 4 600% 300%
So the result is wrong. The turnover rate for A should be 1/((1+3)/2)=50%; For B should be 2/((6+6)/2)=33.33%.
I don't know why it is going wrong. Can anyone help?
For example, I wrote below for start_headcount for each employee
CASE
WHEN Last Hire Date<'2018-01-01' AND Termination Date>= '2018-01-01'
OR Last Hire Date<'2018-01-01' AND Termination Date IS NULL
THEN 1
ELSE 0
END
which means if an employee meets the above standard, will get 1. And then they all grouped under a supervisor. I think it might be the problem why the turnover rate in sum is wrong since it is not calculated on the grouped date but on each record and then summed up.
Most likely you are trying to do both steps within the same query and thus newly created fields like start_headcount, etc. not visible yet within the same select statement - instead you need to put first calculation as a subquery as in example below
#standardSQL
SELECT *, terms/((start_headcount+end_headcount)/2) AS turnover
FROM (
<query for your first step>
)

Calculation for month number in time series data

The data I am working with is oil and gas production data. The production table uniquely identifies each well and contains a time series of production values. I want to be able to calculate a column that contains the month number occurrence of production for every well in the production table. This needs to be a calculation, so I can graph the production for various wells based on the production month, not the calendar month. (I want to compare well performance across wells over the life of wells.) Also note that there could be gaps in the production data so you can't depend on having twelve months of sequential production for each well.
I tried using the answer in this postRankValues but the calculation would never finish. I have over 4 million rows of production data.
In the table shown below, the values shown in ProdMonth is what I need to calculate based on their time occurrence shown in ProdDate. This needs to be performed as a row calculation for each unique WellId
Thanks.
WellID ProdDate ProdMonth
1 12/1/2011 1
1 1/1/2012 2
1 2/1/2012 3
1 3/1/2012 4
… … …
1 11/1/2012 12
2 3/1/2014 1
2 4/1/2014 2
2 5/1/2014 3
2 6/1/2014 4
2 7/1/2014 5
… … …
2 2/1/2014 12
I would create a new date table that has a row for each day (the granularity of your data). I would then add to that table the ProdMonth column. This will ensure you have dates for all days (even if there are gaps in the well reporting data). Then you can use a relationship between the well production data and the Date table on the ProdDate field. Then if you pull in the ProdMonth from the date table, you'll have a list of all of the ProdMonths (hint: you may need to select 'show values with no data' on the field right click menu in the fields well). Then if you add to the same visualization WellID you should be able to see which wells were active in which ProdMonth. If WellID is a number, you might need do use the 'do not summarize' feature on the WellID to get the result you desire.
I posted this question on the PowerPivotPro and Tom Allan provided the DAX formula I needed. First step was to calculate a field that concatenated Year and Month (YearMonth). Then utilized the RANKXX function as such:
= RANKX ( FILTER ( Data, [WellID] = EARLIER ( [WellID] ) ), [YearMonth], , 1, DENSE )
That did the trick and performed fairly quickly on 12mm rows.

Calculating difference in column value from one row to the next

I am using an Access DB to keep track of Utility Usage on hundreds of accounts. The meters in these accounts have consumption values only per month. I need to take the consumption value of this month and subtract it from the value of the previous month. to get the consumption for this month. I know that in SQL Server, there is a lead/lag function that can calculate those differences. Is there a similar function in access? or is there a simple way to subtract the value in one row from the one above it?
Ex.
The first Line is Billed Date
The Second Line is the Meter Reading
The Third Line is Consumption
1/26/2014
2/25/2014
3/27/2014
4/28/2014
5/26/2014
7/29/2014
0
3163
4567
5672
7065
8468
1538
1625
1404
1105
1393
1403
I do not quite get some of your results, but I think you want something like:
SELECT Meters.MeterDate,
Meters.MeterReading,
(SELECT TOP 1 MeterReading
FROM Meters m WHERE m.MeterDate <Meters.MeterDate
ORDER BY MeterDate DESC) AS LastReading,
[MeterReading]-Nz([LastReading],0) AS MonthResult
FROM Meters
ORDER BY Meters.MeterReading;

Rank in powerpivot

In Powerpivot, I have a problem in ranking in Table 1, based on Sales and Year. I want to have the result like that:
Year Store Sales **Rank**
2013 A 200 3
2013 B 250 2
2013 C 300 1
2014 A 350 2
2014 B 300 3
2014 C 400 1
Which rank function could I use to have this rank result?
Thanks in advance.
Tran,
Probably the smartest way to go is to use the 'X' functions. They can be a bit tricky and non intuitive, yet are extremely powerful.
First, create a simple measure to calculate the total sales:
TotalSales:=SUM(Stores[Sales])
Then, use this formula below to calculate the rank (per store per year):
Rank:=RANKX(ALL(Stores[Store]), [TotalSales])
That should do what you are looking for. Once those two measures are ready, create a new powerpivot table, dray Year and Store onto rows pane and add required values.
ALL function overwrites the applied rows filter and thus allows to calculate rank per year.
The result should look like this:
Hope this helps.

SQL YTD for previous years and this year

Wondering if anyone can help with the code for this.
I want to query the data and get 2 entries, one for YTD previous year and one for this year YTD.
Only way I know how to do this is as 2 separate queries with where clauses.. I would prefer to not have to run the query twice.
One column called DatePeriod and populated with 2011 YTD and 2012YTD, would be even better if I could get it to do 2011YTD, 2012YTD, 2011Total, 2012Total... though guessing this is 4 queries.
Thanks
EDIT:
In response to help clear a few things up:
This is being coded in MS SQL.
The data looks like so: (very basic example)
Date | Call_Volume
1/1/2012 | 4
What I would like is to have the Call_Volume summed up, I have queries that group it by week, and others that do it by month. I could pull all the dailies in and do this in Excel but the table has millions of rows so always best to reduce the size of my output.
I currently group by Week/Month and Year and union all so its 1 output. But that means I have 3 queries accessing the same table, large pain, very slow not efficient and that is fine but now I also need a YTD so its either 1 more query or if I could find a way to add it to the yearly query that would ideal:
So
DatePeriod | Sum_Calls
2011 Total | 40
2011 YTD | 12
2012 Total | 45
2012 YTD | 15
Hope this makes any sense.
SQL is built to do operations on rows, not columns (you select columns, of course, but aggregate operations are all on rows).
The most standard approach to this is something like:
SELECT SUM(your_table.sales), YEAR(your_table.sale_date)
FROM your_table
GROUP BY YEAR(your_table.sale_date)
Now you'll get one row for each year on record, with no limit to how many years you can process. If you're already grouping by another field, that's fine; you'll then get one row for each year in each of those groups.
Your program can then iterate over the rows and organize/render them however you like.
If you absolutely, positively must have columns instead, you'll be stuck with something like this:
SELECT SUM(IF(YEAR(date) = 2011, sales, 0)) AS total_2011,
SUM(IF(YEAR(date) = 2012, total_2012, 0)) AS total_2012
FROM your_table
If you're building the query programmatically you can add as many of those column criteria as you need, but I wouldn't count on this running very efficiently.
(These examples are written with some MySQL-specific functions. Corresponding functions exist for other engines but the syntax would be a little different.)