Iterate through table by date column for each common value of different column - sql

Below I have the following table structure:
CREATE TABLE StandardTable
(
RecordId varchar(50),
Balance float,
Payment float,
ProcDate date,
RecordIdCreationDate date,
-- multiple other columns used for calculations
)
And here is what a small sample of what my data might look like:
RecordId Balance Payment ProcDate RecordIdCreationDate
1 1000 100 2005-01-01 2005-01-01
2 5000 250 2008-01-01 2008-01-01
3 7500 350 2006-06-01 2006-06-01
1 900 100 2005-02-01 NULL
2 4750 250 2008-02-01 NULL
3 7150 350 2006-07-01 NULL
The table holds data on a transactional basis and has millions of rows in it. The ProcDate field indicates the month that each transaction is being processed. Regardless of when the transaction occurs throughout the month, the ProcDate field is hard coded to the first of the month that the transaction happened in. So if a transaction occurred on 2009-01-17, the ProcDate field would be 2009-01-01. I'm dealing with historical data, and it goes back to as early as 2005-01-01. There are multiple instances of each RecordId in the table. A RecordId will show up in each month until the Balance column reaches 0. Some RecordId's originate in the month the data starts (where ProcDate is 2005-01-01) and others don't originate until a later date. The RecordIdCreationDate field represents the date where the RecordId was originated. So that row has millions of NULL values in the table because every month that each RecordId didn't originate in is equal to NULL.
I need to somehow look at each RecordId, and run a number of different calculations on a month to month basis. What I mean is I have to compare column values for each RecordId where the ProcDate might be something like 2008-01-01, and compare those values to the same column values where the ProcDate would be 2008-02-01. Then after I run my calculations for the RecordId in that month, I have to compare values from 2008-02-01 to values in 2008-03-01 and run my calculations again, etc. I'm thinking that I can do this all within one big WHILE loop, but I'm not entirely sure what that would look like.
The first thing I did was create another table in my database that had the same table design as my StandardTable and I called it ProcTable. In that ProcTable, I inserted all of the data where the RecordIdCreationDate was not equal to NULL. This gave me the first instance of each RecordId in the database. I was able to run my calculations for the first month successfully, but where I'm struggling is how I use the column values in the ProcTable, and compare those to the column values where the ProcDate is the month after that. Even if I could somehow do that, I'm not sure how I would repeat that process to compare the 2nd month's data to the 3rd month's data, and the 3rd month's data to the 4th month's data, etc.
Any suggestions? Thanks in advance.

Seems to me, all you need to do is JOIN the table to itself, on this condition
ON MyTable1.RecordId = MyTable2.RecordId
AND MyTable1.ProcDate = DATEADD(month, -1, MyTable2.ProcDate)
Then you will have all the rows in your table (MyTable1), joined to the same RecordId's row from the next month (MyTable2).
And in each row you can do whatever calculations you want between the two joined tables.

Related

Issues excluding data in SSAS cube (different output in SSMS and SSAS using EOMONTH())

I'm creating a cube whose stock values should only conclude the last day each month's balance and value.
Hence, I've created the following Query in SSMS:
Create table #theStockTable(
Stock int,
StockValue INT,
DateKey int
)
INSERT INTO #theStockTable
VALUES(3,5, 20170211),
(3,5,20170228),
(1,4,20170331),
(1,4,20170330)
SELECT CAST(CONVERT(varchar, DateKey, 112) AS numeric(8, 0)) AS DateKey, SUM(Stock) AS [CL Stock], SUM(StockValue) AS [CL Stock Value]
FROM #theStockTable
WHERE CONVERT(date, CONVERT(varchar(10), DateKey)) = eomonth(CONVERT(date, CONVERT(varchar(10), DateKey)))
GROUP BY DateKey
In SSMS this returns the correct values:
DateKey CL Stock CL Stock Value
20170228 3 5
20170331 1 4
However, when I create an OLAP cube using SSAS, and use the Query above as the Named Query for my fact table #theStockTable and the same Query as my only partition of the same fact table and deploy and execute the cube, I have a situation where I get different values on each day of every month, but I only want to have the values for each month's last day.
I have used New Project.. -> Import from Server (multidimensional model or data mining model) in SSAS. It is important that the users must be able to browse the cube as they presently do.
The cube whose meta data I have copied contains every day's values on the stock table. May there be some metadata change I need to make in addition to the Query modification I have done in Edit named Query.. in Data Source View and replacing the old Query in the partition with my new Query?
Hopefully someone can shed some light into this.
EDIT
To clarify my request, some users of the cube has explained that it is rather slow to browse in for instance Excel, mainly because my Stock measure is much bigger than it is required to be. As it is now, it returns every StockValue and Stock of each product and each day. I want to only include the total balance of StockValue and Stock of the last day of the month. All other stock values are redundant.
For instance, browsing my DimDate dimension table with the measurements Stock and StockValue should have this return set:
DateKey Stock StockValue
20170131 0 0
rather than the whole return set which is returned now:
DateKey Stock StockValue
20170101 3 5
20170102 4 6
20170103 1 1
20170131 0 0
I think you already had a date dimension in your cube, if yes, then follow these steps:
Add an additional attribute [IsLastDay] with value 0/1 in the date dimension to indicate if the current date record is the last day of that month or not.
2.Add a calculate measure [CalStock] with this formular:
([Measures].[StockValue],[Date].[IsLastDay].&[1])
3.Fire this query to return the expected result:
select {[CalStock]} on 0,
non empty{[Date].[Date].[Date]} on 1
from [YourCube]

Calculating the number of new ID numbers per month in powerpivot

My dataset provides a monthly snapshot of customer accounts. Below is a very simplified version:
Date_ID | Acc_ID
------- | -------
20160430| 1
20160430| 2
20160430| 3
20160531| 1
20160531| 2
20160531| 3
20160531| 4
20160531| 5
20160531| 6
20160531| 7
20160630| 4
20160630| 5
20160630| 6
20160630| 7
20160630| 8
Customers can open or close their accounts, and I want to calculate the number of 'new' customers every month. The number of 'exited' customers will also be helpful if this is possible.
So in the above example, I should get the following result:
Month | New Customers
------- | -------
20160430| 3
20160531| 4
20160630| 1
Basically I want to compare distinct account numbers in the selected and previous month, any that exist in the selected month and not previous are new members, any that were there last month and not in the selected are exited.
I've searched but I can't seem to find any similar problems, and I hardly know where to start myself - I've tried using CALCULATE and FILTER along with DATEADD to filter the data to get two months, and then count the unique values. My PowerPivot skills aren't up to scratch to solve this on my own however!
Getting the new users is relatively straightforward - I'd add a calculated column which counts rows for that user in earlier months and if they don't exist then they are a new user:
=IF(CALCULATE(COUNTROWS(data),
FILTER(data, [Acc_ID] = EARLIER([Acc_ID])
&& [Date_ID] < EARLIER([Date_ID]))) = BLANK(),
"new",
"existing")
Once this is in place you can simply write a measure for new_users:
=CALCULATE(COUNTROWS(data), data[customer_type] = "new")
Getting the cancelled users is a little harder because it means you have to be able to look backwards to the prior month - none of the time intelligence stuff in PowerPivot will work out of the box here as you don't have a true date column.
It's nearly always good practice to have a separate date table in your PowerPivot models and it is a good way to solve this problem - essentially the table should be 1 record per date with a unique key that can be used to create a relationship. Perhaps post back with a few more details.
This is an alternative method to Jacobs which also works. It avoids creating a calculated column, but I actually find the calculated column useful to use as a flag against other measures.
=CALCULATE(
DISTINCTCOUNT('Accounts'[Acc_ID]),
DATESBETWEEN(
'Dates'[Date], 0, LASTDATE('Dates'[Date])
)
) - CALCULATE(
DISTINCTCOUNT('Accounts'[Acc_ID]),
DATESBETWEEN(
'Dates'[Date], 0, FIRSTDATE('Dates'[Date]) - 1
)
)
It basically uses the dates table to make a distinct count of all Acc_ID from the beginning of time until the first day of the period of time selected, and subtracts that from the distinct count of all Acc_ID from the beginning of time until the last day of the period of time selected. This is essentially the number of new distinct Acc_ID, although you can't work out which Acc_ID's these are using this method.
I could then calculate 'exited accounts' by taking the previous months total as 'existing accounts':
=CALCULATE(
DISTINCTCOUNT('Accounts'[Acc_ID]),
DATEADD('Dates'[Date], -1, MONTH)
)
Then adding the 'new accounts', and subtracting the 'total accounts':
=DISTINCTCOUNT('Accounts'[Acc_ID])

Calculation for month number in time series data

The data I am working with is oil and gas production data. The production table uniquely identifies each well and contains a time series of production values. I want to be able to calculate a column that contains the month number occurrence of production for every well in the production table. This needs to be a calculation, so I can graph the production for various wells based on the production month, not the calendar month. (I want to compare well performance across wells over the life of wells.) Also note that there could be gaps in the production data so you can't depend on having twelve months of sequential production for each well.
I tried using the answer in this postRankValues but the calculation would never finish. I have over 4 million rows of production data.
In the table shown below, the values shown in ProdMonth is what I need to calculate based on their time occurrence shown in ProdDate. This needs to be performed as a row calculation for each unique WellId
Thanks.
WellID ProdDate ProdMonth
1 12/1/2011 1
1 1/1/2012 2
1 2/1/2012 3
1 3/1/2012 4
… … …
1 11/1/2012 12
2 3/1/2014 1
2 4/1/2014 2
2 5/1/2014 3
2 6/1/2014 4
2 7/1/2014 5
… … …
2 2/1/2014 12
I would create a new date table that has a row for each day (the granularity of your data). I would then add to that table the ProdMonth column. This will ensure you have dates for all days (even if there are gaps in the well reporting data). Then you can use a relationship between the well production data and the Date table on the ProdDate field. Then if you pull in the ProdMonth from the date table, you'll have a list of all of the ProdMonths (hint: you may need to select 'show values with no data' on the field right click menu in the fields well). Then if you add to the same visualization WellID you should be able to see which wells were active in which ProdMonth. If WellID is a number, you might need do use the 'do not summarize' feature on the WellID to get the result you desire.
I posted this question on the PowerPivotPro and Tom Allan provided the DAX formula I needed. First step was to calculate a field that concatenated Year and Month (YearMonth). Then utilized the RANKXX function as such:
= RANKX ( FILTER ( Data, [WellID] = EARLIER ( [WellID] ) ), [YearMonth], , 1, DENSE )
That did the trick and performed fairly quickly on 12mm rows.

Cumulative sum function in hive

Is there any way to get the cumulative count(customer_id) for today's date and a number of days leading up to today's date, i am running count in Hive? The date column in this format:
20120907
I have 2 columns in my dataset, customer_id and date.
There are also partitions in my table and some of the values in the customer_id column are NULL. I am not sure if there are duplicates so I will use
count(distinct(customer_id))
Here is an example of my data.
customer_id date
10001 20140901
10003 20141001
NULL 20150101
10007 20150102
Please let me know if you need anymore info.
I have the same problem, getting the cumulative count of distinct users per day.
The difficulty here is that you hardly can pre-aggregate the counts per day and sum them up, because the days might have "overlapping" users, thus you would count them multiple times instead of once.
But I was stumbling upon this approach, which basically hashes all users in a "sketch_set" per day and later unions the different hash set, upon which it applies a count estimation.

Number of absent rows in daterange

I have a table with following structure
transaction_id user_id date_column
1 1 01-08-2011
2 2 01-08-2011
3 1 02-08-2011
4 1 03-08-2011
There can be at-max only one entry for each user on each date.
How can get all rows where user_id is not present for specific date range.
So for above table with user_id= 2 and date range 01-08-2011 to 03-08-2011, I want
result
02-08-2011
03-08-2011
Right now, I am using for loop to loop over all dates in given date range.
This is working fine with small date range, but I think it will become resource heavy for large one.
As suggested in a comment, create a table with the dates of interest (I'll call it datesofinterest). Every date from your date range needs to be put into this table.
datesofinterest table
--------------
date
--------------
01-08-2011
02-08-2011
03-08-2011
Then the datesofinterest table needs to be joined with all the userids -- this is the set of all possible combinations of dates-of-interest and userids.
Now you have to remove all those dates-of-interest/userids that are currently in your original table to get your final answer.
In relational algebra, it'd be something like:
(datesofinterest[date] x transaction[user_id]) - (transaction[date_column, user_id])
This page may help with translating '-' to SQL. Generating dates to populate the datesofinterest table can be done in SQL, manually, or with a helper program (perl's DateTime)