PowerPivot - only newest values on current context - powerpivot

I have a problem with PowerPivot.
Let's have a look at only 3 columns in my data source:
date - clientid - category
Category can only be 1 or 2.
In the data source you can find often the same clientid for a given time period, sometimes with different category.
So in my pivot table, I can see the distinct count of my clients depending on the chosen timeline.
But, of course, the sum of clients for cat=1 and cat=2 is bigger than the distinct count.
Is it possible to count only the newest entries for every clientid, so that the sum of the two cats is the same as the distinct count of my clients?
Thanks in advance to everybody who helps and spend his time for me.
Stefan

This was fun! Thanks for an interesting problem. Normally for this sort of thing we might flag the most recent entry for a given clientid in an extra field, but yours needs to be dynamic at runtime based on your date filter selection.
Here we go. Be warned, it's a doozy.
CountCat:=
COUNTROWS(
FILTER(
GENERATE(
VALUES( ClientCats[clientid] )
,CALCULATETABLE(
SAMPLE(
1
,SUMMARIZE(
ClientCats
,ClientCats[date]
,ClientCats[category]
)
,ClientCats[date]
,DESC
)
,ALL( ClientCats[category] )
)
)
,CONTAINS(
VALUES( ClientCats[category] )
,ClientCats[category]
,ClientCats[category]
)
)
)
Let's work through it.
COUNTROWS() is trivial.
FILTER() takes a table as its first argument. It creates a row context by iterating row-by-row through this table. It evaluates a boolean expression in each row context and returns the rows for which the expression returns true. We're not getting to that expression for a little while here. Let's look at the table we'll be filtering.
GENERATE() takes a table as its input and creates a row context by iterating row-by-row through that table. For each row context it evaluates a second table, and cross joins the rows that exist in the second table expression in the current row context from the first table with the row from the first table.
Our first table is VALUES( ClientCats[clientid] ), which is simply a distinct list of all [clientid]s in context from the pivot table.
We then evaluate CALCULATETABLE() for each row context, aka for each [clientid]. CALCULATETABLE() evaluates a table expression in the filter context determined by its second and subsequent arguments.
SAMPLE() is the table we'll evaluate. SAMPLE() is like TOPN(), but with ties broken non-deterministically. SAMPLE( 1, ... ) always returns one row. TOPN( 1, ... ) returns all rows that are tied for first position.
SAMPLE(), here, will return one row from the table defined by SUMMARIZE(). SUMMARIZE() groups by the fields in a table that are named. Thus we have a table of all distinct values of [date] and [category] that are included based on the context determined by our CALCULATETABLE(). SAMPLE()'s third argument defines a sort-by column to determine which rows are first, and its fourth determines the sort order. Thus for each [clientid] we are returning the latest row in the SUMMARIZE() for that [clientid].
The ALL() in our CALCULATETABLE() strips the context from the field [category] that might be coming in from our pivot table. This means that every time we evaluate our GENERATE() (remember we're still in that function here), we get a table of all [clientid]s that exist in context, and their most recent [category], even when we're evaluating in a pivot cell that has filtered [category].
That sounds like a problem - we'd expect the same count now for every pivot cell. And that's what we'd get if we did COUNTROWS( GENERATE() ). But wait, we're still in FILTER()!
Now we get to the boolean expression which will filter the rows of that GENERATE(). CONTAINS() takes a table as its first argument, a reference to a column in that table as its second argument, and a scalar value as its third argument. It returns true if the column in argument 2, of the table in argument 1, contains the value in argument 3.
We are outside of the CALCULATETABLE(), and therefore context exists on [category]. VALUES() returns the unique rows in context. In any pivot cell filtered by [category], this will be a 1x1 table, but in our grand total, it will have multiple rows.
So, the column in that VALUES() we want to test is [category] (the only column that exists in that VALUES()).
The value we want to test for is referred to by ClientCats[category]. That third argument evaluates [category] in the row context determined by FILTER(). Thus we return true for every row that matches the current filter context (in a pivot cell) of ClientCats[category]. Mind-bending stuff here.
Anyway, the upshot is that in a [category]-filtered pivot cell, we get the number of distinct [clientid]s that have, for the time frame selected, that [category] value as their most recent category.
For the grand total we get every [clientid] in context.
This will probably not have a very good performance curve.
Here's a sample workbook to play with the functioning measure defined.
Edit
Based on replies below.
Do you need to maintain in the model all the rows that have [UseClient] <> 1? Deduping and flagging is always easier in tools other than Power Pivot.
I have no idea how you've determined the values for 1 in [UseClient]. None are the most recent entry for a given [ClientID]. If you want to just flag the most recent row, which is what it sounds like you want, but not what your workbook looks like, you can do a calculated column much more easily than doing this in a measure:
=SAMPLE(
1
,CALCULATETABLE( // return all dates for the [clientid] on current row
VALUES( ClientCats[date] )
,ALLEXCEPT( ClientCats, ClientCats[clientid] )
)
,ClientCats[date]
,DESC
) = ClientCats[date] // row context in table
This will return true when the value of [date] on a given row is equal to the maximum [date] for the client on that row.
One thing you could easily do in Power Query is to group [clientid] and take the max date for each [clientid]. Then you have one row per client.
This is all different than your original question, though, because your original wants to find the maxes based on date selection. But a calculated column is not updated based on filter context. It's only recalculated at model refresh time. If you're willing to use a calculated column, then just deal with your data issues before bringing it into Power Pivot.

Related

How to add column to an existing table and calculate the value

Table info:
I want to add new column and calculated the different of the alarmTime column with this code:
ALTER TABLE [DIALinkDataCenter].[dbo].[DIAL_deviceHistoryAlarm]
ADD dif AS (DATEDIFF(HOUR, LAG((alarmTime)) OVER (ORDER BY (alarmTime)), (alarmTime)));
How to add the calculation on the table? Because always there's error like this:
Windowed functions can only appear in the SELECT or ORDER BY clauses.
You are using the syntax for a generated virtual column that shows a calculated value (ADD columnname AS expression).
This, however, only works on values found in the same row. You cannot have a generated column that looks at other rows.
If you consider now to create a normal column and fill it with calculated values, this is something you shouldn't do. Don't store values redundantly. You can always get the difference in an ad-hoc query. If you store this redundantly instead, you will have to consider this in every insert, update, and delete. And if at some time you find rows where the difference doesn't match the time values, which column holds the correct value then and which the incorrect one? alarmtime or dif? You won't be able to tell.
What you can do instead is create a view for convenience:
create view v_dial_devicehistoryalarm as
select
dha.*,
datediff(hour, lag(alarmtime) over (order by alarmtime), alarmtime) as dif
from dial_devicehistoryalarm dha;
Demo: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=b7f9b5eef33e72955c7f135952ef55b5
Remember though, that your view will probably read and sort the whole table everytime you access it. If you query only a certain time range, it will be faster hence to calculate the differences in your query instead.

Querying a table from a parameter in a BigQuery UDF

I am trying to create a UDF that will find the maximum value of a field called 'DatePartition' for each table that is passed through to the UDF as a parameter. The UDF I have created looks like this:
CREATE TEMP FUNCTION maxDatePartition(x STRING) AS ((
SELECT MAX(DatePartition) FROM x WHERE DatePartition >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(),INTERVAL 7 DAY)
));
but I am getting the following error: "Table name "x" missing dataset while no default dataset is set in the request."
The table names will get passed to the UDF in the format:
my-project.my-dataset.my-table
EDIT: Adding more context: I have multiple tables that are meant to update every morning with yesterday's data. Sometimes the tables are updated later than expected so I am creating a view which will allow users to quickly see the most recent data in each table. To do this I need to calculate MAX(DatePartition) for all of these tables in one statement. The list of tables will be stored in another table but it will change from time to time so I can't hardcode them in.
I have tried to do it in a single statement, but have found I need to invoke a common table expression as a sorting mechanism. I haven't found success using the MAX() function on TIMESTAMPs. Here is a method that has worked the best for me that I've discovered (and most concise). No UDF needed. Try something like this:
WITH
DATA AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY your_group_by_fields ORDER BY DatePartition DESC) AS _row,
*
FROM
`my-project.my-dataset.my-table`
WHERE
Date_Partition >= TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 7 DAY)
)
SELECT
* EXCEPT(_row)
FROM
DATA
WHERE
_row = 1;
What this does is creates a new field with a row number for each partition of whatever grouped field that has muliple records of different timestamps. So for each of the records of a certain group, it will order them by most recent DatePartition and give them a row number value with "1" being the most recent since we sorted the DatePartition DESC.
Then it takes your common table expression of sorted values, and just returns everything in your table (EXCEPT that row number "_row" you assigned) and then filter only on "_row =1" which will be your most recent records.

Return the row includes the maximum value of specific column if two rows have the same values.

I have a result from the SQL query, wich is displayed below,
I want to build a SQL query, which can return the row includes maximum number from the last column if any two rows (or more than 2) have the same number from the first column.
For instance, from the table, you can see the top two rows have the same number from the first column, which is 2195333. If the SQL query runs, it will return the first row and the rest of rows, discarding the 2nd row only, since the last column for the 2nd row is 1, which is smaller than 2 from the 1st row.
I was thinking about using the while loop in SQL, like run the loop from the 1st row to the last row, if there are any rows have the same value from the first column, it will return the row which has the maximum value from the last column. Since I am new to SQL, I have no idea how to implement it. Please me help me. Thanks
The question, sample data, and desired results are lacking a bit.
But if I understand your question, you can use the WITH TIES clause in concert with Row_Number()
Example
Select Top 1 with ties *
From YourTable
Order By Row_Number() over (Partition By YourCol1 Order By YourLastCol Desc)
Edit Use Dense_Rank() if you want to see ties

Get latest data for all people in a table and then filter based on some criteria

I am attempting to return the row of the highest value for timestamp (an integer) for each person (that has multiple entries) in a table. Additionally, I am only interested in rows with the field containing ABCD, but this should be done after filtering to return the latest (max timestamp) entry for each person.
SELECT table."person", max(table."timestamp")
FROM table
WHERE table."type" = 1
HAVING table."field" LIKE '%ABCD%'
GROUP BY table."person"
For some reason, I am not receiving the data I expect. The returned table is nearly twice the size of expectation. Is there some step here that I am not getting correct?
You can 1st return a table having max(timestamp) and then use it in sub query of another select statement, following is query
SELECT table."person", timestamp FROM
(SELECT table."person",max(table."timestamp") as timestamp, type, field FROM table GROUP BY table."person")
where type = 1 and field LIKE '%ABCD%'
Direct answer: as I understand your end goal, just move the HAVING clause to the WHERE section:
SELECT
table."person", MAX(table."timestamp")
FROM table
WHERE
table."type" = 1
AND table."field" LIKE '%ABCD%'
GROUP BY table."person";
This should return no more than 1 row per table."person", with their associated maximum timestamp.
As an aside, I surprised your query worked at all. Your HAVING clause referenced a column not in your query. From the documentation (and my experience):
The fundamental difference between WHERE and HAVING is this: WHERE selects input rows before groups and aggregates are computed (thus, it controls which rows go into the aggregate computation), whereas HAVING selects group rows after groups and aggregates are computed.

What value is selected into parameter in SQL query without where clause

For example, I have this query
SELECT #param = column from table
What value is pulled into #param?
I tried this and can't figure out the value that is being pulled. It is not the old record or newer one.
The documentation states:
the variable is assigned the last value that is returned
But without a WHERE clause that uniquely identifies a row nor an ORDER BY clause that specifies a unique value for ordering, the row chosen for the variable assignment is undefined and not deterministic when the table has more than one row.
You could add ORDER BY to the query to return the last ordered row. A more efficient method to do that would to be use SELECT TOP(1)...ORDER BY...DESC. Conversely, SELECT TOP(1)...ORDER BY...ASC will return the first ordered row. Again, the order by column(s) need to be unique for a deterministic value.
This is the value in the column referenced. It seems like it should have a TOP 1 in it, with a WHERE Clause designed to fetch 1 row only.