Querying a table from a parameter in a BigQuery UDF - google-bigquery

I am trying to create a UDF that will find the maximum value of a field called 'DatePartition' for each table that is passed through to the UDF as a parameter. The UDF I have created looks like this:
CREATE TEMP FUNCTION maxDatePartition(x STRING) AS ((
SELECT MAX(DatePartition) FROM x WHERE DatePartition >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(),INTERVAL 7 DAY)
));
but I am getting the following error: "Table name "x" missing dataset while no default dataset is set in the request."
The table names will get passed to the UDF in the format:
my-project.my-dataset.my-table
EDIT: Adding more context: I have multiple tables that are meant to update every morning with yesterday's data. Sometimes the tables are updated later than expected so I am creating a view which will allow users to quickly see the most recent data in each table. To do this I need to calculate MAX(DatePartition) for all of these tables in one statement. The list of tables will be stored in another table but it will change from time to time so I can't hardcode them in.

I have tried to do it in a single statement, but have found I need to invoke a common table expression as a sorting mechanism. I haven't found success using the MAX() function on TIMESTAMPs. Here is a method that has worked the best for me that I've discovered (and most concise). No UDF needed. Try something like this:
WITH
DATA AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY your_group_by_fields ORDER BY DatePartition DESC) AS _row,
*
FROM
`my-project.my-dataset.my-table`
WHERE
Date_Partition >= TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 7 DAY)
)
SELECT
* EXCEPT(_row)
FROM
DATA
WHERE
_row = 1;
What this does is creates a new field with a row number for each partition of whatever grouped field that has muliple records of different timestamps. So for each of the records of a certain group, it will order them by most recent DatePartition and give them a row number value with "1" being the most recent since we sorted the DatePartition DESC.
Then it takes your common table expression of sorted values, and just returns everything in your table (EXCEPT that row number "_row" you assigned) and then filter only on "_row =1" which will be your most recent records.

Related

How to add column to an existing table and calculate the value

Table info:
I want to add new column and calculated the different of the alarmTime column with this code:
ALTER TABLE [DIALinkDataCenter].[dbo].[DIAL_deviceHistoryAlarm]
ADD dif AS (DATEDIFF(HOUR, LAG((alarmTime)) OVER (ORDER BY (alarmTime)), (alarmTime)));
How to add the calculation on the table? Because always there's error like this:
Windowed functions can only appear in the SELECT or ORDER BY clauses.
You are using the syntax for a generated virtual column that shows a calculated value (ADD columnname AS expression).
This, however, only works on values found in the same row. You cannot have a generated column that looks at other rows.
If you consider now to create a normal column and fill it with calculated values, this is something you shouldn't do. Don't store values redundantly. You can always get the difference in an ad-hoc query. If you store this redundantly instead, you will have to consider this in every insert, update, and delete. And if at some time you find rows where the difference doesn't match the time values, which column holds the correct value then and which the incorrect one? alarmtime or dif? You won't be able to tell.
What you can do instead is create a view for convenience:
create view v_dial_devicehistoryalarm as
select
dha.*,
datediff(hour, lag(alarmtime) over (order by alarmtime), alarmtime) as dif
from dial_devicehistoryalarm dha;
Demo: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=b7f9b5eef33e72955c7f135952ef55b5
Remember though, that your view will probably read and sort the whole table everytime you access it. If you query only a certain time range, it will be faster hence to calculate the differences in your query instead.

I need to retrieve a column from already retrieved column from a table

I had a table which has a column like this which i retrieved from this query
select distinct HDD_WP_RPTNG_AS_OF_SID
from wcadbo.WCA_MDW_D_HLDNGS_DATE
order by HDD_WP_RPTNG_AS_OF_SID desc;
Table:
HDD_WP_RPTNG_AS_OF_SID
20210501
20210430
20210429
20210428
It contains dates in integer format.
I wrote a query to retrieve another column of these dates in date format and I named column as AS_OF_DATE - like this:
SELECT DISTINCT
HDD_WP_RPTNG_AS_OF_SID,
to_date(HDD_WP_RPTNG_AS_OF_SID,'YYYYMMDD') AS_OF_DATE
FROM
WCADBO.WCA_MDW_D_HLDNGS_DATE
ORDER BY
HDD_WP_RPTNG_AS_OF_SID DESC;
Result set:
HDD_WP_RPTNG_AS_OF_SID AS_OF_DATE
----------------------------------
20210501 01-MAY-21
20210430 30-APR-21
20210429 29-APR-21
20210428 28-APR-21
Now I need another column as Display_Date in char type which gives LastAvailableDate for latest date in previous column or gives Date in char type for all other dates like this
I wrote this query but not working:
SELECT
HDD_WP_RPTNG_AS_OF_SID,
AS_OF_DATE,
Display_date
FROM
(SELECT DISTINCT
HDD_WP_RPTNG_AS_OF_SID,
to_date(HDD_WP_RPTNG_AS_OF_SID,'YYYYMMDD') AS_OF_DATE
FROM
WCADBO.WCA_MDW_D_HLDNGS_DATE
ORDER BY
HDD_WP_RPTNG_AS_OF_SID DESC)
WHERE
Display_Date = (CASE
WHEN AS_OF_DATE = '01-MAY-21'
THEN 'Last_Available_date'
ELSE TO_CHAR(AS_OF_DATE, 'MON DD YYYY')
END);
Finally I need three columns, one is already in table but modified a bit. Other two are temporary ones(AS_OF_DATE and Display_Date) that i need to retrieve.
I'm a beginner in SQL and couldn't figure out how to retrieve column from another temporary column..
Kindly help, Thank you.
BTW I was doing it in Oracle SQL Developer
It looks like you want something like this
SELECT
subQ.HDD_WP_RPTNG_AS_OF_SID,
subQ.AS_OF_DATE,
(CASE
WHEN subQ.AS_OF_DATE = date '2021-05-01'
THEN 'Last_Available_date'
ELSE TO_CHAR(subQ.AS_OF_DATE, 'MON DD YYYY')
END) Display_date
FROM
(SELECT DISTINCT
tbl.HDD_WP_RPTNG_AS_OF_SID,
to_date(tbl.HDD_WP_RPTNG_AS_OF_SID,'YYYYMMDD') AS_OF_DATE
FROM
WCADBO.WCA_MDW_D_HLDNGS_DATE tbl) subQ
ORDER BY
subQ.HDD_WP_RPTNG_AS_OF_SID DESC
Comments
If you want to add computed columns, that is done in the projection (the select list)
Always compare dates to dates and strings to strings. So in your case statement, compare the date as_of_date against another date. In this case I'm using a date literal. You could also call to_date on a string parameter.
If you want the results of the query ordered by a particular column, you want that order by applied at the outermost layer of the query, not in an inline view.
You basically always want to use aliases when referring to any column in a query. It's less critical in situations where everything is coming from one table but as soon as you start referencing multiple tables in a query, it becomes annoying to look at a query and not sure where a column is coming from. Even in a query like this where there is an inline view and an outer query, it makes it easier to read the query if you're explicit about where the columns are coming from.
Do you really need the distinct? I kept it because it was a part of the original query but I get antsy whenever I see a distinct particularly from people learning SQL. Doing a distinct is a relatively expensive operation and it is very commonly used to cover up an underlying issue in the query (i.e. that you're getting multiple rows because some other column you aren't showing has multiple values) that ought to be addressed correctly (i.e. by adding an additional predicate to ensure that you're only getting each hdd_wp_rptg_as_of_sid once).
Storing dates as strings in tables (as is done apparently with hdd_wp_rptg_as_of_sid) is a really bad practice. If one person writes one row to the table where the string isn't in the right format, your query will suddenly stop working and start throwing errors, for example.

Return All Historical Records for Accounts with Change in Specific Associated Value

I am trying to select all records in a time-variant Account table for each account with a change in an associated value (e.g. the maturity date). A change in the value will result in the most recent record for an account being end-dated and a new record (containing a new effective date of the following day) being created. The most recent records for accounts in this table have an end-date of 12/31/9000.
For instance, in the below illustration, account 44444444 would not be included in my query result set since it hasn't had a change in the value (and thus also has no additional records aside from the original); however, the other accounts have multiple changes in values (and multiple records), so I would want to see those returned.
Also, the table has a number of other fields (columns) not included below but for which changes in the values for these fields can trigger a new record being created; however, I only want to retrieve all records for those accounts where the figure in the “value” column has changed. What are some ways to obtain the results I need?
Note: The primary key for this table includes the acct_id and eff_dt, and I'm using PostgreSQL within a Greenplum environment.
Here are two types of queries I tried to use but which produced problematic results:
Query 1
Query 2
I think you want window functions to compare the value:
select t.*
from (select t.*,
min(t.value) over (partition by t.acct_id) as min_value,
max(t.value) over (partition by t.acct_id) as max_value
from t
) t
where min_value <> max_value;

Get latest data for all people in a table and then filter based on some criteria

I am attempting to return the row of the highest value for timestamp (an integer) for each person (that has multiple entries) in a table. Additionally, I am only interested in rows with the field containing ABCD, but this should be done after filtering to return the latest (max timestamp) entry for each person.
SELECT table."person", max(table."timestamp")
FROM table
WHERE table."type" = 1
HAVING table."field" LIKE '%ABCD%'
GROUP BY table."person"
For some reason, I am not receiving the data I expect. The returned table is nearly twice the size of expectation. Is there some step here that I am not getting correct?
You can 1st return a table having max(timestamp) and then use it in sub query of another select statement, following is query
SELECT table."person", timestamp FROM
(SELECT table."person",max(table."timestamp") as timestamp, type, field FROM table GROUP BY table."person")
where type = 1 and field LIKE '%ABCD%'
Direct answer: as I understand your end goal, just move the HAVING clause to the WHERE section:
SELECT
table."person", MAX(table."timestamp")
FROM table
WHERE
table."type" = 1
AND table."field" LIKE '%ABCD%'
GROUP BY table."person";
This should return no more than 1 row per table."person", with their associated maximum timestamp.
As an aside, I surprised your query worked at all. Your HAVING clause referenced a column not in your query. From the documentation (and my experience):
The fundamental difference between WHERE and HAVING is this: WHERE selects input rows before groups and aggregates are computed (thus, it controls which rows go into the aggregate computation), whereas HAVING selects group rows after groups and aggregates are computed.

PowerPivot - only newest values on current context

I have a problem with PowerPivot.
Let's have a look at only 3 columns in my data source:
date - clientid - category
Category can only be 1 or 2.
In the data source you can find often the same clientid for a given time period, sometimes with different category.
So in my pivot table, I can see the distinct count of my clients depending on the chosen timeline.
But, of course, the sum of clients for cat=1 and cat=2 is bigger than the distinct count.
Is it possible to count only the newest entries for every clientid, so that the sum of the two cats is the same as the distinct count of my clients?
Thanks in advance to everybody who helps and spend his time for me.
Stefan
This was fun! Thanks for an interesting problem. Normally for this sort of thing we might flag the most recent entry for a given clientid in an extra field, but yours needs to be dynamic at runtime based on your date filter selection.
Here we go. Be warned, it's a doozy.
CountCat:=
COUNTROWS(
FILTER(
GENERATE(
VALUES( ClientCats[clientid] )
,CALCULATETABLE(
SAMPLE(
1
,SUMMARIZE(
ClientCats
,ClientCats[date]
,ClientCats[category]
)
,ClientCats[date]
,DESC
)
,ALL( ClientCats[category] )
)
)
,CONTAINS(
VALUES( ClientCats[category] )
,ClientCats[category]
,ClientCats[category]
)
)
)
Let's work through it.
COUNTROWS() is trivial.
FILTER() takes a table as its first argument. It creates a row context by iterating row-by-row through this table. It evaluates a boolean expression in each row context and returns the rows for which the expression returns true. We're not getting to that expression for a little while here. Let's look at the table we'll be filtering.
GENERATE() takes a table as its input and creates a row context by iterating row-by-row through that table. For each row context it evaluates a second table, and cross joins the rows that exist in the second table expression in the current row context from the first table with the row from the first table.
Our first table is VALUES( ClientCats[clientid] ), which is simply a distinct list of all [clientid]s in context from the pivot table.
We then evaluate CALCULATETABLE() for each row context, aka for each [clientid]. CALCULATETABLE() evaluates a table expression in the filter context determined by its second and subsequent arguments.
SAMPLE() is the table we'll evaluate. SAMPLE() is like TOPN(), but with ties broken non-deterministically. SAMPLE( 1, ... ) always returns one row. TOPN( 1, ... ) returns all rows that are tied for first position.
SAMPLE(), here, will return one row from the table defined by SUMMARIZE(). SUMMARIZE() groups by the fields in a table that are named. Thus we have a table of all distinct values of [date] and [category] that are included based on the context determined by our CALCULATETABLE(). SAMPLE()'s third argument defines a sort-by column to determine which rows are first, and its fourth determines the sort order. Thus for each [clientid] we are returning the latest row in the SUMMARIZE() for that [clientid].
The ALL() in our CALCULATETABLE() strips the context from the field [category] that might be coming in from our pivot table. This means that every time we evaluate our GENERATE() (remember we're still in that function here), we get a table of all [clientid]s that exist in context, and their most recent [category], even when we're evaluating in a pivot cell that has filtered [category].
That sounds like a problem - we'd expect the same count now for every pivot cell. And that's what we'd get if we did COUNTROWS( GENERATE() ). But wait, we're still in FILTER()!
Now we get to the boolean expression which will filter the rows of that GENERATE(). CONTAINS() takes a table as its first argument, a reference to a column in that table as its second argument, and a scalar value as its third argument. It returns true if the column in argument 2, of the table in argument 1, contains the value in argument 3.
We are outside of the CALCULATETABLE(), and therefore context exists on [category]. VALUES() returns the unique rows in context. In any pivot cell filtered by [category], this will be a 1x1 table, but in our grand total, it will have multiple rows.
So, the column in that VALUES() we want to test is [category] (the only column that exists in that VALUES()).
The value we want to test for is referred to by ClientCats[category]. That third argument evaluates [category] in the row context determined by FILTER(). Thus we return true for every row that matches the current filter context (in a pivot cell) of ClientCats[category]. Mind-bending stuff here.
Anyway, the upshot is that in a [category]-filtered pivot cell, we get the number of distinct [clientid]s that have, for the time frame selected, that [category] value as their most recent category.
For the grand total we get every [clientid] in context.
This will probably not have a very good performance curve.
Here's a sample workbook to play with the functioning measure defined.
Edit
Based on replies below.
Do you need to maintain in the model all the rows that have [UseClient] <> 1? Deduping and flagging is always easier in tools other than Power Pivot.
I have no idea how you've determined the values for 1 in [UseClient]. None are the most recent entry for a given [ClientID]. If you want to just flag the most recent row, which is what it sounds like you want, but not what your workbook looks like, you can do a calculated column much more easily than doing this in a measure:
=SAMPLE(
1
,CALCULATETABLE( // return all dates for the [clientid] on current row
VALUES( ClientCats[date] )
,ALLEXCEPT( ClientCats, ClientCats[clientid] )
)
,ClientCats[date]
,DESC
) = ClientCats[date] // row context in table
This will return true when the value of [date] on a given row is equal to the maximum [date] for the client on that row.
One thing you could easily do in Power Query is to group [clientid] and take the max date for each [clientid]. Then you have one row per client.
This is all different than your original question, though, because your original wants to find the maxes based on date selection. But a calculated column is not updated based on filter context. It's only recalculated at model refresh time. If you're willing to use a calculated column, then just deal with your data issues before bringing it into Power Pivot.