This is how my data model looks like:
Id Status StartDate
1 StatusA 01/01/2015
1 StatusB 01/03/2015
1 StatusC 01/05/2015
2 StatusA 01/04/2015
2 StatusB 01/08/2015
I am trying to get the max date of StatusB column per Id.
This is how my dimension looks like:
=If(Match(Status,'StatusB'),Timestamp(StartDate))
It works fine but it also gives me an additional duplicate row with Empty max date.
My Straight Chart table contains only these 2 columns. If i remove the Max Date dimension, it shows one record per Id
What am i missing here?
No need to add the filter in the dimension. QV allow calculated dimension but they can cause a lot of performance issues. (basically when calculated dimensions are used QV is createing new "virtual" table in the memory with the new dimension. With big datasets this can drain your ram/cpu)
For this cases its much better to "control" the dimension through the expressions. In your case just add Status as dimension and type the following expression:
max( {< Status = {'StatusB'} >} StartDate)
And in Numbers tab change the format setting to Timestamp.
Stefan
you can look like this;
aggr(max(StartDate), Status, StartDate)
Related
I have a BigQuery dataset updating on irregular times (can be once, twice a week, or less). Data is structured as following.
id
Column1
Column2
data_date(timestamp)
0
Datapoint0
Datapoint00
2022-01-01
1
Datapoint1
Datapoint01
2022-01-01
2
Datapoint2
Datapoint02
2022-01-03
3
Datapoint3
Datapoint03
2022-01-03
4
Datapoint4
Datapoint04
2022-02-01
5
Datapoint5
Datapoint05
2022-02-01
6
Datapoint6
Datapoint06
2022-02-15
7
Datapoint7
Datapoint07
2022-02-15
Timestamp is a string in 'YYYY-MM-DD' format.
I want to make a chart and a pivot table in Google DataStudio that automatically filters by the latest datapoints ('2022-02-15' in the example). All the solutions I tried are either sub-optimal or just don't work:
Creating a support column doesn't work because I need to mix aggregated and non-aggregated fields (data_date and the latest data_date)
Adding a filter to the charts allows me to specify only a specific day - I would need to edit the chart regularly every time the underlyind data is updated
Using a dropdown filter allows me to dynamically filter whatever date I need. However I consider it suboptimal because I can't have it automatically select the latest date. Having a date filter can make it dynamic, but since the update time is not regular it may select a date range with multiple timestamps/or none at all, so it's also a sub-optimal solution
Honestly I'm out of ideas. I stupidly thought it was possible to add a column saying data_date = (select max(data_date) from dataset, but it seems not possible since max needs to work on aggregated data.
One possible solution could be creating a view that can have the latest data point, and referencing the view from the data studio.
CREATE OR REPLACE VIEW `project_id.dataset_id.table_name` AS
SELECT *
FROM `bigquery-public-data.covid19_ecdc_eu.covid_19_geographic_distribution_worldwide`
ORDER BY date DESC # or timestamp DESC
LIMIT 1
I am working on the workforce analysis project. And I did some case when conditional calculations in Google Data Studio. However, when I successfully conducted the creation of the new field, I couldn't do the calculation again based on the fields I created.
Based on my raw data, I generated the start_headcount, new_hires, terminated, end_headcount by applying the Case When conditional calculations. However, I failed in the next step to calculate the Turnover rate and Retention rate.
The formula for Turnover rate is
terms/((start_headcount+end_headcount)/2)
for retention is
end_headcount/start_headcount
However, the result is wrong. Part of my table is as below:
Supervisor sheadcount newhire terms eheadcount turnover Retention
A 1 3 1 3 200% 0%
B 6 2 2 6 200% 500%
C 6 1 3 4 600% 300%
So the result is wrong. The turnover rate for A should be 1/((1+3)/2)=50%; For B should be 2/((6+6)/2)=33.33%.
I don't know why it is going wrong. Can anyone help?
For example, I wrote below for start_headcount for each employee
CASE
WHEN Last Hire Date<'2018-01-01' AND Termination Date>= '2018-01-01'
OR Last Hire Date<'2018-01-01' AND Termination Date IS NULL
THEN 1
ELSE 0
END
which means if an employee meets the above standard, will get 1. And then they all grouped under a supervisor. I think it might be the problem why the turnover rate in sum is wrong since it is not calculated on the grouped date but on each record and then summed up.
Most likely you are trying to do both steps within the same query and thus newly created fields like start_headcount, etc. not visible yet within the same select statement - instead you need to put first calculation as a subquery as in example below
#standardSQL
SELECT *, terms/((start_headcount+end_headcount)/2) AS turnover
FROM (
<query for your first step>
)
I am trying to run a select on a table whereby the data in the table ranges across multiple days, thus it does not conform to daily data that the documentation eludes to.
Application of the xbar selection accross multiple days obviously results in data that is not ordered i.e. select last size, last price by 1 xbar time.second on data that includes 2 days would result in:
second | size price
====================
00:00:01 | 400 555.5
00:00:01 | 600 606.0
00:00:02 | 400 555.5
00:00:02 | 600 606.0
How can one add the current date in the selection such that the result like what is done in pandas can still be orderly across multiple days e.g: 2019-09-26 16:34:40
Furthermore how does one achieve this whilst maintaining a date format that is compatible with pandas once stored in csv?
NB: It is easiest for us to assist you if you provide code that can replicate a sample of the kind of table that you are working with. Otherwise we need to make assumptions about your data.
Assuming that your time column is of timestamp type (e.g. 2019.09.03D23:11:54.711811000), a simple solution is to xbar by one as a timespan, rather than using the time.second syntax:
select last size, last price by 0D00:00:01 xbar time from data
Using xbar keeps the time column as a timestamp rather than casting it to second type.
If your time column is of some other temporal type then you can still use this method if you have a date column in your table that you can use to cast time to a timestamp. This would look something like:
select last size, last price by 0D00:00:01 xbar date+time from data
I would suggest to group by both date and second, and the sum them
update time: date+time from
select last size, last price
by date: `date$time, time: 1 xbar `second$time from data
Or the other shorter and more efficient option is to sum date and second right in the group clause:
select last size, last price by time: (`date$time) + 1 xbar `second$time from data
Below I have the following table structure:
CREATE TABLE StandardTable
(
RecordId varchar(50),
Balance float,
Payment float,
ProcDate date,
RecordIdCreationDate date,
-- multiple other columns used for calculations
)
And here is what a small sample of what my data might look like:
RecordId Balance Payment ProcDate RecordIdCreationDate
1 1000 100 2005-01-01 2005-01-01
2 5000 250 2008-01-01 2008-01-01
3 7500 350 2006-06-01 2006-06-01
1 900 100 2005-02-01 NULL
2 4750 250 2008-02-01 NULL
3 7150 350 2006-07-01 NULL
The table holds data on a transactional basis and has millions of rows in it. The ProcDate field indicates the month that each transaction is being processed. Regardless of when the transaction occurs throughout the month, the ProcDate field is hard coded to the first of the month that the transaction happened in. So if a transaction occurred on 2009-01-17, the ProcDate field would be 2009-01-01. I'm dealing with historical data, and it goes back to as early as 2005-01-01. There are multiple instances of each RecordId in the table. A RecordId will show up in each month until the Balance column reaches 0. Some RecordId's originate in the month the data starts (where ProcDate is 2005-01-01) and others don't originate until a later date. The RecordIdCreationDate field represents the date where the RecordId was originated. So that row has millions of NULL values in the table because every month that each RecordId didn't originate in is equal to NULL.
I need to somehow look at each RecordId, and run a number of different calculations on a month to month basis. What I mean is I have to compare column values for each RecordId where the ProcDate might be something like 2008-01-01, and compare those values to the same column values where the ProcDate would be 2008-02-01. Then after I run my calculations for the RecordId in that month, I have to compare values from 2008-02-01 to values in 2008-03-01 and run my calculations again, etc. I'm thinking that I can do this all within one big WHILE loop, but I'm not entirely sure what that would look like.
The first thing I did was create another table in my database that had the same table design as my StandardTable and I called it ProcTable. In that ProcTable, I inserted all of the data where the RecordIdCreationDate was not equal to NULL. This gave me the first instance of each RecordId in the database. I was able to run my calculations for the first month successfully, but where I'm struggling is how I use the column values in the ProcTable, and compare those to the column values where the ProcDate is the month after that. Even if I could somehow do that, I'm not sure how I would repeat that process to compare the 2nd month's data to the 3rd month's data, and the 3rd month's data to the 4th month's data, etc.
Any suggestions? Thanks in advance.
Seems to me, all you need to do is JOIN the table to itself, on this condition
ON MyTable1.RecordId = MyTable2.RecordId
AND MyTable1.ProcDate = DATEADD(month, -1, MyTable2.ProcDate)
Then you will have all the rows in your table (MyTable1), joined to the same RecordId's row from the next month (MyTable2).
And in each row you can do whatever calculations you want between the two joined tables.
I'm trying to create what seems like should be a pretty simple matrix report and I'm hoping someone can help. I have dataset that returns sales region, Date, and sales amount. The requirement is to compare sales for the various time periods to the current date. I'm looking to get my matrix to look something like this:
CurrentSales Date2Sales CurrentVSDate2 Date3Sales CurrentVSDate3
1 1000 1500 -500 800 200
2 1200 1000 200 900 300
3 1500 1100 400 1400 100
I can get the difference from one column to the next, but I need all columns to reference the CurrentSales column. Any help would be greatly appreciated.
Currently my data set is pulling in a date, region, product and sales amount. I then have three parameters, CurrentDate, PreviousMonth, PreviousQuarter. The regions and products are my row groups and the dates are the column groups. Next I added a column inside the group with the following expression: =Sum(Fields!SalesAmount.Value)-Previous(Sum(Fields!SalesAmount.Value),"BookingDate"). I know this isn't correct because it compares the values to the previous date in the column group and I need the comparision to be to the First date in the column group.
Example:
Using Expressions you can:
=iif(Sum(Fields!SalesAmount.Value)= Previous(Sum(Fields!Date2Sales.Value)),
=iif(Sum(Fields!EndBalance.Value)=0, Nothing, Sum(Fields!EndBalance.Value)) You can also use Switch.
The easiest way to get this result would probably be in your query. Add a field to every row returned maybe called "Current Sales." Use a correlated subquery there to get the right value for comparison. Then your comparison can be as simple as =Fields!Sales.Value - Fields!CurrentSales.Value or similar.
There are some ways to do this at the report level, but they are more of a pain: my current favorite of those is to use custom code embedded in the report. Another approach is to use Aggregates of aggregates.