I am building a data model with PowerPivot for Excel 2013 and need to be able to identify the max number of emails sent per person. The DAX formula below gives me the result that I looking for but performance is incredibly slow. Is there an alternative that will compute a maximum by group without the performance hit?
Maximum Emails per Constituent:
=MAXX(SUMMARIZE('Email Data','Email Data'[person_id],"MAX Value",
([Emails Sent]/[Unique Count People])),[MAX Value])
So, without the measure definitions for [Emails Sent] or [Unique Count People], it is not possible to give definitive advice on performance. I'm going to assume they are trivial measures, though, based on their names - note that this is an assumption and its truth will affect the rest of my post. That being said, there is an obvious optimization to make to start with.
Maximum Emails per Consultant:=
MAXX(
ADDCOLUMNS(
VALUES('Email Data'[person_id])
,"MAX Value"
,[Emails Sent] / [Unique Count People]
)
,[MAX Value]
)
I used the ADDCOLUMNS() rather than a SUMMARIZE() to calculate new columns. See this post for an explanation of the performance implications.
Additionally, since you're not grouping by multiple columns, there's no need to use SUMMARIZE(). The performance impact of using VALUES() instead should be minimal.
The other question that comes to mind is whether this needs to be a measure. Are you going to be slicing by other dimensions? If not, this becomes a static attribute of a [person_id] which could be calculated during ETL, or in a calculated column.
A final note - I've also been assuming that your model is optimal as well. Again, we'd need to see it to make comment on whether you could see performance issues from something you're doing there.
Related
I require some more advanced MDX knowledge than mine.
I need to get the RepoRate_MAX for repo products, at book and instrument level, but also looking at the Java code I'm replacing that code always uses the max MurexId.
How can I perform the below (I've placed MAX in here on the dimension but this is wrong) and I need the combo of the dimensions and also the MAX MurexId:
[Measures].[RepoRate_VAL] = (([Deal].[ProductType].&[REPO],[Deal].[Book],[Deal].[Instrument],MAX([Deal].[MurexId])),[Measures].[RepoRate_MAX])
I'm sure it's a simple one but my mind is part way between the Java OO and MDX worlds currently haha :D
Thanks
Leigh
So after some experimenting I found out about the TAIL and Item MDX functions.
I think at one point I did get it working, but didn't make a note of what did work. I was playing around with this and variants of it..but most versions ended up in unusable query times:
[Measures].[RepoRate_VAL] = (([Deal].[ProductType].&[REPO],[Deal].[Book],[Deal].[Instrument],TAIL(EXISTING([Deal].[MurexId].[MurexId])).Item(0)),[Measures].[RepoRate_MAX])
So I then decided to push the RepoRate calculation back to the SQL data preparation script. Cleaner/smoother data is always better and then to have simple calculated members.
I used SQL to determine the RepoRate from tradelevel with MAX(MurexId) and GROUP BY on Book, Instrument to then update my main fact table to ensure that the correct RepoRate was set at Book, Instrument level.
Thus the calculated member is then:
[Measures].[RepoRate_VAL] = (([Deal].[Book],[Deal].[Instrument]),[Measures].[RepoRate_MAX])
Fast data prep and a fast calculated member on the Excel/Pivot/UI layer.
I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.
Scenario
Designing a Star Diagram for an OLAP environment for the process Incident Management. Management requests to be able to both filter on SLA status (breached, achieved or in progress) and being able to calculate the percentage of sla achieved vs breached. Reporting will be done through in Excel/SSRS through SSAS (tabular).
Question
I’m reasonable inexperienced in designing for an OLAP environment. I know my idea will work but I’m concerned this is not the best approach.
My idea:
SLA needs to be both a measure and a dimension.
DimSLA
…
(Nullable bool) Sla Achieved -> Yes=True, No=False, and InProgress=NULL
…
FactIncident
…
(Nullable Integer) Sla Achieved Yes=1,No=0 and In Progress=NULL
…
Then in SSAS, publish a calculated percentage field which averages FactIncident-SlaAchieved.
Is this the right/advisable way to do it?
As you describe it, "SLA achieved" should be an attribute, as you want to classify by it, not sum it. The only thing you want to sum or aggregate would be other measures (maybe an incident count) under the condition that the "SLA achieved" attribute has certain values like "achieved" or "not achieved". This is the main rule in dimensional design: Things you use for classifying or break down are attributes, and things that you calculate are measures. There are a few cases where you need a column for both, but not many.
Do not just use a boolean value. Use a string value easily understand by users like the texts "SLA achieved", "SLA not achieved", "in progress". This makes it much more easy for non technical users to use the cube. In case you use this in a dimension table , there would be just three records with the strings, and the fact table would reference them with maybe a byte foreign key, hence more meaningful texts do not use up millions of bytes.
I know MDX is used for much more sophisticated math, so please forgive the simplistic scenario, but this is one of my first Calculated members.
When I multiply Price x Quantity, the AS cube's data browser has the correct information in the leaf elements, but not in any of the parents. The reason seems to be that I want something like (1 * 2) + (2 * 3) + (4 * 5) and not (7 * 10) which think I am getting as a result of how the Sum is done on columns.
Is the IsLeaf expression intended to be used in these circumstances? Or is there another way? If so, are there any examples as simple as this I can see?
This Calculated member that I tried to create is just this:
[Measures].[Price]*[Measures].[Quantity]
The result for a particular line item (the leaf) is correct. But the results for, say, all of april, is an incredibly high number.
Edit:
I am now considering that this might be an issue regarding bad data. It would be helpful though if someone could just confirm that the above calculated member should be work under normal circumstances.
Here it is a blog post dealing with this particular problem: Aggregating the Result of an MDX Calculation Using Scoped Assignments.
For leaf level computations resulting in something that can then be summed, MDX is rather complex and slow.
The simplest way to do what you want to achieve would be to make this a normal measure, based on the Price x Quantity calculation defined in the data source view.
Ok, I'm just curious what the formula would be for calculating an expected income over the next X weeks/months/etc, if the only data I have in mySQL DB is all past transactions (dates of transactions, amounts, etc)
I am thinking taking some averages and whatnot, but I can't think of a specific formula (there must be something along those lines) to take say average rise of income over time (weekly/monthly) and then apply it to a select future period and display it weekly/monthly/etc?
Any suggestions?
use AVG() on the income in the past devide it to proper weekly/monthly amounts if neccessary.
see http://dev.mysql.com/doc/refman/5.1/en/group-by-functions.html#function_avg for more info on AVG()
Linear regression + simple integration is probably sufficient for your needs. I leave sorting out exact implementation for your DB up to you, but that follow that link to the "Estimation Methods" section, and probably use Ordinary Least Squares.
Alternatively, you can always slurp your data into something like R where the details are already implemented.
EDIT:
For more detail: you're trying to model INCOME = BASE + SCALING*T where we are assuming that a linear model is "good" (it's probably not great, but it's probably good enough on a short time scale). For two value linear regression, you're pretty much just taking averages; follow that link to "Fitting the Regression Line" and you'll see which things you need to average (y = INCOME and x = T). There are some tricks you can play to simplify the calculation for the computer if you can enforce some other conditions (e.g., having equally spaced time periods + no missing data), but you'll need to math a bit more yourself first if you want to do that (and you'll be less flexible in the face of changing db assumptions).