I am trying to run a query that aggregates data, groups the results by several different fields, and extract all relevant "SubTotal" permutations. (similar to CUBE() in MSSQL)
When Using Group By Rollup(), I get only permutations according to the order of the Group By fields in the Rollup function.
For example the query below (runs on a public dataset), it returns subtotal by year, or by year and month, or by year, month and medallion... but it doesn't subtotal by medallion.
SELECT
trip_year,
trip_month,
medallion,
SUM(trip_count) AS Sum_trip_count
FROM
[nyc-tlc:yellow.Trips_ByMonth_ByMedallion]
WHERE
medallion IN ("2R76", "8J82", "3B85", "4L79", "5D59", "6H75", "7P60", "8V48", "1H12", "2C69", "2F38", "5Y86", "5j90", "8A75", "8V41", "9J24", "9J55", "1E13", "1J82")
GROUP BY
ROLLUP(trip_year,
trip_month,
medallion)
My question is:
What should I do in order to get all different permutations of "Sub Totals" in a single query results.
Already tried: Union with similar query but with different order, it works, but not elegant (it would require too many unions).
Thanks
You are correct on both counts. In BigQuery, ROLLUP respects the hierarchy treating the listed fields as a strictly ordered list. Their order will not be changed during aggregation.
The CUBE aggregate commonly found in other SQL environments is unordered and in fact aggregates every possible order/subset of its listed fields. At this time, CUBE has not been implemented in BigQuery. The workaround you suggest is also what I would suggest. UNION all result sets from ROLLUP using each permutation of its contained fields. Albeit not ideal, you should get the same results.
In short, UNIONs of several queries with different permutations of ROLLUP fields is the only way to achieve this at the moment. The downsides are as you state that this may be difficult to maintain and can be more expensive in queries.
If you would like to see CUBE implemented in BigQuery, I strongly encourage you to file a feature request on the Big Query public issue tracker. Be sure to include a thorough use case in this request.
UPDATE: To support the feature request filed by the OP, please star it and you'll receive notifications with updates.
Related
I'm trying to optimize a 2M row SSAS query into Power BI by using MDX prior to the Power Query. I have experience in T-SQL and found a website to help translate T-SQL experience into MDX, which was successful for some queries (basic rows/column selects, crossjoins, non empty, order by, filter, where). So now I want to get in my sales data which contains three dimensions and four measures but I get the following error:
Executing the query ...
Query (3, 1) The 'Measures' hierarchy appears more than once in the tuple.
Run complete
I attempted a few variations related to crossjoining the measures and the dimensions, only selecting one measure (which still took too long), and specifying members vs children.
'''
select
([Date].[OrderDate].children, [Customer].[CustID].children, [ProdLevel].[ProdNumber].children) on rows,
([Measures].[Revenue], [Measures].[Units], [Measures].[ASP], [Measures].[Profit]) on columns
from [RepProdDB]
where [ProdLevel].[Prod Description].[MyBusinessUnit]
'''
Looking up the error: "The 'Measures' hierarchy appears more than once in the tuple." is a bit vague to me as I have slight but probably incomplete understanding of tuples.
My hope is to have something that I can easily get in PivotTable OLAP, Power Pivot, and Power Query but using the actual MDX code. Thoughts?
So you need to understand the diffrence between tuples and sets.
select
non empty
(
[Date].[OrderDate].children,
[Customer].[CustID].children,
[ProdLevel].[ProdNumber].children
)
on rows,
{
[Measures].[Revenue],
[Measures].[Units],
[Measures].[ASP],
[Measures].[Profit]
}
on columns
from [RepProdDB]
where
[ProdLevel].[Prod Description].[MyBusinessUnit]
UPDATE
Thanks to the posted answer, I found a much simpler way to formulate the problem. The original question can be seen in the revision history.
The problem
I am trying to translate an SQL query into Django, but am getting an error that I don't understand.
Here is the Django model I have:
class Title(models.Model):
title_id = models.CharField(primary_key=True, max_length=12)
title = models.CharField(max_length=80)
publisher = models.CharField(max_length=100)
price = models.DecimalField(decimal_places=2, blank=True, null=True)
I have the following data:
publisher title_id price title
--------------------------- ---------- ------- -----------------------------------
New Age Books PS2106 7 Life Without Fear
New Age Books PS2091 10.95 Is Anger the Enemy?
New Age Books BU2075 2.99 You Can Combat Computer Stress!
New Age Books TC7777 14.99 Sushi, Anyone?
Binnet & Hardley MC3021 2.99 The Gourmet Microwave
Binnet & Hardley MC2222 19.99 Silicon Valley Gastronomic Treats
Algodata Infosystems PC1035 22.95 But Is It User Friendly?
Algodata Infosystems BU1032 19.99 The Busy Executive's Database Guide
Algodata Infosystems PC8888 20 Secrets of Silicon Valley
Here is what I want to do: introduce an annotated field dbl_price which is twice the price, then group the resulting queryset by publisher, and for each publisher, compute the total of all dbl_price values for all titles published by that publisher.
The SQL query that does this is as follows:
SELECT SUM(dbl_price) AS total_dbl_price, publisher
FROM (
SELECT price * 2 AS dbl_price, publisher
FROM title
) AS A
GROUP BY publisher
The desired output would be:
publisher tot_dbl_prices
--------------------------- --------------
Algodata Infosystems 125.88
Binnet & Hardley 45.96
New Age Books 71.86
Django query
The query would look like:
Title.objects
.annotate(dbl_price=2*F('price'))
.values('publisher')
.annotate(tot_dbl_prices=Sum('dbl_price'))
but gives an error:
KeyError: 'dbl_price'.
which indicates that it can't find the field dbl_price in the queryset.
The reason for the error
Here is why this error happens: the documentation says
You should also note that average_rating has been explicitly included
in the list of values to be returned. This is required because of the ordering of the values() and annotate() clause.
If the values() clause precedes the annotate() clause, any annotations
will be automatically added to the result set. However, if the
values() clause is applied after the annotate() clause, you need to explicitly include the aggregate column.
So, the dbl_price could not be found in aggregation, because it was created by a prior annotate, but wasn't included in values().
However, I can't include it in values either, because I want to use values (followed by another annotate) as a grouping device, since
If the values() clause precedes the annotate(), the annotation will be computed using the grouping described by the values() clause.
which is the basis of how Django implements SQL GROUP BY. This means that I can't include dbl_price inside values(), because then the grouping will be based on unique combinations of both fields publisher and dbl_price, whereas I need to group by publisher only.
So, the following query, which only differs from the above in that I aggregate over model's price field rather than annotated dbl_price field, actually works:
Title.objects
.annotate(dbl_price=2*F('price'))
.values('publisher')
.annotate(sum_of_prices=Count('price'))
because the price field is in the model rather than being an annotated field, and so we don't need to include it in values to keep it in the queryset.
The question
So, here we have it: I need to include annotated property into values to keep it in the queryset, but I can't do that because values is also used for grouping (which will be wrong with an extra field). The problem essentially is due to the two very different ways that values is used in Django, depending on the context (whether or not values is followed by annotate) - which is (1) value extraction (SQL plain SELECT list) and (2) grouping + aggregation over the groups (SQL GROUP BY) - and in this case these two ways seem to conflict.
My question is: is there any way to solve this problem (without things like falling back to raw sql)?
Please note: the specific example in question can be solved by moving all annotate statements after values, which was noted by several answers. However, I am more interested in solutions (or discussion) which would keep the annotate statement(s) before values(), for three reasons: 1. There are also more complex examples, where the suggested workaround would not work. 2. I can imagine situations, where the annotated queryset has been passed to another function, which actually does GROUP BY, so that the only thing we know is the set of names of annotated fields, and their types. 3. The situation seems to be pretty straightforward, and it would surprise me if this clash of two distinct uses of values() has not been noticed and discussed before.
Update: Since Django 2.1, everything works out of the box. No workarounds needed and the produced query is correct.
This is maybe a bit too late, but I have found the solution (tested with Django 1.11.1).
The problem is, call to .values('publisher'), which is required to provide grouping, removes all annotations, that are not included in .values() fields param.
And we can't include dbl_price to fields param, because it will add another GROUP BY statement.
The solution in to make all aggregation, which requires annotated fields firstly, then call .values() and include that aggregations to fields param(this won't add GROUP BY, because they are aggregations).
Then we should call .annotate() with ANY expression - this will make django add GROUP BY statement to SQL query using the only non-aggregation field in query - publisher.
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(sum_of_prices=Sum('dbl_price'))
.values('publisher', 'sum_of_prices')
.annotate(titles_count=Count('id'))
The only minus with this approach - if you don't need any other aggregations except that one with annotated field - you would have to include some anyway. Without last call to .annotate() (and it should include at least one expression!), Django will not add GROUP BY to SQL query. One approach to deal with this is just to create a copy of your field:
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(_sum_of_prices=Sum('dbl_price')) # note the underscore!
.values('publisher', '_sum_of_prices')
.annotate(sum_of_prices=F('_sum_of_prices')
Also, mention, that you should be careful with QuerySet ordering. You'd better call .order_by() either without parameters to clear ordering or with you GROUP BY field. If the resulting query will contain ordering by any other field, the grouping will be wrong.
https://docs.djangoproject.com/en/1.11/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Also, you might want to remove that fake annotation from your output, so call .values() again.
So, final code looks like:
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(_sum_of_prices=Sum('dbl_price'))
.values('publisher', '_sum_of_prices')
.annotate(sum_of_prices=F('_sum_of_prices'))
.values('publisher', 'sum_of_prices')
.order_by('publisher')
This is expected from the way group_by works in Django. All annotated fields are added in GROUP BY clause. However, I am unable to comment on why it was written this way.
You can get your query to work like this:
Title.objects
.values('publisher')
.annotate(total_dbl_price=Sum(2*F('price'))
which produces following SQL:
SELECT publisher, SUM((2 * price)) AS total_dbl_price
FROM title
GROUP BY publisher
which just happens to work in your case.
I understand this might not be the complete solution you were looking for, but some even complex annotations can also be accommodated in this solution by using CombinedExpressions(I hope!).
Your problem comes from values() follow by annotate(). Order are important.
This is explain in documentation about [order of annotate and values clauses](
https://docs.djangoproject.com/en/1.10/topics/db/aggregation/#order-of-annotate-and-values-clauses)
.values('pub_id') limit the queryset field with pub_id. So you can't annotate on income
The values() method takes optional positional arguments, *fields,
which specify field names to which the SELECT should be limited.
This solution by #alexandr addresses it properly.
https://stackoverflow.com/a/44915227/6323666
What you require is this:
from django.db.models import Sum
Title.objects.values('publisher').annotate(tot_dbl_prices=2*Sum('price'))
Ideally I reversed the scenario here by summing them up first and then doubling it up. You were trying to double it up then sum up. Hope this is fine.
I trying to calculate a new field in Tableau similar to a sub query in SQL. However, my numbers not matching up when I try to do this. I am stuck at this point and I am trying to see what others have done.
To provide reference, below is the subquery that I am trying to duplicate in Tableau.
select
((sum(n.non_influenced_sales*n.calls)/sum(n.calls))-(sum(n.average_sales*calls)/sum(calls))))/
(sum(n.non_influenced_sales*n.calls)/sum(n.calls)) as impact
from (select
count(d.id) as calls
,avg(d.sale) as average_sales
,avg(case when non_influenced=1 then d.sale else null end) as non_influenced_sales
from data d
group by skill) n
When I try to build the same calculation in Tableau, I am able to get the same results as long as I comment out the group by skill. However, when I try to group by skill, my attempts to match the number have not been working.
The closest I have come is when I try to fix the level of detail expression by using include. Tableau code:
(include [skill]:([non_influenced_sales]-[average_sales])/[non_influenced_sales]}
However, doing this or using fixed has not worked and I can't match the numbers I am getting from SQL.
FYI, Impact is an aggregated measure. I built the sub-query part in tableau by just creating separate fields for the calculation I needed. So for example
Non Influenced Sales calculated in Tableau:
avg(if [non_influenced]=1 then [non_influenced_sales] end)
However, I am not sure if this matters or not.
I have also tried creating custom sql. I am able to get a rolled up version using all of the dates correct. But when I want to get down to different dates/use other filters, things get messy real quick. I am trying to build relationships on a date level, but that hasn't worked either.
Is there an easier way to do this?
I am not sure this is a duplicated question or not (I don't think so) but its very interesting question for me:
In SQL we can create custom field and put it in the result:
SELECT *.p, totalOrder=(SELECT sum(price) from orders where id=p.id)
FROM products p;
so the result is a list of products with totalSales value.
What is best approach in NoSQL(MongoDB),
I am sure we should have two types of socuments(products and orders) so I know we don't have Join but the question is do we have custom field assignment in finding queries?
When you use aggregation, you have the $project operation which is exactly that. It is used to rename fields or derive field values through some simple operators. But as usual with MongoDB, you can not get any data from another collection.
When you need to do something which is too complex to express with aggregation, you can use MapReduce and build your output-documents with Javascript. But again, no breaking out of the collection.
In analysis services, I have cube that is based on hospitalization data. For each hospitalization there are potentially 9 icd codes and these are each stored in their own field in the view on which the cube is based. These are stored in a child table in the relational database on which the SSAS database is based.
I would like to query the cube to return all rows that have a certain ICD code in any one or more of the 9 icd code fields. It seems as if it should be simple to have this sort of "OR" in the WHERE or the Filter clause, but I'm not finding the correct method.
Thanks in advance,
Jeremy Schrader
As far as i understand, you are an SQL guy and new to MDX, so that's why you have difficulties for the query.
it would be better if you tell us what are the measures you want to select with ICD codes but i am going to try to show you an mdx query sample as simple as possible. Your query should like below;
select {Measure1,Measure2,...} on columns
ICDCodeDimension.Children on rows
//{ICDCodeDimension.ICDCode1,ICDCodeDimension.ICDCode5,...} on rows
from Cube
MDX is highly advanced query language and there are many more concept you should know/learn to use it effectively.
Hope this help.
I am guessing that you'd have a dimension called [ICD Codes] with a single level called [Codes] and 9 members called [Code A] and [Code B] or whatever. Perhaps even a member for [No code] too?
In that case your query would be able to tell you the total number of hospitalisation cases for each code, for a certain time period, across all hospitals:
SELECT {[ICD Codes].[Codes].members} ON ROWS,
{[Measures].[Number of Cases]} ON COLUMNS
FROM [CubeName]
WHERE ([Time].[2010].[Quarter 1])
Thanks for both of your feedback. After I researched further (particularly an article here: http://sqlblog.com/blogs/mosha/default.aspx that uses the SUBCUBE method to give some "OR" functionality but with very poor performance) I realized that the OR construct that I was looking for requires record-level information and so doesn't work after the aggregation that SSAS performs. Thus, I need to create a field on the fact table that has the result of the SQL "OR" statement that I need.
In this case I will just create a flag for any record that has a certain range of ICD codes in any of the 9 ICD code fields. Then, I'll create a measure that gives a count of these. Luckily, the requirements of my app are that only a limited number of diagnoses need to be looked at in this way(i.e. any hospitalization that is diabetes-related,tobacco-related,etc.). I'm still curious how one would approach this if you needed to allow the user to choose any ICD code. My understanding at this point is that you would then need to revert back to plain SQL.
Jeremy