I was just going through the MDX documentation.
One clause there I found a bit tricky, rather I did not understood it clearly, as follows
The order of tuples in a set is important; it affects, for example,
the nesting order in an axis dimension. The first tuple represents the
first, or outermost, dimension, the second tuple represents the next
outermost dimension, and so on
{ (Time.[2nd half], Route.nonground.air), (Route.nonground.air,
Time.[2nd half]) }
Also, is it ok to use cross join in a tuple ?
(Time.[2nd half] * Route.nonground.air * Route.nonground.air * Time.[2nd half])
Can anyone elaborate this with a simple example ?
Thank You.
The specification refers to the order of tuples within a set, not the order of hierarchies within the tuples of a set (which by the way must be the same across all tuples of the set, but that is not of concern for this part of the specification).
This is important, as mathematical sets do not have any specific order, i. e. mathematically the sets
{a, b, c}
and
{b, a, c}
are equal.
But as MDX is meant for reporting, where the display order in a report may be relevant, it is convenient that an MDX set always has a specific order. Another difference between mathematical sets and MDX sets is that MDX sets can have duplicates, while in the mathematical sense, one element is either contained or not contained in a set, but never contained multiple times.
And if you compare that to SQL, then SQL result sets are unordered by definition like mathematical sets, but may contain duplicate records. You can however, in some situations, get an SQL result set ordered, but have to request that by an explicit ORDER BY clause. And some SQL dialects do not allow ORDER BY e. g. in subselects, as these never are directly returned to the final user. Technically, the advantage of only guaranteeing a certain order of the result set if explicitly requested has the advantage that the optimizer has more freedom to build an efficient execution plan than if a specific order of the result had always to be delivered.
Related
I have a list of proper names (in a table), and another table with a free-text field. I want to check whether that field contains any of the proper names. If it were just one, I could do
WHERE free_text LIKE "%proper_name%"
but how do you do that for an entire list? Is there a better string function I can use with a list?
Thanks
No, like does not have that capability.
Many databases support regular expressions, which enable to you do what you want. For instance, in Postgres this is phrased as:
where free_text ~ 'name1|name2|name3'
Many databases also have full-text search capabilities that speed such searches.
Both capabilities are highly specific to the database you are using.
Well, you can use LIKE in a standard JOIN, but the query most likely will be slow, because it will search each proper name in each free_text.
For example, if you have 10 proper names in a list and a certain free_text value contains the first name, the server will continue processing the rest of 9 names.
Here is the query:
SELECT -- DISTINCT
free_text_table.*
FROM
free_text_table
INNER JOIN proper_names_table ON free_text_table.free_text LIKE proper_names_table.proper_name
;
If a certain free_text value contains several proper names, that row will be returned several times, so you may need to add DISTINCT to the query. It depends on what you need.
It is possible to use LATERAL JOIN to avoid Cartesian product (where each row in free_text_table is compared to each rows in proper_names_table). The end result may be faster, than the simple variant. It depends on your data distribution.
Here is SQL Server syntax.
SELECT
free_text_table.*
FROM
free_text_table
CROSS APPLY
(
SELECT TOP(1)
proper_names_table.proper_name
FROM proper_names_table
WHERE free_text_table.free_text LIKE proper_names_table.proper_name
-- ORDER BY proper_names_table.frequency
) AS A
;
Here we don't need DISTINCT, there will be at most one row in the result for each row from free_text_table (one or zero). Optimiser should be smart enough to stop reading and processing proper_names_table as soon as the first match is found due to TOP(1) clause.
If you also can somehow order your proper names and put those that are most likely to be found first, then the query is more likely to be faster than a simple JOIN. (Add a suitable ORDER BY clause in subquery).
Given a BigQuery table with the schema: target:STRING,evName:STRING,evTime:TIMESTAMP, consider the following subselect:
SELECT target,
NEST(evName) AS evNames,
NEST(evTime) AS evTimes,
FROM [...]
GROUP BY target
This will group events by target into rows with two repeated fields evNames and evTimes. I understand that the values within each of the repeated fields are not ordered in any predictable way, but is the ordering guaranteed to be consistent between the two repeated fields?
In other words, if I pick N-th value from evNames and N-th value from evTimes within a given row, will they form a proper pair from the original table?
What I would really like to do is to create a nested repeated record, something like:
SELECT target, NEST(RECORD(evName, evTime)) AS events FROM [...] GROUP BY target
but I believe creating RECORDs on the fly like this is currently not supported.
By the way, this question is motivated by the desire to use recently introduced BigQuery user defined functions to implement state machines, as an alternative to window functions tricks.
Note: I realize that an alternative is to emulate record by serializing multiple fields into a single string representation, e.g.:
SELECT target, NEST(CONCAT(evName, ',', STRING(evTime))) ...
and then deserialize the "record" in later stages, but I'd like to avoid that if I can.
Suppose I have the following flat file on HDFS (let's call this key_value):
1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing
Here is the output I'm looking for:
(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)
In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).
Towards this end, I started putting together this pig script:
data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
sorted = ORDER data BY key, value;
GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;
The above script gives me the following output:
(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)
Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).
Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?
UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:
(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)
First of all, let me point out that if for most rows, most of the columns are not filled out, that a better solution IMO would be to use a map. The builtin TOMAP UDF combined with a custom UDF to combine maps would enable you to do this.
I am sure there is a way to solve your original question by computing a list of all possible keys, exploding it out with null values and then throwing away the instances where a non-null value also exists... but this would involve a lot of MR cycles, really ugly code, and I suspect is no better than organizing your data in some other way.
You could also write a UDF to take in a bag of key/value pairs, another bag all possible keys, and generates the tuple you're looking for. That would be clearer and simpler.
I have a situation where I have a product and a time dimension, with a fact table of sales volume. Over time, various details about the product changes, with the except of the business key for the product. In my flat reporting from the cube, I want to include some aggregration at the 'business key' level, regardless of what other parts of the product dimension are shown.
In sql this would be trivial as something like:
select sum(volume) over (partition by productKey,year) as Total
Regardless of whatever else I had selected, the Total column would be aggregated only on those two fields.
In MDX I have managed to achieve the same result, but it seems like there must be a simpler way.
WITH MEMBER Measures.ProductKeyTotal AS
'SUM(([Product].[ProductKey],[Time].[Year]
,[Product].[Product Name].[Product Name].ALLMEMBERS
,[Volume Type].[Volume Type Id].[Volume Type Id].ALLMEMBERS)
,[Measures].[Volume])'
SELECT {[Measures].[Volume],[Measures].[ProductKeyTotal]} ON COLUMNS,
NONEMPTYCROSSJOIN ([Product].[ProductKey].[ProductKey].ALLMEMBERS
,[Time].[Time].[Year].ALLMEMBERS
,[Product].[Product Name].[Product Name].ALLMEMBERS
,[Volume Type].[Volume Type Id].[Volume Type Id].ALLMEMBERS) ON ROWS
FROM [My Cube]
WHERE ([Product].[Include In Report].&[True])
1) If I don't include the allmembers for the rows I don't want in the calculated member the total is not correct, is there a shortcut to force it to ignore all the dimensions other that what you specify?
Part of the reason I ask is that I need to add a bunch of other calculated members, some of which will be using parameters and if I use the method from the example above I am going to need to duplicate the same stuff in multiple places, and the code will get weighty.
Well, first of all, don't use NonEmptyCrossJoin--it's been deprecated. Use non empty and then the cross join operator (*).
It's important to understand how tuples and tuple sets work to answer your question. Essentially, any dimension not explicitly stated will always get the CurrentMember of a given dimension. Typically, this is DefaultMember, but if you have it set to something else in your query, that will change this up. The reason you have to specify ALLMEMBERS for those dimensions is because it will use CurrentMember, otherwise. You could just use the [All] member in lieu of trying to sum up ALLMEMBERS (especially if they're not flat!), which will give you a bit better performance.
The most performant way to do this is to add another Measure Group to your cube, and then remove the keys that don't apply to the measure from that Measure Group. This way, you get a native calculation for these rather than a run-time calculation (which tend to be slow, especially when you're adding up everything in your cube). Moreover, you can even set up some aggregation design on that Measure Group, and it will be very performant.
I have a many-to-many dimension in my cube (next to other regular dimensions). When I want to exclude fact rows in my row count measure, I usually do something like the following in MDX
SELECT [Measures].[Row Count] on 0
FROM cube
WHERE ([dimension].[attribute].Children - [dimension].[attribute].&[value])
This might seem more complicated than needed in this simple example, but in this case the WHERE can grow sometimes, also including UNIONs.
So this works for regular dimensions, but now I have a many-to-many dimension. If I perform the trick above it does not produce the desired result, namely I want to exclude all rows that have that specific attribute in the many-to-many dimension.
Actually it does exactly what the MDX asks, namely count all rows, but ignore the specified attribute. Since a row in the fact table can have multiple attributes in a many-to-many dimension, the row will still be counted.
That's not what I need, I need it to explicitly exclude rows that have that dimension attribute value. Also, I might exclude multiple values. So what I need is something similar to T-SQL's WHERE .. NOT IN (...)
I realize that I can just subtract the resulting values from [attribute].all and [attribute].&[value], but that won't work any more when UNIONing multiple WHERE statements.
Anybody got a good idea on how to solve this?
Thanks in advance,
Delta
I have not tested this, but I think you could do this if you had an attribute that was at the same level of granularity as the rows (so probably implemented as a fact relationship).
So if you wanted to count the number of orders that did NOT have a product category of bikes (assuming a M2M relationship between OrderID and Category) then something like the following should work. (you can find more info on the EXISTS function in Books Online)
[Orders].[Order ID].[Order ID].Members
- EXISTS([Orders].[Order ID].[Order ID].Members
, [Product].[Category].&[Bikes]
, "Order Facts")
Although it could be quite slow as this sort of query is forcing the SSAS engine to add up a lot of facts from a low level.
Have you tried the EXCEPT command? It's syntax is like the following:
EXCEPT({the set i want}, {a set of members i dont want})
You could use the Filter function:
SELECT [Measures].[Row Count] on 0
FROM [cube]
WHERE Filter([dimension].[attribute].Children, [dimension].CurrentMember.MemberValue <> value)