Explanation of Set vs Tuple in MDX - mdx

It seems I am struggling to understand the difference between a set and a tuple in MDX. I've read very fancy definitions comparing the two, but the only difference to me seems that 'A set has the same-type members' and 'A tuple has non-same-type members'. Other than that, any definition I read or come across (talking about dimensional space or what-not) seems to make no sense. The 'one-item' I get:
# Tuple
[Team].[Hierarchy].[Code].[DET]
And then multiple items with that same type (dimensionality) is a set, ok:
{[Team].[Hierarchy].[Code].[DET], [Team].[Hierarchy].[Code].[DAL]}
But here are a few examples that don't make sense to me:
# How is this a set? It just has two exact same items!
{[Team].[Hierarchy].[Code].[DET], [Team].[Hierarchy].[Code].[DET]}
And another example:
# Tuple (again, same thing -- now adding a duplicate attribute
(
{[Team].[Hierarchy].[Code].[DET],[Team].[Hierarchy].[Code].[DET]},
[Team].[Name].[Name].[Detroit Lions]
)
Now since both of these are almost doing the same thing (and neither references a measure, so neither would be self-sufficient to pull a 'value'), what is the actual difference between a tuple and a set? These seem to be so loosely defined in the language (for example, above I can have duplicate members in a set, which is usually not allowed in a set).
A related question (some of the answers cover the basics of a one-level set/tuple difference but don't go into too much detail on nesting): Difference between tuple and set in mdx. Also, most of the links on that page are broken.

MDX sets are an ordered collection of 0 or more tuples (note that a member is considered to be a tuple containing a single element) with the same dimensionality. Unlike a mathematical set, an MDX set may contain duplicates, it is more of a list of elements. More details here.
And perhaps as a refresh for MDX concepts here is a gentle introduction of MDX.

Related

How to constrain dtw from dtw-python library?

Here is what I want to do:
keep a reference curve unchanged (only shift and stretch a query curve)
constrain how many elements are duplicated
keep both start and end open
I tried:
dtw(ref_curve,query_curve,step_pattern=asymmetric,open_end=True,open_begin=True)
but I cannot constrain how the query curve is stretched
dtw(ref_curve,query_curve,step_pattern=mvmStepPattern(10))
it didn’t do anything to the curves!
dtw(ref_curve,query_curve,step_pattern=rabinerJuangStepPattern(4, "c"),open_end=True, open_begin=True)
I liked this one the most but in some cases it shifts the query curve more than needed...
I read the paper (https://www.jstatsoft.org/article/view/v031i07) and the API but still don't quite understand how to achieve what I want. Any other options to constrain number of elements that are duplicated? I would appreciate your help!
to clarify: we are talking about functions provided by the DTW suite packages at dynamictimewarping.github.io. The question is in fact language-independent (and may be more suited to the Cross-validated Stack Exchange).
The pattern rabinerJuangStepPattern(4, "c") you have found does in fact satisfy your requirements:
it's asymmetric, and each step advances the reference by exactly one step
it's slope-limited between 1/2 and 2
it's type "c", so can be normalized in a way that allows open-begin and open-end
If you haven't already, check out dtw.rabinerJuangStepPattern(4, "c").plot().
It goes without saying that in all cases you are getting is the optimal alignment, i.e. the one with the least accumulated distance among all allowed paths.
As an alternative, you may consider the simpler asymmetric recursion -- as your first attempt above -- constrained with a global warping window: see dtw.window and the window_type argument. This provides constraints of a different shape (and flexible size), which might suit your specific case.
PS: edited to add that the asymmetricP2 recursion is also similar to RJ-4c, but with a more constrained slope.

how to get surrogate variables in rpart

I have looked everywhere I can, but I couldn't find answer to my question regarding rpart package.
I have built a regression tree using rpart, I have around 700 variables. I want to get the variables actually used to build the tree including the surrogates. I can find the actual variables used using tree$variable.importance, but I also have to get the surrogates because I need them to predict on the test set data I have. I do not want to keep all the 700 variables in the test set as I have a very big data (20mil observations) and I am running out of memory.
The list variable.importance in an rpart object does show the surrogate variables, but it only shows the top variables limited by a minimum importance value.
The matrix splits in an rpart object lists all of the split variables and their surrogate variables along with some other data like index, the value on which it splits (for continuous variable) or the categories that are split (for categorical variable), count how many observations are that split applies to. It doesn't give a hierarchy of which surrogates apply to which split, but it does list every variable. To get the hierarchy, you have to do summary(rpart_object).

Genetic algorithms for guillotine cut optimization

Ive been revisiting genetic algorithms with encoding, optimizing and decoding. My first attempt was the travelling salesman with ordered cross over which worked great. I found an article that tried to optimize a more complex genome while optimizing a 2d packing problem.
The author encodes the problem using reverse polish notation that made sense. It uses a combination of parts and either V Or H as opertors.
Ie 34H5V
With decoding the stack having to be resolved to one stack element that is my final layout. That being said, the number of operater up until a certain point must be 1 less than the number of parts up until the same point. The author then states that he used a mixed cross over by using an ordered cross over on the parts and binary crossover for the operators.
I mulled this over but i cannot understand how he seperates the parts and operators before crossing over and then recombines them before evaluating performance and they offer little details. If a binary cross over occured replacing parts with an "X" to keep the relative positions so they can be recombined after crossover but the relationship between operator and parts doesnt hold true.
Does anyone perhaps have a resource that has dealt with a similar scenario or perhaps has used this successfully.
This looked way more difficult than it actually was. When the original population is generated, you need to adhere to the limitations set out by postfix notation. When a crossover occurs you simply build a mask of the parent
Ie xxxxooxoxx
Where x is an object and o is an operaror. Once you have the mask holding the positions you can create a sting only of operators and one only of objects. The operators can be done with a binary cross over and the objects as partial map crossover. Once done you fill the mask with the value in the order they appear in each group. Since the mask was valid, the progeny is valid too.
The only issue ia getting all the possible arrangements because without it, it will all be limited to the masks. He solves this by doing a swap mutation dictated by the mutation rates.
Select an item at random.
If the item is an operator then
A. Swithc the operator to another kind
B. Select another. If its an object then make sure the requirementa are met and if so then switch.

Additional PlanningEntity in CloudBalancing - bounded-space situation

I successfully amended the nice CloudBalancing example to include the fact that I may only have a limited number of computers open at any given time (thanx optaplanner team - easy to do). I believe this is referred to as a bounded-space problem. It works dandy.
The processes come in groupwise, say 20 processes in a given order per group. I would like to amend the example to have optaplanner also change the order of these groups (not the processes within one group). I have therefore added a class ProcessGroup in the domain with a member List<Process>, the instances of ProcessGroup being stored in a List<ProcessGroup>. The desired optimisation would shuffle the members of this List, causing the instances of ProcessGroup to be placed at different indices of the List List<ProcessGroup>. The index of ProcessGroup should be ProcessGroup.index.
The documentation states that "if in doubt, the planning entity is the many side of the many-to-one relationsship." This would mean that ProcessGroup is the planning entity, the member index being a planning variable, getting assigned to (hopefully) different integers. After every new assignment of indices, I would have to resort the list List<ProcessGroup in ascending order of ProcessGroup.index. This seems very odd and cumbersome. Any better ideas?
Thank you in advance!
Philip.
The current design has a few disadvantages:
It requires 2 (genuine) entity classes (each with 1 planning variable): probably increases search space (= longer to solve, more difficult to find a good or even feasible solution) + it increases configuration complexity. Don't use multiple genuine entity classes if you can avoid it reasonably.
That Integer variable of GroupProcess need to be all different and somehow sequential. That smelled like a chained planning variable (see docs about chained variables and Vehicle Routing example), in which case the entire problem could be represented as a simple VRP with just 1 variable, but does that really apply here?
Train of thought: there's something off in this model:
ProcessGroup has in Integer variable: What does that Integer represent? Shouldn't that Integer variable be on Process instead? Are you ordering Processes or ProcessGroups? If it should be on Process instead, then both Process's variables can be replaced by a chained variable (like VRP) which will be far more efficient.
ProcessGroup has a list of Processes, but that a problem property: which means it doesn't change during planning. I suspect that's correct for your use case, but do assert it.
If none of the reasoning above applies (which would surprise me) than the original model might be valid nonetheless :)

Grammatically correct double-noun identifiers, plural versions

Consider compounds of two nouns, which in natural English would most often appear in the form "noun of noun", e.g. "direction of light", "output of a filter". When programming, we usually write "LightDirection" and "FilterOutput".
Now, I have a problem with plural nouns. There are two cases:
1) singular of plural
e.g. "union of (two) sets", "intersection of (two) segments"
Which is correct, SetUnion and SegmentIntersection or SetsUnion and SegmentsIntersection?
2) plural of plural
There are two subcases:
(a) Many elements, each having many related elements, e.g. "outputs of filters"
(b) Many elements, each having single related element, e.g. "directions of vectors"
Shall I use FilterOutputs and VectorDirections or FiltersOutputs and VectorsDirections?
I suspect correct is the first version (FilterOutupts, VectorDirections), but I think it may lead to ambiguities, e.g.
FilterOutputs - many outputs of a single filter or many outputs of many filters?
LineSegmentProjections - projections of many segments or many projections of a single segment?
What are the general rules, I should follow?
There's a grammatical misunderstanding lying behind this question. When we turn a phrase of form:
1. X of Y
into
2. Y X
the Y changes grammatical role from a noun in the possessive (1) to an adjective in the attributive (2). So while one may pluralise both X and Y in (1), one may only pluralise X in (2), because Y in (2) is an adjective, and adjectives do not have grammatical number.
Hence, e.g., SetsUnion is not in accordance with English. You're free to use it if it suits you, but you are courting unreadability, and I advise against it.
Postscript
In particular, consider two other possessive constructions, first the old-fashioned construction using the possessive pronoun "its", singular:
3a. Y, its X
the equivalent plural:
4a. Ys, their X
and their contractions, with 4b much less common than 3b:
3b. Y's X
4b. Ys' X
Here, SetsUnion suggests it is a rendering of the singular possessive type (3) Set's Union (=Set, its Union), where you intended to communicate the plural possessive (4) Sets, their Union (contracted to the less common Sets' Union).
So it's actively misleading.
Unless you're getting hamstrung by a convention driven system (ruby on rails, cakePHP etc), why not use OutputsOfFilters, UnionOfSets etc? They may not be conventional but they may be clearer.
For example its pretty clear that ProjectionOfLineSegments and ProjectionsOfLineSegment are different things or even ProjectionsOfLineSegments....
Using plural forms of nouns can make them more difficult to read.
When you have a number of things, they are usually stored in a datastructure - an array, a list, a map, set, etc.. generically called a collection or abstract data type. The interface to a collection of items is typically part of the programming environment (e.g. Collections in java and .net, STL in C++) and is well understood by developers to involve quantities of items.
You can avoid pluralizing your nouns, and make the fact that you are dealing with multiple quantities explicit, and indicate how they are accessed by incorporating the name of the collection. For example,
VectorDirectionList - the vectors and their directions are listed, e.g. some kind of Pair type. Works particularly well if you have a VectorDirection, combining a Vector and a Direction.
VectorDirectionMap - if the vector directions are mapped from vector.
Because it's a collection type, dealing with multiple objects is understood as it is endemic to a collection type. It then puts it in the same class as SetUnion - a union always involves at least 2 sets, and a VectorDirectionList makes it clear there can be more than one VectorDirection.
I agree about avoiding homonyms where the word has more than one word class, e.g. Filter, (and actually, Set, although to my mind Set would not really be used in a class name as a verb, so I interpret it as a noun.) I originally wrote this using FilterOutput as an example, but it didn't read well. Using a compound for Filter may help disambiguate - e.g. ImageFilterOutputs (or applying my own adivce, this would be ImageFilterOutputList.)
Avoiding plural forms with class names seems natural when you consider that an instance of a class is itself always one item - "an instance". If we use a plural name, then we get a mismatch - an instance trying to imply that it is multiple things - it itself is just one thing, even if it references multiple other things. The collection naming above builds on this - you have an instance which is a list, a map etc so there is no mismatch.
I'm assuming you are talking about programming language constructs, although the same thinking applies to tables/views. These are understood to involve quantities of items and table names are consequently often singlular (Customer, Order, Item) even though they store multiple rows. Many-to-Many Mapping tables are usually compounds of the entities being related, e.g. relating orders to items - OrderItem. In my experience, using plurals for table names makes the SQL difficult to read.
To sum up, I would avoid plural froms as they make reading harder. There are sure to be cases where they are unavoidable - where using the plural form is more readable than creating a huge name of nested entities and collections, but these are the exception than the rule.
What are the general rules, I should follow?
Make it Clear -- for both visual and aural thinkers.
Make it Specific but Accurate.
Make it pass the "crowded room" or "emergency phone call" test.
To illustrate with the SetsUnion example:
"SetsUnion" is right out; It's easily confused for a typo and speaking it (even in your head) will confuse it for "Set's Union" (Or worse).
The plural is also implied, so the 2nd 's' is redundant.
SetUnion is better but still ambiguous.
UnionOfSets is clearer and should be the bare minimum standard.
But all of these, so far, are uselessly vague (unless you are working with pure mathematical theory).
The term really should be specific. For example, "Red cars", "Programmers who spent too much time on esoterica", etc.
These are all unions of sets, but they tell you something useful. ;-)
.
Finally, Phil Factor had the right of it. To paraphrase:
Can you shout a (term) out across a crowded room and have it keyed in, and successfully (used), by a listener at the other side?
Try yelling, "SetsUnion," or even, "UnionOfSets," across a packed Irish bar. ;-)
1) i would use SetUnion and SegmentIntersection because i think in this case the plurality is implied anyway and it just looks nicer that way.
2) again, i would use FilterOutputs and VectorDirections, for the same reason. you could always use MultipleFilterOutputs if you want to be more specific.
but ultimately it's entirely down to your personal preference.
I think that while general naming conventions and consistency are important, but in a very very tight/tricky algorithm, clarity should trump convention. If it helps, use veryLongAndDescriptiveIdentifiers.
What's wrong with Union()?
Moreover, "union of sets" turns into "sets' union" (the two sets' union is ...); I'm sure I'm not the only person who's okay with CamelCase but not CamelsCaseMinusApostrophes. If it needs an apostrophe to make sense, don't use it. Set.Union() reads exactly like "union of set(s)".
Mathematations will also say "the (set) union of A and B", or rarely "A and B's (set) union". "The sets' union of A and B" makes no sense!
Most people will also see Vector[] vectors and Directions[] vectorDirections and assume that vectors[i] corresponds to vectorDirections[i]. If things really get ambiguous, I use something like vector_by_index and vectorDirection_by_index. Then you can have Map<Filter,Output> output_by_filter or Map<Filter,Output[]> outputs_by_filter, which makes it very obvious what the key is (this is very important in Objective-C where it's completely non-obvious what type the keys or values are).
If you really want, you can add an s and get vectors_by_index, but then consistency gives you the silly outputss_by_filter.
The right thing is, of course, something like struct FilterState { Filter filter; Output[] outputs; }; FilterState[] filterStates;.
I'd suggest singular for the first word: SetUnion, VectorDirections, etc.
Do a quick class search in your IDE, for: Strings*, Sets*, Vectors*, Collections*
Anyway, whatever you choose, be consistent throughout the whole application.