What is the best order of arguments for spatially indexed functions of sf (st_intersects, etc.)? - indexing

It is explained on https://www.r-spatial.org/r/2017/06/22/spatial-index.html that "The R tree is constructed on the first argument (x), and used to match all geometries on the second argument (y) of binary functions". However, it is finally not explained what are the criteria that could guide the user in the choice of the sf object to put in the first argument. Are there some rules of thumb for this choice?

Generally, with creating spatial indexes you want to start with a larger envelopes that decompose to smaller envelopes. So, if you are extending from that then starting with a geometry with a larger extent then building smaller envelopes from that would match the general schema of how these selections are optimized.

Related

Does it make sense to use big-O to describe the best case for a function?

I have an extremely pedantic question on big-O notation that I would like some opinions on. One of my uni subjects states “Best O(1) if their first element is the same” for a question on checking if two lists have a common element.
My qualm with this is that it does not describe the function on the entire domain of large inputs, rather the restricted domain of large inputs that have two lists with the same first element. Does it make sense to describe a function by only talking about a subset of that function’s domain? Of course, when restricted to that domain, the time complexity is omega(1), O(1) and therefore theta(1), but this isn’t describing the original function. From my understanding it would be more correct to say the entire function is bounded by omega(1). (and O(m*n) where m, n are the sizes of the two input lists).
What do all of you think?
It is perfectly correct to discuss cases (as you correctly point out, a case is a subset of the function's domain) and bounds on the runtime of algorithms in those cases (Omega, Oh or Theta). Whether or not it's useful is a harder question and one that is very situation-dependent. I'd generally think that Omega-bounds on the best case, Oh-bounds on the worst case and Theta bounds (on the "universal" case of all inputs, when such a bound exists) are the most "useful". But calling the subset of inputs where the first elements of each collection are the same, the "best case", seems like reasonable usage. The "best case" for bubble sort is the subset of inputs which are pre-sorted arrays, and is bound by O(n), better than unmodified merge sort's best-case bound.
Fundamentally, big-O notation is a way of talking about how some quantity scales. In CS we so often see it used for talking about algorithm runtimes that we forget that all the following are perfectly legitimate use cases for it:
The area of a circle of radius r is O(r2).
The volume of a sphere of radius r is O(r3).
The expected number of 2’s showing after rolling n dice is O(n).
The minimum number of 2’s showing after rolling n dice is O(1).
The number of strings made of n pairs of balanced parentheses is O(4n).
With that in mind, it’s perfectly reasonable to use big-O notation to talk about how the behavior of an algorithm in a specific family of cases will scale. For example, we could say the following:
The time required to sort a sequence of n already-sorted elements with insertion sort is O(n). (There are other, non-sorted sequences where it takes longer.)
The time required to run a breadth-first search of a tree with n nodes is O(n). (In a general graph, this could be larger if the number of edges were larger.)
The time required to insert n items in sorted order into an initially empty binary heap is On). (This can be Θ(n log n) for a sequence of elements that is reverse-sorted.)
In short, it’s perfectly fine to use asymptotic notation to talk about restricted subcases of an algorithm. More generally, it’s fine to use asymptotic notation to describe how things grow as long as you’re precise about what you’re quantifying. See #Patrick87’s answer for more about whether to use O, Θ, Ω, etc. when doing so.

Explanation of Set vs Tuple in MDX

It seems I am struggling to understand the difference between a set and a tuple in MDX. I've read very fancy definitions comparing the two, but the only difference to me seems that 'A set has the same-type members' and 'A tuple has non-same-type members'. Other than that, any definition I read or come across (talking about dimensional space or what-not) seems to make no sense. The 'one-item' I get:
# Tuple
[Team].[Hierarchy].[Code].[DET]
And then multiple items with that same type (dimensionality) is a set, ok:
{[Team].[Hierarchy].[Code].[DET], [Team].[Hierarchy].[Code].[DAL]}
But here are a few examples that don't make sense to me:
# How is this a set? It just has two exact same items!
{[Team].[Hierarchy].[Code].[DET], [Team].[Hierarchy].[Code].[DET]}
And another example:
# Tuple (again, same thing -- now adding a duplicate attribute
(
{[Team].[Hierarchy].[Code].[DET],[Team].[Hierarchy].[Code].[DET]},
[Team].[Name].[Name].[Detroit Lions]
)
Now since both of these are almost doing the same thing (and neither references a measure, so neither would be self-sufficient to pull a 'value'), what is the actual difference between a tuple and a set? These seem to be so loosely defined in the language (for example, above I can have duplicate members in a set, which is usually not allowed in a set).
A related question (some of the answers cover the basics of a one-level set/tuple difference but don't go into too much detail on nesting): Difference between tuple and set in mdx. Also, most of the links on that page are broken.
MDX sets are an ordered collection of 0 or more tuples (note that a member is considered to be a tuple containing a single element) with the same dimensionality. Unlike a mathematical set, an MDX set may contain duplicates, it is more of a list of elements. More details here.
And perhaps as a refresh for MDX concepts here is a gentle introduction of MDX.

How to constrain dtw from dtw-python library?

Here is what I want to do:
keep a reference curve unchanged (only shift and stretch a query curve)
constrain how many elements are duplicated
keep both start and end open
I tried:
dtw(ref_curve,query_curve,step_pattern=asymmetric,open_end=True,open_begin=True)
but I cannot constrain how the query curve is stretched
dtw(ref_curve,query_curve,step_pattern=mvmStepPattern(10))
it didn’t do anything to the curves!
dtw(ref_curve,query_curve,step_pattern=rabinerJuangStepPattern(4, "c"),open_end=True, open_begin=True)
I liked this one the most but in some cases it shifts the query curve more than needed...
I read the paper (https://www.jstatsoft.org/article/view/v031i07) and the API but still don't quite understand how to achieve what I want. Any other options to constrain number of elements that are duplicated? I would appreciate your help!
to clarify: we are talking about functions provided by the DTW suite packages at dynamictimewarping.github.io. The question is in fact language-independent (and may be more suited to the Cross-validated Stack Exchange).
The pattern rabinerJuangStepPattern(4, "c") you have found does in fact satisfy your requirements:
it's asymmetric, and each step advances the reference by exactly one step
it's slope-limited between 1/2 and 2
it's type "c", so can be normalized in a way that allows open-begin and open-end
If you haven't already, check out dtw.rabinerJuangStepPattern(4, "c").plot().
It goes without saying that in all cases you are getting is the optimal alignment, i.e. the one with the least accumulated distance among all allowed paths.
As an alternative, you may consider the simpler asymmetric recursion -- as your first attempt above -- constrained with a global warping window: see dtw.window and the window_type argument. This provides constraints of a different shape (and flexible size), which might suit your specific case.
PS: edited to add that the asymmetricP2 recursion is also similar to RJ-4c, but with a more constrained slope.

Genetic algorithms for guillotine cut optimization

Ive been revisiting genetic algorithms with encoding, optimizing and decoding. My first attempt was the travelling salesman with ordered cross over which worked great. I found an article that tried to optimize a more complex genome while optimizing a 2d packing problem.
The author encodes the problem using reverse polish notation that made sense. It uses a combination of parts and either V Or H as opertors.
Ie 34H5V
With decoding the stack having to be resolved to one stack element that is my final layout. That being said, the number of operater up until a certain point must be 1 less than the number of parts up until the same point. The author then states that he used a mixed cross over by using an ordered cross over on the parts and binary crossover for the operators.
I mulled this over but i cannot understand how he seperates the parts and operators before crossing over and then recombines them before evaluating performance and they offer little details. If a binary cross over occured replacing parts with an "X" to keep the relative positions so they can be recombined after crossover but the relationship between operator and parts doesnt hold true.
Does anyone perhaps have a resource that has dealt with a similar scenario or perhaps has used this successfully.
This looked way more difficult than it actually was. When the original population is generated, you need to adhere to the limitations set out by postfix notation. When a crossover occurs you simply build a mask of the parent
Ie xxxxooxoxx
Where x is an object and o is an operaror. Once you have the mask holding the positions you can create a sting only of operators and one only of objects. The operators can be done with a binary cross over and the objects as partial map crossover. Once done you fill the mask with the value in the order they appear in each group. Since the mask was valid, the progeny is valid too.
The only issue ia getting all the possible arrangements because without it, it will all be limited to the masks. He solves this by doing a swap mutation dictated by the mutation rates.
Select an item at random.
If the item is an operator then
A. Swithc the operator to another kind
B. Select another. If its an object then make sure the requirementa are met and if so then switch.

SQL Server 2008+ : Best method for detecting if two polygons overlap?

We have an application that has a database full of polygons (currently stored as points) that a .net app pulls out and checks if they overlap.
I occurred to me that it would be much nicer to convert these point arrays to polygon / polyline objects within the database and use sql to get a bool of weather they overlap or not.
I have seen different methods suggested to do this but non of the examples given were quite in-line with my needs.
I would be very happy to receive input from those kind enough to offer their experience.
Additional:
In response to questions: It is indeed 2D. and yes any crossover of the two is considered true. The polygons have n points and can be concave. The polygons will be saved as 1 per row (after data conversion task) as polygons (i.e. the polygon type .. it might be called something else spatial / geom my memory is not on my side right now)
You can use .STIntersection with .STAsText() to test for overlapping polygons. (I really hate the terminology Microsoft has used (or whoever set the standard terms). "Touching," in my mind, should be a test for whether or not two geometry/geography shapes overlap at all, not just share a border.)
Anyway....
If #RadiusGeom is a geometry representing a radius from a point, the following will return a list of any two polygons where an intersection (a geometry that represents the area where two geometries overlap) is not empty.
SELECT CT.ID AS CTID, CT.[Geom] AS CensusTractGeom
FROM CensusTracts CT
WHERE CT.[Geom].STIntersection(#RadiusGeom).STAsText() <> 'GEOMETRYCOLLECTION EMPTY'
If your geometry field is spatially indexed, this runs pretty quickly. I ran this on 66,000 US CT records in about 3 seconds. There may be a better way, but since no one else had an answer, this was my attempt at an answer for you. Hope it helps!
Calculate and store the bounding rectangle of each polygon in a set of new fields within the row which is associated with that polygon. (I assume you have one; if not, create one.) When your dotnet app has a polygon and is looking for overlapping polygons, it can fetch from the database only those polygons whose bounding rectangles overlap, using a relatively simple SQL SELECT statement. Those polygons should be relatively few, so this will be efficient. Then, your dotnet app can perform the finer polygon overlap calculations in order to determine which ones of those really overlap.
Okay, I got another idea, so I am posting it as a different answer. I think my previous answer with the bounding polygons probably has some merit on its own, even if it was to reduce the number of polygons fetched from the database by a small percentage, but this one is probably better.
MSSQL supports integration with the CLR since version 2005. This means that you can define your own data type in an assembly, register the assembly with MSSQL, and from that moment on MSSQL will be accepting your user-defined data type as a valid type for a column, and it will be invoking your assembly to perform operations with your user-defined data type.
An example article for this technique on the CodeProject: Creating User-Defined Data Types in SQL Server 2005
I have never used this mechanism, so I do not know details about it, but I presume that you should be able to either define a new operation on your data type, or perhaps overload some existing operation like "less-than", so that you can check if one polygon intersects another. This is likely to speed things up a lot.