I am sure there must be something in the library I could use, but I couldn`t find any. Any help will be appreciated.
Thanks.
In version 4.8 there is no equality operator for Polygon_with_holes_2 nor General_polygon_with_holes_2. However, you can compute their symmetric difference.
Alternatively, if you care about performance, you can compare the outer boundaries (if exists); then, obtain, say the leftmost vertex of every hole of both polygons with holes and place them in two sequences, sort, and compare, respectively.
If you care a lot about performance, you can precede the full comparisons with comparisons of the bounding boxes.
Notice, that Polygon_2 though has an equality ('=='), left(), and bbox() operators.
Related
Find off, at the moment I'm not looking for alternate suggestions, just a yes or a no, and if it's a yes, what the name is.
Are there any SQL DBMS that allows you to create "Spatial" indexes using arbitrary (i.e. non geometric) data types like integers, dates, etc? While spatial indexes are most commonly used for location data, they can also be used to properly indexes queries where you need to search within two or more ranges.
For example (and this is just a made-up example), if you had a database of customer receipts, and you wanted to find all transactions between $10-$1000 and which took place between 2000-01-01 thru 2005-03-01. The fact that you're searching within multiple ranges means that the regular b-tree indexes cannot be used to efficiently perform this lookup, at least not in a way that's scalable.
Now yes, for the specific example I provided, and probably any other case, you could likely come up with some tricks to do it efficiently using the b-tree indexes, or at the very least narrow it down; I'm well aware, but again, not looking for alternate suggestions, just a no, or a yes and the name.
Appreciate any help you all can provide
EDIT: Just to clarify; I'm using the term spatial index as this is the most common term for it as well as the most commonly implemented use case. I am however referring to any index which uses quadtrees, r-trees, etc to achieve the same or similar effect.
I've searched the forums and found a few related threads but no definitive answer.
(case
when field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%'...
In the above statement, if field1 is "like" t001, will the others even be evaluated?
(case
when (field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%')...
Does adding parenthesis as shown above change the evaluation?
In general, databases short-circuit boolean operations. That is, they stop at the first value that defines the result -- the first "true" for OR, the first "false" for AND.
That said, there are no guarantees. Nor are there guarantees about the order of evaluation. So, DB2 could decide to test the middle, the last, and then the first. That said, these are pretty equivalent, so I would expect the ordering to be either first to last or last to first.
Remember: SQL is a descriptive language, not a procedural language. A SQL query describers the result set, but not the steps used to generate it.
You don't know.
SQL is a declarative language, not an imperative one. You describe what you want, the engine provides it. The database engine will decide in which sequence it will evaluate those predicates, and you don't have control over that.
If you get the execution plan today it may show one sequence of steps, but if you get it tomorrow it may show something different.
Not strictly answering your question, but if you have many of these matches, a simpler, possibly faster, and easier to maintain solution would be to use REGEXP_LIKE. The example you've posted could be written like this:
CASE WHEN REGEXP_LIKE(field1, '.*T(0|2|3)01.*') ...
Just for indication how it really works in this simple case.
select
field1
, case
when field1 LIKE '%T001%' OR RAISE_ERROR('75000', '%T201%') LIKE '%T201%' then 1 else 0
end as expr
from
(
values
'abcT001xyz'
--, '***'
) t (field1);
The query returns expected result for the statement above as is.
But if you uncomment the commented out line, you get SQLCODE=-438.
This means, that 2-nd OR operand is not evaluated, if 1-st returns true.
Note, that it's just for demonstration. There is no any guarantee, that this will work in such a way in any case.
Just to add to some points made about the difference between so-called procedural languages on the one hand, and SQL (which is variously described as a declarative or descriptive language) on the other.
SQL defines a set of relatively high-level operators for working with arrays of data. They are "high-level" in this sense because they work with arrays in a concise fashion that has not been typical of general purpose or procedural languages. Like all operators (whether array-based or not), there is typically more than one algorithm capable of implementing an operator.
In contrast to "general purpose" programming languages to which the vast majority of programmers are exposed, the existence of these array operators - in particular, the ability to combine them algebraically into an expression which defines a composite operation (or a query), and the absence of any explicit algorithms for iteration - was once a defining feature of SQL.
The distinction is less sharp nowadays, with a resurgent interest in functional languages and features, but most still view SQL as a beast of its own kind amongst commercially-popular tooling.
It is often said that in SQL you define what results you want not how to get them. But this is true for every language. It is true even for machine code operators, if you account for how the implementation in circuitry can be varied - and does vary, between CPU designs. It is certainly true for all compiled languages, where compilers employ many different machine code algorithms to actually implement operations specified in the source code - loop unrolling, for one example.
The feature which continues to distinguish SQL (and the relational databases which implement it), is that the algorithm which implements an operation is determined at the time of each execution, and it does this not just by algebraic manipulation of queries (which is not dissimilar to what compilers do), but also by continuously generating and analysing statistics about the data being operated upon and the consequences of previous executions.
Put another way, the database execution engine is engaged in a constant search for the best practical algorithms (and combinations thereof) to implement its overall workload. It is capable of accomodating not just past experience, but of reacting to changes (such as in data volumes, in degree of concurrency and transactional conflict, or in systemic constraints like available memory or overall workload).
The upshot of all this is that there is a specific order of evaluation in SQL, just like any other language. It is this order which defines a correct result. But unless written in so-called RBAR style (and even then, but to a more limited extent...), the database engine has tremendous latitude to implement shortcuts and performance optimisations, provided these do not change the final result.
Many operators fall into a class where it is possible to determine the result in many cases without evaluating all operands. I'm not sure what the formal word is to describe this property - partial evaluativity, maybe - but casually it is called short-circuiting. The OR operator has this property.
Another property of the OR operation is that it is associative. That is, the order in which a series of them are applied does not matter - it behaves like the addition operator does, where you can add numbers in any order without affecting the result.
With a series of OR conditions, these are capable of being reordered and partially evaluated, provided the evaluation of any particular operand does not cause side-effects or depend on hidden variables. It is therefore quite likely that the database engine may reorder or partially evaluate them.
If the operands do cause side-effects or depend on hidden variables (functions which get the current date or time being a prime example of the latter), these often cause problems in queries - either because the database engine does not realise they have side-effects or hidden variables, or because the database does realise it but doesn't handle the case in the way the programmer expects. In such cases, a query may have to be completely rewritten (typically, cracked into multiple statements) to force a specific evaluation order or guarantee full evaluation.
Do we always need to remove a column for one-hot encoding to prevent multicollinearity?
In the solution here (https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic/comments#138896) it mentions
#Kevin Chang You need to delete one column of the dummy variables to
avoid the state of Multicollinearity. It's a state of very high
correlations among the columns(independent variables); meaning that
one can be predicted from the others. It is therefore, a type of
disturbance in the data, and if present in the data the statistical
conclusions made about the data may not be reliable.
In the solutions here, there is not catering for multicollinearity
https://www.kaggle.com/sharmasanthosh/allstate-claims-severity/exploratory-study-on-ml-algorithms
May I know is it a must, or in what situation we ned to cater that?
If I have to answer your question "Do we always need to remove a column for one-hot encoding to prevent multicollinearity?", the answer is yes.
The common way to prevent multicollinearity is to remove highly correlated predictors from the model. If you have two or more factors with a high VIF, remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn't reduce the R-squared.
Or you could use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number of predictors to a smaller set of uncorrelated components.
I am writing some Java unit tests and I need to compare two sql strings where the sql statements are semantically equivalent but can be syntactically different. I can't do string comparison since an order of the from clause and where clause can be different yet the two queries might be equivalent.
Is there anyway to do this in java without having to write my own Oracle SQL Parser ? :)
P.S. The query can be very complicated !
Thank you !
The general answer is NO, because you can always call some kind of stored procedure which hides a Turing machine.
The fact that you can do arithmetic in a SQL statement I think pushes you over the Turing cliff, too.
Of course, theorists always tell us everything is impossible, so we should all roll over and die.
Nah.
So what can you do? Well, a "simple" possibility is to normalize the SQL queries, much as you simplify algebraic equations. If you could somehow, for a SQL statement, "normalize" (convert) it into the absolutely shortest equivalent SQL that did the same thing, then you could normalize both SQL statements and compare the result statements; if the they are equal modulo identifier renaming, then they have the same "semantics". For every operator in SQL, there is some semantics behind it, and some set of equivalent operations, just as there is in algebra. So, if you can determine the set of algebraic equivalences for each SQL operator, you can replace each algebraic computation by the shortest algebriac equivalent which does the same thing.
To do this, you have to be able to parse SQL, and apply SQL rewrites to the parsed SQL, which means you need a program transformation engine. (You can see an analog of this at Parsing and Rewriting Algebra)
This doesn't work in all cases. First, there may be several same-length "shortest" SQL statements that are equivalent (2+X is the same as X+2 but it isn't obvious to a a tool). Now you've got a theorem proving problem (for our X+2 example, use the commutative law to prove they are equal), back to stuck in theory. Second, you may not know how to generate the shortest possible sequence using your rewrites; even math equations sometimes have to swell up before they can get small again. Technically you have to search all possible algebraic equivalences to find the shortest, and that's impossibly large.
So, hard to do in practice, too. So, NO.
Not a direct solution for your problem, but you may want to look into JSqlParser which may already cover part of what you need.
Performance question ...
I have a database of houses that have geolocation data (longitude & latitude).
What I want to do is find the best way to store the locational data in my MySQL (v5.0.24a) using InnoDB database-engine so that I can perform a lot of queries where I'm returning all the home records that are between x1 and x2 latitude and y1 and y2 longitude.
Right now, my database schema is
---------------------
Homes
---------------------
geolat - Float (10,6)
geolng - Float (10,6)
---------------------
And my query is:
SELECT ...
WHERE geolat BETWEEN x1 AND x2
AND geolng BETWEEN y1 AND y2
Is what I described above the best way to store the
latitude and longitude data in MySQL using Float (10,6) and separating out the longitude/latitude? If not, what is? There exist Float, Decimal and even Spatial as a data type.
Is this the best way to perform the
SQL from a performance standpoint? If not, what is?
Does using a different MySQL
database-engine make sense?
UPDATE: Still Unanswered
I have 3 different answers below. One person say to use Float. One person says to use INT. One person says to use Spatial.
So I used MySQL "EXPLAIN" statement to measure the SQL execution speed. It appears that absolutely no difference in SQL execution (result set fetching) exist if using INT or FLOAT for the longitude and latitude data type..
It also appears that using the "BETWEEN" statement is SIGNIFICANTLY faster than using the ">" or "<" SQL statements. It's nearly 3x faster to use "BETWEEN" than to use the ">" and "<" statement.
With that being said, I still am unceratin on what the performance impact would be if using Spatial since it's unclear to me if it's supported with my version of MySQL running (v5.0.24) ... as well as how I enable it if supported.
Any help would be greatly appreacited
float(10,6) is just fine.
Any other convoluted storage schemes will require more translation in and out, and floating-point math is plenty fast.
I know you're asking about MySQL, but if spatial data is important to your business, you might want to reconsider. PostgreSQL + PostGIS are also free software, and they have a great reputation for managing spatial and geographic data efficiently. Many people use PostgreSQL only because of PostGIS.
I don't know much about the MySQL spatial system though, so perhaps it works well enough for your use-case.
The problem with using any other data type than "spatial" here is that your kind of "rectangular selection" can (usually, this depends on how bright your DBMS is - and MySQL certainly isn't generally the brightest) only be optimised in one single dimension.
The system can pick either the longitude index or the latitude index, and use that to reduce the set of rows to inspect. But after it has done that, there is a choice of : (a) fetching all found rows and scanning over those and test for the "other dimension", or (b) doing the similar process on the "other dimension" and then afterwards matching those two result sets to see which rows appear in both. This latter option may not be implemented as such in your particular DBMS engine.
Spatial indexes sort of do the latter "automatically", so I think it's safe to say that a spatial index will give the best performance in any case, but it may also be the case that it doesn't significantly outperform the other solutions, and that it's just not worth the bother. This depends on all sorts of things like the volume of and the distribution in your actual data etc. etc.
It is certainly true that float (tree) indexes are by necessity slower than integer indexes, because of the longer time it usually takes to execute '>' on floats than it does on integers. But I would be surprised if this effect were actually noticeable.
Google uses float(10,6) in their "Store locator" example. That's enough for me to go with that.
https://stackoverflow.com/a/5994082/1094271
Also, starting MySQL 5.6.x, spatial extensions support is much better and comparable to PostGIS in features and performance.
I would store it as integers (int, 4-bytes) represented in 1/1,000,000th degrees. That would give you a resolution of few inches.
I don't think there is any intrinsic spatial datatype in MySQL.
Float (10,6)
Where is latitude or longitude 5555.123456?
Don't you mean Float(9,6) instead?
I have the exact same schema (float(10,6)) and query (selecting inside a rectangle) and I found that switching the db engine from innoDB to myisam doubled the speed for a "point in rectangle look-up" in a table with 780,000 records.
Additionally, I converted all lng/lat values to cartesian integers (x,y) and created a two-column index on the x,y and my speed went from ~27 ms to 1.3 ms for the same look-up.
It really depends on how you are using the data. But in a gross over-simplification of the facts, decimal is faster but less accurate in aproximations. More info here:
http://msdn.microsoft.com/en-us/library/aa223970(SQL.80).aspx
Also, The standard for GPS coordinates is specified in ISO 6709:
http://en.wikipedia.org/wiki/ISO_6709
I know probably you would have moved past this problem. I just wanted to add another approach to this question, in case someone is looking to store geolocation data.
You could encode latitude and longitude information into a geohash. Since they are prefixed searchable to a required degree of precision. It seems you can convert your query to a start and end prefix and do a prefix search with LIKE query.