Efficient sorted bounding box query - sql

How would I create indexes in PostgresSQL 8.3 which would make a sorted bounding box query efficient? The table I'm querying has quite a few rows.
That is I want the create indexes that makes the following query as efficient as possible:
SELECT * FROM features
WHERE lat BETWEEN ? AND ?
AND lng BETWEEN ? AND ?
ORDER BY score DESC
The features table look like this:
Column | Type |
------------+------------------------+
id | integer |
name | character varying(255) |
type | character varying(255) |
lat | double precision |
lng | double precision |
score | double precision |
html | text |

To create a GiST index on a point attribute so that we can efficiently use box operators on the result of the conversion function:
CREATE INDEX pointloc
ON points USING gist (box(location,location));
SELECT * FROM points
WHERE box(location,location) && '(0,0),(1,1)'::box;
http://www.postgresql.org/docs/9.0/static/sql-createindex.html
This is the example in 9.0 docs. It should work for 8.3 though as these are features that have been around for ages.

You could try using a GiST index to implement an R-Tree. This type of index is poorly documented, so you might have to trawl through example code in the source distribution.
(Note: My prior advice to use R-Tree indexes appears to be out of date; they are deprecated.)

Sounds like you'd want to take a look at PostGIS, a PostgreSQL module for spatial data types and queries. It supports quick lookups using GiST indexes. Unfortunately I can't guide you further as I haven't used PostGIS myself.

Related

Normalisation - best way to clean duplicate misspelled values in a sql column

+---------+
| Language|
+---------+
|Spanish |
|spanish |
|venezla |
|venezuala|
|irish |
|Irish |
+---------+
Best approach for normalising data in a sql column? I was thinking of converting to lower case and then using multiple replace functions. Is this the only way? Any insight appreciated thanks :)
There are many ways in sql to do it my friend, it totally depends on the scenario and how you want to utilize it.
Looking at the above ask, you can use LOWER function and then extract the DISTINCT values to give unique values instead of putting multiple REPLACE functions every time you see a new mismatched value.
Or you can delete duplicate values by applying ROW_NUMBER and LOWER function if you want to play around with 1 table only.
Let me know your feedback and i can revert with more inputs.

PostgreSQL Reverse LIKE

I need to test if any part of a column value is in a given string, instead of whether the string is part of a column value.
For instance:
This way, I can find if any of the rows in my table contains the string 'bricks' in column:
SELECT column FROM table
WHERE column ILIKE '%bricks%';
But what I'm looking for, is to find out if any part of the sentence "The ships hung in the sky in much the same way that bricks don’t" is in any of the rows.
Something like:
SELECT column FROM table
WHERE 'The ships hung in the sky in much the same way that bricks don’t' ILIKE '%' || column || '%';
So the row from the first example, where the column contains 'bricks', will show up as result.
I've looked through some suggestions here and some other forums but none of them worked.
Your simple case can be solved with a simple query using the ANY construct and ~*:
SELECT *
FROM tbl
WHERE col ~* ANY (string_to_array('The ships hung in the sky ... bricks don’t', ' '));
~* is the case insensitive regular expression match operator. I use that instead of ILIKE so we can use original words in your string without the need to pad % for ILIKE. The result is the same - except for words containing special characters: %_\ for ILIKE and !$()*+.:<=>?[\]^{|}- for regular expression patterns. You may need to escape special characters either way to avoid surprises. Here is a function for regular expressions:
Escape function for regular expression or LIKE patterns
But I have nagging doubts that will be all you need. See my comment. I suspect you need Full Text Search with a matching dictionary for your natural language to provide useful word stemming ...
Related:
IN vs ANY operator in PostgreSQL
PostgreSQL LIKE query performance variations
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
This query:
SELECT
regexp_split_to_table(
'The ships hung in the sky in much the same way that bricks don’t',
'\s' );
gives a following result:
| regexp_split_to_table |
|-----------------------|
| The |
| ships |
| hung |
| in |
| the |
| sky |
| in |
| much |
| the |
| same |
| way |
| that |
| bricks |
| don’t |
Now just do a semijoin against a result of this query to get desired results
SELECT * FROM table t
WHERE EXISTS (
SELECT * FROM (
SELECT
regexp_split_to_table(
'The ships hung in the sky in much the same way that bricks don’t',
'\s' ) x
) x
WHERE t.column LIKE '%'|| x.x || '%'
)

Incorrect results returned by postgres

I ran the following commands in posgresql 9.6:
./bin/createdb testSpatial
./bin/psql -d testSpatial -c "CREATE EXTENSION postgis;"
create table test(name character varying(250), lat_long character varying(90250), the_geom geometry);
\copy test(name,lat_long) FROM 'test.csv' DELIMITERS E'\t' CSV HEADER;
CREATE INDEX spatial_gist_index ON test USING gist (the_geom );
UPDATE test SET the_geom = ST_GeomFromText(lat_long,4326);
On running: select * from test; I get the following output:
name | lat_long
|
the_geom
------+-----------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------+--------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------
A | POLYGON((-0.061225 -128.427791,-0.059107 -128.428264,-0.056311 -128.428911,-0.054208 -128.426510,-0.055431 -128.426324,-0.057363 -128.42
6124,-0.059315 -128.425843,-0.061225 -128.427791)) | 0103000020E61000000100000008000000D42B6519E258AFBFBE50C076B00D60C07DE9EDCF4543AEBFBC41B456B
40D60C08063CF9ECBD4ACBFA1BC8FA3B90D60C07BF65CA626C1ABBF58AD4CF8A50D60C0BF805EB87361ACBFFFAF3A72A40D60C0B83A00E2AE5EADBF4D81CCCEA20D60C01F1153228
95EAEBF60C77F81A00D60C0D42B6519E258AFBFBE50C076B00D60C0
B | POINT(1.978165 -128.639779)
| 0101000020E61000002D78D15790A6FF3F5D35CF11791460C0
(2 rows)
After this I ran a query: To find all "name" which are within 5 meters of each other. For doing so, I wrote the following command.
testSpatial=# select s1.name, s2.name from test s1, test s2 where ST_DWithin(s1.the_geom, s2.the_geom, 5);
name | name
------+------
A | A
A | B
B | A
B | B
(4 rows)
To my surprise I am getting incorrect output as "A" and "B" are 227.301 km away from each other (as calculated using haversine distance here: http://andrew.hedges.name/experiments/haversine/). Can someone please help me understand as to where am I going wrong.
You have defined your geometry as follows
the_geom geometry
ie, it's not geography. But the ST_DWithin docs say
For Geometries: The distance is specified in units defined by the
spatial reference system of the geometries. For this function to make
sense, the source geometries must both be of the same coordinate
projection, having the same SRID.
For geography units are in meters and measurement is defaulted to
use_spheroid=true, for faster check, use_spheroid=false to measure
along sphere.
So you are actually searching for places that are within 5 degrees of each other. A degree is roughly equal to 111km so you are looking for places that are about 550 km from each other rather than 5 meters.
Additionally, it doesn't make much sense to store strings like POINT(1.978165 -128.639779) in your table. It's completely redundant. It's information that can be generated quite easily from the geography column.

How to sort string data that represents numbers

My client has a set of numeric data stored in a string field in a database. So of course it doesn't sort correctly. These rows sort like this:
105
3
44
When they should sort like this:
3
44
105
This is very much a legacy database and I can't change it at all. I also can't change the software that uses the database. The client doesn't own it or have the source code. It has never worked the way they want. However, there is an unused string field that I could use to sort on (only a small number of fields can be sorted on.)
What I would like to do is take the input data, derive a string from it, and store the new string in the unused field, such that when the data is sorted on this new data, the original data sorts correctly, i.e., numerically.
So, for an overly simplistic example, if the algorithm produced the following new data:
105 -> c
3 -> a
44 -> b
Then when the second column was sorted, the first column would look 'correct'.
The tricky bit is that when new rows are added to the database, they must also sort correctly, without having to regenerate the sort data for all rows. This is the part of the problem that has my brain in a twist. I'm not sure it's actually possible.
You can assume that the number will never be more than 5 'digits'.
I realize this is a total kludge, but since I can't change the system, I have to find a work around, rather than a quality solution. Welcome to the real world.
~~~~~~~~~~~~~~~~~~~~~~ S O L U T I O N ~~~~~~~~~~~~~~~~~~
I don't think this is an uncommon problem, so here are the results of Gordon's solution:
mysql> select * from t order by new;
+------+------------+
| orig | new |
+------+------------+
| 3 | 0000000003 |
| 44 | 0000000044 |
| 105 | 0000000105 |
+------+------------+
In most databases, you can just do:
order by cast(col as int)
This will convert the string representation to a number and use that for ordering. There is no need for an additional column. If you add one, I would recommend adding a numeric column to contain the actual value.
If you really want to store something in the unused field, then you can left pad the number. How to do this depends on the database, but here is one typical method:
update t
set unused = right(concat('0000000000', col), 10);
Not all databases support these two specific functions, but all offer this basic functionality in some method.
Try something like
SELECT column1 FROM table1 ORDER BY LENGTH(column1) ASC, column1 ASC
(Adjust the column and table name for your environment.)
This is a bit of a hack but works as long as the "numbers" in your string column are natural, non-negative numbers only.
If you are looking for a more sophisticated approach or algorithm, try searching for natural sort together with your DBMS.

Homoiconicity and SQL

I'm currently using emacs sql-mode as my sql shell, a (simplified) query response is below:
my_db=# select * from visit limit 4;
num | visit_key | created | expiry
----+-----------------------------+----------------------------+------------
1 | 0f6fb8603f4dfe026d88998d81a | 2008-03-02 15:17:56.899817 | 2008-03-02
2 | 7c389163ff611155f97af692426 | 2008-02-14 12:46:11.02434 | 2008-02-14
3 | 3ecba0cfb4e4e0fdd6a8be87b35 | 2008-02-14 16:33:34.797517 | 2008-02-14
4 | 89285112ef2d753bd6f5e51056f | 2008-02-21 14:37:47.368657 | 2008-02-21
(4 rows)
If I want to then formulate another query based on that data, e.g.
my_db=# select visit_key, created from visit where expiry = '2008-03-02'
and num > 10;
You'll see that I have to add the comma between visit_key and created, and surround the expiry value with quotes.
Is there a SQL DB shell that shows it's content more homoiconically, so that I could minimise this sort of editing? e.g.
num, visit_key, created, expiry
(1, '0f6fb8603f4dfe026d88998d81a', '2008-03-02 15:17:56.899817', '2008-03-02')
or
(num=1, visit_key='0f6fb8603f4dfe026d88998d81a',
created='2008-03-02 15:17:56.899817', expiry='2008-03-02')
I'm using postgresql btw.
Here's one idea, which is similar to what I do sometimes, though I'm not sure that it's exactly what you're asking for:
Run a Lisp compiler (like SBCL) in SLIME. Then load CLSQL. It has a "Functional Data Manipulation Language" (SELECT documentation) which might help you do something like you want, perhaps in conjunction with SLIME's autocompletion capabilities. If not, it's easy to define Lisp functions and macros (assuming you know Lisp, but you're already an Emacser!).
Out-of-the-box, it doesn't give the nicely formatted tables that most SQL interfaces have, but even that isn't too hard to add. And Lisp is certainly powerful enough to let one easily come up with ways to make your common operations easier.
I've found the following changes in psql go some way to giving me homoiconicity:
=# select remote_ip, referer, http_method, time from hit limit 1;
remote_ip | referer | http_method | time
-----------------+---------+-------------+---------------------------
213.233.132.148 | | GET | 2013-08-27 08:01:42.38808
(1 row)
=# \a
Output format is unaligned.
=# \f ''', '''
Field separator is "', '".
=# \t
Showing only tuples.
=# select remote_ip, referer, http_method, time from hit limit 1;
213.233.132.148', '', 'GET', '2013-08-27 08:01:42.38808
caveats: everything is a string, and it's missing start and end quotes.