How can I efficiently make range query on two columns in cassandra? - indexing

I would like to save milions of locations into Cassandra's ColumnFamily and than make a range query on this data.
For example:
Attributes: LocationName, latitude, longitude
Query: SELECT LocationName FROM ColumnFamily WHERE latitute> 10 AND latitude<20 AND longitude>30 AND longitude<40;
What structure and indexes shoul I use so the query will be efficient?

Depending on the granularity you need in your queries (and the variability of that granularity), one way to handle this would be to slice up your map into a grid, where all your locations belong inside a grid square with a defined lat/lon bounding box. You can then do your initial query for grid square IDs, followed by locations inside those squares, with a representation something like this:
GridSquareLat {
key: [very_coarse_lat_value] {
[square_lat_boundary]:[GridSquareIDList]
[square_lat_boundary]:[GridSquareIDList]
}
...
}
GridSquareLon {
key: [very_coarse_lon_value] {
[square_lon_boundary]:[GridSquareIDList]
[square_lon_boundary]:[GridSquareIDList]
}
...
}
Location {
key: [locationID] {
GridSquareID: [GridSquareID] <-- put a secondary index on this col
Lat: [exact_lat]
Lon: [exact_lon]
...
}
...
}
You can then give Cassandra the GridSquareLat/Lon keys representing the very coarse grain lat/lon values, along with a column slice range that will reduce the columns returned to only those squares within your boundaries. You'll get two lists, one of grid square IDs for lat and one for lon. The intersection of these lists will be the grid squares in your range.
To get the locations in these squares, query the Location CF, filtering on GridSquareID (using a secondary index, which will be efficient as long as your total grid square count is reasonable). You now have a reasonably sized list of locations with only a few very efficient queries, and you can easily reduce them to your exact list inside your application.

Let's pretend you are going to grow into the billions(and I will do the millions case later below). If you were using something like PlayOrm on cassandra(or you can do this yourself instead of using PlayOrm), you would need to partition by something. Let's say you choose to partition by longitude so that anything between >= 20 and < 30 is in partition 20 and between >= 30 and < 40 is in partition 30. Then in PlayOrm, you use it's scalable SQL to just write the same query you wrote but you need to query the proper partitions which in some cases would be multiple partitions unless you limit your result set size...
In PlayOrm, or in your data model, it would look like (no other tables needed)
Location {
key: [locationID] {
LonBottom: [partitionKey]
Lat: [exact_lat] <- #NoSqlIndexed
Lon: [exact_lon] <- #NoSqlIndexed
...
}
...
}
That said, if you are in the millions, you would not need partitions so just remove the LonBottom column above and do no partitioning....of course, why use noSQL as millions is not that big and an RDBMS can easily handle millions.
If you want to do it yourself, in the millions case, there are two rows for Lat and Lon(wide row pattern) that hold the indexed values of lat and long to query. For billinos case, it would be two rows per partition as each partition gets it's own index as you don't want indices that are too large.
An indexing row is simple for you to create. It is simply rowkey="index name" and each column name is a compound name of longitude and row key to location. There is NO value for each column, just a compound name (so that each col name is unique).
so your row might look like
longindex = 32.rowkey1, 32.rowkey45, 32.rowkey56, 33.rowkey87, 33.rowkey89
where 32 and 33 are longitudes and the rowkeys are pointing to the locations.

Related

Compare values with BETWEEN from same columns as two different columns

I'm storing myriad attributes from user-uploaded image files in a single table. The basic structure of the table is like this:
attrib_id | image_id | attrib_name | attrib_value
Alongside attributes like TAG and CAPTION, I'm storing LATITUDE and LONGITUDE of the image's location in the same manner. All the columns are of type varchar.
I'm trying to query for images associated with locations within a given bounding box - the inputs are upper and lower latitude, start and end longitude. The output of the query should be a list of image_ids that have a row with name=LATITUDE and value BETWEEN upper and lower latitude, as well as the same for longitude.
Since all the values are strings, and in the same columns, I don't really know where to start with this one.
While I'm willing to consider restructuring the table, my intuition tells me there's a way to accomplish this in SQL.
My database is currently MySql, but I'm likely to switch over to Postgres in the future, so I would prefer non-vendor specific solutions.
You can do something like this:
SELECT * FROM Table lat
INNER JOIN Table lon ON lat.image_id = lon.image_id
WHERE (lon.attrib_name = 'LONGITUDE' AND lon.attrib_value BETWEEN start AND end) AND
(lat.attrib_name = 'LATITUDE' AND lat.attrib_value BETWEEN upper AND lower)
Not sure if BETWEEN is supported in Postgres or whatever you migrate to so you can just use < and > operators.

Find nearest lines to large number of points in an oracle spatial database

The problem I have is simple:
I have a set of datasets. Each dataset has within it a set of points. Each set of points is an identical a 6km spaced grid (this grid never changes). Each point has an associated value.Each dataset is unrelated, so the problem can be seen as just a single set of points.
If the value of a point exceeds a predefined threshold value then the point has to be queried against an oracle spatial database to find all line segments within a certain distance of the point.
Which is a simple enough problem to solve.
The line segments have a non-unique ID, which allow them to be grouped together into features of size 1 to 700 segments (it's all predefined topology).
Ultimately I need to know which feature IDs match against which points as well as the number of line segments for each feature match against each point.
In terms of dataset sizes:
There are around 200 datasets.
There are 56,000 points per dataset.
There is a little over 180,000 line segments in the spatially indexed database.
The line segments can be grouped into a total of 1900 features.
Usually there aren't many more than in the order of 10^3 points that exceed the threshold per dataset.
I have created a solution and it works adequately,
however I'm unhappy with the overall run times - it takes around 3min per dataset.
Normally I wouldn't mind if a precomputation task takes that long, but due to constraints this task cannot take more than an hour to run, and ideally would only take 1/2 an hour.
Currently I use SDO_WITHIN_DISTANCE to do the query, and I run this query for each and every point that exceeds the threshold:
SELECT id, count(shape) AS segments, sum(length) AS length
FROM (
SELECT shape, id, length
FROM lines_1
UNION ALL
SELECT shape, id, length
FROM lines_2
)
WHERE SDO_WITHIN_DISTANCE(
shape,
sdo_geometry(
3001,
8307,
SDO_POINT_TYPE(:lng,:lat, 0),
null,
null
),
'distance=4 unit=km'
) = 'TRUE'
GROUP BY id
This query takes around 0.4s to execute, which isn't all that bad, but it adds up for a single dataset, and is compounded over all of the datasets.
I am not overly experienced with Oracle spatial databases, so I'm not sure how to improve the speed.
Note that I cannot change the format of the incoming set of points, nor can I change the format of the database.
The only way to speed it up that I can think of is by pre computing the query for each point and storing that in a separate table, but I'd rather not do that as it more or less creates another copy of the data.
So the question is - is there a better way to do query?
I ended up precomputing my query into the following table.
+---------+---------+
| LINE_ID | VARCHAR |
| LAT | FLOAT |
| LNG | FLOAT |
+---------+---------+
There were just too many multiline segments for it to be efficient.
By precomputing it I can just lookup in the table for the relevant IDs (which ultimately was all I cared about).
The query takes less than 1/10th of the time, so it works out a lot faster.
Ultimately the tradeoff of having to recompute the point to ID mapping every week (takes about 2 hours) was worth the speed up.

transform rows into columns in a sql table

Supose I would like to store a table with 440 rows and 138,672 columns, as SQL limit is 1024 columns I would like to transform rows into columns, I mean to convert the
440 rows and 138,672 columns to 138,672 rows and 440 columns.
Is this possible?
SQL Server limit is actually 30000 columns, see Sparse Columns.
But creating a query that returns 30k columns (not to mention +138k) will be basically uncontrollable, the sheer size of the metadata on each query result would halt the client to a crawl. One simply does not design databases like that. Go back to the drawing board, when you reach 10 columns stop and think, when you reach 100 column erase the board and start anew.
And read this: Best Practices for Semantic Data Modeling for Performance and Scalability.
The description of the data is as follows....
Each attribute describes the measurement of the occupancy rate
(between 0 and 1) of a captor location as recorded by a measuring
station, at a given timestamp in time during the day.
The ID of each station is given in the stations_list text file.
For more information on the location (GPS, Highway, Direction) of each
station please refer to the PEMS website.
There are 963 (stations) x 144 (timestamps) = 138,672 attributes for
each record.
This is perfect for normalision.
You can have a stations table and a measurements table. Two nice long thin tables.

INT vs VARCHAR in search

Which one of the following queries will be faster and more optimal (and why):
SELECT * FROM items WHERE w = 320 AND h = 200 (w and h are INT)
SELECT * FROM items WHERE dimensions = '320x200'(dimensions is VARCHAR)
Here are some actual measurements. (Using SQLite; may try it with MySQL later.)
Data = All 1,000,000 combinations of w, h ∈ {1...1000}, in randomized order.
CREATE TABLE items (id INTEGER PRIMARY KEY, w INTEGER, h INTEGER)
Average time (of 20 runs) to execute SELECT * FROM items WHERE w = 320 and h = 200 was 5.39±0.29 µs.
CREATE TABLE items (id INTEGER PRIMARY KEY, dimensions TEXT)
Average time to execute SELECT * FROM items WHERE dimensions = '320x200' was 5.69±0.23 µs.
There is no significant difference, efficiency-wise.
But
There is a huge difference in terms of usability. For example, if you want to calculate the area and perimeter of the rectangles, the two-column approach is easy:
SELECT w * h, 2 * (w + h) FROM items
Try to write the corresponding query for the other way.
Intuitively, if you do not create INDEXes on those columns, integer comparison seems faster.
In integer comparison, you compare directly 32-bit values equality with logical operators.
On the other hand, strings are character arrays, it will be difficult to compare them. Character-by-character.
However, another point is that, in 2nd query you have 1 field to compare, in 1st query you have 2 fields. If you have 1,000,000 records and no indexes on columns, that means you may have 1,000,000 string comparisons on worst case (unluckily last result is the thing you've looking for or not found at all)
On the other hand you have 1,000,000 records and all are w=320, then you'll be comparing them for h,too. That means 2,000,000 comparisons. However you create INDEXes on those fields IMHO they will be almost identical since VARCHAR will be hashed (takes O(1) constant time) and will be compared using INT comparison and take O(logn) time.
Conclusion, it depends. Prefer indexes on searchable columns and use ints.
Probably the only way to know that is to run it. I would suspect that if all columns used are indexed, there would be basically no difference. If INT is 4 bytes, it will be almost the same size as the string.
The one wrinkle is in how VARCHAR is stored. If you used a constant string size, it might be faster than VARCHAR, but mostly because your select * needs to go get it.
The huge advantage of using INT is that you can do much more sophisticated filtering. That alone should be a reason to prefer it. What if you need a range, or just width, or you want to do math on width in the filtering? What about constraints based on the columns, or aggregates?
Also, when you get the values into your programming language, you won't need to parse them before using them (which takes time).
EDIT: Some other answers are mentioning string compares. If indexed, there won't be many string compares done. And it's possible to implement very fast compare algorithms that don't need to loop byte-by-byte. You'd have to know the details of what mysql does to know for sure.
Second query, as the chances to match the exact string is smaller (which mean smaller set of records but with greater cardinality)
First query, chances matching first column is higher and more rows are potentially matched (lesser cardinality)
of course, assuming index are defined for both scenario
first one because it is faster to compare numeric data.
It depends on the data and the available indexes. But it is quite possible for the VARCHAR version to be faster because searching a single index can be faster than two. If the combination of values provides a unique (or "mostly" unique) result while each individual H/W value has multiple entries, then it could narrow the down to a much smaller set using the single index.
On the other hand, if you have a multiple column index on the to integer columns, that would likely be the most efficient.

Optimizing Sqlite query for INDEX

I have a table of 320000 rows which contains lat/lon coordinate points. When a user selects a location my program gets the coordinates from the selected location and executes a query which brings all the points from the table that are near. This is done by calculating the distance between the selected point and each coordinate point from my table row. This is the query I use:
select street from locations
where ( ( (lat - (-34.594804)) *(lat - (-34.594804)) ) + ((lon - (-58.377676 ))*(lon - (-58.377676 ))) <= ((0.00124)*(0.00124)))
group by street;
As you can see the WHERE clause is a simple Pythagoras formula to calculate the distance between two points.
Now my problem is that I can not get an INDEX to be usable. I've tried with
CREATE INDEX indx ON location(lat,lon)
also with
CREATE INDEX indx ON location(street,lat,lon)
with no luck. I've notice that when there is math operation with lat or lon, the index is not being called . Is there any way I can optimize this query for using an INDEX so as to gain speed results?
Thanks in advance!
The problem is that the sql engine needs to evaluate all the records to do the comparison (WHERE ..... <= ...) and filter the points so the indexes don’t speed up the query.
One approach to solve the problem is compute a Minimum and Maximum latitude and longitude to restrict the number of record.
Here is a good link to follow: Finding Points Within a Distance of a Latitude/Longitude
Did you try adjusting the page size? A table like this might gain from having a different (i.e. the largest?) available page size.
PRAGMA page_size = 32768;
Or any power of 2 between 512 and 32768. If you change the page_size, don't forget to vacuum the database (assuming you are using SQLite 3.5.8. Otherwise, you can't change it and will need to start a fresh new database).
Also, running the operation on floats might not be as fast as running it on integers (big maybe), so that you might gain speed if you record all your coordinates times 1 000 000.
Finally, euclydian distance will not yield very accurate proximity results. The further you get from the equator, the more the circle around your point will flatten to ressemble an ellipse. There are fast approximations which are not as calculation intense as a Great Circle Distance Calculation (avoid at all cost!)
You should search in a square instead of a circle. Then you will be able to optimize.
Surely you have a primary key in locations? Probably called id?
Why not just select the id along with the street?
select id, street from locations
where ( ( (lat - (-34.594804)) *(lat - (-34.594804)) ) + ((lon - (-58.377676 ))*(lon - (-58.377676 ))) <= ((0.00124)*(0.00124)))
group by street;