Composite multicolumn index for geopoint range and numeric range query

Composite multicolumn index for geopoint range and numeric range query - sql

I am building an app where the server needs to select rows based on some criteria/filters. One of them is the location of the user and the radius at which the user want's to see posts and other filters such date range and filter for a value of another column. This is going to be for an ad-hoc event discovery app.
I have read about PostGIS and I know there is a point datatype. Based on this answer I understood that it is better to order from equality to range columns, even though I feel like geo point column should be the first. But the main question is, how is it possible to create such an index? I thought about GiST index but not sure that helps.
Suppose the following simplified events table (disregard the validity position data):
id event_title event_position event_type is_public start_date
(varchar) (point lat/lon) (smallint) (boolean) (timestamptz)
-- --------------------------- --------------- --------- --------- ----
1 "John's Party" (122,35) 0 0 2020-07-05
2 "Revolution then Starbucks" (123,30) 1 1 2020-07-06
3 "Study for math exam" (120,36) 2 1 2020-07-07
4 "Party after exam" (120,36) 1 1 2020-07-08
5 "Hiking next to the city" (95,40) 3 1 2020-07-09
6 "Football match" (-42,31) 4 1 2020-07-10
So in this table a user would be able to query public events close to (122,34) by 100km (suppose first three rows fall into this area) and of event types 0, 1 or 2 falling between dates 2020-07-05 and 2020-07-07. The user would get the rows with ID 2 and 3.
This is the query I want to optimize with an appropriate index. Thank you!

With the help of the btree_gist extension, you can include the event_type and start_date columns into the GiST index along with the event_position. However, the event_type won't be very useful in the index as long as the restriction clause is something like event_type in (0, 1, 2). (But if the list only contains one element, it will be rewritten into an equality, and in that case it can use that column in the index efficiently.) So using the other two columns would be my starting point. I would put the usually more selective one first. If you are going to leave past events in the table rather than clearing them out, then the date is probably going to eventually be the more selective one.
If the location condition is not very selective at all (e.g. most of your events are in NYC, and most of your users are in NYC, and almost everything is within 100km of almost everything else), then you might want a different approach. Just make a default BTREE index on (event_type, start_date). Unlike GiST, such a BTREE can effectively use a condition like event_type in (0, 1, 2) along with AND start_date between x and y.
I don't think that there is a fundamental reason GiST can't use the in-list effectively while BTREE can. Maybe that will be fixed for GiST in some future version.

Related

Modeling data in Redis. Which is better: Sorted set & strings or hash table?

I want to use redis to store data that is sourced from a sql db. In the db, each row has an ID, date, and value, where the ID and date make up a composite key (there can only be one value for a particular ID and date). An example is below:
ID Date Value
1 01/01/2001 1.2
1 02/01/2001 1.5
1 04/23/2002 1.5
2 05/05/2009 0.4
Users should be able to query this data in redis given a particular ID and date range. For example, they might want all values for 2019 with ID 45. If the user does not specify a start or end time, we use the system's Date.Min or Date.Max respectively. We also want to support refreshing redis data from the database using the same parameters (ID and date range).
Initially, I used a zset:
zset key zset member score
1 01/01/2001_1.2 20010101
1 02/01/2001_1.5 20010201
1 04/23/2002_1.5 20020423
2 05/05/2009_0.4 20090505
Only, what happens if the value field changes in the db? For instance, ID 1 and date 01/01/2001 might change to 1.3 later on. I would want the original value to be updated, but instead, a new member will be inserted. Rather, I would need to first check that a member for a particular score exists, and delete if it does before inserting a new member. I imagine this could get expensive if refreshing, for example, 10 years worth of data.
I thought of two possible fixes to this:
1.) Use a zset and string key-value:
zset key zset value score
1 1_01/01/2001 20010101
1 1_02/01/2001 20010201
1 1_04/23/2002 20020423
2 2_05/05/2009 20090505
string key string value
1_01/01/2001 1.2
1_02/01/2001 1.5
1_04/23/2002 1.5
2_05/05/2009 0.4
This allows me to easily update the value, and query for a date range, but adds some complexity as now I need to use two redis data structures instead of 1.
2.) Use a hash table:
hash key sub-key value
1 01/01/2001 1.2
1 02/01/2001 1.5
1 04/23/2002 1.5
2 05/05/2009 0.4
This is nice because I only have to use 1 data structure and although it would be O(N) to get all values for a particular hash key, solution 1 would have the same drawback when getting values for all string keys returned from the zset.
However, with this solution, I now need to generate all sub-keys between a given start and end date in my calling code, and not every date may have a value. There are also some edge cases that I now need to handle (what if the user wants all values up until today? Do I use HGETALL and just remove the ones in the future I don't care about? At what date range size should I use HGETALL rather than HMGET?)
In my view, there are pro's and con's to each solution, and I'm not sure which one will be easier in the long term to maintain. Does anyone have thoughts as to which structure they would choose in this situation?

Select N unique rows randomly while excluding 1 specific row (in SQLite DB)

I need to select x unique rows randomly from a table with n rows, while excluding 1 specific row. (x is small, 3 for example) This can be done in several queries if needed and I can also compute anything in programming language (Java). The one important thing is that it must be done faster than O(n), consuming O(x) memory and indefinite looping (retrying) is also undesirable.
Probability of selection should be equal for all rows (except the one which is excluded, of course)
Example:
| id | some data |
|————|———————————|
| 1 | … |
| 2 | … |
| 3 | … |
| 4 | … |
| 5 | … |
The algorithm is ran with arguments (x = 3, exclude_id = 4), so:
it should select 3 different random rows from rows with id in 1,2,3,5.
(end of example)
I tried the following approach:
get row count (= n);
get the position of the excluded row by something like select count(*) where id < excluded_id, assuming id is monotonically increasing;
select the numbers from 0..n, obeying all the rules, by using some "clever" algorithms, it's something like O(x), in other words fast enough;
select these x rows one by one by using limit(index, 1) SQL clause.
However, it turned out that it's possible for rows to change positions (I'm not sure why), so the auto-generated ids are not monotonically increasing. And in this case the second step (get the position of the excluded row) produces wrong result and the algorithm fails to do its job correctly.
How can this be improved?
Also, if this is vastly easier with a different SQL-like database system, it would be interesting, too. (the DB is on a server and I can install any software there as long as it's compatible with Ubuntu 14.04 LTS)

(I'm sorry for a bit of confusion;) the algorithm I used is actually correct it the id is monotonically increasing, I just forgot that it was not itself auto-generated, it was taken from another table where it's auto-generated, and it was possible to add these rows in different order.
So I added another id for this table, which is auto-generated, and used it for row selection, and now it works as it should.

Database scheme for searching my age groups

I've struggled with this for a while now trying to figure out how to do this most efficiently.
The problem is as follows. I have items in a database to be marketed for specific age groups such as ages 10 to 20 or ages 16+ and I need to be able to make a query like, find item that is for 17 year old
Here are my two best ideas (but I don't like either, as I think they're both inefficient).
Have a csv column with values like 10-20 and 16+ , retrieve the entire list, and parse through it (Bad idea, I know, I'm fresh out of ideas here though)
Have a csv column with values like 10,11,12,13...20 for ranges, so I can look for it using WHERE ages LIKE "%17%", and for cases like 16+ I'd have to retrieve those special cases using something like WHERE ages LIKE "%+%" and parse through those.
I'm of course leaning towards the second option, but in the very best scenario, I'm running two queries one for regular items, and one for things like 16+
Is there a better way? If not, do you think you could make either of my models more efficient? Thanks.

You can do it like this:
Add lower_age and upper_age columns to your table, both integers that allow NULLs.
If lower_age is NULL then there is no lower bound.
If upper_age is NULL then there is no upper bound.
Combine COALESCE and BETWEEN for your queries.
To clarify (4), you want to say things like this:
select *
from your_table
where $n between coalesce(lower_age, $n) and coalesce(upper_age, $n)
where $n is the age you're looking for. BETWEEN uses inclusive bounds so coalesce(lower_age, $n) ignores $n if lower_age is not NULL and gives you $n >= $n (i.e. an automatic true on that bound) if lower_age is NULL; similarly for the upper_age.
If something is suitable for only 11 year olds, then your [lower_age,upper_age] closed interval would be [11, 11], 16+ would be [16, NULL], six and lower would be [NULL, 6], everyone would be [NULL, NULL], and no one would be [23, 11] or anything else with lower_age > upper_age (or, more likely, invalid data that a CHECK constraint would throw a hissy fit over).

You can do this a number of ways. If you store the age of the user(whatever) in the row. Then you can query the age and with > 16 or < 30 or between 10-20 whatever. The other option is to store this as a bitwise. Have a reference table and store your different ranges if they can have multiples then you just add the two row values together.
1 = 10
2 = 16+
4 = 10-20
8 = 20-30
16 = 20+
32 = 30+
.
.
.
.
then in the table that stores the persons info you can set the column to an int or bigint take your preference and then for whatever groups they belong to you can determine this by the number for example:
Table of Users
ID Name BitWise
1 test 2
2 something 6 (2+4)
3 blah 24 (8+16)
However I think that it may be a bit overkill with the bitwise you might be best just storing the age as a number an running queries against that. More than likely this will be the most efficient.

You have a range of options (no pun intended). For age recommendations, the easiest way is to store a min_age and max_age and query like this:
select * from item where :age between min_age and max_age
where you have to decide whether you allow nulls for these columns (then you need to use coalesce() or nvl() or whatever function your database provides for dealing with comparisons with nulls), or set boundary values for these columns where you can be sure :age will always fall in between.
Alternatively, you can use a m:n table
create table item_ages (item_id int not null, age int not null, constraint item_ages_pk primary key (item_id, age)
and fill it with explicit values:
item_id | age
-------------
1 | 16
1 | 17
1 | 18
and so on. This is more cumbersome tha using a range, but also more flexible, and since your database can index the table and probably store that index in memory, queries should be fast. You only have to touch this table when a new item is entered or the age range for a particular item changes.
Note that CBRRacer's answer has similar properties: both share the idea that you prepare a datastructure that can easily be indexed, and answer the filter question from that index. This is a popular method for storing marketing data in ecommerce applications. The extreme end of that range would be to use a dedicated package for storing inverted indexes for that purpose. But for a simple age recommendation that's of course overkill.

Someting like this:
SELECT *
FROM tablename
WHERE 17 BETWEEN start_age AND end_age

How should you separate dimension tables from fact tables if you are not building a data warehouse?

I realize that referring to these as dimension and fact tables is not exactly appropriate. I am at a lost for better terminology, so please excuse this categorization that I use in the post.
I am building an application for employee record keeping.
The database will contain organizational information. The information is mostly defined in three tables: Locations, Divisions, and Departments. However, there are others with similar problems. First, I need to store the available values for these tables. This will allow for available values in the application when managing an employee and for management of these values when adding/deleting departments and such. For instance, the Locations table may look like,
LocationId | LocationName | LocationStatus
1 | New York | Active
2 | Denver | Inactive
3 | New Orleans | Active
I then need to store these values for each employee and keep their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I cannot pinpoint why, but this struck me as poor design. My next inclination was to create a DimLocation/FactLocation, DimDivision/FactDivision, DimDepartment/FactDepartment set of tables. I do not believe this makes sense either. I have also considered naming them as a combination of Employee, i.e. EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I imagine that data would look similar to a simplified version I have below:
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 3 | 2008-07-01 | NULL
1 | 2 | 2007-04-01 | 2008-06-30
I realize any of the imagined solutions I described above could work, but I am really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this community's help, opinions, and experience with this matter. I am open to and would welcome any suggestion to consider. For instance, should I even store the available values for these three tables in the database? Should they be maintained in the application code/business logic layer? Do I just need to get over seeing the word History repeating three times?
Thanks!

Firstly, I see no issue in describing these as Dimension and Fact tables outside of a warehouse :)
In terms of conceptualising and understanding the relationships, I personally see the use of start/end dates perfectly easy for people to understand. Allowing Agent and Location fact tables, and then time dependant mapping tables such as Agent_At_Location, etc. They do, however, have issues worthy of taking note.
If EndDate is 2008-08-30, was the employee in that location UP TO 30th August, or UP TO and INCLUDING 30th August.
Dealing with overlapping date periods in queries can give messy queries, but more importantly, slow queries.
The first one seems simply a matter of convention, but it can have certain implications when dealign with other data. For example, consider that an EndDate of 2008-08-30 means that they ARE at that location UP TO and INCLUDING 30th August. Then you join on to their Daily Agent Data for that day (Such as when they Actually arrived at work, left for breaks, etc). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 in order to include all the events that happened during that day.
This is because the data's EventTimeStamp isn't measured in days, but probably minutes or seconds.
If you consider that the EndDate of '2008-08-30' means that the Agent was at that Location UP TO but NOT INCLDUING 30th August, the join does not need the + 1. In fact you don't need to know if the date is DAY bound, or can include a time component or not. You just need TimeStamp < EndDate.
By using EXCLUSIVE End markers, all of your queries simplify and never need + 1 day, or + 1 hour to deal with edge conditions.
The second one is much harder to resolve. The simplest way of resolving an overlapping period is as follows:
SELECT
CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom],
CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom],
FROM
TableA
INNER JOIN
TableB
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
-- Where InclusiveFrom is the StartDate
-- And ExclusiveFrom is the EndDate, up to but NOT including that date
The problem with that query is one of indexing. The first condition TableA.InclusiveFrom < TableB.ExclusiveFrom could be be resolved using an index. But it could give a Massive range of dates. And then, for each of those records, the ExclusiveDates could all be just about anything, and certainly not in an order that could help quickly resolve TableA.ExclusiveFrom > TableB.InclusiveFrom
The solution I have previously used for that is to have a maximum allowed gap between InclusiveFrom and ExclusiveFrom. This allows something like...
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
The condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL can't benefit from indexes. But instead we've limitted the number of rows that can be returned by searching TableA.InclusiveFrom. It's at most only ever 30 days of data, because we know that we restricted the duration to a maximum of 30 days.
An example of this is to break up the associations by calendar month (max duration of 31 days).
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 2 | 2007-04-01 | 2008-05-01
1 | 2 | 2007-05-01 | 2008-06-01
1 | 2 | 2007-06-01 | 2008-06-25
(Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)
It's effectively a trade off; using Disk Space to gain performance.
I've even seen this pushed to the extreme of not actually storing date Ranges, but storing the actual mapping for each and every day. Essentially, it's like restricting the maximum duration to 1 day...
EmployeeId | LocationId | EffectiveDate
1 | 2 | 2007-06-23
1 | 2 | 2007-06-24
1 | 3 | 2007-06-25
1 | 3 | 2007-06-26
Instinctively I initially rebelled against this. But in subsequent ETL, Warehousing, Reporting, etc, I actually found it Very powerful, adaptable, and maintainable. I actually saw people making fewer coding mistakes, writing code in less time, the code ending up running faster, and being much more able to adapt to clients' changing needs.
The only two down sides were:
1. More disk space taken (But trival compared to the size of fact tables)
2. Inserts and Updates to this mapping was slower
The actual slow down for Inserts and Updates only actually matter Once, where this model was being used to represent a constantly changing process net; where the app wanted to change the mapping about 30 times a second. Even then it worked, it just chomped up more CPU time than was ideal.

If you want to be efficient and keep a history, do these things. There are multiple solutions to this problem, but this is the one that I keep going back to:
Remember that each row represents a single entity, if you make corrections that entity, that's fine, but don't re-use and ID for a new Location. Set it up so that instead of deleting a Location, you mark it as deleted with a bit and hide it from the interface, that way when it's referenced historically, it's still there.
Create a history table that includes the current value, or no records if a value isn't currently set. Have the foreign key tie back to the employee and tie to the location.
Create a column in the employee table that points to the current active location in the history. When you need to get the employees location, you join in the history table based on this ID. When you need to get all of the history for an employee you join from the history table.
This structure keeps it all normalized, and gives you an easy way to find the current value without having to do any date comparisons.
As far as using the word history, think of it in different terms: since it contains the current item as well as historical items, it's really just a junction table that keeps around the old item. As such you can name it something like EmployeeLocations.

custom sorting or ordering a table without resorting the whole shebang

For ten years we've been using the same custom sorting on our tables, I'm wondering if there is another solution which involves fewer updates, especially since today we'd like to have a replication/publication date and wouldn't like to have our replication replicate unnecessary entries.I had a look into nested sets, but it doesn't seem to do the job for us.
Base table:
id | a_sort
---+-------
1 10
2 20
3 30
After inserting:
insert into table (a_sort) values(15)
An entry at the second position.
id | a_sort
---+-------
1 10
2 20
3 30
4 15
Ordering the table with:
select * from table order by a_sort
and resorting all the a_sort entries, updating at least id=(2,3,4)
will of course produce the desired output:
id | a_sort
---+-------
1 10
4 20
2 30
3 40
The column names, the column count, datatypes, a possible join, possible triggers or the way the resorting is done is/are irrelevant to the problem.Also we've found some pretty neat ways to do this task fast.
only; how the heck can we reduce the updates in the db to 1 or 2 max.
Seems like an awfully common problem.
The captain obvious in me thougth once "use an a_sort float(53), insert using a fixed value of ordervaluefirstentry+abs(ordervaluefirstentry-ordervaluenextentry)/2".
But this would only allow around 1040 "in between" entries - so never resorting seems a bit problematic ;)

You really didn't describe what you're doing with this data, so forgive me if this is a crazy idea for your situation:
You could make a sort of 'linked list' where instead of a column of values, you have a column for the 'next highest valued' id. This would decrease the number of updates to a maximum of 2.
You can make it doubly linked and also have a column for next lowest, which would bring the maximum number of updates to 3.
See:
http://en.wikipedia.org/wiki/Linked_list

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas