I've struggled with this for a while now trying to figure out how to do this most efficiently.
The problem is as follows. I have items in a database to be marketed for specific age groups such as ages 10 to 20 or ages 16+ and I need to be able to make a query like, find item that is for 17 year old
Here are my two best ideas (but I don't like either, as I think they're both inefficient).
Have a csv column with values like 10-20 and 16+ , retrieve the entire list, and parse through it (Bad idea, I know, I'm fresh out of ideas here though)
Have a csv column with values like 10,11,12,13...20 for ranges, so I can look for it using WHERE ages LIKE "%17%", and for cases like 16+ I'd have to retrieve those special cases using something like WHERE ages LIKE "%+%" and parse through those.
I'm of course leaning towards the second option, but in the very best scenario, I'm running two queries one for regular items, and one for things like 16+
Is there a better way? If not, do you think you could make either of my models more efficient? Thanks.
You can do it like this:
Add lower_age and upper_age columns to your table, both integers that allow NULLs.
If lower_age is NULL then there is no lower bound.
If upper_age is NULL then there is no upper bound.
Combine COALESCE and BETWEEN for your queries.
To clarify (4), you want to say things like this:
select *
from your_table
where $n between coalesce(lower_age, $n) and coalesce(upper_age, $n)
where $n is the age you're looking for. BETWEEN uses inclusive bounds so coalesce(lower_age, $n) ignores $n if lower_age is not NULL and gives you $n >= $n (i.e. an automatic true on that bound) if lower_age is NULL; similarly for the upper_age.
If something is suitable for only 11 year olds, then your [lower_age,upper_age] closed interval would be [11, 11], 16+ would be [16, NULL], six and lower would be [NULL, 6], everyone would be [NULL, NULL], and no one would be [23, 11] or anything else with lower_age > upper_age (or, more likely, invalid data that a CHECK constraint would throw a hissy fit over).
You can do this a number of ways. If you store the age of the user(whatever) in the row. Then you can query the age and with > 16 or < 30 or between 10-20 whatever. The other option is to store this as a bitwise. Have a reference table and store your different ranges if they can have multiples then you just add the two row values together.
1 = 10
2 = 16+
4 = 10-20
8 = 20-30
16 = 20+
32 = 30+
.
.
.
.
then in the table that stores the persons info you can set the column to an int or bigint take your preference and then for whatever groups they belong to you can determine this by the number for example:
Table of Users
ID Name BitWise
1 test 2
2 something 6 (2+4)
3 blah 24 (8+16)
However I think that it may be a bit overkill with the bitwise you might be best just storing the age as a number an running queries against that. More than likely this will be the most efficient.
You have a range of options (no pun intended). For age recommendations, the easiest way is to store a min_age and max_age and query like this:
select * from item where :age between min_age and max_age
where you have to decide whether you allow nulls for these columns (then you need to use coalesce() or nvl() or whatever function your database provides for dealing with comparisons with nulls), or set boundary values for these columns where you can be sure :age will always fall in between.
Alternatively, you can use a m:n table
create table item_ages (item_id int not null, age int not null, constraint item_ages_pk primary key (item_id, age)
and fill it with explicit values:
item_id | age
-------------
1 | 16
1 | 17
1 | 18
and so on. This is more cumbersome tha using a range, but also more flexible, and since your database can index the table and probably store that index in memory, queries should be fast. You only have to touch this table when a new item is entered or the age range for a particular item changes.
Note that CBRRacer's answer has similar properties: both share the idea that you prepare a datastructure that can easily be indexed, and answer the filter question from that index. This is a popular method for storing marketing data in ecommerce applications. The extreme end of that range would be to use a dedicated package for storing inverted indexes for that purpose. But for a simple age recommendation that's of course overkill.
Someting like this:
SELECT *
FROM tablename
WHERE 17 BETWEEN start_age AND end_age
Related
I am building an app where the server needs to select rows based on some criteria/filters. One of them is the location of the user and the radius at which the user want's to see posts and other filters such date range and filter for a value of another column. This is going to be for an ad-hoc event discovery app.
I have read about PostGIS and I know there is a point datatype. Based on this answer I understood that it is better to order from equality to range columns, even though I feel like geo point column should be the first. But the main question is, how is it possible to create such an index? I thought about GiST index but not sure that helps.
Suppose the following simplified events table (disregard the validity position data):
id event_title event_position event_type is_public start_date
(varchar) (point lat/lon) (smallint) (boolean) (timestamptz)
-- --------------------------- --------------- --------- --------- ----
1 "John's Party" (122,35) 0 0 2020-07-05
2 "Revolution then Starbucks" (123,30) 1 1 2020-07-06
3 "Study for math exam" (120,36) 2 1 2020-07-07
4 "Party after exam" (120,36) 1 1 2020-07-08
5 "Hiking next to the city" (95,40) 3 1 2020-07-09
6 "Football match" (-42,31) 4 1 2020-07-10
So in this table a user would be able to query public events close to (122,34) by 100km (suppose first three rows fall into this area) and of event types 0, 1 or 2 falling between dates 2020-07-05 and 2020-07-07. The user would get the rows with ID 2 and 3.
This is the query I want to optimize with an appropriate index. Thank you!
With the help of the btree_gist extension, you can include the event_type and start_date columns into the GiST index along with the event_position. However, the event_type won't be very useful in the index as long as the restriction clause is something like event_type in (0, 1, 2). (But if the list only contains one element, it will be rewritten into an equality, and in that case it can use that column in the index efficiently.) So using the other two columns would be my starting point. I would put the usually more selective one first. If you are going to leave past events in the table rather than clearing them out, then the date is probably going to eventually be the more selective one.
If the location condition is not very selective at all (e.g. most of your events are in NYC, and most of your users are in NYC, and almost everything is within 100km of almost everything else), then you might want a different approach. Just make a default BTREE index on (event_type, start_date). Unlike GiST, such a BTREE can effectively use a condition like event_type in (0, 1, 2) along with AND start_date between x and y.
I don't think that there is a fundamental reason GiST can't use the in-list effectively while BTREE can. Maybe that will be fixed for GiST in some future version.
My client has a set of numeric data stored in a string field in a database. So of course it doesn't sort correctly. These rows sort like this:
105
3
44
When they should sort like this:
3
44
105
This is very much a legacy database and I can't change it at all. I also can't change the software that uses the database. The client doesn't own it or have the source code. It has never worked the way they want. However, there is an unused string field that I could use to sort on (only a small number of fields can be sorted on.)
What I would like to do is take the input data, derive a string from it, and store the new string in the unused field, such that when the data is sorted on this new data, the original data sorts correctly, i.e., numerically.
So, for an overly simplistic example, if the algorithm produced the following new data:
105 -> c
3 -> a
44 -> b
Then when the second column was sorted, the first column would look 'correct'.
The tricky bit is that when new rows are added to the database, they must also sort correctly, without having to regenerate the sort data for all rows. This is the part of the problem that has my brain in a twist. I'm not sure it's actually possible.
You can assume that the number will never be more than 5 'digits'.
I realize this is a total kludge, but since I can't change the system, I have to find a work around, rather than a quality solution. Welcome to the real world.
~~~~~~~~~~~~~~~~~~~~~~ S O L U T I O N ~~~~~~~~~~~~~~~~~~
I don't think this is an uncommon problem, so here are the results of Gordon's solution:
mysql> select * from t order by new;
+------+------------+
| orig | new |
+------+------------+
| 3 | 0000000003 |
| 44 | 0000000044 |
| 105 | 0000000105 |
+------+------------+
In most databases, you can just do:
order by cast(col as int)
This will convert the string representation to a number and use that for ordering. There is no need for an additional column. If you add one, I would recommend adding a numeric column to contain the actual value.
If you really want to store something in the unused field, then you can left pad the number. How to do this depends on the database, but here is one typical method:
update t
set unused = right(concat('0000000000', col), 10);
Not all databases support these two specific functions, but all offer this basic functionality in some method.
Try something like
SELECT column1 FROM table1 ORDER BY LENGTH(column1) ASC, column1 ASC
(Adjust the column and table name for your environment.)
This is a bit of a hack but works as long as the "numbers" in your string column are natural, non-negative numbers only.
If you are looking for a more sophisticated approach or algorithm, try searching for natural sort together with your DBMS.
Suppose I have a large table to store ranges of integers. I can do this with two fields:
start|end
10 |210 (represents 10 to 210)
5 |55 (represents 5 to 55)
(quick to select by end column), or:
start|length
10 | 200 (represents 10 to 210)
5 | 50 (represents 5 to 55)
(quick to select by length column).
What if sometimes I need to select by end, and sometimes by length, and both queries need to be fast? I could store both:
start|length|end
10 | 200 |210
5 | 50 |55
But then this is not normalised and everyone has to remember to update both fields, and is just bad design.
I know I can select by start + length or end - start but for a very large table, isn't this extremely slow?
How can I query by calculated values quickly without storing redundant data - or should I just store the extra column?
Depending on the database type you are using, you might want to use a trigger to calculate the derived field. That way, they can never get out of synch.
This means that the field (length) could be re-calculated every time start or end changes.
I'd store the length, but I'd make sure the calculation was done in my insert and update sprocs so that as long as everyone uses your sprocs there is no more overhead for them.
Unfortunately neither of your target databases support computed columns. I would do the following:
First, determine whether you really have a performance problem. It is true that WHERE end - start = ? will perform more slowly than WHERE length = ?, but you don't define what a "really big table" is in your application, nor what the required performance is. No need to optimize away a problem that may not exist.
Determine whether you can support any latency in your searches. If so, you can add the calculated column to the table but dedicate a separate task, running every five minutes, each hour, or whatever, to fill in the values.
In PostgreSQL you could consider a materialized view, which I believe are supported at the engine level. (See Catcall's comment, below).
Finally, if all else fails, consider using a trigger to maintain the calculated column.
I've been beating my head on the desk trying to figure this one out. I have a table that stores job information, and reasons for a job not being completed. The reasons are numeric,01,02,03,etc. You can have two reasons for a pending job. If you select two reasons, they are stored in the same column, separated by a comma. This is an example from the JOBID table:
Job_Number User_Assigned PendingInfo
1 user1 01,02
There is another table named Pending, that stores what those values actually represent. 01=Not enough info, 02=Not enough time, 03=Waiting Review. Example:
Pending_Num PendingWord
01 Not Enough Info
02 Not Enough Time
What I'm trying to do is query the database to give me all the job numbers, users, pendinginfo, and pending reason. I can break out the first value, but can't figure out how to do the second. What my limited skills have so far:
select Job_number,user_assigned,SUBSTRING(pendinginfo,0,3),pendingword
from jobid,pending
where
SUBSTRING(pendinginfo,0,3)=pending.pending_num and
pendinginfo!='00,00' and
pendinginfo!='NULL'
What I would like to see for this example would be:
Job_Number User_Assigned PendingInfo PendingWord PendingInfo PendingWord
1 User1 01 Not Enough Info 02 Not Enough Time
Thanks in advance
You really shouldn't store multiple items in one column if your SQL is ever going to want to process them individually. The "SQL gymnastics" you have to perform in those cases are both ugly hacks and performance degraders.
The ideal solution is to split the individual items into separate columns and, for 3NF, move those columns to a separate table as rows if you really want to do it properly (but baby steps are probably okay if you're sure there will never be more than two reasons in the short-medium term).
Then your queries will be both simpler and faster.
However, if that's not an option, you can use the afore-mentioned SQL gymnastics to do something like:
where find ( ',' |fld| ',', ',02,' ) > 0
assuming your SQL dialect has a string search function (find in this case, but I think charindex for SQLServer).
This will ensure all sub-columns begin and start with a comma (comma plus field plus comma) and look for a specific desired value (with the commas on either side to ensure it's a full sub-column match).
If you can't control what the application puts in that column, I would opt for the DBA solution - DBA solutions are defined as those a DBA has to do to work around the inadequacies of their users :-).
Create two new columns in that table and make an insert/update trigger which will populate them with the two reasons that a user puts into the original column.
Then query those two new columns for specific values rather than trying to split apart the old column.
This means that the cost of splitting is only on row insert/update, not on _every single select`, amortising that cost efficiently.
Still, my answer is to re-do the schema. That will be the best way in the long term in terms of speed, readable queries and maintainability.
I hope you are just maintaining the code and it's not a brand new implementation.
Please consider to use a different approach using a support table like this:
JOBS TABLE
jobID | userID
--------------
1 | user13
2 | user32
3 | user44
--------------
PENDING TABLE
pendingID | pendingText
---------------------------
01 | Not Enough Info
02 | Not Enough Time
---------------------------
JOB_PENDING TABLE
jobID | pendingID
-----------------
1 | 01
1 | 02
2 | 01
3 | 03
3 | 01
-----------------
You can easily query this tables using JOIN or subqueries.
If you need retro-compatibility on your software you can add a view to reach this goal.
I have a tables like:
Events
---------
eventId int
eventTypeIds nvarchar(50)
...
EventTypes
--------------
eventTypeId
Description
...
Each Event can have multiple eventtypes specified.
All I do is write 2 procedures in my site code, not SQL code
One procedure converts the table field (eventTypeIds) value like "3,4,15,6" into a ViewState array, so I can use it any where in code.
This procedure does the opposite it collects any options your checked and converts it in
If changing the schema is an option (which it probably should be) shouldn't you implement a many-to-many relationship here so that you have a bridging table between the two items? That way, you would store the number and its wording in one table, jobs in another, and "failure reasons for jobs" in the bridging table...
Have a look at a similar question I answered here
;WITH Numbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
FROM JobId
),
Split AS
(
SELECT JOB_NUMBER, USER_ASSIGNED, SUBSTRING(PENDING_INFO, Numbers.N, CHARINDEX(',', PENDING_INFO + ',', Numbers.N) - Numbers.N) AS PENDING_NUM
FROM JobId
JOIN Numbers ON Numbers.N <= DATALENGTH(PENDING_INFO) + 1
AND SUBSTRING(',' + PENDING_INFO, Numbers.N, 1) = ','
)
SELECT *
FROM Split JOIN Pending ON Split.PENDING_NUM = Pending.PENDING_NUM
The basic idea is that you have to multiply each row as many times as there are PENDING_NUMs. Then, extract the appropriate part of the string
While I agree with DBA perspective not to store multiple values in a single field it is doable, as bellow, practical for application logic and some performance issues. Let say you have 10000 user groups, each having average 1000 members. You may want to have a table user_groups with columns such as groupID and membersID. Your membersID column could be populated like this:
(',10,2001,20003,333,4520,') each number being a memberID, all separated with a comma. Add also a comma at the start and end of the data. Then your select would use like '%,someID,%'.
If you can not change your data ('01,02,03') or similar, let say you want rows containing 01 you still can use " select ... LIKE '01,%' OR '%,01' OR '%,01,%' " which will insure it match if at start, end or inside, while avoiding similar number (ie:101).
For ten years we've been using the same custom sorting on our tables, I'm wondering if there is another solution which involves fewer updates, especially since today we'd like to have a replication/publication date and wouldn't like to have our replication replicate unnecessary entries.I had a look into nested sets, but it doesn't seem to do the job for us.
Base table:
id | a_sort
---+-------
1 10
2 20
3 30
After inserting:
insert into table (a_sort) values(15)
An entry at the second position.
id | a_sort
---+-------
1 10
2 20
3 30
4 15
Ordering the table with:
select * from table order by a_sort
and resorting all the a_sort entries, updating at least id=(2,3,4)
will of course produce the desired output:
id | a_sort
---+-------
1 10
4 20
2 30
3 40
The column names, the column count, datatypes, a possible join, possible triggers or the way the resorting is done is/are irrelevant to the problem.Also we've found some pretty neat ways to do this task fast.
only; how the heck can we reduce the updates in the db to 1 or 2 max.
Seems like an awfully common problem.
The captain obvious in me thougth once "use an a_sort float(53), insert using a fixed value of ordervaluefirstentry+abs(ordervaluefirstentry-ordervaluenextentry)/2".
But this would only allow around 1040 "in between" entries - so never resorting seems a bit problematic ;)
You really didn't describe what you're doing with this data, so forgive me if this is a crazy idea for your situation:
You could make a sort of 'linked list' where instead of a column of values, you have a column for the 'next highest valued' id. This would decrease the number of updates to a maximum of 2.
You can make it doubly linked and also have a column for next lowest, which would bring the maximum number of updates to 3.
See:
http://en.wikipedia.org/wiki/Linked_list