I am following the tutorial on https://redis.io/commands/geosearch/ and I have successfully migrated ~300k records (from existing pg database) into testkey (sorry for the unfortunate name, but I am testing it out!) key.
However, executing a query to return items with 5km results in 1000s of items. I'd like to limit the number of items to 10 at a time, and be able to load the next 10 using some sort of keyset pagination.
So, to limit the results I am using
GEOSEARCH testkey FROMLONLAT -122.2612767 37.7936847 BYRADIUS 5 km WITHDIST COUNT 10
How can I execute GEOSEARCH queries with pagination?
Some context: I have a postgres + postgis database with ~3m records. I have a service that fetches items within a radius and even with right indexes it is starting to get sluggish. For context, my other endpoints can handle 3-8k rps, while this one can barely handle 1500 (8ms average query exec time). I am exploring moving items into redis cache, either the entire payload or just IDs and run IN query (<1ms query time).
I am struggling to find any articles using google search.
You can use GEOSEARCHSTORE to create a sorted set with the results from your search. You can then paginate this sorted set with ZRANGE. This is shown as an example on the GEOSEARCHSTORE page:
redis> GEOSEARCHSTORE key2 Sicily FROMLONLAT 15 37 BYBOX 400 400 km ASC COUNT 3 STOREDIST
(integer) 3
redis> ZRANGE key2 0 -1 WITHSCORES
1) "Catania"
2) "56.441257870158204"
3) "Palermo"
4) "190.44242984775784"
5) "edge2"
6) "279.7403417843143"
redis>
Related
With the CKAN API query I get a count = 47 (thats correct) but only 10 results.
How do I get all (=47) results with the API query?
CKAN API Query:
https://suche.transparenz.hamburg.de/api/3/action/package_search?q=title:Fahrplandaten+(GTFS)&sort=score+asc
From source: *For me the page loads very slowly, patience
https://suche.transparenz.hamburg.de/dataset?q=hvv-fahrplandaten+gtfs&sort=score+desc%2Ctitle_sort+asc&esq_not_all_versions=true&limit=50&esq_not_all_versions=true
The count shows only the total number of results found. You can change the total number of results returned by setting up limit and row parameters. e.g https://suche.transparenz.hamburg.de/api/3/action/package_search?q=title:Fahrplandaten+(GTFS)&sort=score+asc&rows=100. The row limit is 1000 per query. You can find more info here
I know that Redis doesn't really have the concept of secondary indexes, but that you can use the Z* commands to simulate one. I have a question about the best way to handle the following scenario.
We are using Redis to keep track of orders. But we also want to be able to find those orders by phone number or email ID. So here is our data:
> set 123 7245551212:dlw#email.com
> set 456 7245551212:dlw#email.com
> set 789 7245559999:kdw#email.com
> zadd phone-index 0 7245551212:123:dlw#email.com
> zadd phone-index 0 7245551212:456:dlw#email.com
> zadd phone-index 0 7245559999:789:kdw#email.com
I can see all the orders for a phone number via the following (is there a better way to get the range other than adding a 'Z' to the end?):
> zrangebylex phone-index [7245551212 (7245551212Z
1) "7245551212:123:dlw#dcsg.com"
2) "7245551212:456:dlw#dcsg.com"
My question is, is this going to perform well? Or should we just create a list that is keyed by phone number, and add an order ID to that list instead?
> rpush phone:7245551212 123
> rpush phone:7245551212 456
> rpush phone:7245559999 789
> lrange phone:7245551212 0 -1
1) "123"
2) "456"
Which would be the preferred method, especially related to performance?
RE: is there a better way to get the range other than adding a 'Z' to the end?
Yes, use the next immediate character instead of adding Z:
zrangebylex phone-index [7245551212 (7245551213
But certainly the second approach offers better performance.
Using a sorted set for lexicographical indexing, you need to consider that:
The addition of elements, ZADD, is O(log(N))
The query, ZRANGEBYLEX, is O(log(N)+M) with N being the number of elements in the sorted set and M the number of elements being returned
In contrast, using lists:
The addition, RPUSH, is O(1)
The query, LRANGE, is O(N) as you are starting in zero.
You can also use sets (SADD and SMEMBERS), the difference is lists allows duplicates and preserves order, sets ensure uniqueness and doesn't respect insertion order.
ZSet use skiplist for score and dict for hashset. And if you add all elements with same score, skiplist will be turned to B-TREE like structure, which have a O(logN) time complexity for lexicographical order search.
So if you don't always perform range query for phone number, you should use list for orders which phone number as key for precise query. Also this will work for email(you can use hash to combine these 2 list). In this way performance for query will be much better than ZSET.
for example
SELECT company_ID, totalRevenue
FROM `BigQuery.BQdataset.companyperformance`
ORDER BY totalRevenue LIMIT 10
The only difference I can see between using and not using LIMIT 10 is just the different amount of data used for displaying to user.
The system still orders all the data first before performing a LIMIT.
Below is applicable for BigQuery
Not necessarily 100% technically correct - but close enough so I hope below will give you an idea why LIMIT N is extremely important to consider in BigQuery
Assume you have 1,000,000 rows of data and 8 workers to process query like below
SELECT * FROM table_with_1000000_rows ORDER BY some_field
Round 1: To sort this data each worker gets 125,000 rows – so now you have 8 sorted sets of 125,000 rows each
Round 2: Worker #1 sends its sorted data (125,000 rows) to worker #2, #3 sends to #4 and so on. So now we have 4 workers and each produce ordered set of 250,000 rows
Round 3: Above logic repeated and now we have just 2 workers each producing ordered list of 500,000 rows
Round 4: And finally, just one worker producing final ordered set of 1,000,000 rows
Of course, based on number of rows and number of available workers – number of rounds can be different than in above example
In Summary: what we have here:
a. We have quite a huge amount of data being transferred between workers – this can be quite a factor for performance going down
b. And we have chance for one of the workers not being able to process amount of data distributed to respective worker. It can happen earlier or later and is usually manifested with “Resources exceeded …” type of error
So, now if you have LIMIT as a part of query as below
SELECT * FROM table_with_1000000_rows ORDER BY some_field LIMIT 10
So, now – Round 1 is going to be the same. But starting with Round 2 – ONLY top 10 rows will be sent to another worker – thus in each Round after first one - only 20 rows will processed and only top 10 will be sent for further processing
Hope you see how different these two processes in terms of volume of the data being sent between workers and how much work each worker needs to apply to sort respective data
To Summarize:
Without LIMIT 10:
• Initial rows moved (Round 1): 1,000,000;
• Initial rows ordered (Round 1): 1,000,000;
• Intermediate rows moved (Round 2 - 4): 1,500,000
• Overall merged ordered rows (Round 2 - 4): 1,500,000;
• Final result: 1,000,000 rows
With LIMIT 10:
• Initial rows moved (Round 1): 1,000,000;
• Initial rows ordered (Round 1): 1,000,000;
• Intermediate rows moved (Round 2 - 4): 70
• Overall merged ordered rows (Round 2 - 4): 140;
• Final result: 10 rows
Hope above numbers clearly show the difference in performance you gain using LIMIT N and in some cases even ability to successfully run the query without "Resource exceeded ..." error
This answer assumes you are asking about the difference between the following two variants:
ORDER BY totalRevenue
ORDER BY totalRevenue LIMIT 10
In many databases, if a suitable index existed involving totalRevenue, the LIMIT query could stop sorting after finding the top 10 records.
In the absence of any index, as you pointed out, both versions would have to do a full sort, and therefore should perform the same.
Also, there is a potentially major performance difference between the two, if the table be large. In the LIMIT version, BigQuery only has to send across 10 records, while in the non LIMIT version, potentially much more data has to be sent.
There is no performance gain. bigQuery still has go through all the records on the table.
You can partition your data in order to cut the amount of records that bigQuery has to read. That will increase performance. You can read more information here:
https://cloud.google.com/bigquery/docs/partitioned-tables
See the statistical difference in bigQuery UI between the below 2 queries
SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 1000
SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 10000
As you can see BQ will return immediately to UI after the limit criteria is reached this result in better performance and less traffic on the network
I'm using paging in my app but I've noticed that paging has gone very slow and the line below is the culprit:
SELECT COUNT (*) FROM MyTable
On my table, which only has 9 million rows, it takes 43 seconds to return the row count. I read in another article which states that to return the row count for 1.4 billion rows, it takes over 5 minutes. This obviously cannot be used with paging as it is far too slow and the only reason I need the row count is to calculate the number of available pages.
After a bit of research I found out that I get the row count pretty much instantly (and accurately) using the following:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
But the above returns me the count for the entire table which is fine if no filters are applied but how do I handle this if I need to apply filters such as a date range and/or a status?
For example, what is the row count for MyTable when the DateTime field is between 2013-04-05 and 2013-04-06 and status='warning'?
Thanks.
UPDATE-1
In case I wasn't clear, I require the total number of rows available so that I can determine the number of pages required that will match my query when using 'paging' feature. For example, if a page returns 20 records and my total number of records matching my query is 235, I know I'll need to display 12 buttons below my grid.
01 - (row 1 to 20) - 20 rows displayed in grid.
02 - (row 21 to 40) - 20 rows displayed in grid.
...
11 - (row 200 to 220) - 20 rows displayed in grid.
12 - (row 221 to 235) - 15 rows displayed in grid.
There will be additional logic added to handle a large amount of pages but that's a UI issue, so this is out of scope for this topic.
My problem with using "Select count(*) from MyTable" is that it is taking 40+ seconds on 9 million records (thought it isn't anymore and I need to find out why!) but using this method I was able to add the same filter as my query to determine the query. For example,
SELECT COUNT(*) FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning'
Once I determine the page count, I would then run the same query but include the fields instead of count(*), the CurrentPageNo and PageSize in order to filter my results by page number using the row ids and navigate to a specific pages if needed.
SELECT RowId, DateTime, Status, Message FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning' AND
RowId BETWEEN (CurrentPageNo * PageSize) AND ((CurrentPageNo + 1) * PageSize)
Now, if I use the other mentioned method to get the row count i.e.
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
It returns the count instantly but how do I filter this so that I can include the same filters as if I was using the SELECT COUNT(*) method, so I could end up with something like:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable') AND
(index_id=0 or index_id=1) AND
([DateTime] BETWEEN '2018-04-05' AND '2018-04-06') AND
([Status] = 'Warning')
The above clearing won't work as I'm querying the dm_db_partition_stats but I would like to know if I can somehow perform a join or something similar to provide me with the total number of rows instantly but it needs to be filtered rather than apply to the entire table.
Thanks.
Have you ever asked for directions to alpha centauri? No? Well the answer is, you can't get there from here.
Adding indexes, re-orgs/re-builds, updating stats will only get you so far. You should consider changing your approach.
sp_spaceused will return the record count typically instantly; You may be able to use this, however depending (which you've not quite given us enough information) on what you are using the count for might not be adequate.
I am not sure if you are trying to use this count as a means to short circuit a larger operation or how you are using the count in your application. When you start to highlight 1.4 billion records and you're looking for a window in said set, it sounds like you might be a candidate for partitioned tables.
This allows you assign several smaller tables, typically separated by date, years / months, that act as a single table. When you give the date range on 1.4+ Billion records, SQL can meet performance expectations. This does depend on SQL Edition, but there is also view partitioning as well.
Kimberly Tripp has a blog and some videos out there, and Kendra Little also has some good content on how they are used and how to set them up. This would be a design change. It is a bit complex and not something you would want implement on a whim.
Here is a link to Kimberly's Blog: https://www.sqlskills.com/blogs/kimberly/sqlskills-sql101-partitioning/
Dev banter:
Also, I hear you blaming SQL, are you using entity framework by chance?
In my postgres database, I have the following relationships (simplified for the sake of this question):
Objects (currently has about 250,000 records)
-------
n_id
n_store_object_id (references store.n_id, 1-to-1 relationship, some objects don't have store records)
n_media_id (references media.n_id, 1-to-1 relationship, some objects don't have media records)
Store (currently has about 100,000 records)
-----
n_id
t_name,
t_description,
n_status,
t_tag
Media
-----
n_id
t_media_path
So far, so good. When I need to query the data, I run this (note the limit 2 at the end, as part of the requirement):
select
o.n_id,
s.t_name,
s.t_description,
me.t_media_path
from
objects o
join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
join media me on o.n_media_id = me.n_id
limit
2
This works fine and gives me two entries back, as expected. The execution time on this is about 20 ms - just fine.
Now I need to get 2 random entries every time the query runs. I thought I'd add order by random(), like so:
select
o.n_id,
s.t_name,
s.t_description,
me.t_media_path
from
objects o
join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
join media me on o.n_media_id = me.n_id
order by
random()
limit
2
While this gives the right results, the execution time is now about 2,500 ms (over 2 seconds). This is clearly not acceptable, as it's one of a number of queries to be run to get data for a page in a web app.
So, the question is: how can I get random entries, as above, but still keep the execution time within some reasonable amount of time (i.e. under 100 ms is acceptable for my purpose)?
Of course it needs to sort the whole thing according to random criteria before getting first rows. Maybe you can work around by using random() in offset instead?
Here's some previous work done on the topic which may prove helpful:
http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/
I'm thinking you'll be better off selecting random objects first, then performing the join to those objects after they're selected. I.e., query once to select random objects, then query again to join just those objects that were selected.
It seems like your problem is this: You have a table with 250,000 rows and need two random rows. Thus, you have to generate 250,000 random numbers and then sort the rows by their numbers. Two seconds to do this seems pretty fast to me.
The only real way to speed up the selection is not have to come up with 250,000 random numbers, but instead lookup rows through an index.
I think you'd have to change the table schema to optimize for this case. How about something like:
1) Create a new column with a sequence starting at 1.
2) Every row will then have a number.
3) Create an index on: number % 1000
4) Query for rows where number % 1000 is equal to a random number
between 0 and 999 (this should hit the index and load a random
portion of your database)
5) You can probably then add on RANDOM() to your ORDER BY clause and
it will then just sort that chunk of your database and be 1,000x
faster.
6) Then select the first two of those rows.
If this still isn't random enough (since rows will always be paired having the same "hash"), you could probably do a union of two random rows, or have an OR clause in the query and generate two random keys.
Hopefully something along these lines could be very fast and decently random.