How to Setup current database for accurate asynchronous pagination - sql

So currently, I am working on a way for accurate pagination for an asynchronously written database. Basically a request comes in and is broken down into X number of Jobs (in a Queue) which get pulled by another app which batch writes Y number of ResponseItems per X job to a PostgreSQL table.
Two Tables
RequestMetaData: RequestId, ..., ..., ...,
ResponseItemData: Id, RequestId, ..., ...
Each ResponseItemData is part of a Request.
Lets say a request A comes in. It gets split into 700 jobs. Application pulls a job and writes 1000 items. But another request B could have come in as well, so the next job that gets written could have written 1000 items.
1
.
.
.
1000 ==> Last Item in Job 1 of Request A
1001
.
.
.
2000 ==> Last Item in Job 1 of Request B
2001
.
.
.
3000 ==> Last Item in Job 2 Of Request A
So my question is how do I do accurate pagination here? I already know how to do relative cursors
and can retrieve X number of items per requestId by using the last item seen in a incrementing table. Something like this
SELECT *
FROM `ResponseItemData`
WHERE `ResponseItemData`.`id` > **$Last_Item_Seen(Null On First Get Request) ** AND 'ResponseItemData'.'id' == **$RequestId**
ORDER BY `ResponseItemData`.`id` ASC
LIMIT X
My question:
A) How do I know when a request is done? If I keep paginating using a requestId 1000 items 699 times how do I know the 700 is the last one? How do I keep track of when my request is done? In my RequestMetaData Table, I store the amount_of_jobs_request_split_into as well as the expected_number_of_response_items.
Thanks for the help

Related

How can I paginate the results of GEOSEARCH?

I am following the tutorial on https://redis.io/commands/geosearch/ and I have successfully migrated ~300k records (from existing pg database) into testkey (sorry for the unfortunate name, but I am testing it out!) key.
However, executing a query to return items with 5km results in 1000s of items. I'd like to limit the number of items to 10 at a time, and be able to load the next 10 using some sort of keyset pagination.
So, to limit the results I am using
GEOSEARCH testkey FROMLONLAT -122.2612767 37.7936847 BYRADIUS 5 km WITHDIST COUNT 10
How can I execute GEOSEARCH queries with pagination?
Some context: I have a postgres + postgis database with ~3m records. I have a service that fetches items within a radius and even with right indexes it is starting to get sluggish. For context, my other endpoints can handle 3-8k rps, while this one can barely handle 1500 (8ms average query exec time). I am exploring moving items into redis cache, either the entire payload or just IDs and run IN query (<1ms query time).
I am struggling to find any articles using google search.
You can use GEOSEARCHSTORE to create a sorted set with the results from your search. You can then paginate this sorted set with ZRANGE. This is shown as an example on the GEOSEARCHSTORE page:
redis> GEOSEARCHSTORE key2 Sicily FROMLONLAT 15 37 BYBOX 400 400 km ASC COUNT 3 STOREDIST
(integer) 3
redis> ZRANGE key2 0 -1 WITHSCORES
1) "Catania"
2) "56.441257870158204"
3) "Palermo"
4) "190.44242984775784"
5) "edge2"
6) "279.7403417843143"
redis>

CKAN - Why I only get a result with the first 10 from Count

With the CKAN API query I get a count = 47 (thats correct) but only 10 results.
How do I get all (=47) results with the API query?
CKAN API Query:
https://suche.transparenz.hamburg.de/api/3/action/package_search?q=title:Fahrplandaten+(GTFS)&sort=score+asc
From source: *For me the page loads very slowly, patience
https://suche.transparenz.hamburg.de/dataset?q=hvv-fahrplandaten+gtfs&sort=score+desc%2Ctitle_sort+asc&esq_not_all_versions=true&limit=50&esq_not_all_versions=true
The count shows only the total number of results found. You can change the total number of results returned by setting up limit and row parameters. e.g https://suche.transparenz.hamburg.de/api/3/action/package_search?q=title:Fahrplandaten+(GTFS)&sort=score+asc&rows=100. The row limit is 1000 per query. You can find more info here

Is there any performance benefit when we use Limit

for example
SELECT company_ID, totalRevenue
FROM `BigQuery.BQdataset.companyperformance`
ORDER BY totalRevenue LIMIT 10
The only difference I can see between using and not using LIMIT 10 is just the different amount of data used for displaying to user.
The system still orders all the data first before performing a LIMIT.
Below is applicable for BigQuery
Not necessarily 100% technically correct - but close enough so I hope below will give you an idea why LIMIT N is extremely important to consider in BigQuery
Assume you have 1,000,000 rows of data and 8 workers to process query like below
SELECT * FROM table_with_1000000_rows ORDER BY some_field
Round 1: To sort this data each worker gets 125,000 rows – so now you have 8 sorted sets of 125,000 rows each
Round 2: Worker #1 sends its sorted data (125,000 rows) to worker #2, #3 sends to #4 and so on. So now we have 4 workers and each produce ordered set of 250,000 rows
Round 3: Above logic repeated and now we have just 2 workers each producing ordered list of 500,000 rows
Round 4: And finally, just one worker producing final ordered set of 1,000,000 rows
Of course, based on number of rows and number of available workers – number of rounds can be different than in above example
In Summary: what we have here:
a. We have quite a huge amount of data being transferred between workers – this can be quite a factor for performance going down
b. And we have chance for one of the workers not being able to process amount of data distributed to respective worker. It can happen earlier or later and is usually manifested with “Resources exceeded …” type of error
So, now if you have LIMIT as a part of query as below
SELECT * FROM table_with_1000000_rows ORDER BY some_field LIMIT 10
So, now – Round 1 is going to be the same. But starting with Round 2 – ONLY top 10 rows will be sent to another worker – thus in each Round after first one - only 20 rows will processed and only top 10 will be sent for further processing
Hope you see how different these two processes in terms of volume of the data being sent between workers and how much work each worker needs to apply to sort respective data
To Summarize:
Without LIMIT 10:
• Initial rows moved (Round 1): 1,000,000;
• Initial rows ordered (Round 1): 1,000,000;
• Intermediate rows moved (Round 2 - 4): 1,500,000
• Overall merged ordered rows (Round 2 - 4): 1,500,000;
• Final result: 1,000,000 rows
With LIMIT 10:
• Initial rows moved (Round 1): 1,000,000;
• Initial rows ordered (Round 1): 1,000,000;
• Intermediate rows moved (Round 2 - 4): 70
• Overall merged ordered rows (Round 2 - 4): 140;
• Final result: 10 rows
Hope above numbers clearly show the difference in performance you gain using LIMIT N and in some cases even ability to successfully run the query without "Resource exceeded ..." error
This answer assumes you are asking about the difference between the following two variants:
ORDER BY totalRevenue
ORDER BY totalRevenue LIMIT 10
In many databases, if a suitable index existed involving totalRevenue, the LIMIT query could stop sorting after finding the top 10 records.
In the absence of any index, as you pointed out, both versions would have to do a full sort, and therefore should perform the same.
Also, there is a potentially major performance difference between the two, if the table be large. In the LIMIT version, BigQuery only has to send across 10 records, while in the non LIMIT version, potentially much more data has to be sent.
There is no performance gain. bigQuery still has go through all the records on the table.
You can partition your data in order to cut the amount of records that bigQuery has to read. That will increase performance. You can read more information here:
https://cloud.google.com/bigquery/docs/partitioned-tables
See the statistical difference in bigQuery UI between the below 2 queries
SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 1000
SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 10000
As you can see BQ will return immediately to UI after the limit criteria is reached this result in better performance and less traffic on the network

T-SQL query for SQL Server 2008 : how to query X # of rows where X is a value in a query while matching on another column

Summary:
I have a list of work items that I am attempting to assign to a list of workers. Each working is allowed to only have a max of 100 work items assigned to them. Each work item specifies the user that should work it (associated as an owner).
For example:
Jim works a total of 5 accounts each with multiple work items. In total jim has 50 items to work already assigned to him. I am allowed to assign only 50 more.
My plight/goal:
I am using a temp table and a select statement to get the # of items each owner has currently assigned to them and I calculate the available slots for new items and store the values in new column. I need to be able to select from the items table where the owner matches my list of owners and their available items(in the temp table), only retrieving the number of rows for each user equal to the number of available slots per user - query would return only 50 rows for jim even though there may be 200 matching the criteria while sam may get 0 rows because he has no available slots while there are 30 items for him to work in the items table.
I realize I may be approaching this problem wrong. I want to avoid using a cursor.
Edit: Adding some example code
SELECT
nUserID_Owner
, CASE
WHEN COUNT(c.nWorkID) >= 100 THEN 0
ELSE 100 - COUNT(c.nWorkID)
END
,COUNT(c.nWorkID)
FROM tblAccounts cic
LEFT JOIN tblWorkItems c
ON c.sAccountNumber = cic.sAccountNumber
AND c.nUserID_WorkAssignedTo = cic.nUserID_Owner
AND c.nTeamID_WorkAssignedTo = cic.nTeamID_Owner
WHERE cic.nUserID_Collector IS NOT NULL
AND nUserID_CurrentOwner = 5288
AND c.bCompleted = 0
GROUP BY nUserID_Owner
This provides output vaulues of 5288, 50, 50 (in Jim's scenario)
It took longer than I wanted it to but I found a solution.
I did use a sub-query as suggested above to produce the work items with a unique row count by user.
I used PARTITION BY to produce a unique row count for each worker and included in my HAVING clause that the row number must be < the count of available slots. I'd post the code but it's beyond the char limit and I'd also have a lot of things to change to anon the system properly.
Originally I was approaching the problem incorrectly focusing on limiting the results rather than thinking about creating the necessary data to relate the result sets.

How to get Google CSE RESTFull API Result's Next Page & confusion on daily request limit?

I am using Google CSE Restlful API. And my code to get results is
Google.Apis.Customsearch.v1.CseResource.ListRequest listRequest = svc.Cse.List(query);
listRequest.Cx = cx;
Google.Apis.Customsearch.v1.Data.Search search = listRequest.Fetch();
foreach (Google.Apis.Customsearch.v1.Data.Result result in search.Items)
{
//do something with items
}
It returns me 10 results out of total 100 . To see results of next 10 records I have to
listRequest.Start = 11;
search = listRequest.Fetch();
And now I my 'search.Items' have results from 11-20 .
Now I have 2 questions:
1- Is it right way to get the results of next page ( next 10 records) ?
2- And doing so would it mean that I have consumed 2 request out of 100 allowed requests per day ?
If this is correct then effectively user can only get total of 1000 results per day from Google CSE API.
So it means if I have to see all 100 results of my first query I would have to make 10 requests.
Thanks,
Wasim
Yes it's the right way: setting the start parameter to the next index will request the next paginated results from your query.
You are right also on the second question, each request (paginated or non paginated) is counted between the max of 100 allowed per day, resulting of a total of 1000 max results per day.