How to optimize this SQL query for a rectangular region? - sql

I'm trying to optimize the following query, but it's not clear to me what index or indexes would be best. I'm storing tiles in a two-dimensional plane and querying for rectangular regions of that plane. The table has, for the purposes of this question, the following columns:
id: a primary key integer
world_id: an integer foreign key which acts as a namespace for a subset of tiles
tileY: the Y-coordinate integer
tileX: the X-coordinate integer
value: the contents of this tile, a varchar if it matters.
I have the following indexes:
"ywot_tile_pkey" PRIMARY KEY, btree (id)
"ywot_tile_world_id_key" UNIQUE, btree (world_id, "tileY", "tileX")
"ywot_tile_world_id" btree (world_id)
And this is the query I'm trying to optimize:
ywot=> EXPLAIN ANALYZE SELECT * FROM "ywot_tile" WHERE ("world_id" = 27685 AND "tileY" <= 6 AND "tileX" <= 9 AND "tileX" >= -2 AND "tileY" >= -1 ); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on ywot_tile (cost=11384.13..149421.27 rows=65989 width=168) (actual time=79.646..80.075 rows=96 loops=1)
Recheck Cond: ((world_id = 27685) AND ("tileY" <= 6) AND ("tileY" >= (-1)) AND ("tileX" <= 9) AND ("tileX" >= (-2)))
-> Bitmap Index Scan on ywot_tile_world_id_key (cost=0.00..11367.63 rows=65989 width=0) (actual time=79.615..79.615 rows=125 loops=1)
Index Cond: ((world_id = 27685) AND ("tileY" <= 6) AND ("tileY" >= (-1)) AND ("tileX" <= 9) AND ("tileX" >= (-2)))
Total runtime: 80.194 ms
So the world is fixed, and we are querying for a rectangular region of tiles. Some more information that might be relevant:
All the tiles for a queried region may or may not be present
The height and width of a queried rectangle are typically about 10x10-20x20
For any given (world, X) or (world, Y) pair, there may be an unbounded number of matching tiles, but the worst case is currently around 10,000, and typically there are far fewer.
New tiles are created far less frequently than existing ones are updated (changing the 'value'), and that itself is far less frequent that just reading as in the query above.
The only thing I can think of would be to index on (world, X) and (world, Y). My guess is that the database would be able to take those two sets and intersect them. The problem is that there is a potentially unbounded number of matches for either for either of those. Is there some other kind of index that would be more appropriate?

cluster the table on "ywot_tile_world_id_key", the primary key seems like it is simply an artificial id. If you have more unique vertical values, than horizontal you might want to reverse the order (world-id, y, x). Also remove the lone index on world-id, it is duplicated by the compound index.

GIST for your X,Y much the same as PostGIS does. As a matter of fact you could even use the PostGIS extension for Postgresql and get quite a bit of bang for your buck

Here's what I ended up doing. The queries are now ~20ms instead of 80, which is a decent improvement, though not amazing.
Loaded the btree_gist contrib module
Created the following index: CREATE INDEX CONCURRENTLY ywot_tile_boxes ON ywot_tile USING gist (world_id, box(point("tileX","tileY"),point("tileX","tileY")));
Switched the queries to look like this:SELECT * FROM "ywot_tile" WHERE world_id = 27685 AND box(point("tileX","tileY"),point("tileX","tileY")) && box(point(-2,-1),point(9,6));
Any further suggestions would be much appreciated.

Related

postgreSQL - Need to extract lat/lon from geography data tyle column

It seems this particular topic has quite a bit of traction here, but I have only been able to get one example to work, but I'm not sure if it will work for my use case. Please note I am new to PostgreSQL so go easy. I've learned a ton from this and other sites so that is how I have been getting things done thus far.
The ultimate goal is to extract lat/lon information so it can be plotted on a Google Map. There is a column called "outline" that is a "geography" data type that contains the information needed to extract a lat/lon. Every example I have found here doesn't seem to work and from what I have read, it needs to be cast to geometry, which I have tried. Here is a really simple example I have tried just to test if it works:
SELECT i.site_id, ST_X(outline::geometry), ST_Y(outline::geometry) FROM images i
The result is "Argument to ST_X() must be a point.
For reference, here is an example of the data that is in this particular outline column:

The only example I have found that does seem to work is:
SELECT
i.customer_id,
i.captured_at,
i.name,
i.site_id,
i.outline,
ST_AsText(ST_Centroid(outline))
FROM images i
This does not error, and the result gives me this format:
POINT(-121.080244930964 36.2187349133648)
I am hoping for a bit of a push towards what ever the best solution may be to produce a lat/lon. The only problem with the output above is I would need to take that and make a new column which will trim POINT( and also the ending ). I would also need to reverse the values, meaning, the new column would need to result in 36.2187349133648, -121.080244930964. I also have no idea if the fact it's a text field would hurt me in the end. For what it's worth, I use Google Data Studio for reporting and would use the Google Map control and have the lat/lon column feed the points. I have read it requires the coordinates to be fed in the above example.
Sorry for such a long note, and I would appreciate any advice you may have. I am using PostgreSQL v10.
Although the accepted answer works and has its merits, it comes with some performance issues. Extracting the centroid in a subquery and then extracting the x and y values in an outer query means that you're reading the same data set twice - in a table full scan btw. Try to keep things simple and avoid subqueries like this whenever possible, as it might slow things down significantly. Here is an example of how to do it using LATERAL JOIN instead:
SELECT
gid as id,
ST_X(centroid) as lon,
ST_Y(centroid) as lat
FROM images
CROSS JOIN LATERAL ST_Centroid(outline::geometry) AS centroid
Demo: db<>fiddle
Table containing 1000 polygons
Using a LATERAL JOIN
EXPLAIN (ANALYSE,BUFFERS)
SELECT id,
ST_X(centroid) as lon,
ST_Y(centroid) as lat
FROM images
CROSS JOIN LATERAL ST_Centroid(outline::geometry) AS centroid;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1.25..289.84 rows=6731 width=20) (actual time=0.113..6.517 rows=1000 loops=1)
Buffers: shared hit=53
-> Seq Scan on images (cost=0.00..120.31 rows=6731 width=36) (actual time=0.015..0.236 rows=1000 loops=1)
Buffers: shared hit=53
-> Function Scan on st_centroid centroid (cost=1.25..1.26 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1000)
Planning Time: 0.161 ms
Execution Time: 6.658 ms
(7 rows)
Using a subquery as proposed in the other answer:
EXPLAIN (ANALYSE,BUFFERS)
SELECT
i.id as id,
ST_X(i.centroid) as lon,
ST_Y(i.centroid) as lat
FROM (
SELECT
id,
ST_Centroid(outline)::geometry AS centroid
FROM images
) i;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on images (cost=0.00..2573.00 rows=1000 width=20) (actual time=0.750..127.119 rows=1000 loops=1)
Buffers: shared hit=53
Planning Time: 0.136 ms
Execution Time: 127.210 ms
(4 rows)
Note: Keep in mind that PostgreSQL 10 is reaching EOL in a few months. Consider upgrading your system asap.
Your outline is a multipolygon geography. ST_Centroid returns a point geography, which you'll need to cast to geometry and feed into st_x/st_y, so if you just need those values for lon and lat, you can try
SELECT
i.id as id,
ST_X(i.centroid) as lon,
ST_Y(i.centroid) as lat
FROM (
SELECT
id,
ST_Centroid(outline)::geometry AS centroid
FROM images
) i;
Just modify that to suit your needs.
NOTE: There are a few different ways to write this exact query (Check out WITH clauses; they're super useful in specific scenarios), but this is just the way that came to mind first for me.
EDIT: Definitely take a look at the other solution here too. Lateral joins on a function can be a bit less intuitive at first glance than subqueries but are clearly more optimized if you're running into performance issues.

Extract integer array from jsonb faster in Postgres 11+

I am designing a table that has a jsonb column realizing permissions with the following format:
[
{"role": 5, "perm": "view"},
{"role": 30, "perm": "edit"},
{"role": 52, "perm": "view"}
]
TL;DR
How do I convert such jsonb value into an SQL array of integer roles? In this example, it would be '{5,30,52}'::int[]. I have some solutions but none are fast enough. Keep reading...
Each logged-in user has some roles (one or more). The idea is to filter the records using the overlap operator (&&) on int[].
SELECT * FROM data WHERE extract_roles(access) && '{1,5,17}'::int[]
I am looking for the extract_roles function/expression that can also be used in the definition of an index:
CREATE INDEX data_roles ON data USING gin ((extract_roles(access)))
jsonb in Postgres seems to have broad support for building and transforming but less for extracting values - SQL arrays in this case.
What I tried:
create or replace function extract_roles(access jsonb) returns int[]
language sql
strict
parallel safe
immutable
-- with the following bodies:
-- (0) 629ms
select translate(jsonb_path_query_array(access, '$.role')::text, '[]', '{}')::int[]
-- (1) 890ms
select array_agg(r::int) from jsonb_path_query(access, '$.role') r
-- (2) 866ms
select array_agg((t ->> 'role')::int) from jsonb_array_elements(access) as x(t)
-- (3) 706ms
select f1 from jsonb_populate_record(row('{}'::int[]), jsonb_build_object('f1', jsonb_path_query_array(access, '$.role'))) as x (f1 int[])
When the index is used, the query is fast. But there are two problems with these expressions:
some of the functions are only stable and not immutable; this also applies to cast. Am I allowed to mark my function as immutable? The immutability is required by the index definition.
they are slow; the planner does not use the index in some scenarios, and then the query can become really slow (times above are on a table with 3M records):
explain (analyse)
select id, access
from data
where extract_roles(access) && '{-3,99}'::int[]
order by id
limit 100
with the following plan (same for all variants above; prefers scanning the index associated with the primary key, gets sorted results and hopes that it finds 100 of them soon):
Limit (cost=1000.45..2624.21 rows=100 width=247) (actual time=40.668..629.193 rows=100 loops=1)
-> Gather Merge (cost=1000.45..476565.03 rows=29288 width=247) (actual time=40.667..629.162 rows=100 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using data_pkey on data (cost=0.43..472184.44 rows=12203 width=247) (actual time=25.522..513.463 rows=35 loops=3)
Filter: (extract_roles(access) && '{-3,99}'::integer[])
Rows Removed by Filter: 84918
Planning Time: 0.182 ms
Execution Time: 629.245 ms
Removing the LIMIT clause is paradoxically fast:
Gather Merge (cost=70570.65..73480.29 rows=24938 width=247) (actual time=63.263..75.710 rows=40094 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=69570.63..69601.80 rows=12469 width=247) (actual time=59.870..61.569 rows=13365 loops=3)
Sort Key: id
Sort Method: external merge Disk: 3744kB
Worker 0: Sort Method: external merge Disk: 3232kB
Worker 1: Sort Method: external merge Disk: 3160kB
-> Parallel Bitmap Heap Scan on data (cost=299.93..68722.36 rows=12469 width=247) (actual time=13.823..49.336 rows=13365 loops=3)
Recheck Cond: (extract_roles(access) && '{-3,99}'::integer[])
Heap Blocks: exact=9033
-> Bitmap Index Scan on data_roles (cost=0.00..292.44 rows=29926 width=0) (actual time=9.429..9.430 rows=40094 loops=1)
Index Cond: (extract_roles(access) && '{-3,99}'::integer[])
Planning Time: 0.234 ms
Execution Time: 77.719 ms
Is there any better and faster way to extract int[] from a jsonb? Because I cannot rely on the planner always using the index. Playing with COST of the extract_roles function helps a bit (planner starts using the index for LIMIT 1000) but even an insanely high value does not force the index for LIMIT 100.
Comments:
If there is not, I will probably store the information in another column roles int[], which is fast but takes extra space and requires extra treatment (can be solved using generated columns on Postgres 12+, which Azure still does not provide, or a trigger, or in the application logic).
Looking into the future, will there be any better support in Postgres 15? Maybe JSON_QUERY but I don’t see any immediate improvement because its RETURNING clause probably refers to the whole result and not its elements.
Maybe jsonb_populate_record could also consider non-composite types (its signature allows it) such as:
select jsonb_populate_record(null::int[], '[123,456]'::jsonb)
The two closest questions are:
Extract integer array from jsonb within postgres 9.6
Cast postgresql jsonb value as array of int and remove element from it
Reaction to suggested normalization:
Normalization is probably not viable. But let's follow the train of thoughts.
I assume that the extra table would look like this: *_perm (id, role, perm). There would be an index on id and another index on role.
Because a user has multiple roles, it could join multiple records for the same id, which would cause multiplication of the records in the data table and force a group by aggregation.
A group by is bad for performance because it prevents some optimizations. I am designing a building block. So there can be for example two data tables at play:
select pd.*, jsonb_agg(to_jsonb(pp))
from posts_data pd
join posts_perm pp on pd.id = pp.id
where exists(
select 1
from comments_data cd on cd.post_id = pd.id
join comments_perm cp on cp.id = cd.id
where cd.reputation > 100
and cp.role in (3,34,52)
-- no group by needed due to semi-join
)
and cp.role in (3,34,52)
group by pd.id
order by pd.title
limit 10
If I am not mistaken, this query will require the aggregation of all records before they are sorted. No index can help here. That will never be fast with millions of records. Moreover, there is non-trivial logic behind group by usage - it is not always needed.
What if we did not need to return the permissions but only cared about its existence?
select pd.*
from posts_data pd
where exists(
select 1
from posts_perm pp on pd.id = pp.id
where cp.role in (3,34,52)
)
and exists(
select 1
from comments_data cd on cd.post_id = pd.id
where exists(
select 1
from comments_perm cp on cp.id = cd.id
where cp.role in (3,34,52)
)
and cd.reputation > 100
)
order by pd.title
limit 10
Then we don't need any aggregation - the database will simply issue a SEMI-JOIN. If there is an index on title, the database may consider using it. We can even fetch the permissions in the projection; something like this:
select pd.*, (select jsonb_agg(to_jsonb(pp)) from posts_perm pp on pd.id = pp.id) perm
...
Where a nested-loop join will be issued for only the few (10) records. I will test this approach.
Another option is to keep the data in both tables - the data table would only store an int[] of roles. Then we save a JOIN and only fetch from the permission table at the end. Now we need an index that supports array operations - GIN.
select pd.*, (select jsonb_agg(to_jsonb(pp)) from posts_perm pp on pd.id = pp.id) perm
from posts_data pd
where pd.roles && '{3,34,52}'::int[]
and exists(
select 1
from comments_data cd on cd.post_id = pd.id
where cd.roles && '{3,34,52}'::int[]
and cd.reputation > 100
)
order by pd.title
limit 10
Because we always aggregate all permissions for the returned records (their interpretation is in the application and does not matter that we return all of them), we can store the post_perms as a json. Because we never need to work with the values in SQL, storing them directly in the data table seems reasonable.
We will need to support some bulk-sharing operations later that update the permissions for many records, but that is much rarer than selects. Because of this we could favor jsonb instead.
The projection does not need the select of permissions anymore:
select pd.*
...
But now the roles column is redundant - we have the same information in the same table, just in JSON format. If we can write a function that extracts just the roles, we can directly index it.
And we are back at the beginning. But it looks like the extract_roles function is never going to be fast, so we need to keep roles column.
Another reason for keeping permissions in the same table is the possibility of combining multiple indices using Bitmap And and avoiding a join.
There will be a huge bias in the roles. Some are going to be present on almost all rows (admin can edit everything), others will be rare (John Doe can only access these 3 records that were explicitly shared with him). I am not sure how well statistics will work on the int[] approach but so far my tests show that the GIN index is used when the role is infrequent (high selectivity).
It looks like the core problem here is the classic one with WHERE...ORDER BY...LIMIT, that the planner assumes all of the qualifying rows are scattered evenly throughout the ordering. But that isn't the case here: rows meeting your && condition are selectively deficient in low-numbered "id". So it has to walk that index far farther than it thought it would need to before it catches the LIMIT.
There is nothing you can do (in any version) to get the planner to estimate that better. You could just prevent that index from being used by rewriting it to order by id+0. But then it wouldn't use that plan even when it would truly be faster, like the admin who is on everything. (Which by the way seems like a bad idea--an exceptional user should probably be handled exceptionally, not shoehorned into the normal system).
The immutable extraction function certainly is slow, but if the above planning problem were fixed that wouldn't matter. Making the function faster would probably require some compiled code, and Azure surely would not let you link the .so file into their managed server.
Because the JSON has a regular structure (int, text), I also considered two alternative storages:
create a composite type role of (int, text) and store the array role[]; extract_roles function is still needed;
store two arrays int[] and text[].
The latter one won for the following reasons:
smallest disk space (important for queries that require seq scan);
no need for extract_roles function - the int array is stored directly;
no need for functional index;
easy append (but same is true for JSON);
the library that I am using (jOOQ) has a good binding for arrays so working with them may even be more pleasant than with a JSON.
Disadvantages are:
harder remove - need to unnest and reaggregate.

Postgres preferring costly ST_Intersects() over cheap index

I'm executing a rather simple query on a full planet dump of OSM using Postgres 9.4. What I want to do is fetching all ways which belong to the A8 autobahn in Germany. In a preparation step, I've created multipolygons for all administrative boundary relations and stored them in the table polygons so I can do a more easy spatial intersection test. To allow for a fast query processing, I also created an index for the 'ref' hstore tags:
CREATE INDEX idx_ways_tags_ref ON planet_20141222.ways USING btree (lower(tags->'ref'));
Additionally, I have already obtained the id of the administrative boundary of Germany by a previous query (result id = 51477).
My db schema is the normal API 0.6 schema, the data was imported via the dump approach into Postgres (using the pgsnapshot_schema_0.6*.sql scripts which come with osmosis). VACUUM ANALYZE was also performed for all tables.
The problematic query looks like this:
SELECT DISTINCT wy.id FROM planet_20141222.ways wy, planet_20141222.polygons py
WHERE py.id=51477 AND ST_Intersects(py.geom,wy.linestring) AND ((wy.tags->'highway') is not null) AND (lower(wy.tags->'ref') like lower('A8'));
The runtime of this query is terrible because Postgres prefers the costly ST_Intersects() test over the cheap (and highly selective) index on 'ref'. When removing the intersection test, the query returns in some milliseconds.
What can I do so that Postgres first evaluates the parts of the query where an index exists instead of testing each way in the entire planet for an intersection with Germany?
My current solution is to split the SQL query in two separate queries. The first does the index-supported tag tests and the second does the spatial intersection test. I suppose that Postgres can do better, but how?
Edit:
a) the OSM 0.6 import scripts create the following indexes on the ways table:
CREATE INDEX idx_ways_bbox ON ways USING gist (bbox);
CREATE INDEX idx_ways_linestring ON ways USING gist (linestring);
b) Additionally, I created another index on polygons:
CREATE INDEX polygons_geom_tags on polygons using gist(geom, tags);
c) The EXPLAIN ANALYZE output of the query without ST_Intersects() looks like this:
"Index Scan using ways_tags_ref on ways (cost=0.57..4767.61 rows=1268 width=467) (actual time=0.064..0.267 rows=60 loops=1)"
" Index Cond: (lower((tags -> 'ref'::text)) = 'a8'::text)"
" Filter: (((tags -> 'highway'::text) IS NOT NULL) AND (lower((tags -> 'ref'::text)) ~~ 'a8'::text))"
" Rows Removed by Filter: 5"
"Total runtime: 0.300 ms"
The runtime of the query with ST_Intersects() is more than 15 minutes, so I cancelled it.
maybe try something like this..?
WITH wy AS (
SELECT * FROM planet_20141222.ways
WHERE ((tags->'highway') IS NOT null)
AND (lower(tags->'ref') LIKE lower('A8'))
)
SELECT DISTINCT wy.id
FROM wy, planet_20141222.polygons py
WHERE py.id=51477
AND ST_Intersects(py.geom,wy.linestring);

PostgreSql Select query performance issue

I have a simple select query:
SELECT * FROM entities WHERE entity_type_id = 1 ORDER BY entity_id
Then I want to get the first 100 results, so I use this:
SELECT * FROM entities WHERE entity_type_id = 1 ORDER BY entity_id LIMIT 100
The problem is that the second query works much slower then the first one. It takes less than a second to execute the first query and more than a minute to execute the second one.
These are execution plans for the queries:
without limit:
Sort (cost=26201.43..26231.42 rows=11994 width=72)
Sort Key: entity_id
-> Index Scan using entity_type_id_idx on entities (cost=0.00..24895.34 rows=11994 width=72)
Index Cond: (entity_type_id = 1)
with limit:
Limit (cost=0.00..8134.39 rows=100 width=72)
-> Index Scan using xpkentities on entities (cost=0.00..975638.85 rows=11994 width=72)
Filter: (entity_type_id = 1)
I don't understand why these two plans are so different and why the performance decreases so much. How should I tweak the second query to make it work faster?
I use PostgreSql 9.2.
You want the 100 smallest entity_id's matching your condition. Now - if those were numbers 1..100 then clearly using the entity_id index is the best way to handle this - everything is pre-sorted. In fact, if the 100 you wanted were in the range 1..200 then it still makes sense. Probably 1..1000 would.
So - PostgreSQL thinks it will find lots of entity_type_id=1 values at the "start" of the table. It estimates a cost of 8134 vs 26231 to filter by type then sort. In your case it is wrong.
Now - either there is some correlation which isn't obvious (that's bad - we can't tell the planner about that at present), or we don't have up-to-date or sufficient stats.
Does an ANALYZE entities make any difference? You can see what values the planner knows about by reading the planner-stats page in the manuals.

Postgres min function performance

I need the lowest value for runnerId.
This query:
SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ;
takes 80 ms (1968 result rows).
This:
SELECT min("runnerId") FROM betlog WHERE "marketId" = '107416794' ;
takes 1600 ms.
Is there a faster way to find the minimum, or should I calc the min in my java program?
"Result (cost=100.88..100.89 rows=1 width=0)"
" InitPlan 1 (returns $0)"
" -> Limit (cost=0.00..100.88 rows=1 width=9)"
" -> Index Scan using runneridindex on betlog (cost=0.00..410066.33 rows=4065 width=9)"
" Index Cond: ("runnerId" IS NOT NULL)"
" Filter: ("marketId" = 107416794::bigint)"
CREATE INDEX marketidindex
ON betlog
USING btree
("marketId" COLLATE pg_catalog."default");
Another idea:
SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ORDER BY "runnerId" LIMIT 1 >1600ms
SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ORDER BY "runnerId" >>100ms
How can a LIMIT slow the query down?
What you need is a multi-column index:
CREATE INDEX betlog_mult_idx ON betlog ("marketId", "runnerId");
If interested, you'll find in-depth information about multi-column indexes in PostgreSQL, links and benchmarks under this related question on dba.SE.
How did I figure?
In a multi-column index, rows are ordered (and thereby clustered) by the first column of the index ("marketId"), and each cluster is in turn ordered by the second column of the index - so the first row matches the condition min("runnerId"). This makes the index scan extremely fast.
Concerning the paradox effect of LIMIT slowing down a query - the Postgres query planner has a weakness there. The common workaround is to use a CTE (not necessary in this case). Find more information under this recent, closely related question:
PostgreSQL query taking too long
The min statement will be executed by PostgreSQL using a sequential scan of the entire table. You could optimize the query using the following approach:
SELECT col FROM sometable ORDER BY col ASC LIMIT 1;
When you had an index on ("runnerId") (or at least with "runnerId" as the high order column) but did not have the index on ("marketId", "runnerId") it compared the cost of passing all rows with a matching "marketId" using the index on that column and picking out the minimum "runnerId" from that set to the cost of scanning using the index on "runnerId" and stopping when it found the first row with a matching "marketId". Based on available statistics and the assumption that "marketId" values would be randomly distributed within the index entries for the index on "runnerId" it estimated a lower cost for the latter approach.
It also estimated the cost of scanning the whole table and picking the minimum from matching rows as well as probably a number of other alternatives. It does not always use a certain type of plan, but compares costs of all the alternatives.
The problem is that the assumption that values will be randomly distributed in the range is not necessarily true (as in this example), leading to a scan of a high percentage of the range to find the rows lurking at the end. For some values of "marketId", where the chosen value is available near the beginning of the "runnerId" index, this plan should be very fast.
There has been discussion in the PostgreSQL developer community of how we might bias against plans which are "risky" in terms of running long if the data distribution is not what was assumed, and there has been work on tracking multi-column statistics so that correlated values don't run into such problems. Expect improvements in this area in the next few releases. Until then, Erwin's suggestions are on target for how to work around the issue.
Basically it comes down to making a more attractive plan available or introducing an optimization barrier. In this case you can provide a more attractive option by adding the index on ("marketId", "runnerId") -- which allows a very direct way to get straight to the answer. The planner assigns a very low cost to that alternative, causing it to be chosen. If you preferred not to add the index, you could force an optimization barrier by doing something like this:
SELECT min("runnerId")
FROM (SELECT "runnerId" FROM betlog
WHERE "marketId" = '107416794'
OFFSET 0) x;
When there is an OFFSET clause (even for an offset of zero) it forces the subquery to be planned separately and its results fed to the outer query. I would expect this to run in 80 ms rather than the 1600 ms you get without the optimization barrier. Of course, if you can add the index, the speed of the query when data is cached should be less than 1 ms.