Why does this simple query not use the index in postgres? - sql

In my postgreSQL database I have a table named "product". In this table I have a column named "date_touched" with type timestamp. I created a simple btree index on this column. This is the schema of my table (I omitted irrelevant column & index definitions):
Table "public.product"
Column | Type | Modifiers
---------------------------+--------------------------+-------------------
id | integer | not null default nextval('product_id_seq'::regclass)
date_touched | timestamp with time zone | not null
Indexes:
"product_pkey" PRIMARY KEY, btree (id)
"product_date_touched_59b16cfb121e9f06_uniq" btree (date_touched)
The table has ~300,000 rows and I want to get the n-th element from the table ordered by "date_touched". when I want to get the 1000th element, it takes 0.2s, but when I want to get the 100,000th element, it takes about 6s. My question is, why does it take too much time to retrieve the 100,000th element, although I've defined a btree index?
Here is my query with explain analyze that shows postgreSQL does not use the btree index and instead sorts all rows to find the 100,000th element:
first query (100th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Limit (cost=3035.26..3038.29 rows=1 width=12) (actual time=160.208..160.209 rows=1 loops=1)
-> Index Scan using product_date_touched_59b16cfb121e9f06_uniq on product (cost=0.42..1000880.59 rows=329797 width=12) (actual time=16.651..159.766 rows=1001 loops=1)
Total runtime: 160.395 ms
second query (100,000th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 100000;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=106392.87..106392.88 rows=1 width=12) (actual time=6621.947..6621.950 rows=1 loops=1)
-> Sort (cost=106142.87..106967.37 rows=329797 width=12) (actual time=6381.174..6568.802 rows=100001 loops=1)
Sort Key: date_touched
Sort Method: external merge Disk: 8376kB
-> Seq Scan on product (cost=0.00..64637.97 rows=329797 width=12) (actual time=1.357..4184.115 rows=329613 loops=1)
Total runtime: 6629.903 ms

It is a very good thing, that SeqScan is used here. Your OFFSET 100000 is not a good thing for the IndexScan.
A bit of theory
Btree indexes contain 2 structures inside:
balanced tree and
double-linked list of keys.
First structure allows for fast keys lookups, second is responsible for the ordering. For bigger tables, linked list cannot fit into a single page and therefore it is a list of linked pages, where each page's entries maintain ordering, specified during index creation.
It is wrong to think, though, that such pages are sitting together on the disk. In fact, it is more probable that those are spread across different locations. And in order to read pages based on the index's order, system has to perform random disk reads. Random disk IO is expensive, compared to sequential access. Therefore good optimizer will prefer a SeqScan instead.
I highly recommend “SQL Performance Explained” book to better understand indexes. It is also available on-line.
What is going on?
Your OFFSET clause would cause database to read index's linked list of keys (causing lots of random disk reads) and than discarding all those results, till you hit the wanted offset. And it is good, in fact, that Postgres decided to use SeqScan + Sort here — this should be faster.
You can check this assumption by:
running EXPLAIN (analyze, buffers) of your big-OFFSET query
than do SET enable_seqscan TO 'off';
and run EXPLAIN (analyze, buffers) again, comparing the results.
In general, it is better to avoid OFFSET, as DBMSes not always pick the right approach here. (BTW, which version of PostgreSQL you're using?)
Here's a comparison of how it performs for different offset values.
EDIT: In order to avoid OFFSET one would have to base pagination on the real data, that exists in the table and is a part of the index. For this particular case, the following might be possible:
show first N (say, 20) elements
include maximal date_touched that is shown on the page to all the “Next” links. You can compute this value on the application side. Do similar for the “Previous” links, except include minimal date_touch for these.
on the server side you will get the limiting value. Therefore, say for the “Next” case, you can do a query like this:
SELECT id
FROM product
WHERE date_touched > $max_date_seen_on_the_page
ORDER BY date_touched ASC
LIMIT 20;
This query makes best use of the index.
Of course, you can adjust this example to your needs. I used pagination as it is a typical case for the OFFSET.
One more note — querying 1 row many times, increasing offset for each query by 1, will be much more time consuming, than doing a single batch query that returns all those records, which are then iterated from on the application side.

Related

Can't improve SQL join speed with indexes

I'm totally new to SQL and I am trying to speed up join queries for very large data. I started adding indexes (but to be honest, I don't have a deep understanding of them) and not seeing much change, I decided to benchmark on a more simple, simulated example. I'm using the psql interface of PostgreSQL 11.5 on MacOS 10.14.6. The data server is hosted locally on my computer. I apologize for any lack of relevant information, first time posting about SQL.
Databases' Structures
I created two initially identical databases, db and db_idx. I never put any index or key on tables in db, while I try putting indexes and keys on tables in db_idx. I then run simple join queries within db and db_idx separately and I compare the performance. Specifically, db_idx is made of two tables:
A client table with with 100,000 rows and the following structure:
Table "public.client"
Column | Type | Collation | Nullable | Default
-------------+---------+-----------+----------+---------
client_id | integer | | not null |
client_name | text | | |
Indexes:
"pkey_c" PRIMARY KEY, btree (client_id)
A client_additional table with 70,000 rows and the following structure:
Table "public.client_additional"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
client_id | integer | | not null |
client_age | integer | | |
Indexes:
"pkey_ca" PRIMARY KEY, btree (client_id)
"cov_idx" btree (client_id, client_age)
The client_id column in the client_additional table contains a subset of client's client_id values. Note the primary keys, and the other index I created on client_additional. I thought these would increase the benchmark query speed (see below) but it did not.
Importantly the db database is exactly the same (same structure, same values) except that it has no index or key.
Side note: the client and client_additional table should perhaps be a single table, since they give information at exactly the same level (client level). However, the database I'm using in real life came structured this way: some tables are split into several tables by "topic" although they give information at the same level. I don't know if that matters for my issue.
Benchmark Query
I'm using the following query, which mimics a lot what I need to do with real data:
SELECT
client_additional.client_id,
client_additional.client_age,
client.client_name
FROM client
INNER JOIN client_additional
ON client.client_id = client_additional.client_id;
Benchmark Results
On both databases, the benchmark query takes about 630 ms. Removing the keys and/or indexes in db_idx does not change anything. These benchmark results carry over to larger data sizes: speed is identical in the indexed and non-indexed cases.
That's where I am. How do I explain these results? Can I improve the join speed and how?
Use the EXPLAIN verb to see how the SQL engine intends to resolve the query. (Different SQL engines present this in different ways.) You can conclusively see whether the index will be used.
Also, you'll first need to load the tables with a lot of test data, because EXPLAIN will tell you what the SQL engine intends to do right now, and this decision is based in part on the size of the table and various other statistics. If the table is virtually empty, the SQL engine might decide that the index wouldn't be helpful now.
SQL engines use all kinds of very clever tricks to optimize performance, so it's actually rather difficult to get a useful timing test. But, if EXPLAIN tells you that the index is being used, that's pretty much the answer that you're looking for.
Setting up a small test DB, adding some rows and running your query:
CREATE TABLE client
(
client_id integer PRIMARY KEY,
client_name text
);
CREATE TABLE client_additional
(
client_id integer PRIMARY KEY,
client_age integer
);
INSERT INTO client (client_id, client_name) VALUES (generate_series(1,100000),'Phil');
INSERT INTO client_additional (client_id, client_age) VALUES (generate_series(1,70000),21);
ANALYZE;
EXPLAIN ANALYZE SELECT
client_additional.client_id,
client_additional.client_age,
client.client_name
FROM
client
INNER JOIN
client_additional
ON
client.client_id = client_additional.client_id;
gave me this plan:
Hash Join (cost=1885.00..3590.51 rows=70000 width=11) (actual time=158.958..44 1.222 rows=70000 loops=1)
Hash Cond: (client.client_id = client_additional.client_id)
-> Seq Scan on client (cost=0.00..1443.00 rows=100000 width=7) (actual time =0.019..100.318 rows=100000 loops=1)
-> Hash (cost=1010.00..1010.00 rows=70000 width=8) (actual time=158.785..15 8.786 rows=70000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 3759kB
-> Seq Scan on client_additional (cost=0.00..1010.00 rows=70000 width =8) (actual time=0.016..76.507 rows=70000 loops=1)
Planning Time: 0.357 ms
Execution Time: 506.739 ms
What you can see from this is both tables were sequentially scanned, the values from each table were hashed and a hash join was done. Postgres determined this was the optimal way to execute this query.
If you were to recreate the tables without the Primary Key (and therefore remove the implicit index on the PK column of each), you get exactly the same plan, as Postgres has determined that the quickest way to execute this query is by ignoring the indexes and by hashing the table's values then doing a hash join on the two sets of hashed values to get the result.
After changing the number of rows in the client table like so:
TRUNCATE Client;
INSERT INTO client (client_id, client_name) VALUES (generate_series(1,200000),'phil');
ANALYZE;
Then I re-ran the same query and I see this plan instead:
Merge Join (cost=1.04..5388.45 rows=70000 width=13) (actual time=0.050..415.50
3 rows=70000 loops=1)
Merge Cond: (client.client_id = client_additional.client_id)
-> Index Scan using client_pkey on client (cost=0.42..6289.42 rows=200000 width=9) (actual time=0.022..86.897 rows=70001 loops=1)
-> Index Scan using client_additional_pkey on client_additional (cost=0.29..2139.29 rows=70000 width=8) (actual time=0.016..86.818 rows=70000 loops=1)
Planning Time: 0.517 ms
Execution Time: 484.264 ms
Here you can see that index scans were done, as Postgres has determined that this plan is a better one based on the current number of rows in the tables.
The point is that Postgres will use the indexes when it feels they will produce a faster result, but the thresholds before they are used are somewhat higher than you may have expected.
All best,
Phil
You have a primary key on the two tables which will be used for the joins. If you want to really see the queries slow down, remove the primary keys.
What is happening? Well, my guess is that the execution plans are the same with or without the secondary indexes. You would need to look at the plans themselves.
Unlike most other databases, Postgres does not get a benefit from covering indexes, because lock information is stored in the data pages only. So, the data pages always need to be accessed.

Sequential scan for column indexed with varchar_pattern_ops

I have a table Users and it contains location column. I have indexed location column using varchar_pattern_ops. But when I run query planner it tells me it is doing a sequential scan.
EXPLAIN ANALAYZE
SELECT * FROM USERS
WHERE lower(location) like '%nepa%'
ORDER BY location desc;
It gives following result:
Sort (cost=12.41..12.42 rows=1 width=451) (actual time=0.084..0.087 rows=8 loops=1)
Sort Key: location
Sort Method: quicksort Memory: 27kB
-> Seq Scan on users (cost=0.00..12.40 rows=1 width=451) (actual time=0.029..0.051 rows=8 loops=1)
Filter: (lower((location)::text) ~~ '%nepa%'::text)
Planning time: 0.211 ms
Execution time: 0.147 ms
I have searched through stackoverflow. Found most answers to be like "postgres performs sequential scan in large table in case index scan will be slower". But my table is not big either.
The index in my users table is:
"index_users_on_lower_location_varchar_pattern_ops" btree (lower(location::text) varchar_pattern_ops)
What is going on?
*_patter_ops indexes are good for prefix matching - LIKE patterns anchored to the start, without leading wildcard. But not for your predicate:
WHERE lower(location) like '%nepa%'
I suggest you create a trigram index instead. And you do not need lower() in the index (or query) since trigram indexes support case insensitive ILIKE (or ~*) at practically the same cost.
Follow instructions here:
PostgreSQL LIKE query performance variations
Also:
But my table is not big either.
You seem to have that backwards. If your table is not big enough, it may be faster for Postgres to just read it sequentially and not bother with indexes. You would not create any indexes for this at all. The tipping point depends on many factors.
Aside: your index definition does not make sense to begin with:
(lower(location::text) varchar_pattern_ops)
For a varchar columns use the varchar_pattern_ops operator class.
But if you cast to text, use text_pattern_ops. Since lower() returns text even for varchar input, use text_pattern_ops. Except that you probably do not need this (or any?) index at all, as advised.

Index doesn't improve performance

I have a simple table structure in my postgres database:
CREATE TABLE device
(
id bigint NOT NULL,
version bigint NOT NULL,
device_id character varying(255),
date_created timestamp without time zone,
last_updated timestamp without time zone,
CONSTRAINT device_pkey PRIMARY KEY (id )
)
I'm often querying data based on deviceId column. The table has 3,5 million rows, so it leads to performance issues:
"Seq Scan on device (cost=0.00..71792.70 rows=109 width=8) (actual time=352.725..353.445 rows=2 loops=1)"
" Filter: ((device_id)::text = '352184052470420'::text)"
"Total runtime: 353.463 ms"
Hence I've created index on device_id column:
CREATE INDEX device_device_id_idx
ON device
USING btree
(device_id );
However my problem is, that database still uses sequential scan, not index scan. The query plan after creating the index is the same:
"Seq Scan on device (cost=0.00..71786.33 rows=109 width=8) (actual time=347.133..347.508 rows=2 loops=1)"
" Filter: ((device_id)::text = '352184052470420'::text)"
"Total runtime: 347.538 ms"
The result of the query are 2 rows, so I'm not selecting a big portion of the table. I don't really understand why index is disregarded. What can I do to improve the performance?
edit:
My query:
select id from device where device_id ='357560051102491A';
I've run analyse on the device table, which didn't help
device_id contains also characters.
You may need to look at the queries. To use an index, the queries need to be sargable. That means certain ways to construct the queries are better than other ways. I am not familiar with Postgre but in SQl Server this would include such things (very small sample of the bad constructs):
Not doing data transformations in the join - instead store the data
properly
Not using correlated subqueries - use derived tables or temp table
instead
Not using OR conditions - use UNION ALL instead
Your first step shoud be to get a good book on performance tuning for your specific database. It will talk about what constructions to avoid for your particular database engine.
Indexes are not used when you cast a column to a different type:
((device_id)::text = '352184052470420'::text)
Instead, you can do this way:
(device_id = ('352184052470420'::character varying))
(or maybe you can change device_id to TEXT in the original table, if you wish.)
Also, remember to run analyze device after index has been created, or it will not be used anyway.
It seems, like time resolves everything. I'm not sure what have happened, but currently its working fine.
From the time I've posted this question I didn't change anything and now I get this query plan:
"Bitmap Heap Scan on device (cost=5.49..426.77 rows=110 width=166)"
" Recheck Cond: ((device_id)::text = '357560051102491'::text)"
" -> Bitmap Index Scan on device_device_id_idx (cost=0.00..5.46 rows=110 width=0)"
" Index Cond: ((device_id)::text = '357560051102491'::text)"
Time breakdown (timezone GMT+2):
~15:50 I've created the index
~16:00 I've dropepd and recreated the index several times, since it was not working
16:05 I've run analyse device (didn't help)
16:44:49 from app server request_log, I can see that the requests executing the query are still taking around 500 ms
16:56:59 I can see first request, which takes 23 ms (the index started to work!)
The question stays, why it took around 1:10 hour for the index to be applied? When I was creating indexes in the same database few days ago the changes were immediate.

PostgreSQL does not use a partial index

I have a table in PostgreSQL 9.2 that has a text column. Let's call this text_col. The values in this column are fairly unique (may contain 5-6 duplicates at the most). The table has ~5 million rows. About half these rows contain a null value for text_col. When I execute the following query I expect 1-5 rows. In most cases (>80%) I only expect 1 row.
Query
explain analyze SELECT col1,col2.. colN
FROM table
WHERE text_col = 'my_value';
A btree index exists on text_col. This index is never used by the query planner and I am not sure why. This is the output of the query.
Planner
Seq Scan on two (cost=0.000..459573.080 rows=93 width=339) (actual time=1392.864..3196.283 rows=2 loops=1)
Filter: (victor = 'foxtrot'::text)
Rows Removed by Filter: 4077384
I added another partial index to try to filter out those values that were not null, but that did not help (with or without text_pattern_ops. I do not need text_pattern_ops considering no LIKE conditions are expressed in my queries, but they also match equality).
CREATE INDEX name_idx
ON table
USING btree
(text_col COLLATE pg_catalog."default" text_pattern_ops)
WHERE text_col IS NOT NULL;
Disabling sequence scans using set enable_seqscan = off; makes the planner still pick the seqscan over an index_scan. In summary...
The number of rows returned by this query is small.
Given that the non-null rows are fairly unique, an index scan over the text should be faster.
Vacuuming and analyzing the table did not help the optimizer pick the index.
My questions
Why does the database pick the sequence scan over the index scan?
When a table has a text column whose equality condition should be checked, are there any best practices I can adhere to?
How do I reduce the time taken for this query?
[Edit - More information]
The index scan is picked up on my local database that houses about 10% of the data that is available in production.
A partial index is a good idea to exclude half the rows of the table which you obviously do not need. Simpler:
CREATE INDEX name_idx ON table (text_col)
WHERE text_col IS NOT NULL;
Be sure to run ANALYZE table after creating the index. (Autovacuum does that automatically after some time if you don't do it manually, but if you test right after creation, your test will fail.)
Then, to convince the query planner that a particular partial index can be used, repeat the WHERE condition in the query - even if it seems completely redundant:
SELECT col1,col2, .. colN
FROM table
WHERE text_col = 'my_value'
AND text_col IS NOT NULL; -- repeat condition
Voilá.
Per documentation:
However, keep in mind that the predicate must match the conditions
used in the queries that are supposed to benefit from the index. To be
precise, a partial index can be used in a query only if the system can
recognize that the WHERE condition of the query mathematically implies
the predicate of the index. PostgreSQL does not have a sophisticated
theorem prover that can recognize mathematically equivalent
expressions that are written in different forms. (Not only is such a
general theorem prover extremely difficult to create, it would
probably be too slow to be of any real use.) The system can recognize
simple inequality implications, for example "x < 1" implies "x < 2";
otherwise the predicate condition must exactly match part of the
query's WHERE condition or the index will not be recognized as usable.
Matching takes place at query planning time, not at run time. As a
result, parameterized query clauses do not work with a partial index.
As for parameterized queries: again, add the (redundant) predicate of the partial index as an additional, constant WHERE condition, and it works just fine.
An important update in Postgres 9.6 largely improves chances for index-only scans (which can make queries cheaper and the query planner will more readily chose such query plans). Related:
PostgreSQL not using index during count(*)
A partial index is only used if the WHERE conditions match. Thus an index with WHERE text_col IS NOT NULL can only be used if you use the same condition in your SELECT. Collation mismatch could also cause harm.
Try the following:
Make a simplest possible btree index CREATE INDEX foo ON table (text_col)
ANALYZE table
Query
I figured it out. Upon taking a closer look at the pg_stats view that analyze helps build, I came across this excerpt on the documentation.
Correlation
Statistical correlation between physical row ordering and logical
ordering of the column values. This ranges from -1 to +1. When the
value is near -1 or +1, an index scan on the column will be estimated
to be cheaper than when it is near zero, due to reduction of random
access to the disk. (This column is null if the column data type does
not have a < operator.)
On my local box the correlation number is 0.97 and on production it was 0.05. Thus the planner is estimating that it is easier to go through all those rows sequentially instead of looking up the index each time and diving into a random access on the disk block. This is the query I used to peek at the correlation number.
select * from pg_stats where tablename = 'table_name' and attname = 'text_col';
This table also has a few updates performed on its rows. The avg_width of the rows is estimated to be 20 bytes. If the update has a large value for a text column, it can exceed the average and also result in a slower update. My guess was that the physical and logical ordering are slowing moving apart with each update. To fix that I executed the following queries.
ALTER TABLE table_name SET (FILLFACTOR = 80);
VACUUM FULL table_name;
REINDEX TABLE table_name;
ANALYZE table_name;
The idea is that I could give each disk block a 20% buffer and vacuum full the table to reclaim lost space and maintain physical and logical order. After I did this the query picks up the index.
Query
explain analyze SELECT col1,col2... colN
FROM table_name
WHERE text_col is not null
AND
text_col = 'my_value';
Partial index scan - 1.5ms
Index Scan using tango on two (cost=0.000..165.290 rows=40 width=339) (actual time=0.083..0.086 rows=1 loops=1)
Index Cond: ((victor five NOT NULL) AND (victor = 'delta'::text))
Excluding the NULL condition picks up the other index with a bitmap heap scan.
Full index - 0.08ms
Bitmap Heap Scan on two (cost=5.380..392.150 rows=98 width=339) (actual time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (victor = 'delta'::text)
-> Bitmap Index Scan on tango (cost=0.000..5.360 rows=98 width=0) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: (victor = 'delta'::text)
[EDIT]
While it initially looked like correlation plays a major role in choosing the index scan #Mike has observed that a correlation value that is close to 0 on his database still resulted in an index scan. Changing fill factor and vacuuming fully has helped but I'm unsure why.

SQL indexes for "not equal" searches

The SQL index allows to find quickly a string which matches my query. Now, I have to search in a big table the strings which do not match. Of course, the normal index does not help and I have to do a slow sequential scan:
essais=> \d phone_idx
Index "public.phone_idx"
Column | Type
--------+------
phone | text
btree, for table "public.phonespersons"
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone = '+33 1234567';
QUERY PLAN
-------------------------------------------------------------------------------
Index Scan using phone_idx on phonespersons (cost=0.00..8.41 rows=1 width=4)
Index Cond: (phone = '+33 1234567'::text)
(2 rows)
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone != '+33 1234567';
QUERY PLAN
----------------------------------------------------------------------
Seq Scan on phonespersons (cost=0.00..18621.00 rows=999999 width=4)
Filter: (phone <> '+33 1234567'::text)
(2 rows)
I understand (see Mark Byers' very good explanations) that PostgreSQL
can decide not to use an index when it sees that a sequential scan
would be faster (for instance if almost all the tuples match). But,
here, "not equal" searches are really slower.
Any way to make these "is not equal to" searches faster?
Here is another example, to address Mark Byers' excellent remarks. The
index is used for the '=' query (which returns the vast majority of
tuples) but not for the '!=' query:
essais=> \d tld_idx
Index "public.tld_idx"
Column | Type
-----------------+------
pg_expression_1 | text
btree, for table "public.emailspersons"
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) = 'fr';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using tld_idx on emailspersons (cost=0.25..4010.79 rows=97033 width=4) (actual time=0.137..261.123 rows=97110 loops=1)
Index Cond: (tld(email) = 'fr'::text)
Total runtime: 444.800 ms
(3 rows)
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) != 'fr';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on emailspersons (cost=0.00..27129.00 rows=2967 width=4) (actual time=1.004..1031.224 rows=2890 loops=1)
Filter: (tld(email) <> 'fr'::text)
Total runtime: 1037.278 ms
(3 rows)
DBMS is PostgreSQL 8.3 (but I can upgrade to 8.4).
Possibly it would help to write:
SELECT person FROM PhonesPersons WHERE phone < '+33 1234567'
UNION ALL
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
or simply
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
OR phone < '+33 1234567'
PostgreSQL should be able to determine that the selectivity of the range operation is very high and to consider using an index for it.
I don't think it can use an index directly to satisfy a not-equals predicate, although it would be nice if it could try re-writing the not-equals as above (if it helps) during planning. If it works, suggest it to the developers ;)
Rationale: searching an index for all values not equal to a certain one requires scanning the full index. By contrast, searching for all elements less than a certain key means finding the greatest non-matching item in the tree and scanning backwards. Similarly, searching for all elements greater than a certain key in the opposite direction. These operations are easy to fulfill using b-tree structures. Also, the statistics that PostgreSQL collects should be able to point out that "+33 1234567" is a known frequent value: by removing the frequency of those and nulls from 1, we have the proportion of rows left to select: the histogram bounds will indicate whether those are skewed to one side or not. But if the exclusion of nulls and that frequent value pushes the proportion of rows remaining low enough (Istr about 20%), an index scan should be appropriate. Check the stats for the column in pg_stats to see what proportion it's actually calculated.
Update: I tried this on a local table with a vaguely similar distribution, and both forms of the above produced something other than a plain seq scan. The latter (using "OR") was a bitmap scan that may actually devolve to just being a seq scan if the bias towards your common value is particularly extreme... although the planner can see that, I don't think it will automatically rewrite to an "Append(Index Scan,Index Scan)" internally. Turning "enable_bitmapscan" off just made it revert to a seq scan.
PS: indexing a text column and using the inequality operators can be an issue, if your database location is not C. You may need to add an extra index that uses text_pattern_ops or varchar_pattern_ops; this is similar to the problem of indexing for column LIKE 'prefix%' predicates.
Alternative: you could create a partial index:
CREATE INDEX PhonesPersonsOthers ON PhonesPersons(phone) WHERE phone <> '+33 1234567'
this will make the <>-using select statement just scan through that partial index: since it excludes most of the entries in the table, it should be small.
The database is able use the index for this query, but it chooses not to because it would be slower. Update: This is not quite right: you have to rewrite the query slightly. See Araqnid's answer.
Your where clause selects almost all rows in your table (rows = 999999). The database can see that a table scan would be faster in this case and therefore ignores the index. It is faster because the column person is not in your index so it would have to make two lookups for each row, once in the index to check the WHERE clause, and then again in the main table to fetch the column person.
If you had a different type of data where there were most values were foo and just a few were bar and you said WHERE col <> 'foo' then it probably would use the index.
Any way to make these "is not equal to" searches faster?
Any query that selects almost 1 million rows is going to be slow. Try adding a limit clause.