TITLE: K-Nearest-Neighbor Indexing

SUBHEADER: KNN indexes provide an innovative method to avoid expensive table scans. They enhance PostgreSQL's query capabilities by using mathematical “distance” for indexing and searches. These indexes can be used to improve common text searches, text-similarity searches, geospatial location comparisons and other queries. Text search indexes can now be configured to provide indexing support for LIKE '%string%' queries without changing any SQL. PostgreSQL is among the first database systems to have KNN.


TEXT:

GiST indexes can now be used to return sorted rows, if a 'distance' has a meaning and can be defined for the data type. For now, this work has been done for the point datatype, the pg_trgm contrib, and many btree_gist datatypes. This feature is available for all datatypes to use, so there will probably be more in the near future.

Here is an example with pg_trgm, pg_trgm uses trigrams to compare strings. Here are the trigrams for the 'hello' string:

SELECT show_trgm('hello');
            show_trgm            
---------------------------------
 {"  h"," he",ell,hel,llo,"lo "}

Trigrams are used to evaluate similarity (between 0 and 1) between strings. So there is a notion of distance, with distance defined as '1-similarity'.

For our example, we need the pg_trgm extension and a indexed table. The table contains 5 million text records, for 750MB.

CREATE TABLE test_trgm (text_data text);
CREATE INDEX test_trgm_idx ON test_trgm USING gist (text_data extensions.gist_trgm_ops);

Until 9.0, if we wanted the two closest text_data to hello from the table, here was the query:

SELECT text_data, similarity(text_data, 'hello')
FROM test_trgm 
WHERE text_data % 'hello'
ORDER BY similarity(text_data, 'hello')
LIMIT 2;

On the test database, it takes around 2 seconds to complete.


SUBHEADER: You can now create index on “distance” for faster location and text-search queries


With 9.1 and KNN, one can write:

SELECT text_data, text_data <-> 'hello'
FROM test_trgm 
ORDER BY text_data <-> 'hello'
LIMIT 2;

The ↔ operator is the distance operator available in the btree_gist extension. It runs in 20ms, using the index to directly retrieve the 2 best records.

While we're talking about pg_trgm and KNN, another new feature is that the LIKE and ILIKE operators can now automatically make use of a trgm index. Still using the same table:

SELECT text_data
FROM test_trgm
WHERE text_data LIKE '%hello%';

This query will use the test_trgm_idx index instead of scanning the whole table.