The pg_trgm extension
Improve Postgres text searches with the pg_trgm extension
The pg_trgm
extension enhances Postgres' ability to perform text searches by using trigram matching. Trigrams are groups of three consecutive characters taken from a string. By breaking down text into trigrams, Postgres can perform more efficient and flexible searches, such as similarity and proximity searches.
This extension is particularly useful for applications requiring fuzzy string matching or searching within large bodies of text.
In this guide, we'll explore the pg_trgm
extension, covering how to enable it, use it for text searches, and optimize queries. This extension has applications in data retrieval, text analysis, and anywhere robust text search capabilities are needed.
note
The pg_trgm
extension is open-source and can be installed on any Postgres setup. Detailed information about the extension is available in the PostgreSQL Documentation.
Version availability
Please refer to the list of all extensions available in Neon for up-to-date information.
Currently, Neon uses version 1.6
of the pg_trgm
extension for all Postgres versions.
pg_trgm
extension
Enable the Activate pg_trgm
by running the CREATE EXTENSION
statement in your Postgres client:
For information about using the Neon SQL Editor, see Query with Neon's SQL Editor. For information about using the psql
client with Neon, see Connect with psql.
Example usage
Let's say you're developing a database of books and you want to find books with similar titles. We first create a test table and insert some sample data, using the query below.
Basic string matching
The pg_trgm
extension can help you do fuzzy matches on strings.
For example, the query below looks for titles that are similar to the misspelled phrase "Grate Expectation". The %
operator, provided by pg_trgm
, measures similarity between two strings based on trigrams, and returns results if the similarity is above a certain threshold.
This query returns the following:
The similarity threshold can be adjusted by setting the pg_trgm.similarity_threshold
parameter (default value is 0.3
).
Trigrams
Counting trigrams
The pg_trgm
module makes these assumptions about how to count trigrams in a text string:
- Only alphanumeric characters are considered.
- The string is lowercased before counting trigrams.
- Each word is assumed to be prefixed with two spaces and suffixed with one space.
- The set of trigrams output is deduplicated.
We can use the show_trgm
function to see how pg_trgm
counts trigrams in a string. Here is an example:
Computing similarity
Given the set of trigrams for two strings A
and B
, pg_trgm
computes the similarity score as the size of the intersection of the two sets divided by the size of the union of the two sets.
Here is an example.
This query returns the following:
There are 7 distinct trigrams across the two input strings and 1 trigram in common. So the similarity score comes out to be 1/7 (0.14285715).
Advanced text searching
pg_trgm
offers powerful tools for more complex text search requirements.
Proximity search
The similarity
function provided by pg_trgm
, returns a number between 0 and 1, representing how similar the two strings are. By filtering on the similarity score, you can search for strings that are within the specified threshold.
This query returns the following:
Substring matching
pg_trgm
also provides functionality to match the input text value against substrings within the target string. The query below illustrates this:
This query returns the following:
The word_similarity
function returns the maximum similarity score between the input string and any substring of the target string. The similarity score is still computed using trigrams. In this example, the first string apple
matches with the substring apple
in the target.
In contrast, the strict_word_similarity
function only considers a subset of substrings from the target, namely only sequences of full words in the target string. That is, the first string apple
matches the substring apples
in the target, hence the lower score.
Distance scores
There are operators to calculate the distance
between two strings, i.e., one minus the similarity score.
This query returns the following:
Similarly, there are operators to compute the distance based on the word_similarity
and strict_word_similarity
functions.
Performance considerations
While pg_trgm
enhances text search capabilities, computing similarity can get expensive when matching against a large set of strings. Here are a couple of tips to improve performance:
-
Indexing: Using
pg_trgm
, you can create aGiST
orGIN
index to speed up similarity search queries. This also helps regular expression-based searches, such as withLIKE
andILIKE
operators. -
Limiting results: Use
LIMIT
to restrict the number of rows returned for more efficient querying.
Conclusion
pg_trgm
offers a versatile set of tools for text processing and searching in Postgres. We went over the basics of the extension, including how to enable it and how to use it for fuzzy string matching and proximity searches.
Resources
Need help?
Join our Discord Server to ask questions or see what others are doing with Neon. Users on paid plans can open a support ticket from the console. For more detail, see Getting Support.