Vamsi Guntur's personal website

Wrote Scalable Services

Ran a technical spike

Designed a Backend System

I had to build a system, which has to perform a full-text search. Two choices immediately rang to me, Postgres and Elasticsearch.

Postgres

I usually work with Postgres on a day-to-day basis but have not explored full-text search.

To do the full-text search in Postgres, you need to do the following:

create a column with tsvector type
update all the values to this column, from the search target column
create a gin index on this column

The code looks something like this

alter table articles add column document_tokens tsvector

UPDATE articles a
SET document_tokens = to_tsvector('english', a.title)

CREATE INDEX clever_stuff ON articles USING gin(document_tokens);

Search query looks something like this:

select
  id,
  ts_rank(
    document_tokens,
    to_tsquery('english', 'synergize & integrated')
  ) AS rank
FROM articles
WHERE
  document_tokens @@
  to_tsquery('english', 'synergize & integrated')
ORDER BY rank DESC
LIMIT 10

The results were pretty satisfactory, I had a database of 350 k records, the search takes around ~30 ms.

Advantage:
Able to give the relevant results even with past and future tenses of a word, as the tsvector stores the word's root form.

Disadvantage:
It's unable to understand the context of the search, if the words are jumbled or a similar form of the sentence is present somewhere, it's not able to figure that out.

Elasticsearch

To index documents in elastic search, well, you need to create an index.

While indexing, elastic search analyses the text by doing the following

Character Filters
Tokenizers
Token filters

There's an in-depth article about these concepts here

For my use case of indexing documents with special characters, the customization of all of the mentioned above things made elastic search a winner.

With the help of character filters, I was able to replace the special characters with custom words. Otherwise, these special characters would not be stored in vector representation.

With the help of token filters, you can specify whether you want synonyms to be added to individual words while indexing, or whether you want the words to be stored in lower case, etc

The ability to customize helped me produce a better system with much better relevance.

The results are snappier than Postgres, though it's a micro-optimization. All the results are less than 10 ms for searches.

But the reason we are going to go ahead with Elastic search is that it's understanding the context better and can return results that have better quality.