12h ago

Looking for advice: adding a search feature able to match different phrasings of the same question

I want to build a small site which acts as a broad, searchable FAQ for a certain topic.

Consider I have the FAQ:

What is the approximate mass of Earth?
It's 5.9722 × 10^24 kilograms, wow!

I want the user to have a chance at finding this FAQ by asking How heavy is our planet

Looking at this basically, the two similar questions have only one shared word, "is", which is an extremely common word. So using something really simple like word comparison or even stemming/lemmatization alone won't help.

On the very other end of the spectrum, a search engine's AI feature can interpret this effectively, rephrase the question and give a similar answer. So, what strategies are are in-between these two extremes?

A few people will be adding questions to the site regularly.
If possible, no external services, just self-hosting on an affordable server.
Simpler and lighter solutions are preferred.

Are any of the features in OpenSearch (ElasticSearch/Lucene fork) able to do this? Is it overkill?

Since the site will have new questions to match regularly, will a solution require the repeated, wasteful retraining of NLP models to to create weights? Or is training so efficient for small-scale text datasets that it's responsible and reasonable to do on a cheap low-end server?

edit: Just spitballing here, I could try a solution which does the bulk work at insert-time rather than runtime, by asking a general pre-trained language model to rephrase the question many different ways, or generate keywords, then use those responses to generate tags for a basic keyword search to match. This would avoid making a heavy search function or retraining any model on the server.

Example result:

txt

    
GPT-4o mini

Here’s a list of synonyms for the keywords in "What is the approximate mass of Earth?" formatted as an array of strings:

json

[
  "weight",
  "heaviness",
  "bulk",
  "load",
  "volume",
  "estimated",
  "rough",
  "approximal",
  "near",
  "close to",
  "planet Earth",
  "the globe",
  "the world",
  "Terra",
  "our planet"
]