Swift: What is the best way to quickly search through a huge database to find relevant result?

I’m trying to implement a search algorithm that can search through hundreds of thousands of products and display the most relevant searches. My current process is

  1. Get user’s input and filter out prepositions and punctuations to arrive at keywords

  2. Break keywords into and array

  3. For each of the keywords find all the products that contains the keyword in the product description and add all the product to a RawProductDictionary.

  4. Calculate Levenshtein Distance number between the Keywords and each product description.

  5. Create an array of product based the Levenshtein Distance number.

this question builds on top of this question

Swift: How can the dictionary values be arranged based on each item's Levenshtein Distance number

this is my Levenshtein Distance function

  func levenshteinDist(test: String, key: String) -> Int {    let empty = Array<Int>(repeating:0, count: key.count)    var last = [Int](0...key.count)     for (i, testLetter) in test.enumerated() {        var cur = [i + 1] + empty        for (j, keyLetter) in key.enumerated() {            cur[j + 1] = testLetter == keyLetter ? last[j] : min(last[j], last[j + 1], cur[j]) + 1        }        last = cur    }    return last.last!  } 

This is the function that implements step 5

   func getProductData(){        Global.displayProductArry = []   var pIndexVsLevNum = [String : Int]()   for product0 in Global.RawSearchDict{       let generatedString = product0.value.name.uppercased()       let productIndex = product0.key       let relevanceNum = levenshteinDist(test: generatedString, key: self.userWordSearch)                pIndexVsLevNum[productIndex] = relevanceNum   }          print(pIndexVsLevNum)     Global.displayProductArry = []      for (k,v) in (Array(pIndexVsLevNum).sorted {$0.1 < $1.1}) {         print("\(k):\(v)")         Global.displayProductArry.append(Global.RawSearchDict[k]!)     } } 

The code works but the products are not that relevant to the user input

  • Levenshtein Distance number is not always indicative of relevance. Products with shorter description are usually disadvantaged and missed.

what is the best way to implement searching through hundreds of thousand of products quickly in swift?

Asked on August 31, 2020 in Swift.
Add Comment
1 Answer(s)

According to Wikipedia:

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

You should be using Levenshtein distance to compute individual words with each other, not entire product descriptions with a single word. The reason you would compare individual words with each other is to determine if the user has made a typo, and determine if he actually meant to type something else. Hence the first part of your problem is to first try to clean up the users query.

  • First check for perfect matches against your keyword database
  • For words which do not match perfectly, run Levenshtein to create a list of words most closely matching.

Lets stepback for a moment and look at the big picture: Simply using Levenshtein distance by itself, is not the best way to determine which is the most relevant product by comparing with the entire product description, since normally the product description will be much much larger than a users query and will describe a variety of features. Let us assume that the words are correctly spelled and forget spellchecking for a moment so we can focus on relevancy.

You will have to use a combination of techniques to determine which is the most relevant document to display:

  • First, create a tf-idf database to determine how important each word is in a product description. Words like and, is, the etc are very common and usually will not help you determine which document is most relevant to a users query.
  • The longer a product description is, the more often a word is likely to occur. This is why we need to compute inverse document frequency, to determine how rare a word is across the entire database of documents.
  • By creating tf-idf database, you can rank the most important words in a product description, as well as determine how common a word is across all documents. This will help you assign weights to the value of each word.

A high weight in tf-idf is reached by a high term frequency in a given document, and a low document frequency of the term in the whole collection of documents.

Hence, for each word in a query, you must compute the relevancy score for all documents in your product description database. This should ideally be done in advance, so that you can quickly retrieve results. There are multiple ways you can compute TF-IDF, so based on your ability, select one option, and compute for TF-IDF for every unique word in your document.

Now how will you use TF-IDF to produce relevant results ?

Here is an example:

Query: "Chocolate Butter Pancakes"

You should have already computed the TF, and IDFs for each of the three words in the query. A simplistic formula for computing relevance is:

Simplistic Product Description Score: TF-IDF(Chocolate) + TF-IDF(Butter) + TF-IDF(Pancakes) 

Compute the Product Description score for every single product description (for the words in the query), and sort the results from highest score to lowest score to get the most relevant result.

The above example, is a very simple explanation of how to compute relevancy, since the question you asked is actually a huge topic. To improve relevancy, you would have to do several additional things:

  • Stemming, Lemmatization and other Text Normalization techniques prior to computing TF-IDF of your product descriptions.
  • Likewise, you may need to do the same for you search queries.

As you can imagine, the above algorithm to provide sorted relevant results would perform poorly if you have a large database of product descriptions. To improve performance, you may have to do a number of things:

  • Cache the results of previous queries. If new products are not added / removed, and the product descriptions do not change often, then it becomes much easier.
  • If descriptions change, or products are added / removed, you need to compute TF-IDF for the entire database again to get the more relevant results. You will also need to trash your previous cache and cache new results instead. This means that you would have to periodically recompute TF-IDF for your entire database, depending on how it often it is updated.

As you can see, even this simple example is already starting to get complicated to implement, even though we haven’t even started talking about more advanced techniques in Natural Language processing, even things as simple as how to consider the usage of synonyms in a document.

Hence this question is simply too broad for anyone to provide an answer on stackoverflow.

Rather than implement a solution yourself, I would recommend searching for a ready-made solution and incorporating it in your project instead. Search is a common feature nowadays, and as there are many solutions available for different platforms, perhaps you could offload search to a web-service, so you are not limited by having to use Swift – and can then just use ready-made solution like Solr, Lucene, Elastic Search etc..

Answered on August 31, 2020.
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.