NGrams and Edge NGrams – computational complexity

In my current project, we do a bunch of work revolving around Elasticsearch and enabling our customers to quickly access the relevant portions of our large data set. A couple of weeks ago I was asked to come up with a method to compare the costs of working with NGrams and Edge NGrams. I tried to make my life easier and look around the Internet for somebody else’s breakdown but I didn’t find anything I would like. So I decided to bite the bullet and do the work myself. In this post and the follow up one, I would like to present my way of reasoning about NGrams and Edge NGrams.

I won’t bother with the basic of what an NGram or Edge NGram is. If you need to familiarize yourself with these terms, please check out the official documentation for their respective tokenizers. Going forward, basic level of familiarity with Elasticsearch or the concepts it is built on is expected. There are many cases when the choice of one over the other is directly dictated by the use-case the team is trying to achieve. However, even if you know exactly which way to go and why, it makes sense to reason about how your choice impacts performance and storage requirements. In this post, we will look at the impact on computational and time complexity.

Computational and time complexity

When I am talking about computational complexity, what I focus on is just simply how many NGrams (of varying sizes) will be created and subsequently processed should we decide to go with one of these approaches. This directly impacts not only the number of operations required, but also the overall processing time. Good way to reason about this is in terms of how many extra tokens do we get when the length of a word grows by 1 character (marginal change). Or the scalability of each approach if you will.

For the purposes of these estimations, we will be interested only in NGrams of size 2 and up. By simulating the work of a tokenizer on a group of words varying in length, I have arrived at two functions that describe just that. Let’s drop some math.

Edge NGrams

Edge NGrams are simple. For a 5 characters long word such as apple, you get 4 NGrams, namely: ap, app, appl, apple. The same can be done going backwards from the end of the word.

The following function describes a change in the number of NGrams when the original word grows by one character:

f(x): y = x - 1; x > 1; x \in Z

Note: X axis represents the length of the word being processed / Y axis represents the overall number of NGrams being processed

NGrams

NGrams, on the other hand, require more processing (especially as the size of the words processed grows). Using NGrams breakdown on a five character word as apple renders 4 bi-grams (ap, pp, pl, le), 3 tri-grams (app, ppl, ple) and so on.

The following function describes a change in the number of NGrams when the original word grows by one character:

f(x): y = \frac{(x - 1)^2 + (x - 1)}{2}; x > 1; x \in Z

Note: X axis represents the length of the word being processed / Y axis represents the overall number of NGrams being processed

How do they compare?

It is apparent that when we make the switch from Edge NGrams to NGrams things do escalate quickly. There are certainly many cases when you won’t really bother even considering this impact, however when you start dealing with couple of terabytes worth of data to be ingested, there is no way you can afford to simply ignore it. This blog post focuses on time and computational complexity so in case of continuous data delivery, the different approach renders sizable difference in how soon can you make the most recent data available to your customers.

Note: X axis represents the length of the word being processed / Y axis represents the overall number of NGrams being processed

To illustrate this point further, consider the following table showing the difference in the concrete cases with number of NGrams that need to be processed.

Length of ingested word # of Edge NGrams processed # of NGrams processed
5 4 10
10 9 45
15 14 105
20 19 190
25 24 300
30 29 435
40 39 780
50 49 1 225

Conclusion

Understanding both, your data and the features you are making available to your customers, is crucial. It not only makes your live as a developer easier, but also the team discussions and the support of the applications that are deployed much easier. Depending on the SLAs and the architecture employed within your product, correct choice will impact how soon and how expensive it will be to keep those vending marts up-to-date, how big your data ingesting clusters need to be and so on. But this is only one side of the coin. In the next blog post titled NGrams and Edge NGrams – space complexity, we will take a look at the data size and storage implications. Stay tuned 😉

If you enjoyed this post, then make sure you subscribe to my Newsletter Newsletter and/or Feed Feed

Tagged with: , ,

Leave a Reply