NGrams and Edge NGrams – space complexity

In my previous post NGrams and Edge NGrams – computational complexity, I explored the runtime implications of working both with NGrams and Edge NGrams. This is a follow-up post that will build on the previous one so if you are not familiar with these terms, I would suggest checking it first. In this post, I would like to take a look at the space requirements of NGrams and Edge NGrams and how to estimate the size required to store the data intended for your indices.

Space complexity

While the previous post answered the question how much work needs to be carried out to process ingested data, this post will try to answer the question of how much storage you will need to store said data. Before we dive into this, it is important to keep in mind that the structure of your data will hugely impact the overall resulting sizes. It is not only about the size of your data set, but also about many other aspects, such as variability of data, the similarity within the words and even character set/encoding.

Once again, I will be interested in scalability of these two approaches and what the marginal rate of change is. For the purposes of these estimations, we will be interested only in NGrams of size 2 and up. By simulating the work of a tokenizer on a group of words varying in length, I have arrived at two functions that describe just that. In this case however things get a bit more complicated and proper parametrization (the way we apply the formula with regards to a data set) is going to influence how accurate this model gets. Issue here stems from how Elasticsearch stores characters in memory. It is using UTF-8 encoding and that can mean that a single character can use from 1 to 4 bytes. For this reason one might consider further research into how to apply these formulas to their specific data set to gain more precision for the estimates.

Edge NGrams

Once again, Edge NGrams are simpler when it comes to describing their properties. Let’s look at the size required to store all NGrams (of varying sizes) created by ingesting a word of length x, over each group of NGrams of size n, given it takes s bytes to store a single character in the Elasticsearch.

The following function describes a change in the reference size of Edge NGrams when the original word grows by one character:

s \in <1,4>; 2 \leq n \leq x; x,n,s \in Z
f(x): y = \sum_{n=2}^{x} s*n
Note: X axis represents the length of the word being processed / Y axis represents the amount of memory in bytes required to save all NGrams

NGrams

Not only do NGrams require more processing in terms of compute spent, but they also require more space to be stored. This is to be expected given the nature of this type of processing.

The following function describes a change in the reference size of NGrams when the original word grows by one character:

s \in <1,4>; 2 \leq n \leq x; x,n,s \in Z
f(x): y = \sum_{n=2}^{x} s*n*(x+1-n)
Note: X axis represents the length of the word being processed / Y axis represents the amount of memory in bytes required to save all NGrams

How do they compare?

It is apparent that both of these functions exhibit rather steep exponential growth given the parameters set forward in the previous paragraphs. However, it should be obvious at this point that the growth rate is significantly different in these two cases. While it might not be that important from the asymptotic notations point of view, these growth rates do have significant impact on the hardware properties of the cluster as well as its recommended settings. Keeping these in mind will help reduce the cost of running and maintenance of your application.

Note: X axis represents the length of the word being processed / Y axis represents the amount of memory in bytes required to save all NGrams

To illustrate this point further, consider the following table showing the difference in the concrete cases with number of NGrams that need to be processed.

Length of ingested word Size of Edge NGrams processed in bytes
(with s = 1)
Size of NGrams processed in bytes
(with s = 1)
5 14 30
10 54 210
15 119 665
20 209 1 520
25 324 2 900
30 464 4 930
40 819 11 440
50 1 274 22 050

Conclusion

As expected the ngrams tend to be more expensive than edge ngrams given the generalizations I made when modeling this experiment. The real world data and conditions you might run into will no doubt be different and brush on some edge case not covered here. I believe, however, that it is good to take time to think about spacial and computational costs when considering which approach to take and why. Hopefully, these blog posts inspired you to do just that.

Leave a Reply

Your email address will not be published. Required fields are marked *