In my previous post NGrams and Edge NGrams – computational complexity, I explored the runtime implications of working both with NGrams and Edge NGrams. This is a follow-up post that will build on the previous one so if you are not familiar with these terms, I would suggest checking it first. In this post, I would like to take a look at the space requirements of NGrams and Edge NGrams and how to estimate the size required to store the data intended for your indices.
While the previous post answered the question how much work needs to be carried out to process ingested data, this post will try to answer the question of how much storage you will need to store said data. Before we dive into this, it is important to keep in mind that the structure of your data will hugely impact the overall resulting sizes. It is not only about the size of your data set, but also about many other aspects, such as variability of data, the similarity within the words and even character set/encoding.
Once again, I will be interested in scalability of these two approaches and what the marginal rate of change is. For the purposes of these estimations, we will be interested only in NGrams of size 2 and up. By simulating the work of a tokenizer on a group of words varying in length, I have arrived at two functions that describe just that. In this case however things get a bit more complicated and proper parametrization (the way we apply the formula with regards to a data set) is going to influence how accurate this model gets. Issue here stems from how Elasticsearch stores characters in memory. It is using UTF-8 encoding and that can mean that a single character can use from 1 to 4 bytes. For this reason one might consider further research into how to apply these formulas to their specific data set to gain more precision for the estimates.
Once again, Edge NGrams are simpler when it comes to describing their properties. Let’s look at the size required to store all NGrams (of varying sizes) created by ingesting a word of length x, over each group of NGrams of size n, given it takes s bytes to store a single character in the Elasticsearch.
The following function describes a change in the reference size of Edge NGrams when the original word grows by one character:
Not only do NGrams require more processing in terms of compute spent, but they also require more space to be stored. This is to be expected given the nature of this type of processing.
The following function describes a change in the reference size of NGrams when the original word grows by one character:
How do they compare?
It is apparent that both of these functions exhibit rather steep exponential growth given the parameters set forward in the previous paragraphs. However, it should be obvious at this point that the growth rate is significantly different in these two cases. While it might not be that important from the asymptotic notations point of view, these growth rates do have significant impact on the hardware properties of the cluster as well as its recommended settings. Keeping these in mind will help reduce the cost of running and maintenance of your application.
To illustrate this point further, consider the following table showing the difference in the concrete cases with number of NGrams that need to be processed.
|Length of ingested word||Size of Edge NGrams processed in bytes
(with s = 1)
|Size of NGrams processed in bytes
(with s = 1)
|50||1 274||22 050|
As expected the ngrams tend to be more expensive than edge ngrams given the generalizations I made when modeling this experiment. The real world data and conditions you might run into will no doubt be different and brush on some edge case not covered here. I believe, however, that it is good to take time to think about spacial and computational costs when considering which approach to take and why. Hopefully, these blog posts inspired you to do just that.