Summer Research Recap: Analyzing the Information Density of Various Tokenizations for the Optimization of Natural Language Processing Models

10 min readAug 31, 2021

Example of NLP tokenizations. Photo by Mehul Gupta.

This summer, I participated in an internship at Stanford University through the STEM to SHTEM program. Under the guidance of Dr. Dmitri Pavlichin, my group (Riya Bhatia, Angela Yuan, Komal Keesara, Tawshia Choudhary) and I conducted a research project that focused on improving NLP model tokenization for educational applications to help intellectually disabled people. This paper will be posted on Stanford’s website soon, and I’ll link it here when it is published. However, let me provide a quick recap on the project, our methods, results, conclusions, and future work. Here’s the Github repository with our code for this paper.

Current Issues

Machine Translation

Current methods of machine translation are quite slow, and inaccurate
When it comes to word-for-word accuracy from one language to another, a machine won’t deliver a high percentage of accurate translation

Education

Only 27.8% of people with a disability obtain a college-level education
Numerous studies across Europe have proven that NLP applications have improved the learning skills of intellectually disabled students significantly

Cost

The cost to train NLP models is extremely high
With a token to word ratio of 1.4, the average cost is ~$0.04–0.08 per token

Methods

N-Gram Analysis for Genomes

We wanted to test how tokenization works on the most basic of texts, and genomes were a great start for achieving this. In our analysis, we viewed n-grams for genomes as tokens. Our goal was to determine if different genomes differ in their repetitiveness by the number of distinct tokens.

To analyze genome repeatability, we compared the number of distinct n-grams (strings of length n) to the length of the sequence scanned. We began by examining 6 different genomes: Aaosphaeria arxii (Eukaryota), Abiotrophia defectiva (Bacteria), Abditibacterium utsteinense (Bacteria), Sars-CoV-2, and Homo sapiens chromosome 21 (open reading frame 91). After downloading these as FASTA files, we filtered out the line breaks, initial text descriptions, and capitalization so that we could solely analyze the base pairs.

The first algorithm that we created precisely displays these two values on the x-axis and y-axis, respectively. With this method, some of the downloaded genomes had larger file sizes as imported into the Jupyter Notebook, such as for Aaosphaeria arxii. Thus, we examined both the entire and partial genome lengths in their base pair count. After, we modified this initial algorithm to better distinctly display the data. This new algorithm returns the fraction of non-unique 8-grams. With NumPy and Matplotlib, we created a superimposed plot of the 6 genomes. It shows the increasing repetition found in the 8-grams of each genome sequence. We mainly analyzed the first 90,000 base pairs to focus on the sections of the sequence that display the most change in the slope of the plot.

Analysis of Various Natural Languages with Byte Pair Encoding

In information theory, Byte-Pair Encoding (BPE) is a method of data compression that iteratively replaces repeated pairs of consecutive bytes with a byte that does not occur in the data. In this work, BPE is used as a method to analyze the repeatability of texts, especially those texts that are in different languages and families.

To allow for consistency between the languages, we chose one text, “The Monkey’s Paw” by W.W. Jacobs, to convert into the seven different languages initially. Specifically, the seven most popular languages that NLP processes utilize were tested, which include English, Chinese (Traditional), Urdu, Persian, Arabic, French, and Spanish. The tokenization length and token dictionary size were recorded after each iteration of BPE and plotted for easier visualization.

Pre-Trained Tokenizers for the English Language

We aimed to test the tokenization of pre-trained tokenizers on English texts of similar sizes. We used English texts because pre-trained tokenizers are historically trained on large English corpora, so testing the tokenizers on any other language texts would prove both ineffective and fruitless.

For testing, first, the texts were all transformed from “.txt” files into a string after parsing and removing all control characters (\n, \t, etc.). Secondly, we ran each tokenizer on the same system, and recorded the length of the tokens generated, alongside the size of the token dictionary generated. Finally, to visualize the results, we used Numpy and Matplotlib to find the relationship between each pre-trained tokenizer with a dot plot.

Results

We found first that all genomes tend to be more repetitive towards the beginning of their sequence, then become less repetitive, and finally more repetitive again, as seen in Figure 1. The purple line — the human genome — is the most repetitive of out the 6 genomes, meaning that it has more frequent repetitions than bacteria or viruses on long-length scales. This could be because human cells reproduce rarely and are rather large.

***Figure 1.*** 8-gram repetition analysis applied to 6 filtered genomes. A logarithmic scale is used. For example, when x = 0, log(x) equals -∞, not shown in the plot. Conversely, the log of a positive input is detailed above.

Next, we visualized our results from applying Byte-Pair Encoding to our selected texts, as seen in Figure 2, and found that Arabic has the smallest tokenization length at the byte level, meaning that it has the smallest number of individual characters in the text. We speculate that this occurs because in Arabic, short vowels are not written but long vowels are. It is also noticed that the final token dictionary size is the smallest for Arabic. This could be as a result of a number of reasons: articles such as a/an not being written and verbs are usually one word. This shows that the language is more compactly written, allowing it to be more efficient for NLP systems to understand if inputted.

***Figure 2.*** *Analyzing the repeatability of the seven most popular languages used in NLP.*

We then applied the same algorithm on various other languages, starting with the languages in the Indo-Iranian family, which have generally similar characteristics as seen in Figure 3. However, Pashto has a smaller tokenization length and token dictionary size, implying that information can be conveyed in a smaller amount of words, while Urdu had the largest of the two metrics discussed.

***Figure 3.*** *BPE applied to five Indo-Iranian Languages. Hindi, Bengali, Urdu belong to the Indo-Aryan family, while Persian and Pashto are part of the Iranian family.*

Similarly, we applied the algorithm to the various romance languages. As seen in Figure 4, Spanish, Portuguese, and Romanian are especially close in terms of their initial and final tokenization length and token dictionary size. French and Italian, on the other hand, have the largest initial tokenization length and final token dictionary size, and almost overlap each other. This is because the lexical similarity, or the similarity between the two languages, is around 85–90%, based on prior analysis, which shows that almost 90% of the words are similar in both languages.

***Figure 4.*** *BPE is Applied to five romance languages.*

On the flip side, we found very interesting results for various pre-trained tokenizers that we tested (BERT, WordPiece, GPT-2). After running each tokenizer on the English texts, there was a clear pattern for the efficiency of each pre-trained tokenizer. Overall, BERT proved to be more efficient than the other tokenizers, as the tokenization size, and token dictionary, was smaller in nearly every text, as shown in Figure 5. There were some instances where the other two tokenizers were a bit more efficient than BERT, however, in general, GPT-2 performed the worst and WordPiece performed almost the same as BERT, while sometimes being less efficient.

***Figure 5.*** *Pre-trained tokenizers applied to various English texts, with their tokenization length and token dictionary size recorded.*

GPT-2 ran last possibly because the functionality of the program is quite different from what we are testing. GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence, while less efficient as for producing the least amount of tokens when breaking up English text.

BERT and WordPiece ran quite closely, as BERT was actually trained on the WordPiece tokenization. There were odd cases, such as Text 4 and Text 6, where the two tokenizers were on opposite ends of the spectrum. However, this could be a case of bias in the text, as BERT prefers simpler text, while WordPiece can span larger, more complex texts easily.

Conclusions/Discussion

Genomics

Comparing the number of repeated n-grams, representative of tokens, to the length of the sequence scanned offers an explanation of how unique a subset, or the whole, the genome is. All genomes eventually reach a limit in regards to the number of unique n-grams, causing plots to level out and reach a particular number. It was particularly insightful to look into the different rates that different types of genomes exhibited this behavior.

From the genomes that we analyzed, we found that the human genomes, looking at chromosome 21, are more repetitive than viruses. These viruses are much more repetitive towards the beginning of their sequences in comparison to the bacteria and eukaryotes.

The foundation of our analysis into genome repeatability can be largely useful for practitioners in the genomics and bioinformatics fields interested in the uniqueness spectra of genomes, and how unique one section is compared to another. Particularly, for certain engineering practices, it is significant to target unique sites that do not have homologous base pairs in another genetic region.

BPE and Various Languages

For this study, Byte-Pair Encoding was used to measure the repeatability of a variety of languages and language families. Through the visualization and comparison of the token dictionary size and tokenization length, it was found that Arabic was the most compact and efficient language for NLP tasks when compared with the six other most commonly used languages in NLP. This was as a result of short vowels and articles not being written, but also because it relies on the root system.

The languages and families utilized can allow for NLP systems to be more efficient. Our findings, such as French being less repetitive than English and other languages, point to the idea that certain languages that convey less information in more words may be ineffective for NLP tasks. By analyzing the information density of the text, ambiguity issues in NLP models can be reduced.

English-Specific Pre-trained Tokenizers

Although each pre-trained tokenizer has its own purpose and can be used for a variety of developmental applications, BERT seemed to surpass other tokenizers in terms of efficiency for short English texts. By knowing that BERT is more efficient for tokenizing small text, we can use this tokenizer for more specialized applications (NLP apps, machine translation, etc.) until a better algorithm, such as improved BPE, can be widely implemented.

Specifically, there are many potential applications for improving NLP applications for disabled people. Recently, in a study done by Svensson et al (2019), 149 students representing various urban and rural schools in Sweden took part in a study “to investigate if an intervention using assistive technology for students with reading disabilities affected their reading ability, ability to assimilate text, and motivation for schoolwork.” Over the 4 weeks of testing, students used NLP applications, such as SayHi (speech-to-text = STT), VoiceDreamReader (text-to-speech = TTS), Prizmo (scanning from written text to digitized text), Skolstil-2 (an easy word processor and text-to-speech app that even pronounces each sound-letter, words, sentences, and the whole text while writing a text), Legimus (an audiobook reader), and Ruzzle (a word game). [8] The results show gains in reading ability despite using nothing but assistive technology during the intervention period, and these gains were comparable to the enhancement in a control group during 1 year. Approximately 50% of the students and their parents reported an increase in motivation for schoolwork after they had finished the interventions.

With tokenizers like BERT, we can improve the situation in schools for children, or even in the workforce. NLP applications can become more efficient with better-implemented tokenizers. This research is promising for future work in advancing current technology into the mainstream consumer market for NLP apps.

Future Work

NLP Applications for the Disabled

We hope to apply our findings to devise new teaching methods for intellectually disabled and hearing-impaired students. This could include a new hybrid between an improved BPE algorithm and a historically efficient tokenizer like BERT to optimize the NLP models that developers currently use for NLP assistive technology.

Evaluation Framework for Developers

We would also like to construct an evaluation framework that would provide real-time recommendations to NLP model users on which tokenizers to use based on the NLP task at hand. In this study, it has been determined that the BERT tokenizer is the most effective for English texts, but we would like to expand these findings and find trends between language families and the behavior of such tokenizers. Because of the large number of people that use and are constantly improving NLP models, it would be effective for users to understand how to best utilize their resources as well as which tokenizer is best fit for their tasks, whether that be translating between languages, working with English texts, or working with specific texts translated into another language.

Acknowledgments

I’d like to thank our mentor, Dr. Dmitri Pavlichin, for providing guidance throughout the project and helping us with technical material. I’d also like to thank Cindy Nguyen and Professor Tsachy Weissman for organizing the STEM to SHTEM program. I’d like to thank my team for making this all possible.

Citations

Cho, Seung Woo, et al. “Analysis of off-Target Effects Of Crispr/Cas-Derived RNA-Guided Endonucleases and Nickases.” Genome Research, Cold Spring Harbor Laboratory Press, Jan. 2014, www.ncbi.nlm.nih.gov/pmc/articles/PMC3875854/.
Devlin, Jacob, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” 11 Oct. 2018, arXiv: 1810.04805.
Schepens, Job et al. “Cross-language distributions of high frequency and phonetically similar cognates.” PloS one vol. 8. 10 May. 2013, doi:10.1371/journal.pone.0063006.
Schuster, Mike, and Nakajima, Kaisuke. “Japanese and Korean Voice Search.” https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf.
Svensson, Idor, et al. “Effects of Assistive Technology for Students with Reading and Writing Disabilities.” Disability and Rehabilitation: Assistive Technology, vol. 16, no. 2, 2019, pp. 196–208., doi:10.1080/17483107.2019.1646821.
Radford, Alec, et al. “Language Models are Unsupervised Multitask Learners.” OpenAI, https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Thank you so much for reading!

Shivam Syal is a 17 y/o disruptive innovator, computer science enthusiast, and emerging entrepreneur. Currently, he is looking for ways to use management and technology to address social inequalities arising out of unequal opportunities that are caused by disabilities, socioeconomic status, and global disparities.

Connect with him here 👇

shivamsyal.com | linkedin.com/in/shivamsyal | github.com/shivamsyal