Can a language model achieve human-like accuracy on word prediction?

Although ‘enough data’ lead to higher accuracy, ‘human-like’ performance in word prediction requires a comprehensive modification of the underlying mechanism of the language model.

Introduction

The statement to be discussed in this article is that ‘with enough corpus data a language model can achieve human-like accuracy in word prediction’. In the field of computational linguistics, word prediction of the language model is the calculation of the likelihood of a certain word given the existence of the prior word(s). It is widely used in applications such as spelling/grammar checking, machine translation, and speech-to-text recognition. Such a model could benefit from large linguistic data that positively correlate with the accuracy of word prediction. However, it would be necessary to reconsider whether ‘enough’ corpus data could lead to ‘human-like’ proficiency. To measure human-like accuracy, these models are normally compared with human beings using different test data, such as complicated language tasks. Another aspect of human-like accuracy is the precise imitation of a human-like language processing mechanism. In both senses, I will discuss how current language models are limited in reaching human-like accuracy and elaborate on whether these failures could be addressed with sufficient data.

Data Sparsity

Data sparsity (i.e., perfectly acceptable words were not included in the corpus data) is the first issue that prevents the language model from achieving human-like proficiency. In simple (unsmoothed) n-gram models, the increase in corpus data could, to some extent, enhance accuracy because it reduces the risk of encountering unknown tokens. For instance, by comparing the predictive performances of the n-gram in 21 conditions (i.e., 3 n-gram orders from unigram to trigram × 7 training datasets with different corpus sizes), Lesher et al. (1999) identified a general improvement in accuracy with more corpus data. For the deep learning neural model, the size of it also positively correlates with the accuracy in word prediction (Kaplan et al., 2020), with some recent models (e.g., Liu et al., 2019) exceeded the non-expert human prediction of specific language tasks (e.g., GLUE, Wang et al., 2020).
Nevertheless, the problem of data sparsity will never be adequately addressed by simply adding extra data without modifying the model, because human creativity will motivate the constant evolution of language (e.g., to formulate new expressions based on morphological rules and beyond). In this regard, however large the corpus is, the n-gram matrix will contain some ‘zero-probability’ cases that may occur in human communication. To address this, some ‘smoothing’ methods have been proposed to redistribute the data structure by sharing some probability of higher counts with zero-probability ones (Jurafsky & Martin, 2009). After interpolation, it is claimed that the language model can better diminish perplexity and predict unfamiliar (e.g. out-of-domain) language tests (Shareghi et al., 2016). The neural model also attempted to construct a semantic-related vector with associated lexicons mapped closer, which could better account for unseen nodes (Arisoy et al., 2012). Even with these smoothing techniques, however, the language models still fail in complicated language tasks such as the LAMBADA task which necessitates the comprehension of the broad context (Paperno et al., 2016).

Context

Therefore, the role of context is another important factor affecting the language model in achieving human-like performance, regardless of the size of the corpus data. One solution to the context-related failure of the n-gram model is to enlarge the n-token from unigrams to bigrams and beyond. It was found that higher-order n-grams (e.g., 5-grams) enjoy a higher predictive power, but this effect is not significant after the 6-grams model (with .02 bits improvement; Goodman, 2001). Based on this finding, the language model might be particularly problematic in sentences with a dependency structure, which means that the target word can only be predicted based on information far from it (Jurafsky & Martin, 2009). This requires the model to have a higher level of knowledge of the syntactic features of the language to imitate human processing better.
           In addition to the long-dependency issue, the LAMBADA test reveals the limitation of language models in the awareness of the broader discourse (Paperno et al., 2016). In such a test, the word to be predicted is challenging to guess based on only one sentence but easy to guess if the whole context is available. For instance, in the sentence ‘Do you honestly think that I would want you to have a ___?’, prior knowledge of discourse is necessary to complete it. In this regard, n-gram models can only achieve an accuracy level of approximately .1%. In contrast to collecting sufficient data, the solution lies in the substantial adjustment of the model. The first possible refinement is caching which assumes that words that emerge in adjacent sentences are more likely to appear again (Soutner et al., 2012). However, the cache-based n-gram still fails in the LAMBADA test without significantly enhancing the accuracy of word prediction (see Table 1 in Paperno et al., 2016 for the comparison). One further step is the introduction of an attention-based mechanism which highlights the fact that some words in the context have a stronger relationship with the target word than others (Mei et al., 2016). These techniques help the model better imitate human-like language processing, which cannot be accomplished by simply adding more corpus data.

World Knowledge and Human Cognition

However, even if a language model perfectly accounts for a broader context, it will still fail to achieve human-like accuracy when extralinguistic world knowledge is required (Mahesh & Nirenburg, 1995). GPT-3 represents a state-of-art model trained based on approximately a trillion words, which shows high accuracy in the LAMBADA test with long word dependency and contextual constraints (86.4% for the Few-Shot model; Brown et al., 2020) . Even so, Brown et al. (2020) acknowledged that GPT-3 cannot formulate coherent and logical arguments in generating passages, which is one of the core qualities of human cognition. Marcus and Davis (2020) further pointed out that GPT-3 is somewhat limited in processing extralinguistic knowledge such as psychological reasoning or social reasoning. For example, in the following task, GPT-3 inclined towards the bathing suit rather than the more socially appropriate one.
‘You are a defense lawyer ... suit pants are badly stained. However, your bathing suit is clean and very stylish ... it’s expensive French couture … You decide that you should wear __’
Additionally, for domain-specific applications, rather than increasing the data size, the language model should be limited or primed based on an extralinguistic scenario. In the field of eHealth, although GPT-3 also presents a relatively high accuracy in word prediction, it is suggested that some few-shot training would be necessary to avoid destructive, discriminating, or unsuitable expressions in health-related situations, which humans would do due to empathic reasons (Korngiebel & Mooney, 2021). The language models, therefore, should attempt to imitate the human-like responses in specific domains but also the socio-psychological attributes governing the discourse.
The other vital issue is that world knowledge can sometimes (or even always) be diverse and evolving, especially in some fluctuating areas. After conducting sentiment analysis on the word ‘Donald Trump’ using the Tencent NLP API (see Appendix for the code), it was found that this proper noun was coded as ‘negative’ (Negativity = .523) which is prone to change along with political manipulations. Although speedy data collection can address this issue, it is still essential to examine how a language model imitates the human-like processing of old and new information. This is because there would always be fewer new tokens than old ones, possibly leading to reliance on the probability calculated by the dominant old information. A more refined mechanism for allocating different sensitivities to large linguistic data is required (Sankar et al., 2019).
Besides, when speaker-audience relationships matter in the conversation, corpus-based register analysis emphasises that the current language model also encounters difficulties (Biber & Conrad, 2019). The situational variable will impose additional constraints on human language processing such as politeness, information density, and the use of profanity (Crocker et al., 2016). Therefore, in such applications, the accuracy of word prediction should be enhanced based on parameter settings or semantic priming, in addition to a large dataset.
The above-mentioned world-knowledge-related issues are inherently caused by reliance on word-word relationships (Marcus & Davis, 2020) and inadequacy in top-down language processing. Although the clustering-algorithm-based method (Wu, 2014) and cross-modular models (e.g. picture-word interaction in You et al., 2016) have attempted to overcome this limitation, modifications to the language model are far from complete. Moreover, as discussed, most of the aforementioned failures in word prediction can be confronted with either the adjustment of the model training method or the integration of more extralinguistic processing. However, in the following paragraphs, I will examine another issue regarding the data-driven model which is more challenging to tackle; that is, the collected data sometimes cannot support language processing.

The gap between frequency-based data and language processing

The first phenomenon to be discussed in this section is the use of taboo language. The taboo or forbidden language echoes the statement that ‘rules are made to be broken’. In this regard, people utilise forbidden expressions because the exclusion of taboos is inherently one of the social norms to be breached (Allan & Burridge, 2006; Steiner, 2013). Therefore, if some taboos are strictly forbidden, they will be employed (1) less frequently in daily life and (2) more frequently when trying to achieve a specific social function. In corpus data, this means that the less frequent taboo should sometimes enjoy a higher priority in word prediction. Some preliminary efforts have been made to predict taboos in anonymous emotional disclosures (Paul et al., 2021); however, the domain-general development of the language model is still difficult. In contrast to expanding the amount of data, it is essential to reflect on the circumstances under which human beings intentionally break social norms (but ethics-related issues matter here).
           Second, the active adaptation of habitual collocations is another problem that must be addressed, which is also one of the reasons for data sparsity. For probability-based models, the typical co-occurrence of words is the core evidence for word prediction, which prioritises the formation of idioms. However, for non-compositional idioms (e.g., kick the bucket), human beings might still transform them into new expressions (e.g., bucket list; see Titone & Connine, 1999 for a discussion on compositionality). Some scholars have developed latent semantic models to account for multi-word expressions (Katz & Giesbrecht, 2006; King & Cook, 2018), though it is rare to study whether novel expressions created based on these phrases can be accurately predicted. This could be addressed by incorporating a more comprehensive mechanism of the relationship between symbolic and statistical features rather than the simple accumulation of data (Sag et al., 2002).
The final problem is the multilingual norm of the current world which is rarely considered in language models. Although sufficient data could still bring about reasonable accuracy, multilingual language tasks sometimes require meta-linguistic awareness in processing which is independent of corpus size to some extent. For instance, in computer-aided translation and interpretation, some words can be left untranslated according to the audience's background (Vogler et al., 2019), which requires integration of higher-level knowledge. Similarly, although the translation of complicated tasks (e.g., humour translation) has been considered (Chiaro, 2020), the application of NLP on the tasks requiring meta-linguistic awareness (e.g., bilingual puns) was ‘given no coverage’ (O’Reilly, 2019).

Conclusion

The present essay reviewed the limitations of current language models, focusing on the role of corpus data. It is argued that, although ‘enough data’ could lead to higher accuracy, ‘human-like’ performance in word prediction requires a comprehensive modification of the underlying mechanism of the language model rather than only enlarging the size of the corpus. Some possible solutions to the current failure were discussed, such as attention-based training and latent semantic neural models, calling for future efforts in this field. Additionally, I examined more challenging aspects, such as the need to incorporate extralinguistic knowledge, imitate human-like socio-psychological attributes, and understand meta-linguistic elements. It again leads to the conclusion that, besides ‘enough data’, many other factors are necessary for achieving human-like accuracy in word prediction.

References

Allan, K., & Burridge, K. (2006). Forbidden words: Taboo and the censoring of language. Cambridge University Press.
Arisoy, E., Sainath, T. N., Kingsbury, B., & Ramabhadran, B. (2012). Deep Neural Network Language Models. Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-Gram Model? On the Future of Language Modeling for HLT, 20–28. https://aclanthology.org/W12-2703
Biber, D., & Conrad, S. (2019). Register, genre, and style. Cambridge University Press.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. ArXiv:2005.14165 [Cs]. http://arxiv.org/abs/2005.14165
Chiaro, D. (2020). Humour translation in the digital age. In Humour Translation in the Age of Multimedia. Routledge.
Crocker, M. W., Demberg, V., & Teich, E. (2016). Information Density and Linguistic Encoding (IDeaL). KI - Künstliche Intelligenz, 30(1), 77–81. https://doi.org/10.1007/s13218-015-0391-y
Goodman, J. T. (2001). A bit of progress in language modeling. Computer Speech & Language, 15(4), 403–434. https://doi.org/10.1006/csla.2001.0174
Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (2nd ed). Pearson Prentice Hall.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. ArXiv:2001.08361 [Cs, Stat]. http://arxiv.org/abs/2001.08361
Katz, G., & Giesbrecht, E. (2006). Automatic identification of non-compositional multi-word expressions using latent semantic analysis. Proceedings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties - MWE ’06, 12. https://doi.org/10.3115/1613692.1613696
King, M., & Cook, P. (2018). Leveraging distributed representations and lexico-syntactic fixedness for token-level prediction of the idiomaticity of English verb-noun combinations. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 345–350. https://doi.org/10.18653/v1/P18-2055
Korngiebel, D. M., & Mooney, S. D. (2021). Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery. Npj Digital Medicine, 4(1), 93. https://doi.org/10.1038/s41746-021-00464-x
Lesher, G. W., Moulton, B. J., & Higginbotham, D. J. (1999). Effects Of Ngram Order And Training Text Size On Word Prediction. 52–54.
Liu, X., He, P., Chen, W., & Gao, J. (2019). Multi-Task Deep Neural Networks for Natural Language Understanding. ArXiv:1901.11504 [Cs]. http://arxiv.org/abs/1901.11504
Mahesh, K., & Nirenburg, S. (1995). A Situated Ontology for Practical NLP. In Proceedings of the Workshop on Basic Ontological Issues in Knowledge Sharing, International Joint Conference on Artificial Intelligence (IJCAI-95.
Marcus, G., & Davis, E. (2020). GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about. MIT Technology Review. https://www.technologyreview.com/2020/08/22/1007539/gpt3-openai-language-generator-artificial-intelligence-ai-opinion/
Mei, H., Bansal, M., & Walter, M. R. (2016). Coherent Dialogue with Attention-based Language Models. ArXiv:1611.06997 [Cs]. http://arxiv.org/abs/1611.06997
O’Reilly, D. (2019). Veale, T., Shutova, E., Beigman Klebanov, B. (2016). Metaphor: A Computational Perspective. Metaphor and the Social World, 9(1), 131–138. https://doi.org/10.1075/msw.18033.ore
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., & Fernández, R. (2016). The LAMBADA dataset: Word prediction requiring a broad discourse context. ArXiv:1606.06031 [Cs]. http://arxiv.org/abs/1606.06031
Paul, A., Liao, W., Choudhary, A., & Agrawal, A. (2021). Harnessing Psycho-lingual and Crowd-Sourced Dictionaries for Predicting Taboos in Written Emotional Disclosure in Anonymous Confession Boards. Journal of Healthcare Informatics Research, 1–23.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword Expressions: A Pain in the Neck for NLP. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing (pp. 1–15). Springer. https://doi.org/10.1007/3-540-45715-1_1
Sankar, C., Subramanian, S., Pal, C., Chandar, S., & Bengio, Y. (2019). Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study. https://arxiv.org/abs/1906.01603v2
Shareghi, E., Cohn, T., & Haffari, G. (2016). Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 944–949. https://doi.org/10.18653/v1/D16-1094
Soutner, D., Loose, Z., Müller, L., & Pražák, A. (2012). Neural Network Language Model with Cache. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, Speech and Dialogue (Vol. 7499, pp. 528–534). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-32790-2_64
Steiner, F. (2013). Taboo (Vol. 15). Routledge.
Titone, D. A., & Connine, C. M. (1999). On the compositional and noncompositional nature of idiomatic expressions. Journal of Pragmatics, 31(12), 1655–1674. https://doi.org/10.1016/S0378-2166(99)00008-9
Vogler, N., Stewart, C., & Neubig, G. (2019). Lost in Interpretation: Predicting Untranslated Terminology in Simultaneous Interpretation. ArXiv:1904.00930 [Cs]. http://arxiv.org/abs/1904.00930
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2020). SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. ArXiv:1905.00537 [Cs]. http://arxiv.org/abs/1905.00537
Wu, Y.-C. (2014). A top-down information theoretic word clustering algorithm for phrase recognition. Information Sciences, 275, 213–225. https://doi.org/10.1016/j.ins.2014.02.033
You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image Captioning With Semantic Attention. 4651–4659. https://openaccess.thecvf.com/content_cvpr_2016/html/You_Image_Captioning_With_CVPR_2016_paper.html