By: Rebecca Bilbro
Siri, Cortana, and Alexa may seem like novelties today, but in fact, they’re a signal of a new era of language-aware applications that are rapidly becoming the new norm. Language-aware applications are ones that can:
“Leverage natural language processing techniques to understand human-generated text and audio data… [and] curate the myriad of human-generated information on the web specifically on our behalf, offering new and personalized mechanisms of human-computer interaction” (Bengfort, Bilbro and Ojeda, 2016).
Rather than being the purview only of companies like Apple, Microsoft, and Amazon, such applications are increasingly being democratized and applied in a range of contexts to derive business value. In this post, we’ll explore a bit about what natural language processing is, how it works, and how it can be applied to help solve everyday business problems.
What is Natural Language Processing?
Natural language is what people use to communicate with each other. Unlike formal languages (e.g. programming languages), which are defined by strict rules, natural language is flexible, contextual, and evolving. As a result, natural language is not as straightforward for a computer program to process as a script written in a language like Java, Python, or SQL. When we talk about Natural Language Processing (or “NLP” for short), we’re talking about the ways in which we can use computers to process and interact with human language.
Because it has to do with how humans communicate, NLP is as much influenced by fields like computational linguistics as it is by computer science. In fact, one of the most popular NLP libraries in Python, NLTK, was written by two computational linguists, Steven Bird and Ewan Klein. In their book, “Natural Language Processing with Python,” often affectionately referred to as the “whale book,” they explain:
“Technologies based on NLP are becoming increasingly widespread. For example, phones and handheld computers support predictive text and handwriting recognition; web search engines give access to information locked up in unstructured text; machine translation allows us to retrieve texts written in Chinese and read them in Spanish; text analysis enables us to detect sentiment in tweets and blogs. By providing more natural human-machine interfaces, and more sophisticated access to stored information, language processing has come to play a central role in the multilingual information society.” (Bird, Klein, and Loper, 2009)
How does Natural Language Processing Work?
Under the hood, the majority of NLP-based applications work in the same fundamental way; they take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result.
In other words, language-aware applications are not “automagic.” Rather than beautiful, bespoke algorithms, the best applications tend to use language models trained on domain-specific corpora (collections of related documents containing natural language). The reason for this is that language is highly contextual, and words can mean different things in different contexts. For example, depending on context, the word “bank” can refer a place where you put your money, the side of a river, the surface of a mine shaft, or the cushion of a pool table! With domain-specific corpora, we can reduce ambiguity and prediction space to make results more intelligible.
How NLP is Different from Standard (Numeric) Machine Learning
While we can perform machine learning on text in much the same way that we would do on a numeric dataset, one caveat of NLP is that working with text data is substantially different from working with numeric data.
Ingesting a raw text corpus in a form that will support systematic parsing is non-trivial, and you must anticipate a range of problems from HTTP response errors, to streaming for memory safety, to unpredictable text encoding. Once ingested, you must also establish a standardized method of preprocessing, normalizing, and transforming your raw ingested text into a corpus that is ready for computation and modeling. This process must include content extraction, paragraph blocking, sentence segmentation, word tokenization, and part-of-speech tagging. Finally, it’s important to acknowledge that vectorizing text will result in extremely high dimensional decision space, which can lead to modeling problems associated with the curse of dimensionality.
Fortunately, there are some great open source libraries and tools that can help you get started; for ingestion, check out Baleen, a production-grade corpus ingestion engine, and for normalization and transformation, check out Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
Solving Everyday Problems with NLP
When done right, NLP-based applications can lead to a markedly improved customer experience. For example, an NLP feature like auto-completion for free-entry text can ensure that customers are more likely to find the product they are looking for, even if they are unsure as to how it is spelled. Machine translation is another example of an NLP-based application that can allow speakers from different countries to communicate across language boundaries. Integrating named entity recognition tools can make it easier to identify and link entities across data sources (e.g. rapidly connecting a patient’s medical history across many hospital records in order to lessen the chances of misdiagnosis).
At ByteCubed we’re currently implementing NLP tools for a range of different business problems — from matching companies across disparate data sources, to better understanding how messages travel through social networks. Our main takeaway has been that while working with unstructured text data can present unique challenges — from having to ingest and manage special domain-specific corpora, to experimentation with multiple normalization or vectorization methods — it is also often the only way to answer certain kinds of questions. Importantly, natural language data encodes complex things about the human experience that simply cannot be found in traditional numeric datasets — and while that complexity doesn’t come easily, the added insight is often well worth the effort.