Natural Language Processing is concerned with the exploration of computational techniques to learn, understand and produce human language content. NLP technologies can assist both human-human communication and human-machine communication, and can analyse and learn from the vast amount of textual data available online.
However, there are a few hindrances to this vastly unexplored aspect of technology.
We don’t consciously understand language ourselves as Homo Sapiens to begin with. The second major difficulty is ambiguity.
Computers are extremely good at manipulating syntax, for example, count how many times the word and appears in a 120 pages document, but they are extremely weak at manipulating concepts. As a matter of fact, a concept is totally stranger to computer processes. On the other hand, natural language is all about concepts and it only uses syntax as a transient means to get to it.
A computer is unaware about conceptual processing dimension makes it difficult to process natural language since the purpose of natural languages is to convey concepts and syntax is only used as a transient means in natural language.
Such a limitation can be alleviated by making computer processes more aware about the conceptual dimension.
This is almost a philosophical question. In natural language, syntax is a means, and concept is the goal. If you relate to transportation for example, a road is the means where getting from point A to point B is the goal. If extra-terrestrial would come to earth long before we are gone and would find roads all over the place, would they be able to make some sense about transportation just by analyzing the means? Probably not! You can’t analyze the means exclusively in order to fully understand an object of knowledge.
When you think of a linguistic concept like a word or a sentence, those seem like simple, well-formed ideas. But in reality, there are many borderline cases that can be quite difficult to figure out.
For instance, is “won’t” one word, or two? (Most systems treat it as two words.) In languages like Chinese or (especially) Thai, native speakers disagree about word boundaries, and in Thai, there isn’t really even the concept of a sentence in the way that there is in English. And words and sentences are incredibly simple compared to finding meaning in text.
The thing is, many, many words are like that. “Ground” has tons of meanings as a verb, and even more as a noun. To understand what a sentence means, you have to understand the meaning of the words, and that’s no simple task.
The crazy thing is, for humans, all this stuff is effortless. When you read web page with lists, tables, run on sentences, newly made up words, nouns used as verbs, and sarcasm, you get it immediately, usually without having to work at it.
Puns and wordplay are constructs people use for fun but they’re also exactly what you’d create if you were trying your best to baffle an NLP system. The reason for that is that computers process language in a way totally unlike humans, so once you go away from whatever text they were trained on, they are likely to be hopelessly confused. Whereas humans happily learn the new rules of communicating on Twitter without having to think about it.
If we really understood how people understand language, we could maybe make a computer system do something similar. But because it’s so deeply buried and unconscious, we resort to approximations and statistical techniques, which are at the mercy of their training data and may never be as flexible as a human.
Natural language processing is the art of solving engineering problems that need to analyze or generate natural language text.The metric of success is not whether you designed a better scientific theory or proved that languages X and Y were historically related. Rather, the metric is whether you got good solutions on the engineering problem.
For example, you don’t judge Google Translate on whether it captures what translation “truly is” or explains how human translators do their job. You judge it on whether it produces reasonably accurate and fluent translations for people who need to translate certain things in practice. The machine translation community has ways of measuring this, and they focus strongly on improving those scores.
NLP is mainly used to help people navigate and digest large quantities of information that already exist in text form. It is also used to produce better user interfaces so that humans can better communicate with computers and with other humans.
Saying that NLP is engineering, we don’t mean that it is always focused on developing commercial applications. NLP may be used for scientific ends within other academic disciplines such as political science (blog posts), economics (financial news and reports), medicine (doctor’s notes), digital humanities (literary works, historical sources), etc.
Although, it is being used also as a tool within computational X-ology in order to answer the scientific questions of X-ologists, rather than the scientific questions of linguists.
That said, NLP professionals often get away with relatively superficial linguistics. They look at the errors made by their current system, and learn only as much linguistics as they need to understand and fix the most prominent types of errors. After all, their goal is not a full theory but rather the simplest, most efficient approach that will get the job done.
NLP is a growing field and despite many hindrances, it has come forward and shown us tremendous capabilities to abstract and utilize data. It teaches us that simplicity is the key at the end of the day.