
You were probably debugging a BiLSTM that day. Maybe wrestling with attention weights. Or writing yet another paper on named entity recognition with a 0.3% improvement over baseline.
Then Google dropped BERT. Everything you knew about NLP became obsolete in about 48 hours. Your carefully crafted feature engineering, your domain-specific models, your years of expertise in linguistic rules—all replaced by a pre-trained transformer that a junior developer could fine-tune in an afternoon.
That was just the beginning. The journey from word2vec to ChatGPT isn't just a story of technical progress. It's a story of how LLMs changed NLP from a research discipline into mostly solved infrastructure.
Back in 2013, Tomas Mikolov wasn't trying to revolutionize anything. He just wanted word embeddings that didn't suck. The Google researcher published two papers that year introducing word2vec, and suddenly "king - man + woman = queen" became the party trick that launched a thousand startups.
Word2vec was beautifully simple. Skip-gram and CBOW architectures that could learn word representations from raw text. No hand-crafted features. No linguistic expertise required. Just feed it Wikipedia and watch it learn that "Paris" and "France" belong together.
The real disruption wasn't the math. It was the mindset shift. For the first time, you could download pre-trained embeddings and get decent results on almost any NLP task. The bitter lesson was knocking at the door: general methods that leverage computation beat clever algorithms every time.
But we didn't listen. Not yet.
Between 2013 and 2017, NLP researchers went wild with architectures. LSTMs, GRUs, bidirectional everything, attention mechanisms bolted onto RNNs like spoilers on a Honda Civic. Each paper proudly announced a 1-2% improvement on some benchmark nobody cared about.
The 2017 "Attention Is All You Need" paper from Google should have been the wake-up call. Vaswani and team threw away recurrence entirely. Just attention layers stacked on attention layers. The Transformer architecture was born.
Here's what most people missed: the paper wasn't really about translation. It was about proving that domain expertise, clever architectures, and task-specific solutions were all headed for extinction. Scale and compute would eat everything.
The Transformer didn't just beat RNNs at translation. It beat them while being easier to parallelize, simpler to implement, and more general purpose. Still, most of the field kept publishing incremental improvements to old methods. Careers were built on technologies that were already dead.
October 2018. Google releases BERT. Pre-trained on 3.3 billion words. Fine-tunable for any task. State-of-the-art results on 11 different benchmarks out of the box.
This wasn't an incremental improvement. BERT beat specialized models by margins that made previous research look like rounding errors. Question answering, sentiment analysis, named entity recognition—it dominated everything. The paper's casual tone almost mocked the field: "Hey, we pre-trained a big model and it beats all your fancy task-specific architectures. Weird, right?"
The aftermath was brutal. Entire research groups pivoted overnight. Years of work on specialized architectures became worthless. That novel approach to dependency parsing you spent two years on? BERT beats it with three lines of code.
How LLMs changed NLP became crystal clear: they made expertise in NLP almost irrelevant. You didn't need to understand linguistics anymore. You needed to understand transfer learning and have access to GPUs.
If BERT was a warning shot, GPT-3 was the asteroid that killed the dinosaurs. OpenAI's 2020 paper didn't even pretend to care about traditional NLP benchmarks. Why bother? The model could write poetry, answer questions, translate languages, and generate code without any fine-tuning at all.
175 billion parameters. Trained on basically the entire internet. The bitter lesson wasn't knocking anymore—it had kicked down the door and was raiding your fridge.
Traditional NLP tasks became almost comically trivial. Sentiment analysis? Prompt "Is this review positive or negative:" Named entity recognition? "List all the people mentioned in this text:" No training required. No dataset curation. No feature engineering. Just ask nicely.
The research community's response was telling. Papers started focusing on what LLMs couldn't do—reasoning, math, staying consistent across long texts. But each new model release fixed half the problems from the last round of criticism. It was like watching someone bail water from a sinking ship with a teaspoon.
Rich Sutton's "Bitter Lesson" essay from 2019 predicted all of this. The history of AI is researchers believing that human knowledge and specialized methods matter, only to be steamrolled by general methods that scale with compute.
Chess fell to deep search. Go fell to self-play and neural networks. Now NLP has fallen to transformers and scale. The pattern is always the same. Clever algorithms work for a while. Then someone throws 10x more compute at a simpler method and wins.
For NLP, the bitter lesson hit harder than most fields. Language was supposed to be special. It required understanding meaning, context, human knowledge. Surely you couldn't just throw parameters at it until it worked?
Turns out you can. And it works better than everything else we tried.
Here's the uncomfortable truth about how LLMs changed NLP research: they solved most NLP problems as a side effect of learning to predict the next token. Not perfectly. Not in ways that satisfy theoretical computer scientists. But well enough that the engineering problems are basically done.
Text classification? Solved. Machine translation? Solved. Question answering? Solved. Summarization? Solved. Not solved like "100% accurate." Solved like "good enough that further improvements don't matter for most applications."
The problems that remain—hallucination, consistency, reasoning—aren't really NLP problems anymore. They're general intelligence problems. And they'll probably be solved the same way everything else was: bigger models, better data, more compute.
This kills traditional NLP research. Why spend years on a 2% improvement to some component when the next GPT version will beat it by 20%? Why develop task-specific architectures when a general model plus prompting works better?
LLMs solved most NLP problems, but they created new ones. How do you control these models? How do you make them reliable? How do you deploy them efficiently? These aren't traditional NLP questions. They're engineering and alignment problems.
The field is splitting into two camps. One group chases scale—bigger models, better training techniques, more efficient architectures. The other group tries to understand and control what we've built. Neither looks much like NLP research from five years ago.
For practitioners, this is mostly good news. You can build applications that would have been impossible before. Real-time translation, human-like chatbots, content generation at scale—all accessible through an API call.
For researchers, it's an identity crisis. If you spent your career becoming an expert in parse trees and dependency grammar, what do you do when a model trained on Reddit comments outperforms your methods? You adapt or you become irrelevant.
The bitter lesson keeps being right because it's not really about AI. It's about humility. Every generation of researchers thinks their clever insights matter more than raw compute and data. Every generation is wrong. NLP just learned this lesson harder than most.
Want to know what's next? Whatever requires 10x more compute than we have today. The details don't matter. They never did.
LLM sentiment analysis promised to solve a decades-old NLP problem. Instead, it created five new ones nobody talks about.