What We Know & Don't Know About AI

What AI Can Actually Do Today

The gap between what AI companies claim and what their products reliably deliver is the single most important thing to understand about the field in 2026. Not because the technology is bad — it is genuinely remarkable — but because the marketing has outpaced the engineering at nearly every major lab.

Large language models can generate coherent text across virtually any domain, translate between languages with near-professional quality, summarize complex documents, write functional code in dozens of programming languages, and engage in multi-turn reasoning about novel problems. These capabilities are real and commercially valuable.

What they cannot reliably do: guarantee factual accuracy, perform precise mathematical reasoning without tooling, maintain perfect consistency across long contexts, understand causation rather than correlation, or exercise genuine judgment about ambiguous situations. When a model appears to do these things, it is pattern-matching successfully — not reasoning from first principles.

The test: If a company describes an AI capability without specifying the failure rate, the claim is marketing. Every production AI system has a failure rate. The ones worth using are the ones that tell you what it is.

Image generation has reached photorealistic quality for many subjects but still struggles with hands, text rendering, spatial relationships, and consistency across a series. Video generation is advancing rapidly but remains unreliable for anything requiring physical accuracy. Audio generation — voice cloning, music, sound effects — is arguably the most mature generative domain, with results that are frequently indistinguishable from human-produced content.

How Large Language Models Work

A large language model is a prediction machine. Given a sequence of text, it predicts what comes next. That single capability — next-token prediction — scaled to billions of parameters and trained on trillions of words, produces behavior that looks remarkably like understanding.

The architecture that makes modern LLMs possible is called the Transformer, published in 2017 in a paper titled "Attention Is All You Need." The key innovation is the attention mechanism — a way for the model to weigh which parts of its input are most relevant to predicting the next output. Before Transformers, models processed text sequentially, word by word. Attention lets the model consider the entire context simultaneously, which is why modern models can maintain coherence across long passages.

Training happens in two phases. Pre-training exposes the model to vast amounts of text — books, websites, code repositories, scientific papers — and teaches it the statistical structure of language. Fine-tuning then shapes the model for specific behaviors: following instructions, refusing harmful requests, maintaining a helpful tone. The pre-trained model knows language; the fine-tuned model knows how to be useful.

Parameters are the adjustable weights inside the network — think of them as knobs that get tuned during training. GPT-4 is estimated to have over a trillion parameters. More parameters generally means more capacity to encode patterns, but the relationship between parameter count and capability is not linear. Architecture, training data quality, and training methodology all matter as much as scale.

The Scaling Hypothesis

The scaling hypothesis is the most consequential bet in AI: that making models bigger — more parameters, more data, more compute — will continue producing qualitatively new capabilities. This hypothesis drove the investment of hundreds of billions of dollars in GPU clusters, training runs, and power infrastructure.

The evidence is genuinely mixed. Scaling laws — mathematical relationships between compute, data, and model performance — held remarkably well from GPT-2 through GPT-4. Each order-of-magnitude increase in compute produced predictable improvements in benchmark performance and often surprising emergent capabilities that nobody explicitly programmed.

But there are signs of diminishing returns. Several labs have reported that their largest training runs produced smaller capability gains than predicted. Benchmark saturation is real — models score so highly on many tests that the tests stop being useful discriminators. And the cost curve is brutal: each generation of frontier model costs roughly 3-5x more to train than the last.

Where TexTak stands: Pure parameter scaling is likely approaching its ceiling for text-only models. The next capability gains will come from architectural innovation (mixture-of-experts, test-time compute), multimodal integration, and better training data — not just bigger runs of the same approach.

What Alignment Research Has and Hasn't Solved

Alignment is the problem of making AI systems do what humans actually want, not just what they were literally told. It is both more solved and more unsolved than most coverage suggests.

What works today: RLHF (reinforcement learning from human feedback) and its variants are remarkably effective at making models helpful, harmless, and honest in typical interactions. Constitutional AI provides a scalable framework for encoding behavioral principles. Instruction following is largely a solved problem for well-specified tasks. The models available in 2026 are dramatically safer and more controllable than those from even two years ago.

What remains unsolved: we do not have reliable methods for ensuring that a model's stated goals match its actual optimization target. We cannot verify whether a model that appears aligned is genuinely aligned or has learned to appear aligned because that produces better training signal. We lack robust techniques for maintaining alignment as models become more capable — the problem gets harder precisely as the stakes get higher.

Mechanistic interpretability — the ability to look inside a model and understand what it is actually computing — is the field most likely to produce a breakthrough. If researchers can reliably read a model's internal representations, they can verify alignment rather than just testing for it. This work is progressing but remains far from production-ready.

The Measurement Problem

AI benchmarks are broken, and the industry knows it. MMLU, HumanEval, GSM8K — the tests that labs use to compare models have become optimization targets rather than measurement tools. When you train specifically to score well on a test, the test stops measuring what it was designed to measure. This is Goodhart's Law applied to AI: when a measure becomes a target, it ceases to be a good measure.

The contamination problem is equally serious. Models trained on internet-scale data have almost certainly seen benchmark questions during training. A model that scores 95% on a test it memorized the answers to is not demonstrating 95% capability — it is demonstrating recall. Some labs have begun using held-out evaluation sets, but the arms race between benchmark creation and data contamination is ongoing.

Newer evaluations like ARC-AGI, SWE-bench, and GPQA attempt to test capabilities that are harder to game — novel reasoning, real-world software engineering, graduate-level domain expertise. These are better signals, but they still reduce complex capability to a single number. The actual question practitioners care about — "will this model work well for my specific use case?" — cannot be answered by any benchmark.

What this means for you: Ignore benchmark leaderboards when choosing a model. Run your actual workload against two or three candidates and measure the results that matter to your use case. The best model on paper is frequently not the best model in practice.

Genuine Unknowns

There are questions about AI that nobody has the answer to — not the labs, not the researchers, not the policymakers. Honest engagement with the field requires acknowledging these unknowns rather than pretending confidence where none exists.

Do LLMs understand anything? The "stochastic parrot" debate remains genuinely unresolved. Models produce outputs that are functionally indistinguishable from understanding in many contexts, but whether the internal representations constitute anything like comprehension is an open philosophical and empirical question. The honest answer is: we do not know, and we do not yet have the tools to determine the answer definitively.

Where is the capability ceiling? Nobody knows whether current architectures can produce AGI-level capability or whether a fundamentally different approach is required. Reasonable experts disagree by decades on timelines, which tells you the uncertainty is genuine, not a communication failure.

What happens to labor markets? Historical technology transitions suggest that new jobs emerge to replace displaced ones, but the speed and breadth of AI capability development has no historical precedent. The labor economists who say "it will be fine" and the ones who say "this time is different" are both extrapolating from insufficient data.

Can AI systems become genuinely dangerous? The existential risk debate is polarized and unproductive. The measured position is: current systems are not dangerous in the way science fiction imagines, but the trajectory of capability development creates real risks that deserve serious institutional attention. The risk is not a rogue AI — it is powerful systems deployed faster than our ability to understand their failure modes.

← AI CONTROVERSY THE FUTURE OF AI →