Parsing

What is a parser?

VersionDude
Parsing
5 min read

A parser turns a stream of characters into structured data a program can work with - the step between raw text and meaning.

A parser is the component that reads input text and builds a structured representation of it that a program can actually work with. Raw text is just a sequence of characters with no inherent meaning to software; a parser is the step that turns that flat stream into something with shape and hierarchy. It sits at the boundary between human-written or machine-generated text and the structured data the rest of a program needs.

Two stages: tokenize, then build a tree

Abstract representation of digital data.

Most parsers work in two distinct stages. The first stage, a tokenizer or lexer, groups characters into meaningful units - tags, words, numbers, operators, symbols - so that the input is no longer individual characters but a sequence of tokens. The second stage, tree construction, arranges those tokens into a hierarchy according to a grammar, producing a structured tree. Separating these two jobs keeps each one simpler and is a pattern you will see again and again.

Tokenization alone resolves a lot of ambiguity. Consider how the characters that form an opening tag, an attribute name, and a quoted value must be recognised as separate things before any structure can be built; the tokenizer is what makes those distinctions. Only once the input is a clean stream of tokens can the tree-construction stage reason about how they nest and relate to one another.

Parsers are everywhere

Parsers are genuinely everywhere once you start looking. Compilers parse source code into an abstract syntax tree before generating machine instructions; browsers parse HTML and CSS to build the structures they render; and ordinary applications parse JSON, XML, YAML and configuration files every time they start up. Almost any program that accepts text input has a parser somewhere inside it, even if the author never thinks of it by that name.

Why the output is a tree

The output of a parser is typically a tree, and the kind of tree depends on the domain. For a programming language it is an abstract syntax tree that captures the structure of expressions and statements; for a web page it is the DOM, the tree of elements and text nodes the browser builds from HTML. In both cases the tree, not the original text, is what subsequent stages operate on.

The reason a tree is the natural output is that most languages and document formats are inherently nested. An expression contains sub-expressions, an element contains child elements, a configuration block contains nested blocks. A flat list of tokens cannot capture that nesting, but a tree expresses it directly, which is why parsing is so often described as turning a one-dimensional stream into a multi-level structure.

What makes HTML parsing unusual

What makes HTML parsing unusual among all these examples is its standardised error recovery. Real-world markup is full of mistakes - unclosed tags, misplaced elements, attributes in odd positions - and yet pages still need to display. The HTML specification responds to this by defining exactly how a parser must handle each kind of malformed input, rather than leaving it to each implementation to guess.

The consequence of that precise specification is remarkable interoperability. Because the error-recovery rules are spelled out in detail, every modern browser builds the same DOM tree from the same broken input, instead of each one improvising differently. This is why a malformed page renders consistently across browsers, and it stands in contrast to stricter formats like XML, which simply reject input that is not well-formed rather than trying to repair it.

Forgiving versus strict, by design

That contrast highlights a deliberate design trade-off. A forgiving parser like HTML's prioritises resilience and the user always seeing something, at the cost of letting errors pass silently. A strict parser like XML's prioritises correctness and predictability, at the cost of rejecting imperfect input outright. Neither is universally better; each suits the job it was designed for, and recognising which philosophy a format follows tells you a lot about how it behaves.

- VersionDude

A mental model for the whole stack

Once you start seeing software as a series of parsers turning text into structure, a great deal of the web stack becomes easier to reason about. The browser parsing HTML into the DOM, the engine parsing JavaScript into a syntax tree, the server parsing a JSON request body - they are all the same idea applied in different places. That single lens, of text becoming structured data through a parser, is one of the most useful mental models in all of programming.

Related project

Colour-highlighted JavaScript code on a dark screen

guides