How does \expandafter work: An introduction to TeX tokens
Background for \expandafter
: TeX tokens and token lists
As a first step towards understanding how \expandafter
really works, we’ll take a look at two components of TeX that are fundamental to the operation of \expandafter
: TeX tokens (integer numbers) and token lists (lists of integers). Readers who would like to explore those topics in much more detail may be interested to read the following articles published by Overleaf:
Where did the token data come from?
Throughout this article we use actual token values calculated by TeX—data that is not usually accessible to users. For readers curious to know how this token-value data was obtained, Overleaf has custom builds of several TeX engines which we use for research. Those engines are modified to output information on TeX’s inner processing activities—helping to provide additional background material for some of the articles we produce. By showing/discussing numerical token values, our aim is to include details which, hopefully, help readers to better understand “TeX tokens”, making this important concept feel a little less opaque.
TeX Tokens 101 (and notions of expansion)
When TeX processes your input file it reads the text and converts individual characters and sequences of characters (commands) into so-called tokens. A TeX token is simply an integer value, calculated by TeX, which is used to “encode” data TeX needs to store about an item read-in from its current input source. Think of tokens as small parcels of information which “package together” data that TeX needs to record, ready for passing on to the next stage of processing. Internally, TeX operates on those integer token values—it does not use the actual letters, symbols, digits etc. originally contained in your input file: everything is converted to a token (an integer) and TeX works with those.
How TeX calculates token values
Here we look at token calculations used in Knuth’s original TeX, e-TeX and pdfTeX; for other TeX engines, particularly XeTeX and LuaTeX, their token calculations need to be slightly different to account for the use of Unicode but the calculation methods are similar to those described below.
Character tokens (non-active characters)
Calculation of token values for non-active characters is straightforward:
\[\text{character token} = 256\times \text{(category code)} + \text{character (ASCII) code}\]
Example: The letter A with category code 11, character code 65 is represented by TeX as the character token value \(256\times 11 + 65 = 2881\).
You might encounter descriptions in TeX literature noting that once TeX has input a character, its category code value becomes “permanently bound” to that character: the above token value calculation shows why that is true. However, later in TeX’s processing it can, and does, “unpackage” character tokens to reveal the constituent (character code, category code) pair from which the token was constructed—when TeX does that “unpackaging” it still won’t alter that character’s category code, it merely uses that information during its subsequent processing.
Command tokens
TeX’s input processing and token generation recognize two types of command:
- commands constructed from one or more characters that have category code 11;
- single-character commands where that character’s category code is not 11: such as
\$
or\#
.
In both cases, TeX excludes the leading \
character and uses the character code of each remaining character to calculate an integer that TeX calls curcs
(current control sequence). TeX then uses the value of curcs
to calculate a token value for the command.
Commands made from characters with category code 11
Suppose our command (minus the leading \
character) is composed of a sequence of characters: \(\mathrm{C_1C_2C_3...C_N}\) where \(\mathrm{C}_i\) is the character code of each character—e.g., the character code of A is 65. TeX uses all of the character codes \(\mathrm{C}_i\) to calculate the integer curcs
(using a hash function). Once TeX has calculated the value of curcs
it simply adds 4095 to that value, to give the token value:
\[\text{command token} = \text{curcs + 4095}\]
Note that the variable curcs
plays an extremely important role in TeX’s inner processing activities.
Single-character commands
Tokens to represent commands such as \$
, \#
etc are subject to a slightly different calculation: the integer curcs
is now the simpler calculation:
\[\text{curcs} = 257 + \text{character (ASCII) code}\]
For example, with \$
, \(\text{curcs}=257 + 36 = 293\). TeX again adds 4095 to this value (using \(\text{command token} = \text{curcs} + 4095\)) resulting in \$
having a token value \(293 + 4095 = 4388\).
Compared to commands comprised of characters with category code 11, the only difference here is the way that TeX calculates the value for curcs
.
Note: the integer curcs
is not calculated for character tokens: it is always set to 0 when TeX is creating, or working with, character tokens.
Active-character tokens
TeX has the concept of so-called active characters: any character assigned to have category code 13. Tokens for this special class of characters are subject to a different calculation compared to regular characters.
The active-character mechanism allows TeX to create what are, in effect, single-character macros that you can use without having to prefix the active character with an escape character (typically \
): the isolated character will trigger its macro behaviour. The canonical example is the tilde character (~) that TeX/LaTeX use for non-breaking spaces, which can be defined/enabled as follows:
\catcode`~=13 %assign category code 13 to ~ \def~{\penalty100000\ } % define ~ to act as a macro
When TeX subsequently reads a ~
character it will detect its category code is 13 and process it as a “mini macro”. To calculate a token representing an active character TeX applies another variation for calculating curcs
:
\[ \begin{align*} \text{curcs} &= \text{character code} + 1\\ \text{active character token} &= \text{curcs} + 4095\\ \end{align*} \]
For example, the ~ character has character code 126, meaning its active-character token value representation is calculated as follows:
\[ \begin{align*} \text{curcs} &= 126 + 1\\ \text{active character token} &= 127 + 4095\\ &=4222\\ \end{align*} \]
Note that, like commands, tokens representing active characters are > 4095.
Consequences/notes
- Any token whose value exceeds 4095 is immediately identifiable as a command token—hence TeX can very easily detect whether a particular token represents a character or a command.
- For any token value, TeX can, when it needs to, “unpackage” that token to reveal the character (and its category code), or the command, originally present in your
.tex
file, stored in a macro definition or contained in some other token list. - The “intermediate” quantity called
curcs
—that TeX uses to calculate command token values—plays an important role in TeX’s low-level processing.curcs
acts as an “index value” that TeX uses to store/lookup the current meaning of a command. Given any command token, \(\mathrm{T}\), TeX simply subtracts 4095 to access the value ofcurcs
: \[\text{curcs} = \mathrm{T}-4095\]
Incidentally, TeX does store the human-readable string of characters from which a command token is generated—this is essential for error reporting and other commands such as \string
whose expansion is the human-readable version of a token value. However, those human-readable strings of characters stored inside TeX are only used/output when requested: for all other processing the token integer value is used.
Looking at some real tokens
Just to make the notion of tokens feel a little less opaque, we’ll define the following simple macro and take a look at the tokens TeX produces:
\def\hello{Greetings, from \TeX. \hskip 10pt}
For the \hello
macro, TeX uses the characters h
, e
, l
, l
, o
to calculate a value of 3745 for curcs
; TeX then adds 4095 to create a token value of \(3745 + 4095 = 7840\) (for Knuth’s TeX, e-TeX or pdfTeX).
After creating a token to represent \hello
, the \def
command causes TeX to read the subsequent tokens and use them to create a token list which is stored as the definition of the \hello
command. That stored definition (token list) can then be retrieved whenever you tell TeX to use the \hello
command.
The following table lists the actual token values created for each item (character, macro or primitive) contained in the \hello
macro definition—this list of tokens (integers) is what TeX stores in its memory (as data structure known as a linked-list). Readers wishing to understand token lists in more detail are referred to the Overleaf article What is a TeX token list?
In the token list above, the characters have category codes of 10, 11 or 12. For example:
- <space> characters have category code 10 and character code 32, giving a token value of \(256\times 10 + 32 = 2592\)
,
and.
have category code 12 and character codes 44 and 46 respectively, giving tokens:- token for
,
\(= 256 \times 12 + 44 = 3116\) - token for
.
\(= 256\times 12+ 46 = 3118\)
Whenever TeX subsequently encounters the token value 7840 (representing \hello
) it can, if required, “unpackage” that token to extract curcs
through the simple calculation \(\text{curcs} = \text{token value} - 4095\) (see above). Using the value of curcs
TeX can consult its inner data tables to determine that command token 7840 represents a macro command. In addition, again via curcs
, TeX can also look-up and retrieve the stored definition of \hello
.
When TeX needs to fully process token 7840, i.e., to run the \hello
macro, it no longer needs token 7840: that token has done its job—i.e., it triggered TeX to run the macro \hello
. TeX can now discard token 7840 and fetch the tokens which represent the definition (token list) stored in memory. In effect, the \hello
macro command (token 7840) has been removed from TeX’s current input source and replaced by tokens contained in the definition of \hello
. What we have just described is one form of token expansion.
The \TeX
command (token value 5235 listed above) used within \hello
is itself a macro constructed from more tokens—so its definition is also stored as a token list:
If we were to replace the \hello
command with the complete list of tokens from which it is built, including the \TeX
macro, it would be a rather long list—i.e., if we also expanded the \TeX
macro we would see:
Essentially, the single token value 7840 (for \hello
) would, when fully expanded, produce a total of 51 tokens (integers) representing characters and primitive commands. In the following list the character or command represented by each token in enclosed in parentheses “(...)”—these are not directly stored in TeX’s token lists and are shown to assist the reader:
2887 (G), 2930 (r), 2917 (e), 2917 (e), 2932 (t), 2921 (i), 2926 (n), 2919 (g), 2931 (s), 3116 (,), 2592 (<space>), 2918 (f), 2930 (r), 2927 (o), 2925 (m), 2592 (<space>), 2900 (T), 19598 (\kern), 3117 (-), 3118 (.), 3121 (1), 3126 (6), 3126 (6), 3127 (7), 2917 (e), 2925 (m), 19597 (\lower), 3118 (.), 3125 (5), 2917 (e), 2936 (x), 6175 (\hbox), 379 ({), 2885 (E), 637 (}), 19598 (\kern), 3117 (-), 3118 (.), 3121 (1), 3122 (2), 3125 (5), 2917 (e), 2925 (m), 2904 (X), 3118 (.), 2592 (<space>), 7943 (\hskip), 3121 (1), 3120 (0), 2928 (p), 2932 (t)
To a human reader this is just a series of integers but to TeX it encodes a great deal of information.
Read tokens now and save them for later
As TeX reads your input there may be times when it needs (or is instructed) to delay fully processing some particular set of tokens. If directed to do so, TeX will, until it is told to stop, continue to create tokens from the input but store them for use later on—subsequently retrieving and processing them as part of its typesetting activities. Those stored tokens are saved as so-called token lists which are, in effect, TeX’s only (internal) token-data storage mechanism.
We’ve already seen examples of token lists—the \hello
and \TeX
macros listed above: the definition of those macros are stored in TeX’s memory as lists of tokens. TeX will only process (action) such token lists when you decide to call those macros. Remember too that each token (integer value) encodes sufficient information for TeX to easily work out whether each token stored in a macro definition represents a character or a command.
Saving tokens with token registers
Another example of token storage is the explicit creation of lists of tokens that are saved in so-called token registers: dedicated internal storage areas that TeX provides for users to store token lists. The TeX primitive \toksdef
is one way to use token registers; for example, to use token register 100
and reference it using the command \mylist
:
\toksdef\mylist=100
\mylist={some \TeX{} tokens here}
\mylist
is, in effect, just a name that you assign to a list of tokens stored in register location 100
. Similar to a macro definition, \mylist
contains the following token list:
Note: to terminate the \TeX
macro and prevent it from absorbing the following <space>
character we used a pair of braces {}
immediately after \TeX
—the tokens for {
(379) and }
(637) are stored in the token list. Another option is to use a “control space” token \<space>
which would appear in the token list as shown below (in bold):
Note that the <space>
character is represented as a character token with value \(256\times 10 + 32 = 2592 \) but \<space>
is treated as a single-character command token (value 4384) which is calculated using the formulae given above:
\begin{align*} \text{curcs} & = 257 + \text{character (ASCII) code}\\ & = 257 + 32\\ &=289\\ \text{command token for} \left<\text{\\space}\right> & = \text{curcs + 4095}\\ & = 289+4095\\ &=4384\\ \end{align*}
In essence \mylist={some \TeX{} tokens here}
says to TeX: please scan my input file to convert the following characters/commands to tokens and save them for use later on. TeX will oblige and store those tokens in a memory location you can access by writing \the\mylist
, instructing TeX to insert a copy of the tokens contained in token register \mylist
. TeX engines include a number of primitive commands that explicitly generate and store token lists—such as \everyjob
, \everypar
, \mark
, and many others.
Overleaf guides
- Creating a document in Overleaf
- Uploading a project
- Copying a project
- Creating a project from a template
- Using the Overleaf project menu
- Including images in Overleaf
- Exporting your work from Overleaf
- Working offline in Overleaf
- Using Track Changes in Overleaf
- Using bibliographies in Overleaf
- Sharing your work with others
- Using the History feature
- Debugging Compilation timeout errors
- How-to guides
- Guide to Overleaf’s premium features
LaTeX Basics
- Creating your first LaTeX document
- Choosing a LaTeX Compiler
- Paragraphs and new lines
- Bold, italics and underlining
- Lists
- Errors
Mathematics
- Mathematical expressions
- Subscripts and superscripts
- Brackets and Parentheses
- Matrices
- Fractions and Binomials
- Aligning equations
- Operators
- Spacing in math mode
- Integrals, sums and limits
- Display style in math mode
- List of Greek letters and math symbols
- Mathematical fonts
- Using the Symbol Palette in Overleaf
Figures and tables
- Inserting Images
- Tables
- Positioning Images and Tables
- Lists of Tables and Figures
- Drawing Diagrams Directly in LaTeX
- TikZ package
References and Citations
- Bibliography management with bibtex
- Bibliography management with natbib
- Bibliography management with biblatex
- Bibtex bibliography styles
- Natbib bibliography styles
- Natbib citation styles
- Biblatex bibliography styles
- Biblatex citation styles
Languages
- Multilingual typesetting on Overleaf using polyglossia and fontspec
- Multilingual typesetting on Overleaf using babel and fontspec
- International language support
- Quotations and quotation marks
- Arabic
- Chinese
- French
- German
- Greek
- Italian
- Japanese
- Korean
- Portuguese
- Russian
- Spanish
Document structure
- Sections and chapters
- Table of contents
- Cross referencing sections, equations and floats
- Indices
- Glossaries
- Nomenclatures
- Management in a large project
- Multi-file LaTeX projects
- Hyperlinks
Formatting
- Lengths in LaTeX
- Headers and footers
- Page numbering
- Paragraph formatting
- Line breaks and blank spaces
- Text alignment
- Page size and margins
- Single sided and double sided documents
- Multiple columns
- Counters
- Code listing
- Code Highlighting with minted
- Using colours in LaTeX
- Footnotes
- Margin notes
Fonts
Presentations
Commands
Field specific
- Theorems and proofs
- Chemistry formulae
- Feynman diagrams
- Molecular orbital diagrams
- Chess notation
- Knitting patterns
- CircuiTikz package
- Pgfplots package
- Typesetting exams in LaTeX
- Knitr
- Attribute Value Matrices
Class files
- Understanding packages and class files
- List of packages and class files
- Writing your own package
- Writing your own class