BCP 47, ISO 639 & HTML lang Best Practices
Language codes are the unsung heroes behind every accessible, SEO‑friendly, and correctly rendered multilingual website. By telling browsers, search engines, and assistive technologies exactly which human language and regional variant your text uses, you ensure proper hyphenation, screen‑reader pronunciation, font selection, and search ranking. A simple <html lang="en">
can improve readability for users with dyslexia, guide Google’s indexing to the correct regional market, and prevent mismatches where French text is misread as English.
However, modern localization demands go far beyond two‑letter abbreviations. Sites that serve English in Canada (fr-CA
vs en-CA
), support both traditional and simplified Chinese (zh-Hant
vs zh-Hans
), or cater to Spanish‑speaking Latin America (es-419
) need precise tagging. And when you layer in variants—like phonebook collation (de-DE-u-co-phonebk
) or script subtags (sr-Latn
)—the syntax becomes richer but more complex.
In this comprehensive guide, we’ll explore:
- Why language codes matter for SEO, accessibility, and localization tooling.
- ISO 639 two‑ vs three‑letter codes, when to use each.
- BCP 47 structure: building full tags with language, script, region, variants, and extensions.
- Script (ISO 15924), region (ISO 3166), variant & extension subtags explained.
- Practical HTML usage:
lang
,xml:lang
, inheritance, and proper casing. - Best practices & pitfalls: the “golden rules,” und/mul/zxx, and avoiding common errors.
- Language codes in HTTP & SEO:
Accept-Language
,hreflang
, andx-default
. - Automation & tooling for validation and runtime locale detection.
- FAQs addressing everyday questions like when to include scripts or region subtags.
- Conclusion: auditing, migrating, and maintaining accurate language tagging.
Armed with this knowledge, you’ll be able to implement robust, future‑proof localization on any web project, ensuring content reaches the right audience in the right way.
1. What Are Language Codes & Why They Matter
Language codes are standardized identifiers—strings like en
, fr-CA
, or zh-Hant-TW
—that indicate the language, script, region, and optional variant of text. They appear in:
- The HTML
<html lang="…">
attribute <p lang="…">
or other elements for nested content- HTTP headers, e.g.,
Accept-Language: en-US,en;q=0.9,fr-CA;q=0.8
<link rel="alternate" hreflang="…">
tags for SEO- Localization files and translation workflows
Why They Matter
- Accessibility
- Screen readers use
lang
to choose pronunciation rules (e.g., French nasal vowels vs English).
- Screen readers use
- Typography & Hyphenation
- Browser text shaping engines apply the correct line breaks and hyphens per language.
- SEO & Regional Targeting
- Search engines use
hreflang
andlang
to rank content for specific markets (e.g., Switzerland vs France).
- Search engines use
- Localization Tooling
- Build systems and translation platforms match locale codes to resource bundles (e.g.,
de-CH
for Swiss German).
- Build systems and translation platforms match locale codes to resource bundles (e.g.,
Without precise codes, text may be mispronounced (“bonjour” spoken with English phonemes), search engines may show the wrong country version, and translators may assign wrong language files.
2. ISO 639: Two vs Three‑Letter Codes
The ISO 639 family defines the core language identifiers:
ISO 639‑1 (Two‑Letter Codes)
Code | Language | Notes |
---|---|---|
en | English | Widely supported |
fr | French | |
de | German | |
es | Spanish | |
zh | Chinese | Generic (requires script/region) |
- Use Case: Common languages—browser UIs, HTML
lang
, HTTP headers.
ISO 639‑2/T and ‑B (Three‑Letter Codes)
- Terminology (T) vs Bibliographic (B) variants for some languages (e.g.,
lit
vslti
). - Covers ~500 languages, including obscure, historical, or regional ones.
Code | Language |
---|---|
arb | Arabic |
grc | Ancient Greek |
roh | Romansh |
yue | Cantonese |
ISO 639‑3 (Extended)
- Superset of ISO 639‑2, covering ~7,000 languages and dialects.
- Rarely used in HTML, but common in linguistic research.
When to Use Three‑Letter Codes
- No Two‑Letter Equivalent: e.g.,
tzm
for Central Atlas Tamazight. - Historical or Constructed Languages:
qaa–qtz
reserved for local use. - Interoperability: Some APIs or archives require ISO 639‑3.
Fallback Strategy
- Always prefer ISO 639‑1 if available.
- If not, use ISO 639‑2/T for broader coverage.
- Only use ISO 639‑3 when necessary for obscure languages.
<!-- Two-letter -->
<html lang="en">English page</html>
<!-- Three-letter, for Tamazight -->
<html lang="tzm">ⵜⴰⵎⴰⵣⵉⵖⵜ</html>
3. BCP 47: Building Full Language Tags
BCP 47 (Best Current Practice) is the IETF’s standard for language tags, combining multiple subtags:
[language]-[extlang]-[script]-[region]-[variant]-[extension]-[privateuse]
Core Components
Subtag | Examples | Purpose |
---|---|---|
language | en, fr, sr, zh | ISO 639 code |
extlang | (rare) | Extended language (deprecated) |
script | Latn, Cyrl, Hans | ISO 15924 script codes |
region | US, GB, CA, 419 | ISO 3166‑1 country codes or UN M49 region |
variant | valencia, polyton | Language variants (e.g., Valencian dialect) |
extension | u-co-phonebk | Unicode locale extensions, transforms |
privateuse | x-microsoft, x-pig | Private‑use subtags |
Examples
en → English (generic)
fr-CA → Canadian French
zh-Hans → Simplified Chinese (generic region)
zh-Hant-TW → Traditional Chinese, Taiwan
sr-Latn → Serbian in Latin script
es-419 → Latin American Spanish (UN M49)
de-DE-u-co-phonebk→ German (Germany), phonebook collation
de-DE-x-microsoft → German (Germany), private‑use Microsoft variant
Building Tags
- Language (required):
en
- Script (optional if ambiguous):
zh-Hans
- Region (optional if relevant):
fr-CA
- Variants & Extensions (advanced):
de-DE-u-ca-gregory
(calendar = Gregorian)
<html lang="sr-Cyrl-RS">…</html> <!-- Serbian, Cyrillic, Serbia -->
Private‑Use and Extensions
- Extensions (single-letter):
u-
for Unicode locale attributes (e.g.,u-nu-latn
).t-
for transformed content.
- Private-Use (
x-
):- Not standardized—used for organization‑specific tags.
4. Script, Region, Variant & Extensions Explained
Script Subtags (ISO 15924)
Code | Script |
---|---|
Latn | Latin |
Cyrl | Cyrillic |
Hans | Han (Simplified) |
Hant | Han (Traditional) |
- Capitalize first letter (
Latn
), though browsers are case‑insensitive.
Region Subtags (ISO 3166‑1 & UN M49)
Code | Meaning |
---|---|
US | United States |
GB | United Kingdom |
CA | Canada |
419 | Latin America & Caribbean (UN M49) |
- Numeric codes (
419
) cover supra‑national regions.
Variant Subtags
Variant | Description |
---|---|
valencia | Valencian dialect |
polyton | Greek polytonic orthography |
- Lowercase, 5–8 alphanumeric characters.
Extension Subtags
- Unicode extensions (
u-
):- Collation:
u-co-phonebk
- Numbering system:
u-nu-latn
- Collation:
- Transform extensions (
t-
): Transliteration, etc. - Private-use (
x-
): Implementation‑specific tags.
<p lang="de-DE-u-co-phonebk">…</p>
5. Practical HTML Usage (lang
Attribute)
Declaring on <html>
<!DOCTYPE html>
<html lang="en">
<head>…</head>
<body>…</body>
</html>
- Applies to the entire document.
- Must match declared charset for best results.
Nested Language Segments
<p lang="en">Welcome!</p>
<p lang="fr-CA">Bienvenue au Canada !</p>
- Browsers and screen readers switch pronunciation and hyphenation mid‑document.
- In HTML5,
lang
inheritance means child elements default to parent unless overridden.
XHTML & xml:lang
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
- In XML serializations, use
xml:lang
in addition tolang
.
Casing Conventions (Ignored by Parsers)
- Language: lowercase (
en
,fr
) - Script: Title case (
Latn
,Cyrl
) - Region: UPPERCASE (
US
,CA
) - Extensions/Variants: lowercase (
u-co-phonebk
,valencia
)
Browsers treat tags case‑insensitively but consistent casing aids readability.
6. Best Practices & Common Pitfalls (≈300 words)
- Use the Shortest Valid Tag
- Prefer
ja
overja-JP
, since Japanese is almost exclusively in Japan.
- Prefer
- Casing for Readability
- Follow BCP 47’s recommended casing, even though HTML ignores it.
- Avoid Missing Subtags
- If site targets Simplified Chinese, explicitly use
zh-Hans
to differentiate from Traditional.
- If site targets Simplified Chinese, explicitly use
- Special Tags
und
: Undetermined language (lang="und"
).mul
: Multiple languages in one element.zxx
: No linguistic content (e.g.,<audio lang="zxx">
).
- No Spaces
- Use hyphens (
-
), never underscores (_
) or spaces.
- Use hyphens (
- Validate Against IANA Registry
- Ensure subtags appear in the IANA Language Subtag Registry.
- Don’t Mix Formats
- Avoid
en_us
orEN-US
; stick to BCP 47.
- Avoid
Pitfall Example:
<p lang="zh">…</p> <!-- Ambiguous: could be Hans or Hant -->
<p lang="zh-Hant">…</p> <!-- Clear Traditional Chinese -->
7. Language Codes in HTTP & SEO (hreflang
)
Accept-Language
Header
Browsers send user preferences:
Accept-Language: en-US,en;q=0.9,fr-CA;q=0.8,fr;q=0.7
- Server can serve localized content dynamically.
hreflang
Links for SEO
<link rel="alternate" hreflang="en-US" href="https://example.com/en/">
<link rel="alternate" hreflang="fr-CA" href="https://example.com/fr-ca/">
<link rel="alternate" hreflang="x-default" href="https://example.com/">
x-default
: Fallback when no region match.- Search Engines: Google uses these to serve correct language pages in SERPs.
Best Practices
- Consistent Tags: Match
lang
in<html>
tohreflang
. - Canonicalization: Use
rel="alternate"
andrel="canonical"
appropriately. - Sitemap Entries: Include
hreflang
annotations in XML sitemaps if needed.
Accurate hreflang
implementations improve user engagement and reduce bounce rates by directing searchers to the right language version.
8. Automation & Tooling
- IANA Language Subtag Registry: Source of truth for valid subtags.
- Linting: Integrate eslint-plugin-i18n or HTML validators to flag invalid
lang
values. - Intl API (JavaScript):
const locale = navigator.language || 'en-US'; const formatter = new Intl.DateTimeFormat(locale, { dateStyle: 'long' });
- Localization Frameworks (i18next, Vue I18n): Use BCP 47 tags to load correct resource bundles.
- CI Checks: Fail builds when new pages lack
lang
or use unrecognized codes.
Automated validation ensures consistent, accurate language tagging across large codebases.
9. FAQs
Q1. What’s the difference between ISO 639‑1 and ISO 639‑3?
ISO 639‑1 uses two‑letter codes for ~180 common languages. ISO 639‑3 uses three‑letter codes for ~7,000 languages, including obscure and historical ones.
Q2. When should I include script or region subtags?
Include script when the same language uses multiple scripts (e.g., sr-Latn
vs sr-Cyrl
). Include region when regional differences affect content (e.g., en-GB
vs en-US
).
Q3. Is casing important in language tags?
No—HTML parsers ignore case. However, the BCP 47 convention (lowercase language, Titlecase script, UPPERCASE region) improves human readability.
Q4. Why use zh-Hant-TW
vs zh
?zh
defaults to Simplified in many contexts. Use zh-Hant-TW
to specify Traditional Chinese as used in Taiwan, ensuring proper character set and collation.
Q5. What if my content mixes multiple languages?
Use lang="mul"
for mixed content or wrap individual segments in elements with their respective lang
, e.g., <span lang="es">hola</span>
.
Conclusion
Well‑formed language tags are the backbone of accessible, localized, and SEO‑optimised web content. By mastering ISO 639 two‑ and three‑letter codes, constructing robust BCP 47 tags with scripts, regions, and extensions, and implementing them correctly in HTML, HTTP headers, and sitemaps, you guarantee that your audience—and search engines—interpret your pages as intended. Audit your templates for missing or ambiguous lang
declarations, automate validation in your build pipeline, and adopt precise tagging (e.g., zh-Hans-CN
, es-419
) to reflect real linguistic and regional nuances. With these practices in place, your site will deliver true multilingual support, enhanced accessibility, and stronger regional SEO performance—empowering users around the globe to engage with your content in their native language.