Module:languages: Difference between revisions

Line 1:

--[==[ ~~intro:~~

--[=[

This module implements fetching of language-specific information and processing text in a given language.

~~===Types of languages===~~

There are two types of languages: full languages and etymology-only languages. The essential difference is that only

Line 9:

Line 7:

their parent (in the parent-child inheritance sense), and for etymology-only languages with another etymology-only

language as their parent, a full language can always be derived by following the parent links upwards. For example,

"Canadian French", code `fr-CA`, is an etymology-only language whose parent is the full language "French", code `fr`.

"Canadian French", code 'fr-CA', is an etymology-only language whose parent is the full language "French", code 'fr'.

An example of an etymology-only language with another etymology-only parent is "Northumbrian Old English", code

`ang-nor`, which has "Anglian Old English", code `ang-ang` as its parent; this is an etymology-only language whose

'ang-nor', which has "Anglian Old English", code 'ang-ang' as its parent; this is an etymology-only language whose

parent is "Old English", code `ang`, which is a full language. (This is because Northumbrian Old English is considered

parent is "Old English", code "ang", which is a full language. (This is because Northumbrian Old English is considered

a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code `und`; this is the case,

a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code 'und'; this is the case,

for example, for "substrate" languages such as "Pre-Greek", code `qsb-grc`, and "the BMAC substrate", code `qsb-bma`.

for example, for "substrate" languages such as "Pre-Greek", code 'qsb-grc', and "the BMAC substrate", code 'qsb-bma'.

It is important to distinguish language ''parents'' from language ''ancestors''. The parent-child relationship is one

of containment, i.e. if X is a child of Y, X is considered a variety of Y. On the other hand, the ancestor-descendant

relationship is one of descent in time. For example, "Classical Latin", code `la-cla`, and "Late Latin", code `la-lat`,

relationship is one of descent in time. For example, "Classical Latin", code 'la-cla', and "Late Latin", code 'la-lat',

are both etymology-only languages with "Latin", code `la`, as their parents, because both of the former are varieties

are both etymology-only languages with "Latin", code 'la', as their parents, because both of the former are varieties

of Latin. However, Late Latin does *NOT* have Classical Latin as its parent because Late Latin is *not* a variety of

Classical Latin; rather, it is a descendant. There is in fact a separate `ancestors` field that is used to express the

Classical Latin; rather, it is a descendant. There is in fact a separate 'ancestors' field that is used to express the

ancestor-descendant relationship, and Late Latin's ancestor is given as Classical Latin. It is also important to note

that sometimes an etymology-only language is actually the conceptual ancestor of its parent language. This happens,

for example, with "Old Italian" (code `roa-oit`), which is an etymology-only variant of full language "Italian" (code

for example, with "Old Italian" (code 'roa-oit'), which is an etymology-only variant of full language "Italian" (code

`it`), and with "Old Latin" (code `itc-ola`), which is an etymology-only variant of Latin. In both cases, the full

'it'), and with "Old Latin" (code 'itc-ola'), which is an etymology-only variant of Latin. In both cases, the full

language has the etymology-only variant listed as an ancestor. This allows a Latin term to inherit from Old Latin

using the {{tl|inh}} template (where in this template, "inheritance" refers to ancestral inheritance, i.e. inheritance

Line 50:

Line 48:

functions in [[Module:languages]] and [[Module:etymology languages]] to convert a language's canonical name to a

{Language} object (depending on whether the canonical name refers to a full or etymology-only language).

~~===Textual representations===~~

Textual strings belonging to a given language come in several different ''text variants'':

# The ''input text'' is what the user supplies in wikitext, in the parameters to {{tl|m}}, {{tl|l}}, {{tl|ux}},

{{tl|t}}, {{tl|lang}} and the like.

~~# The ''corrected input text'' is the input text with some corrections and/or normalizations applied, such as~~

# The ''display text'' is the text in the form as it will be displayed to the user. This can include accent marks that

~~bad-character replacements for certain languages, like replacing `l` or `1` to [[palochka]] in some languages written~~

are stripped to form the entry text (see below), as well as embedded bracketed links that are variously processed

~~in Cyrillic. (FIXME: This currently goes under the name ''display text'' but that will be repurposed below. Also,~~

further. The display text is generated from the input text by applying language-specific transformations; for most

~~[[User:Surjection]] suggests renaming this to ''normalized input text'', but "normalized" is used in a different sense~~

languages, there will be no such transformations. Examples of transformations are bad-character replacements for

~~in [[Module:usex]].)~~

certain languages (e.g. replacing 'l' or '1' to [[palochka]] in certain languages in Cyrillic); and for Thai and

# The ''display text'' is the text in the form as it will be displayed to the user~~. This is what appears in headwords,~~

Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as [กรีน/กฺรีน],

~~in usexes, in displayed internal links, etc~~. This can include accent marks that are ~~removed~~ to form the ~~stripped~~

which indicate how to transliterate given words.

~~display~~ text (see below), as well as embedded bracketed links that are variously processed further. The display text

# The ''entry text'' is the text in the form used to generate a link to a Wiktionary entry. This is usually generated

is generated from the ~~corrected~~ input text by applying language-specific transformations; for most languages, there

from the display text by stripping certain sorts of diacritics on a per-language basis, and sometimes doing other

will be no such transformations~~. The general reason for having a difference between input and display text is to allow~~

transformations. The concept of ''entry text'' only really makes sense for text that does not contain embedded links,

~~for extra information in the input text that is not displayed to the user but is sent to the transliteration module.~~

meaning that display text containing embedded links will need to have the links individually processed to get

~~Note that having different display and input text is only supported currently through special-casing but will be~~

per-link entry text in order to generate the resolved display text (see below).

~~generalized~~. Examples of transformations are~~: (1) Removing the {{cd|^}} that is used in~~ certain ~~East Asian (and~~

# The ''resolved display text'' is the result of resolving embedded links in the display text (e.g. converting them to

~~possibly other unicameral)~~ languages ~~to indicate capitalization of the transliteration~~ (~~which is currently~~

two-part links where the first part has entry-text transformations applied, and adding appropriate language-specific

~~special-cased); (2) for Korean, removing or otherwise processing hyphens (which is currently special-cased); (3) for~~

fragments) and adding appropriate language and script tagging. This text can be passed directly to MediaWiki for

~~Arabic, removing a~~ ''~~sukūn'' diacritic placed over a ''tāʔ marbūṭa~~'' ~~(like this: ةْ)~~ to ~~indicate that the~~

display.

~~''tāʔ marbūṭa'' is pronounced and transliterated as /t/ instead of being silent~~ [~~NOTE, NOT IMPLEMENTED YET~~]; ~~(4)~~ for

# The ''source translit text'' is the text as supplied to the language-specific {transliterate()} method. The form of

Thai and Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as

the source translit text may need to be language-specific, e.g Thai and Khmer will need the full unprocessed input

`[กรีน/กฺรีน]`, which indicate how to transliterate given words ~~[NOTE, NOT IMPLEMENTED YET except in language-specific~~

text, whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded

~~templates like {{tl|th-usex}}]~~.

bracketed links are handled in the existing code.] In general, embedded links need to be removed (i.e. converted to

## The ''~~right-resolved display~~ text'' is the ~~result of removing brackets around one-part embedded links and resolving~~

their "bare display" form by taking the right part of two-part links and removing double brackets), but when this

~~two-part embedded links into their right-hand components (i.e. converting two-part links into~~ the ~~displayed~~ form).

happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the

~~The process of right-resolution~~ is ~~what happens when you call {{cd|remove_links()}} in [[Module:links]] on some text.~~

text through the transliterate mechanism, and for others (those listed with "cont" in {substition} in

~~When applied to~~ the display text, ~~it produces exactly what the user sees, without any link markup~~.

[[Module:languages/data]]) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is

# The ''~~stripped display~~ text'' ~~is the result of applying diacritic-stripping to the display~~ text.

still unclear to me.)

~~## The ''left-resolved stripped~~ display text~~'' [NEED BETTER NAME] is the result of applying left-resolution~~ to the

# The ''transliterated text'' (or ''transliteration'') is the result of transliterating the source translit text.

~~stripped display text, i.e. similar~~ to ~~right-resolution but resolving two-part embedded links into their left-hand~~

Unlike for all the other text variants except the transcribed text, it is always in the Latin script.

~~components (i.e. the linked~~-to ~~page). If~~ the display text ~~refers to a single page, the resulting of applying~~

~~diacritic stripping and left-resolution produces the ''logical pagename''~~.

# The ''~~physical pagename~~ text'' is the result of ~~converting the stripped display text into physical page~~ links~~. If~~ the

~~stripped~~ display text ~~contains embedded links, the left side of those links is converted into physical page links;~~

~~otherwise, the entire text is considered a pagename and converted in the same fashion. The conversion does three~~

~~things:~~ (~~1) converts characters not allowed in pagenames into their "unsupported title" representation,~~ e.g.

~~{{cd|Unsupported titles/`gt`}} in place of~~ the ~~logical name {{cd|>}}; (2) handles certain special~~-~~cased~~

~~unsupported-title logical pagenames~~, ~~such as {{cd|Unsupported titles/Space}} in place of {{cd|[space]}}~~ and

~~{{cd|Unsupported titles/Ancient Greek dish}} in place of a very long Greek name for a gourmet dish as found in~~

~~Aristophanes; (3~~) ~~converts "mammoth" pagenames such as [[a]] into their~~ appropriate ~~split component, e.g~~.

~~[[a/languages A to L]]~~.

# The ''source translit text'' is the text as supplied to the language-specific {~~{cd|~~transliterate()}} method. The form

of the source translit text may need to be language-specific, e.g Thai and Khmer will need the ~~corrected~~ input text,

whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded bracketed

links are handled in the existing code.] In general, embedded links need to be right-~~resolved (see above~~), but when

~~this~~ happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the

text through the transliterate mechanism, and for others (those listed with "cont" in {~~{cd|substitution}~~} in

[[Module:languages/data]]) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is

still unclear to me.)

# The ''transliterated text'' (or ''transliteration'') is the result of transliterating the source translit text. Unlike

for all the other text variants except the transcribed text, it is always in the Latin script.

# The ''transcribed text'' (or ''transcription'') is the result of transcribing the source translit text, where

"transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian,

Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form.

Unlike for all the other text variants other than the transliterated text, it is always in the Latin script.

Currently, the transcribed text is always supplied manually be the user; there is no such thing as a

{~~{cd~~|transcribe()}} method on language objects.

{lua|transcribe()} method on language objects.

# The ''sort key'' is the text used in sort keys for determining the placing of pages in categories they belong to. The

sort key is generated from the pagename or a specified ''sort base'' by lowercasing, doing language-specific

transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it

needs to be converted to display text, have embedded links removed ~~through~~ right-~~resolution~~ and have

needs to be converted to display text, have embedded links removed (i.e. resolving them to their right side if they

~~diacritic-stripping~~ applied.

are two-part links) and have entry text transformations applied.

# There are other text variants that occur in usexes (specifically, there are normalized variants of several of the

above text variants), but we can skip them for now.

The following methods exist on {Language} objects to convert between different text variants:

# ~~{correctInputText} (currently called~~ {makeDisplayText}): This converts input text to ~~corrected input~~ text.

# {makeDisplayText}: This converts input text to display text.

# {~~stripDiacritics~~}: This converts to ~~stripped display~~ text. [FIXME: This needs some rethinking. In particular,

# {lua|makeEntryName}: This converts input or display text to entry text. [FIXME: This needs some rethinking. In

{~~stripDiacritics~~} is sometimes called on ~~input text, corrected input text or~~ display text (in ~~various~~ paths inside of

particular, {lua|makeEntryName} is sometimes called on display text (in some paths inside of [[Module:links]]) and

[[Module:links]], and, in ~~the case~~ of ~~input text~~, usually from other modules). We need to make sure we don't try to

sometimes called on input text (in other paths inside of [[Module:links]], and usually from other modules). We need

convert input text to display text twice, but at the same time we need to support calling it directly on input text

to make sure we don't try to convert input text to display text twice, but at the same time we need to support

since so many modules do this. This means we need to add a parameter indicating whether the passed-in text is input,

calling it directly on input text since so many modules do this. This means we need to add a parameter indicating

~~corrected input,~~ or display text; if ~~the~~ former ~~two~~, we call {~~correctInputText~~} ourselves.]

whether the passed-in text is input or display text; if that former, we call {lua|makeDisplayText} ourselves.]

# {~~logicalToPhysical}: This converts logical pagenames to physical pagenames.~~

# {lua|transliterate}: This appears to convert input text with embedded brackets removed into a transliteration.

~~# {~~transliterate}: This appears to convert input text with embedded brackets removed into a transliteration.

[FIXME: This needs some rethinking. In particular, it calls {lua|processDisplayText} on its input, which won't work

[FIXME: This needs some rethinking. In particular, it calls {processDisplayText} on its input, which won't work

for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the

language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code;

a lot of callers remove the links themselves before calling {lua|transliterate()}, which I assume is wrong.]

a lot of callers remove the links themselves before calling {transliterate()}, which I assume is wrong.]

# {lua|makeSortKey}: This converts entry text (?) to a sort key. [FIXME: Clarify this.]

# {makeSortKey}: This converts ~~display~~ text (?) to a sort key. [FIXME: Clarify this.]

]=]

]==]

local export = {}

local etymology_languages_data_module = "Module:etymology languages/data"

local families_module = "Module:families"

~~local headword_page_module = "Module:headword/page"~~

local json_module = "Module:JSON"

local language_like_module = "Module:language-like"

Line 145:

Line 118:

local links_data_module = "Module:links/data"

local load_module = "Module:load"

local patterns_module = "Module:patterns"

local scripts_module = "Module:scripts"

local scripts_data_module = "Module:scripts/data"

local string_encode_entities_module = "Module:string/encode entities"

~~local string_pattern_escape_module = "Module:string/patternEscape"~~

~~local string_replacement_escape_module = "Module:string/replacementEscape"~~

local string_utilities_module = "Module:string utilities"

local table_module = "Module:table"

Line 188:

Line 160:

local Hant_chars

local function check_object(...)

--[==[

check_object = require(utilities_module).check_object

Loaders for functions in other modules, which overwrite themselves with the target function when called. This ensures modules are only loaded when needed, retains the speed/convenience of locally-declared pre-loaded functions, and has no overhead after the first call, since the target functions are called directly in any subsequent calls.]==]

return check_object(...)

local function check_object(...)

end

check_object = require(utilities_module).check_object

return check_object(...)

end

local function decode_entities(...)

decode_entities = require(string_utilities_module).decode_entities

return decode_entities(...)

end

local function decode_uri(...)

decode_uri = require(string_utilities_module).decode_uri

return decode_uri(...)

end

local function deep_copy(...)

deep_copy = require(table_module).deepCopy

return deep_copy(...)

end

local function encode_entities(...)

encode_entities = require(string_encode_entities_module)

return encode_entities(...)

end

local function ~~get_L2_sort_key~~(...)

local function get_script(...)

~~get_L2_sort_key~~ = require(~~headword_page_module~~).~~get_L2_sort_key~~

get_script = require(scripts_module).getByCode

return ~~get_L2_sort_key~~(...)

return get_script(...)

end

local function ~~get_script~~(...)

local function find_best_script_without_lang(...)

~~get_script~~ = require(scripts_module).~~getByCode~~

find_best_script_without_lang = require(scripts_module).findBestScriptWithoutLang

return ~~get_script~~(...)

return find_best_script_without_lang(...)

end

local function ~~find_best_script_without_lang~~(...)

local function get_family(...)

~~find_best_script_without_lang~~ = require(~~scripts_module~~).~~findBestScriptWithoutLang~~

get_family = require(families_module).getByCode

return ~~find_best_script_without_lang~~(...)

return get_family(...)

end

local function ~~get_family~~(...)

local function get_plaintext(...)

~~get_family~~ = require(~~families_module~~).~~getByCode~~

get_plaintext = require(utilities_module).get_plaintext

return ~~get_family~~(...)

return get_plaintext(...)

end

local function ~~get_plaintext~~(...)

local function get_wikimedia_lang(...)

~~get_plaintext~~ = require(~~utilities_module~~).~~get_plaintext~~

get_wikimedia_lang = require(wikimedia_languages_module).getByCode

return ~~get_plaintext~~(...)

return get_wikimedia_lang(...)

end

local function ~~get_wikimedia_lang~~(...)

local function keys_to_list(...)

~~get_wikimedia_lang~~ = require(~~wikimedia_languages_module~~).~~getByCode~~

keys_to_list = require(table_module).keysToList

return ~~get_wikimedia_lang~~(...)

return keys_to_list(...)

end

local function ~~keys_to_list~~(...)

local function list_to_set(...)

~~keys_to_list~~ = require(table_module).~~keysToList~~

list_to_set = require(table_module).listToSet

return ~~keys_to_list~~(...)

return list_to_set(...)

end

local function ~~list_to_set~~(...)

local function load_data(...)

~~list_to_set~~ = require(~~table_module~~).~~listToSet~~

load_data = require(load_module).load_data

return ~~list_to_set~~(...)

return load_data(...)

end

local function ~~load_data~~(...)

local function make_family_object(...)

~~load_data~~ = require(~~load_module~~).~~load_data~~

make_family_object = require(families_module).makeObject

return ~~load_data~~(...)

return make_family_object(...)

end

local function ~~make_family_object~~(...)

local function pattern_escape(...)

~~make_family_object~~ = require(~~families_module~~).~~makeObject~~

pattern_escape = require(patterns_module).pattern_escape

return ~~make_family_object~~(...)

return pattern_escape(...)

end

local function ~~pattern_escape~~(...)

local function remove_duplicates(...)

~~pattern_escape~~ = require(~~string_pattern_escape_module~~)

remove_duplicates = require(table_module).removeDuplicates

return ~~pattern_escape~~(...)

return remove_duplicates(...)

end

local function replacement_escape(...)

replacement_escape = require(~~string_replacement_escape_module~~)

replacement_escape = require(patterns_module).replacement_escape

return replacement_escape(...)

end

local function safe_require(...)

safe_require = require(load_module).safe_require

return safe_require(...)

end

local function shallow_copy(...)

shallow_copy = require(table_module).shallowCopy

return shallow_copy(...)

end

local function split(...)

split = require(string_utilities_module).split

return split(...)

end

local function to_json(...)

to_json = require(json_module).toJSON

return to_json(...)

end

local function u(...)

u = require(string_utilities_module).char

return u(...)

end

local function ugsub(...)

ugsub = require(string_utilities_module).gsub

return ugsub(...)

end

local function ulen(...)

ulen = require(string_utilities_module).len

return ulen(...)

end

local function ulower(...)

ulower = require(string_utilities_module).lower

return ulower(...)

end

local function umatch(...)

umatch = require(string_utilities_module).match

return umatch(...)

end

local function uupper(...)

uupper = require(string_utilities_module).upper

return uupper(...)

end

local function normalize_code(code)

Line 381:

Line 355:

end

-- Pre-substitution, of "[[" and "]]", which makes pattern matching more accurate.

text = gsub(text, "%f[%[]%[%[", "\1"):gsub("%f[%]]%]%]", "\2")

text = gsub(text, "%f[%[]%[%[", "\1")

:gsub("%f[%]]%]%]", "\2")

local i = #subbedChars

for _, pattern in ipairs(patterns) do

Line 405:

Line 380:

end)

end

text = gsub(text, "\1", "%[%["):gsub("\2", "%]%]")

text = gsub(text, "\1", "%[%[")

:gsub("\2", "%]%]")

return text, subbedChars

end

Line 415:

Line 391:

local byte3 = floor(i / 64) % 64 + 128

local byte4 = i % 64 + 128

text = gsub(text, "\244[" .. char(byte2) .. char(byte2+8) .. "]" .. char(byte3) .. char(byte4),

text = gsub(text, "\244[" .. char(byte2) .. char(byte2+8) .. "]" .. char(byte3) .. char(byte4), replacement_escape(subbedChars[i]))

replacement_escape(subbedChars[i]))

end

text = gsub(text, "\1", "%[%["):gsub("\2", "%]%]")

text = gsub(text, "\1", "%[%[")

:gsub("\2", "%]%]")

return text

end

Line 445:

Line 421:

end

~~-- Subfunction of iterateSectionSubstitutions(). Process an individual chunk of text according to the specifications in~~

local function doSubstitutions(self, text, sc, substitution_data, function_name, recursed)

~~-- `substitution_data`. The input parameters are all as in the documentation of iterateSectionSubstitutions() except for~~

local fail, cats = nil, {}

~~-- `recursed`, which is set to true if we called ourselves recursively to process a script-specific setting or~~

~~-- script-wide fallback. Returns two values: the processed text and the actual substitution data used to do the~~

~~-- substitutions (same as the `actual_substitution_data` return value to iterateSectionSubstitutions()).~~

local function doSubstitutions(self, text, sc, substitution_data~~, data_field~~, function_name, recursed)

~~-- BE CAREFUL in this function because the value at any level can be `false`~~, ~~which causes no processing to be done~~

~~-- and blocks any further fallback processing.~~

~~local actual_substitution_data~~ = ~~substitution_data~~

-- If there are language-specific substitutes given in the data module, use those.

if type(substitution_data) == "table" then

-- If a script is specified, run this function with the script-specific data before continuing.

local sc_code = sc:getCode()

~~local has_substitution_data = false~~

if substitution_data[sc_code] then

if substitution_data[sc_code] ~~~= nil~~ then

text, fail, cats = doSubstitutions(self, text, sc, substitution_data[sc_code], function_name, true)

~~has_substitution_data = true~~

-- Hant, Hans and Hani are usually treated the same, so add a special case to avoid having to specify each one separately.

~~if substitution_data[sc_code] then~~

elseif sc_code:match("^Han") and substitution_data.Hani then

text, ~~actual_substitution_data~~ = doSubstitutions(self, text, sc, substitution_data[sc_code], ~~data_field,~~

text, fail, cats = doSubstitutions(self, text, sc, substitution_data.Hani, function_name, true)

function_name, true)

~~end~~

-- Hant, Hans and Hani are usually treated the same, so add a special case to avoid having to specify each one

-- separately.

elseif sc_code:match("^Han") and substitution_data.Hani ~~~= nil~~ then

~~has_substitution_data = true~~

~~if substitution_data.Hani then~~

text, ~~actual_substitution_data~~ = doSubstitutions(self, text, sc, substitution_data.Hani, ~~data_field,~~

function_name, true)

~~end~~

-- Substitution data with key 1 in the outer table may be given as a fallback.

elseif substitution_data[1] ~~~= nil~~ then

elseif substitution_data[1] then

~~has_substitution_data = true~~

text, fail, cats = doSubstitutions(self, text, sc, substitution_data[1], function_name, true)

~~if substitution_data[1] then~~

text, ~~actual_substitution_data~~ = doSubstitutions(self, text, sc, substitution_data[1], ~~data_field,~~

function_name, true)

~~end~~

end

-- Iterate over all strings in the "from" subtable, and gsub with the corresponding string in "to". We work with

-- Iterate over all strings in the "from" subtable, and gsub with the corresponding string in "to". We work with the NFD decomposed forms, as this simplifies many substitutions.

-- the NFD decomposed forms, as this simplifies many substitutions.

if substitution_data.from then

~~has_substitution_data = true~~

for i, from in ipairs(substitution_data.from) do

-- Normalize each loop, to ensure multi-stage substitutions work correctly.

Line 493:

Line 446:

if substitution_data.remove_diacritics then

~~has_substitution_data = true~~

text = sc:toFixedNFD(text)

-- Convert exceptions to PUA.

Line 516:

Line 468:

text = text:gsub("\242[\128-\191]*", substitutes)

end

~~end~~

~~if not has_substitution_data and sc._data[data_field] then~~

~~-- If language-specific sort key (etc.) is nil, fall back to script-wide sort key (etc.).~~

~~text, actual_substitution_data = doSubstitutions(self, text, sc, sc._data[data_field], data_field,~~

~~function_name, true)~~

end

elseif type(substitution_data) == "string" then

Line 529:

Line 476:

-- TODO: translit functions should be called with form NFD.

if function_name == "tr" then

~~if not module[function_name] then~~

text, fail, cats = module[function_name](text, self._code, sc:getCode())

~~error(("Internal error: Module [[%s]] has no function named 'tr'"):format(substitution_data))~~

~~end~~

text = module[function_name](text, self._code, sc:getCode())

~~elseif function_name == "stripDiacritics" then~~

~~-- FIXME, get rid of this arm after renaming makeEntryName -> stripDiacritics.~~

~~if module[function_name] then~~

~~text = module[function_name](sc:toFixedNFD(text), self, sc)~~

~~elseif module.makeEntryName then~~

~~text = module.makeEntryName(sc:toFixedNFD(text), self, sc)~~

~~else~~

~~error(("Internal error: Module [[%s]] has no function named 'stripDiacritics' or 'makeEntryName'"~~

~~):format(substitution_data))~~

~~end~~

else

~~if not module[function_name] then~~

text, fail, cats = module[function_name](sc:toFixedNFD(text), self, sc)

~~error(("Internal error: Module [[%s]] has no function named '%s'"):format(~~

~~substitution_data~~, ~~function_name))~~

~~end~~

~~text~~ = module[function_name](sc:toFixedNFD(text), self, sc)

end

else

error("Substitution data '" .. substitution_data .. "' does not match an existing module.")

end

~~elseif substitution_data == nil and sc._data[data_field] then~~

~~-- If language-specific sort key (etc.) is nil, fall back to script-wide sort key (etc.).~~

~~text, actual_substitution_data = doSubstitutions(self, text, sc, sc._data[data_field], data_field,~~

~~function_name, true)~~

end

-- Don't normalize to NFC if this is the inner loop or if a module returned nil.

if recursed or not text then

return text, ~~actual_substitution_data~~

return text, fail, cats

end

-- Fix any discouraged sequences created during the substitution process, and normalize into the final form.

return sc:toFixedNFC(sc:fixDiscouragedSequences(text)), ~~actual_substitution_data~~

return sc:toFixedNFC(sc:fixDiscouragedSequences(text)), fail, cats

end

-- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate

-- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate over each one to apply substitutions. This avoids putting PUA characters through language-specific modules, which may be unequipped for them.

-- over each ~~section~~ to apply substitutions ~~(e.g. transliteration or diacritic stripping)~~. This avoids putting PUA

local function iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, substitution_data, function_name)

-- characters through language-specific modules, which may be unequipped for them~~. This function is passed the following~~

local fail, cats, sections = nil, {}

~~-- values:~~

~~-- * `self` (the Language object);~~

~~-- * `text` (the text to process);~~

~~-- * `sc` (the script of the text, which must be specified; callers should call checkScript() as needed to autodetect the~~

~~-- script of the text if not given explicitly by the user);~~

~~-- * `subbedChars` (an array of the same length as the text, indicating which characters have been substituted and by~~

~~-- what, or {nil} if no substitutions are to happen);~~

~~-- * `keepCarets` (DOCUMENT ME);~~

~~-- * `substitution_data` (the data indicating which substitutions to apply, taken directly from `data_field` in the~~

~~-- language's data structure in a submodule of [[Module:languages/data]]);~~

~~-- * `data_field` (the data field from which `substitution_data` was fetched, such as "sort_key" or "strip_diacritics");~~

~~-- * `function_name` (the name of the function to call to do the substitution, in case `substitution_data` specifies a~~

~~-- module to do the substitution);~~

~~-- * `notrim` (don't trim whitespace at the edges of `text`; set when computing the sort key, because whitespace at the~~

~~-- beginning of a sort key is significant and causes the resulting page to be sorted at the beginning of the category~~

~~-- it's in).~~

~~-- Returns three values:~~

~~-- (1) the processed text;~~

~~-- (2) the value of `subbedChars` that was passed in, possibly modified with additional character substitutions; will be~~

~~-- {nil} if {nil} was passed in;~~

~~-- (3) the actual substitution data that was used to apply substitutions to `text`; this may be different from the value~~

~~-- of `substitution_data` passed in if that value recursively specified script-specific substitutions or if no~~

~~-- substitution data could be found in the language-specific data (e.g. {nil} was passed in or a structure was passed~~

~~-- in that had no setting for the script given in `sc`), but a script-wide fallback value was set; currently it is~~

~~-- only used by makeSortKey()~~.

local function iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, substitution_data, ~~data_field,~~

function_name~~, notrim~~)

local sections

-- See [[Module:languages/data]].

if not find(text, "\244") or load_data(languages_data_module).substitution[self._code] == "cont" then

if not find(text, "\244") or (load_data(languages_data_module).substitution[self._code] == "cont") then

sections = {text}

else

sections = split(text, "\244[\128-\143][\128-\191]*", true)

end

~~local actual_substitution_data~~

for _, section in ipairs(sections) do

-- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated

-- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated modules).

-- modules).

if gsub(section, "%s+", "") ~= "" then

local sub, ~~this_actual_substitution_data~~ = doSubstitutions(self, section, sc, substitution_data, ~~data_field,~~

local sub, sub_fail, sub_cats = doSubstitutions(self, section, sc, substitution_data, function_name)

function_name)

-- Second round of temporary substitutions, in case any formatting was added by the main substitution process. However, don't do this if the section contains formatting already (as it would have had to have been escaped to reach this stage, and therefore should be given as raw text).

~~actual_substitution_data = this_actual_substitution_data~~

-- Second round of temporary substitutions, in case any formatting was added by the main substitution

-- process. However, don't do this if the section contains formatting already (as it would have had to have

-- been escaped to reach this stage, and therefore should be given as raw text).

if sub and subbedChars then

local noSub

Line 626:

Line 518:

end

if not sub then

if (not sub) or sub_fail then

text = sub

fail = sub_fail

cats = sub_cats or {}

break

end

text = sub and gsub(text, pattern_escape(section), replacement_escape(sub), 1) or text

if type(sub_cats) == "table" then

for _, cat in ipairs(sub_cats) do

insert(cats, cat)

end

~~if not notrim then~~

-- Trim, unless there are only spacing characters, while ignoring any final formatting characters.

text = text and text:gsub("^([\128-\191\244]*)%s+(%S)", "%1%2")

~~-- Do not trim sort keys because spaces at the beginning are significant.~~

:gsub("(%S)%s+([\128-\191\244]*)$", "%1%2")

text = text and text:gsub("^([\128-\191\244]*)%s+(%S)", "%1%2"):gsub("(%S)%s+([\128-\191\244]*)$", "%1%2") or

~~nil~~

-- Remove duplicate categories.

if #cats > 1 then

cats = remove_duplicates(cats)

end

return text, subbedChars~~, actual_substitution_data~~

return text, fail, cats, subbedChars

end

Line 650:

Line 551:

text, rep = gsub(text, "\\\\(\\*^)", "\3%1")

until rep == 0

return (text:gsub("\\^", "\4")

return text:gsub("\\^", "\4")

:gsub(pattern or "%^", repl or "")

:gsub("\3", "\\")

:gsub("\4", "^"))

:gsub("\4", "^")

end

Line 807:

Line 708:

Language.hasType = require(language_like_module).hasType

return self:hasType(...)

end

function Language:getMainCategoryName()

return self._data.main_category or "lemma"

end

Line 865:

Line 770:

function Language:makeWikipediaLink()

return make_link(self, (self:hasType("conlang") and self:getCanonicalName() or "w:" .. self:getWikipediaArticle()), self:getCanonicalName())

~~end~~

~~function Language:getMainCategoryName()~~

~~return self._data.main_category or "lemma"~~

end

Line 980:

Line 881:

local t, s, found = 0, 0

-- This is faster than using mw.ustring.gmatch directly.

for ch in gmatch((ugsub(text, "[" .. Hani.characters .. "]", "\255%0")), "\255(.[\128-\191]*)") do

for ch in gmatch(ugsub(text, "[" .. Hani.characters .. "]", "\255%0"), "\255(.[\128-\191]*)") do

found = true

if Hant_chars[ch] then

Line 1,009:

Line 910:

-- Count characters by removing everything in the script's charset and comparing to the original length.

local charset = sc.characters

local count = charset and length - ulen((ugsub(text, "[" .. charset .. "]+", ""))) or 0

local count = charset and length - ulen(ugsub(text, "[" .. charset .. "]+", "")) or 0

if count >= length then

Line 1,279:

Line 1,180:

local ancestorsParents = {}

for _, ancestor in ipairs(ancestors) do

-- When checking the parents of the other language, and the ancestor is also a parent, skip to the next ancestor, so that we exclude any etymology-only children of that parent that are not directly related (see below).

local ret = func(ancestor) or iterateOverAncestorTree(ancestor, func, parent_check)

local ret = ~~(parent_check or not node:hasParent(ancestor)) and~~

func(ancestor) or iterateOverAncestorTree(ancestor, func, parent_check)

if ret then

return ret

Line 1,550:

Line 1,449:

function Language:getStandardCharacters(sc)

local standard_chars = self._data.~~standard_chars~~

local standard_chars = self._data.standardChars

if type(standard_chars) ~= "table" then

return standard_chars

Line 1,569:

Line 1,468:

end

--[==[

--[==[Make the entry name (i.e. the correct page name).]==]

~~Strip diacritics from display text `text`~~ (~~in a language-specific fashion), which is in the script `sc`~~. ~~If `sc` is~~

function Language:makeEntryName(text, sc)

~~omitted or {nil}, the script is autodetected~~. ~~This also strips certain punctuation characters from the end and (in~~ the

~~case of Spanish upside-down question mark and exclamation points~~) ~~from the beginning; strips any whitespace at the~~

~~end of the text or between the text and final stripped punctuation characters; and applies some language-specific~~

~~Unicode normalizations to replace discouraged characters with their prescribed alternatives. Return the stripped text~~.

]==]

function Language:~~stripDiacritics~~(text, sc)

if (not text) or text == "" then

return text

return text, nil, {}

end

~~sc = checkScript(text, self, sc)~~

-- Set `unsupported` as true if certain conditions are met.

local unsupported

~~text = normalize(text, sc)~~

-- Check if there's an unsupported character. \239\191\189 is the replacement character U+FFFD, which can't be typed directly here due to an abuse filter. Unix-style dot-slash notation is also unsupported, as it is used for relative paths in links, as are 3 or more consecutive tildes.

~~-- FIXME, rename makeEntryName to stripDiacritics and get rid of second and third return values~~

-- Note: match is faster with magic characters/charsets; find is faster with plaintext.

~~-- everywhere~~

~~text, _, _ = iterateSectionSubstitutions(self, text, sc, nil, nil,~~

~~self._data.strip_diacritics or self._data.entry_name, "strip_diacritics", "stripDiacritics")~~

~~text = umatch(text, "^[¿¡]?(.-[^%s%p].-)%s*[؟?!;՛՜ ՞ ՟？！︖︕।॥။၊་།]?$") or text~~

~~return text~~

~~end~~

~~--[==[~~

~~Convert a ''logical'' pagename (the pagename as it appears to the user, after diacritics and punctuation have been~~

~~stripped) to a ''physical'' pagename (the pagename as it appears in the MediaWiki database). Reasons for a difference~~

~~between the two are (a) unsupported titles such as `[ ]` (with square brackets in them), `#` (pound/hash sign) and~~

~~`¯\_(ツ)_/¯` (with underscores), as well as overly long titles of various sorts; (b) "mammoth" pages that are split into~~

~~parts (e.g. `a`, which is split into physical pagenames `a/languages A to L` and `a/languages M to Z`). For almost all~~

~~purposes, you should work with logical and not physical pagenames. But there are certain use cases that require physical~~

~~pagenames, such as checking the existence of a page or retrieving a page's contents.~~

~~`pagename` is the logical pagename to be converted. `is_reconstructed_or_appendix` indicates whether the page is in the~~

~~`Reconstruction` or `Appendix` namespaces. If it is omitted or has the value {nil}, the pagename is checked for an~~

~~initial asterisk, and if found, the page is assumed to be a `Reconstruction` page. Setting a value of `false` or `true`~~

~~to `is_reconstructed_or_appendix` disables this check and allows for mainspace pagenames that begin with an asterisk.~~

~~]==]~~

~~function Language:logicalToPhysical(pagename, is_reconstructed_or_appendix)~~

~~-- FIXME: This probably shouldn't happen but it happens when makeEntryName() receives nil.~~

~~if pagename == nil then~~

~~return nil~~

~~end~~

~~local initial_asterisk~~

~~if is_reconstructed_or_appendix == nil then~~

~~local pagename_minus_initial_asterisk~~

~~initial_asterisk, pagename_minus_initial_asterisk = pagename:match("^(%*)(.*)$")~~

~~if pagename_minus_initial_asterisk then~~

~~is_reconstructed_or_appendix = true~~

~~pagename = pagename_minus_initial_asterisk~~

~~elseif self:hasType("appendix-constructed") then~~

~~is_reconstructed_or_appendix = true~~

~~end~~

~~if not is_reconstructed_or_appendix then~~

~~-- Check if the pagename is a listed unsupported title.~~

~~local unsupportedTitles = load_data(links_data_module).unsupported_titles~~

~~if unsupportedTitles[pagename] then~~

~~return "Unsupported titles/" .. unsupportedTitles[pagename]~~

~~end~~

-- Set `unsupported` as true if certain conditions are met.

local unsupported

-- Check if there's an unsupported character. \239\191\189 is the replacement character U+FFFD, which can't be typed

-- directly here due to an abuse filter. Unix-style dot-slash notation is also unsupported, as it is used for

-- relative paths in links, as are 3 or more consecutive tildes. Note: match is faster with magic

-- characters/charsets; find is faster with plaintext.

if (

match(~~pagename~~, "[#<>%[%]_{|}]") or

match(text, "[#<>%[%]_{|}]") or

find(~~pagename~~, "\239\191\189") or

find(text, "\239\191\189") or

match(~~pagename~~, "%f[^%z/]%.%.?%f[%z/]") or

match(text, "%f[^%z/]%.%.?%f[%z/]") or

find(~~pagename~~, "~~~")

find(text, "~~~")

) then

unsupported = true

-- If it looks like an interwiki link.

elseif find(~~pagename~~, ":") then

elseif find(text, ":") then

local prefix = gsub(~~pagename~~, "^:*(.-):.*", ulower)

local prefix = gsub(text, "^:*(.-):.*", ulower)

if (

load_data("Module:data/namespaces")[prefix] or

Line 1,656:

Line 1,496:

end

-- Escape unsupported characters so they can be used in titles. ` is used as a delimiter for this, so a raw use of

-- Check if the text is a listed unsupported title.

-- it in an unsupported title is also escaped here to prevent interference; this is only done with unsupported

local unsupportedTitles = load_data(links_data_module).unsupported_titles

-- titles, though, so inclusion won't in itself mean a title is treated as unsupported (which is why it's excluded

if unsupportedTitles[text] then

-- from the earlier test).

return "Unsupported titles/" .. unsupportedTitles[text], nil, {}

end

sc = checkScript(text, self, sc)

local fail, cats

text = normalize(text, sc)

text, fail, cats = iterateSectionSubstitutions(self, text, sc, nil, nil, self._data.entry_name, "makeEntryName")

text = umatch(text, "^[¿¡]?(.-[^%s%p].-)%s*[؟?!;՛՜ ՞ ՟？！︖︕।॥။၊་།]?$") or text

-- Escape unsupported characters so they can be used in titles. ` is used as a delimiter for this, so a raw use of it in an unsupported title is also escaped here to prevent interference; this is only done with unsupported titles, though, so inclusion won't in itself mean a title is treated as unsupported (which is why it's excluded from the earlier test).

if unsupported then

~~-- FIXME: This conversion needs to be different for reconstructed pages with unsupported characters. There~~

~~-- aren't any currently, but if there ever are, we need to fix this e.g. to put them in something like~~

~~-- Reconstruction:Proto-Indo-European/Unsupported titles/`lowbar``num`.~~

local unsupported_characters = load_data(links_data_module).unsupported_characters

~~pagename~~ = ~~pagename~~:gsub("[#<>%[%]_`{|}\239]\191?\189?", unsupported_characters)

text = text:gsub("[#<>%[%]_`{|}\239]\191?\189?", unsupported_characters)

:gsub("%f[^%z/]%.%.?%f[%z/]", function(m)

return (gsub(m, "%.", "`period`"))

return gsub(m, "%.", "`period`")

end)

:gsub("~~~+", function(m)

return (gsub(m, "~", "`tilde`"))

return gsub(m, "~", "`tilde`")

end)

~~pagename~~ = "Unsupported titles/" .. ~~pagename~~

text = "Unsupported titles/" .. text

~~elseif not is_reconstructed_or_appendix then~~

end

~~-- Check if this is a mammoth page. If so, which subpage should we link to?~~

~~local m_links_data = load_data(links_data_module)~~

~~local mammoth_page_type = m_links_data.mammoth_pages[pagename]~~

~~if mammoth_page_type then~~

~~local canonical_name = self:getFullName()~~

~~if canonical_name ~= "Translingual" and canonical_name ~= "English" then~~

~~local this_subpage~~

~~local L2_sort_key = get_L2_sort_key(canonical_name)~~

~~for _, subpage_spec in ipairs(m_links_data.mammoth_page_subpage_types[mammoth_page_type]) do~~

~~-- unpack() fails utterly on data loaded using mw.loadData() even if offsets are given~~

~~local subpage, pattern = subpage_spec[1], subpage_spec[2]~~

~~if pattern == true or L2_sort_key:match(pattern) then~~

~~this_subpage = subpage~~

~~break~~

~~end~~

~~if not this_subpage then~~

~~error(("Internal error: Bad data in mammoth_page_subpage_pages in [[Module:links/data]] for mammoth page %s, type %s; last entry didn't have 'true' in it"):format(~~

~~pagename, mammoth_page_type))~~

~~end~~

~~pagename = pagename .. "/" .. this_subpage~~

~~end~~

~~return (initial_asterisk or "") .. pagename~~

end

~~--[==[~~

~~Strip the diacritics from a display pagename and convert the resulting logical pagename into a physical pagename.~~

~~This allows you, for example, to retrieve the contents of the page or check its existence. WARNING: This is deprecated~~

~~and will be going away. It is a simple composition of `self:stripDiacritics` and `self:logicalToPhysical`; most callers~~

~~only want the former, and if you need both, call them both yourself.~~

~~`text` and `sc` are as in `self:stripDiacritics`, and `is_reconstructed_or_appendix` is as in `self:logicalToPhysical`.~~

return text, fail, cats

~~]==]~~

~~function Language:makeEntryName(text, sc, is_reconstructed_or_appendix)~~

return ~~self:logicalToPhysical(self:stripDiacritics(~~text, ~~sc)~~, ~~is_reconstructed_or_appendix)~~

end

--[==[Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.]==]

Line 1,725:

Line 1,537:

end

--[==[Creates a sort key for the given ~~stripped text~~, following the rules appropriate for the language. This removes

--[==[Creates a sort key for the given entry name, following the rules appropriate for the language. This removes diacritical marks from the entry name if they are not considered significant for sorting, and may perform some other changes. Any initial hyphen is also removed, and anything parentheses is removed as well.

diacritical marks from the ~~stripped text~~ if they are not considered significant for sorting, and may perform some other

The <code>sort_key</code> setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the entry name and returns a sortkey.]==]

changes. Any initial hyphen is also removed, and anything in parentheses is removed as well.

The <code>sort_key</code> setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the ~~stripped text~~ and returns a sortkey.]==]

function Language:makeSortKey(text, sc)

if (not text) or text == "" then

return text

return text, nil, {}

end

-- Remove directional characters~~, bold, italics~~, soft hyphens, strip markers and HTML tags.

-- Remove directional characters, soft hyphens, strip markers and HTML tags.

~~-- FIXME: Partly duplicated with remove_formatting() in [[Module:links]]~~.

text = ugsub(text, "[\194\173\226\128\170-\226\128\174\226\129\166-\226\129\169]", "")

~~text = text:gsub("('*)'''(.-'*)'''", "%1%2"):gsub("('*)''(.-'*)''", "%1%2")~~

text = gsub(unstrip(text), "<[^<>]+>", "")

Line 1,756:

Line 1,564:

text = sc:toFixedNFD(text)

end

-- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is

-- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is usually not necessary to convert "i" to "İ" and "ı" to "I" first, because "I" will always be interpreted as conventional "I" (not dotless "İ") by any sorting algorithms, which will have been taken into account by the sortkey substitutions themselves. However, if no sortkey substitutions have been specified, then conversion is necessary so as to prevent "i" and "ı" both being sorted as "I".

-- usually not necessary to convert "i" to "İ" and "ı" to "I" first, because "I" will always be interpreted as

-- An exception is made for scripts that (sometimes) sort by scraping page content, as that means they are sensitive to changes in capitalization (as it changes the target page).

-- conventional "I" (not dotless "İ") by any sorting algorithms, which will have been taken into account by the

local fail, cats

-- sortkey substitutions themselves. However, if no sortkey substitutions have been specified, then conversion is

-- necessary so as to prevent "i" and "ı" both being sorted as "I".

--

-- An exception is made for scripts that (sometimes) sort by scraping page content, as that means they are sensitive

-- to changes in capitalization (as it changes the target page).

if not sc:sortByScraping() then

text = ulower(text)

end

local ~~actual_substitution_data~~

local sort_key = self._data.sort_key

~~-- Don't trim whitespace here because it's significant at the beginning of a sort key or sort base~~.

text, fail, cats = iterateSectionSubstitutions(self, text, sc, nil, nil, sort_key, "makeSortKey")

text, _, ~~actual_substitution_data~~ = iterateSectionSubstitutions(self, text, sc, nil, nil, ~~self._data.sort_key,~~

"sort_key", "makeSortKey~~", "notrim~~")

if not sc:sortByScraping() then

if self:hasDottedDotlessI() and not ~~actual_substitution_data~~ then

if self:hasDottedDotlessI() and not sort_key then

text = ~~text:~~gsub("ı", "I")~~:gsub(~~"i", "İ")

text = gsub(gsub(text, "ı", "I"), "i", "İ")

text = sc:toFixedNFC(text)

end

Line 1,782:

Line 1,583:

-- Remove parentheses, as long as they are either preceded or followed by something.

text = gsub(text, "(.)[()]+", "%1"):gsub("[()]+(.)", "%1")

text = gsub(text, "(.)[()]+", "%1")

:gsub("[()]+(.)", "%1")

text = escape_risky_characters(text)

return text

return text, fail, cats

end

--[==[Create the form used as as a basis for display text and transliteration~~. FIXME: Rename to correctInputText()~~.]==]

--[==[Create the form used as as a basis for display text and transliteration.]==]

local function processDisplayText(text, self, sc, keepCarets, keepPrefixes)

local subbedChars = {}

Line 1,797:

Line 1,599:

sc = checkScript(text, self, sc)

local fail, cats

text = normalize(text, sc)

text, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, self._data.display_text,

text, fail, cats, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, self._data.display_text, "makeDisplayText")

~~"display_text"~~, "makeDisplayText")

text = removeCarets(text, sc)

Line 1,812:

Line 1,614:

while true do

local prefix = gsub(text, "^(.-):.+", function(m1)

return (gsub(m1, "\244[\128-\191]*", ""))

return gsub(m1, "\244[\128-\191]*", "")

end)

-- Check if the prefix is an interwiki, though ignore capitalised Wiktionary:, which is a namespace.

Line 1,827:

Line 1,629:

end)

end

text = gsub(text, "\3", "\\"):gsub("\4", ":")

text = gsub(text, "\3", "\\")

:gsub("\4", ":")

end

--[[if not self:hasType("conlang") then

text = gsub(text,"^%*", "")

end

text = gsub(text,"^%*%*", "*")]]

return text, subbedChars

return text, fail, cats, subbedChars

end

--[==[Make the display text (i.e. what is displayed on the page).]==]

function Language:makeDisplayText(text, sc, keepPrefixes)

if not text or text == "" then

if (not text) or text == "" then

return text

return text, nil, {}

end

local subbedChars

local fail, cats, subbedChars

text, subbedChars = processDisplayText(text, self, sc, nil, keepPrefixes)

text, fail, cats, subbedChars = processDisplayText(text, self, sc, nil, keepPrefixes)

text = escape_risky_characters(text)

return undoTempSubstitutions(text, subbedChars)

return undoTempSubstitutions(text, subbedChars), fail, cats

end

--[==[Transliterates the text from the given script into the Latin script (see

--[==[Transliterates the text from the given script into the Latin script (see [[Wiktionary:Transliteration and romanization]]). The language must have the <code>translit</code> property for this to work; if it is not present, {{code|lua|nil}} is returned.

[[Wiktionary:Transliteration and romanization]]). The language must have the <code>translit</code> property for this to

Returns three values:

work; if it is not present, {{code|lua|nil}} is returned.

# The transliteration.

# A boolean which indicates whether the transliteration failed for an unexpected reason. If {{code|lua|false}}, then the transliteration either succeeded, or the module is returning nothing in a controlled way (e.g. the input was {{code|lua|"-"}}). Generally, this means that no maintenance action is required. If {{code|lua|true}}, then the transliteration is {{code|lua|nil}} because either the input or output was defective in some way (e.g. [[Module:ar-translit]] will not transliterate non-vocalised inputs, and this module will fail partially-completed transliterations in all languages). Note that this value can be manually set by the transliteration module, so make sure to cross-check to ensure it is accurate.

The <code>sc</code> parameter is handled by the transliteration module, and how it is handled is specific to that

# A table of categories selected by the transliteration module, which should be in the format expected by {{code|lua|format_categories}} in [[Module:utilities]].

module. Some transliteration modules may tolerate {{code|lua|nil}} as the script, others require it to be one of the

The <code>sc</code> parameter is handled by the transliteration module, and how it is handled is specific to that module. Some transliteration modules may tolerate {{code|lua|nil}} as the script, others require it to be one of the possible scripts that the module can transliterate, and will show an error if it's not one of them. For this reason, the <code>sc</code> parameter should always be provided when writing non-language-specific code.

possible scripts that the module can transliterate, and will ~~throw~~ an error if it's not one of them. For this reason,

The <code>module_override</code> parameter is used to override the default module that is used to provide the transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no default module yet, or you want to demonstrate an alternative version of a transliteration module before making it official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked by [[Wiktionary:Tracking/languages/module_override]].

the <code>sc</code> parameter should always be provided when writing non-language-specific code.

The <code>module_override</code> parameter is used to override the default module that is used to provide the

transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no

default module yet, or you want to demonstrate an alternative version of a transliteration module before making it

official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked

by [[~~wikt:~~Wiktionary:Tracking/languages/module_override]].

'''Known bugs''':

* This function assumes {tr(s1) .. tr(s2) == tr(s1 .. s2)}. When this assertion fails, wikitext markups like <nowiki>'''</nowiki> can cause wrong transliterations.

* HTML entities like <code>&apos;</code>, often used to escape wikitext markups, do not work.

* HTML entities like <code>&apos;</code>, often used to escape wikitext markups, do not work.]==]

]==]

function Language:transliterate(text, sc, module_override)

-- If there is no text, or the language doesn't have transliteration data and there's no override, return nil.

if not text or text == "" or text == "-" then

if not (self._data.translit or module_override) then

return text

return nil, false, {}

elseif (not text) or text == "" or text == "-" then

return text, false, {}

end

-- If the script is not transliteratable (and no override is given), return nil.

sc = checkScript(text, self, sc)

if not (sc:isTransliterated() or module_override) then

return nil

return nil, true, {}

end

Line 1,882:

Line 1,686:

-- Get the display text with the keepCarets flag set.

local subbedChars

local fail, cats, subbedChars

if processed then

text, subbedChars = processDisplayText(text, self, sc, true)

text, fail, cats, subbedChars = processDisplayText(text, self, sc, true)

end

-- Transliterate (using the module override if applicable).

text, fail, cats, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, true, module_override or self._data.translit, "tr")

if not text then

return nil, true, cats

end

~~-- Transliterate (using the module override if applicable).~~

-- Incomplete transliterations return nil.

~~text, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, true, module_override or~~

local charset = sc.characters

~~self._data.translit, "translit", "tr")~~

if charset and umatch(text, "[" .. charset .. "]") then

-- Remove any characters in Latin, which includes Latin characters also included in other scripts (as these are false positives), as well as any PUA substitutions. Anything remaining should only be script code "None" (e.g. numerals).

~~if not text then~~

local check_text = ugsub(text, "[" .. get_script("Latn").characters .. "􀀀-􏿽]+", "")

~~return nil~~

-- Set none_is_last_resort_only flag, so that any non-None chars will cause a script other than "None" to be returned.

~~end~~

if find_best_script_without_lang(check_text, true):getCode() ~= "None" then

return nil, true, cats

-- Incomplete transliterations return nil.

end

local charset = sc.characters

end

if charset and umatch(text, "[" .. charset .. "]") then

-- Remove any characters in Latin, which includes Latin characters also included in other scripts (as these are

if processed then

-- false positives), as well as any PUA substitutions. Anything remaining should only be script code "None"

text = escape_risky_characters(text)

-- (e.g. numerals).

text = undoTempSubstitutions(text, subbedChars)

local check_text = ugsub(text, "[" .. get_script("Latn").characters .. "􀀀-􏿽]+", "")

end

-- Set none_is_last_resort_only flag, so that any non-None chars will cause a script other than "None" to be

-- returned.

-- If the script does not use capitalization, then capitalize any letters of the transliteration which are immediately preceded by a caret (and remove the caret).

if find_best_script_without_lang(check_text, true):getCode() ~= "None" then

if text and not sc:hasCapitalization() and text:find("^", 1, true) then

return nil

text = processCarets(text, "%^([\128-\191\244]*%*?)([^\128-\191\244][\128-\191]*)", function(m1, m2)

end

return m1 .. uupper(m2)

end

end)

if processed then

text = escape_risky_characters(text)

text = undoTempSubstitutions(text, subbedChars)

end

~~-- If the script does not use capitalization, then capitalize any letters of the transliteration which are~~

fail = text == nil and (not not fail) or false

~~-- immediately preceded by a caret (and remove the caret).~~

if text and not ~~sc:hasCapitalization() and text:find("^", 1, true) then~~

~~text = processCarets(text, "%^([\128-\191\244]*%*?)([^\128-\191\244][\128-\191]*)", function(m1, m2)~~

~~return m1 .. uupper(m2)~~

~~end~~)

~~end~~

return text

return text, fail, cats

end

Line 1,961:

Line 1,762:

function Language:toJSON(opts)

local ~~strip_diacritics~~, ~~strip_diacritics_patterns~~, ~~strip_diacritics_remove_diacritics~~ = self._data.~~strip_diacritics~~

local entry_name, entry_name_patterns, entry_name_remove_diacritics = self._data.entry_name

if ~~strip_diacritics~~ then

if entry_name then

if ~~strip_diacritics~~.from then

if entry_name.from then

~~strip_diacritics_patterns~~ = {}

entry_name_patterns = {}

for i, from in ipairs(~~strip_diacritics~~.from) do

for i, from in ipairs(entry_name.from) do

insert(~~strip_diacritics_patterns~~, {from = from, to = ~~strip_diacritics~~.to[i] or ""})

insert(entry_name_patterns, {from = from, to = entry_name.to[i] or ""})

end

~~strip_diacritics_remove_diacritics~~ = ~~strip_diacritics~~.remove_diacritics

entry_name_remove_diacritics = entry_name.remove_diacritics

end

-- mainCode should only end up non-nil if dontCanonicalizeAliases is passed to make_object().

~~-- props should either contain zero-argument functions to compute the value, or the value itself.~~

local ret = {

local ~~props~~ = {

ancestors = self:getAncestorCodes(),

ancestors = ~~function() return~~ self:getAncestorCodes() ~~end~~,

canonicalName = self:getCanonicalName(),

canonicalName = ~~function() return~~ self:getCanonicalName() ~~end~~,

categoryName = self:getCategoryName("nocap"),

categoryName = ~~function() return~~ self:getCategoryName("nocap") ~~end~~,

code = self._code,

mainCode = self._mainCode,

parent = ~~function() return~~ self:getParentCode() ~~end~~,

parent = self:getParentCode(),

full = ~~function() return~~ self:getFullCode() ~~end~~,

full = self:getFullCode(),

~~stripDiacriticsPatterns~~ = ~~strip_diacritics_patterns~~,

entryNamePatterns = entry_name_patterns,

~~stripDiacriticsRemoveDiacritics~~ = ~~strip_diacritics_remove_diacritics~~,

entryNameRemoveDiacritics = entry_name_remove_diacritics,

family = ~~function() return~~ self:getFamilyCode() ~~end~~,

family = self:getFamilyCode(),

aliases = ~~function() return~~ self:getAliases() ~~end~~,

aliases = self:getAliases(),

varieties = ~~function() return~~ self:getVarieties() ~~end~~,

varieties = self:getVarieties(),

otherNames = ~~function() return~~ self:getOtherNames() ~~end~~,

otherNames = self:getOtherNames(),

scripts = ~~function() return~~ self:getScriptCodes() ~~end~~,

scripts = self:getScriptCodes(),

type = ~~function() return~~ keys_to_list(self:getTypes()) ~~end~~,

type = keys_to_list(self:getTypes()),

wikimediaLanguages = ~~function() return~~ self:getWikimediaLanguageCodes() ~~end~~,

wikimediaLanguages = self:getWikimediaLanguageCodes(),

wikidataItem = ~~function() return~~ self:getWikidataItem() ~~end~~,

wikidataItem = self:getWikidataItem(),

wikipediaArticle = ~~function() return~~ self:getWikipediaArticle(true) ~~end~~,

wikipediaArticle = self:getWikipediaArticle(true),

}

~~local ret = {}~~

~~for prop, val in pairs(props) do~~

~~if not opts.skip_fields or not opts.skip_fields[prop] then~~

~~if type(val) == "function" then~~

~~ret[prop] = val()~~

~~else~~

~~ret[prop] = val~~

~~end~~

-- Use `deep_copy` when returning a table, so that there are no editing restrictions imposed by `mw.loadData`.

return opts and opts.lua_table and deep_copy(ret) or to_json(ret, opts)

Line 2,134:

Line 1,923:

--[==[

<span style="color: ~~var(--wikt-palette-red,~~#BA0000)">This function is not for use in entries or other content pages.</span>

<span style="color: #BA0000">This function is not for use in entries or other content pages.</span>

Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes. If `extra` is set, any extra data in the relevant `/extra` module will be included. (Note that it will be included anyway if it has already been loaded into the language object.) If `raw` is set, then the returned data will not contain any data inherited from parent objects.

-- Do NOT use these methods!

@@ Line 1: / Line 1: @@
---[==[ intro:
+--[=[
 This module implements fetching of language-specific information and processing text in a given language.
-===Types of languages===
 There are two types of languages: full languages and etymology-only languages. The essential difference is that only
@@ Line 9: / Line 7: @@
 their parent (in the parent-child inheritance sense), and for etymology-only languages with another etymology-only
 language as their parent, a full language can always be derived by following the parent links upwards. For example,
-"Canadian French", code `fr-CA`, is an etymology-only language whose parent is the full language "French", code `fr`.
+"Canadian French", code 'fr-CA', is an etymology-only language whose parent is the full language "French", code 'fr'.
 An example of an etymology-only language with another etymology-only parent is "Northumbrian Old English", code
-`ang-nor`, which has "Anglian Old English", code `ang-ang` as its parent; this is an etymology-only language whose
+'ang-nor', which has "Anglian Old English", code 'ang-ang' as its parent; this is an etymology-only language whose
-parent is "Old English", code `ang`, which is a full language. (This is because Northumbrian Old English is considered
+parent is "Old English", code "ang", which is a full language. (This is because Northumbrian Old English is considered
-a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code `und`; this is the case,
+a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code 'und'; this is the case,
-for example, for "substrate" languages such as "Pre-Greek", code `qsb-grc`, and "the BMAC substrate", code `qsb-bma`.
+for example, for "substrate" languages such as "Pre-Greek", code 'qsb-grc', and "the BMAC substrate", code 'qsb-bma'.
 It is important to distinguish language ''parents'' from language ''ancestors''. The parent-child relationship is one
 of containment, i.e. if X is a child of Y, X is considered a variety of Y. On the other hand, the ancestor-descendant
-relationship is one of descent in time. For example, "Classical Latin", code `la-cla`, and "Late Latin", code `la-lat`,
+relationship is one of descent in time. For example, "Classical Latin", code 'la-cla', and "Late Latin", code 'la-lat',
-are both etymology-only languages with "Latin", code `la`, as their parents, because both of the former are varieties
+are both etymology-only languages with "Latin", code 'la', as their parents, because both of the former are varieties
 of Latin. However, Late Latin does *NOT* have Classical Latin as its parent because Late Latin is *not* a variety of
-Classical Latin; rather, it is a descendant. There is in fact a separate `ancestors` field that is used to express the
+Classical Latin; rather, it is a descendant. There is in fact a separate 'ancestors' field that is used to express the
 ancestor-descendant relationship, and Late Latin's ancestor is given as Classical Latin. It is also important to note
 that sometimes an etymology-only language is actually the conceptual ancestor of its parent language. This happens,
-for example, with "Old Italian" (code `roa-oit`), which is an etymology-only variant of full language "Italian" (code
+for example, with "Old Italian" (code 'roa-oit'), which is an etymology-only variant of full language "Italian" (code
-`it`), and with "Old Latin" (code `itc-ola`), which is an etymology-only variant of Latin. In both cases, the full
+'it'), and with "Old Latin" (code 'itc-ola'), which is an etymology-only variant of Latin. In both cases, the full
 language has the etymology-only variant listed as an ancestor. This allows a Latin term to inherit from Old Latin
 using the {{tl|inh}} template (where in this template, "inheritance" refers to ancestral inheritance, i.e. inheritance
@@ Line 50: / Line 48: @@
 functions in [[Module:languages]] and [[Module:etymology languages]] to convert a language's canonical name to a
 {Language} object (depending on whether the canonical name refers to a full or etymology-only language).
-===Textual representations===
 Textual strings belonging to a given language come in several different ''text variants'':
 # The ''input text'' is what the user supplies in wikitext, in the parameters to {{tl|m}}, {{tl|l}}, {{tl|ux}},
-  {{tl|t}}, {{tl|lang}} and the like.
+{{tl|t}}, {{tl|lang}} and the like.
-# The ''corrected input text'' is the input text with some corrections and/or normalizations applied, such as
+# The ''display text'' is the text in the form as it will be displayed to the user. This can include accent marks that
-  bad-character replacements for certain languages, like replacing `l` or `1` to [[palochka]] in some languages written
+are stripped to form the entry text (see below), as well as embedded bracketed links that are variously processed
-  in Cyrillic. (FIXME: This currently goes under the name ''display text'' but that will be repurposed below. Also,
+further. The display text is generated from the input text by applying language-specific transformations; for most
-  [[User:Surjection]] suggests renaming this to ''normalized input text'', but "normalized" is used in a different sense
+languages, there will be no such transformations. Examples of transformations are bad-character replacements for
-  in [[Module:usex]].)
+certain languages (e.g. replacing 'l' or '1' to [[palochka]] in certain languages in Cyrillic); and for Thai and
-# The ''display text'' is the text in the form as it will be displayed to the user. This is what appears in headwords,
+Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as [กรีน/กฺรีน],
-  in usexes, in displayed internal links, etc. This can include accent marks that are removed to form the stripped
+which indicate how to transliterate given words.
-  display text (see below), as well as embedded bracketed links that are variously processed further. The display text
+# The ''entry text'' is the text in the form used to generate a link to a Wiktionary entry. This is usually generated
-  is generated from the corrected input text by applying language-specific transformations; for most languages, there
+from the display text by stripping certain sorts of diacritics on a per-language basis, and sometimes doing other
-  will be no such transformations. The general reason for having a difference between input and display text is to allow
+transformations. The concept of ''entry text'' only really makes sense for text that does not contain embedded links,
-  for extra information in the input text that is not displayed to the user but is sent to the transliteration module.
+meaning that display text containing embedded links will need to have the links individually processed to get
-  Note that having different display and input text is only supported currently through special-casing but will be
+per-link entry text in order to generate the resolved display text (see below).
-  generalized. Examples of transformations are: (1) Removing the {{cd|^}} that is used in certain East Asian (and
+# The ''resolved display text'' is the result of resolving embedded links in the display text (e.g. converting them to
-  possibly other unicameral) languages to indicate capitalization of the transliteration (which is currently
+two-part links where the first part has entry-text transformations applied, and adding appropriate language-specific
-  special-cased); (2) for Korean, removing or otherwise processing hyphens (which is currently special-cased); (3) for
+fragments) and adding appropriate language and script tagging. This text can be passed directly to MediaWiki for
-  Arabic, removing a ''sukūn'' diacritic placed over a ''tāʔ marbūṭa'' (like this: ةْ) to indicate that the
+display.
-  ''tāʔ marbūṭa'' is pronounced and transliterated as /t/ instead of being silent [NOTE, NOT IMPLEMENTED YET]; (4) for
+# The ''source translit text'' is the text as supplied to the language-specific {transliterate()} method. The form of
-  Thai and Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as
+the source translit text may need to be language-specific, e.g Thai and Khmer will need the full unprocessed input
-  `[กรีน/กฺรีน]`, which indicate how to transliterate given words [NOTE, NOT IMPLEMENTED YET except in language-specific
+text, whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded
-  templates like {{tl|th-usex}}].
+bracketed links are handled in the existing code.] In general, embedded links need to be removed (i.e. converted to
-## The ''right-resolved display text'' is the result of removing brackets around one-part embedded links and resolving
+their "bare display" form by taking the right part of two-part links and removing double brackets), but when this
-   two-part embedded links into their right-hand components (i.e. converting two-part links into the displayed form).
+happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the
-   The process of right-resolution is what happens when you call {{cd|remove_links()}} in [[Module:links]] on some text.
+text through the transliterate mechanism, and for others (those listed with "cont" in {substition} in
-   When applied to the display text, it produces exactly what the user sees, without any link markup.
+[[Module:languages/data]]) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is
-# The ''stripped display text'' is the result of applying diacritic-stripping to the display text.
+still unclear to me.)
-## The ''left-resolved stripped display text'' [NEED BETTER NAME] is the result of applying left-resolution to the
+# The ''transliterated text'' (or ''transliteration'') is the result of transliterating the source translit text.
-   stripped display text, i.e. similar to right-resolution but resolving two-part embedded links into their left-hand
+Unlike for all the other text variants except the transcribed text, it is always in the Latin script.
-   components (i.e. the linked-to page). If the display text refers to a single page, the resulting of applying
-   diacritic stripping and left-resolution produces the ''logical pagename''.
-# The ''physical pagename text'' is the result of converting the stripped display text into physical page links. If the
-  stripped display text contains embedded links, the left side of those links is converted into physical page links;
-  otherwise, the entire text is considered a pagename and converted in the same fashion. The conversion does three
-  things: (1) converts characters not allowed in pagenames into their "unsupported title" representation, e.g.
-  {{cd|Unsupported titles/`gt`}} in place of the logical name {{cd|>}}; (2) handles certain special-cased
-  unsupported-title logical pagenames, such as {{cd|Unsupported titles/Space}} in place of {{cd|[space]}} and
-  {{cd|Unsupported titles/Ancient Greek dish}} in place of a very long Greek name for a gourmet dish as found in
-  Aristophanes; (3) converts "mammoth" pagenames such as [[a]] into their appropriate split component, e.g.
-  [[a/languages A to L]].
-# The ''source translit text'' is the text as supplied to the language-specific {{cd|transliterate()}} method. The form
-  of the source translit text may need to be language-specific, e.g Thai and Khmer will need the corrected input text,
-  whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded bracketed
-  links are handled in the existing code.] In general, embedded links need to be right-resolved (see above), but when
-  this happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the
-  text through the transliterate mechanism, and for others (those listed with "cont" in {{cd|substitution}} in
-  [[Module:languages/data]]) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is
-  still unclear to me.)
-# The ''transliterated text'' (or ''transliteration'') is the result of transliterating the source translit text. Unlike
-  for all the other text variants except the transcribed text, it is always in the Latin script.
 # The ''transcribed text'' (or ''transcription'') is the result of transcribing the source translit text, where
-  "transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian,
+"transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian,
-  Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form.
+Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form.
-  Unlike for all the other text variants other than the transliterated text, it is always in the Latin script.
+Unlike for all the other text variants other than the transliterated text, it is always in the Latin script.
-  Currently, the transcribed text is always supplied manually be the user; there is no such thing as a
+Currently, the transcribed text is always supplied manually be the user; there is no such thing as a
-  {{cd|transcribe()}} method on language objects.
+{lua|transcribe()} method on language objects.
 # The ''sort key'' is the text used in sort keys for determining the placing of pages in categories they belong to. The
-  sort key is generated from the pagename or a specified ''sort base'' by lowercasing, doing language-specific
+sort key is generated from the pagename or a specified ''sort base'' by lowercasing, doing language-specific
-  transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it
+transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it
-  needs to be converted to display text, have embedded links removed through right-resolution and have
+needs to be converted to display text, have embedded links removed (i.e. resolving them to their right side if they
-  diacritic-stripping applied.
+are two-part links) and have entry text transformations applied.
 # There are other text variants that occur in usexes (specifically, there are normalized variants of several of the
-  above text variants), but we can skip them for now.
+above text variants), but we can skip them for now.
 The following methods exist on {Language} objects to convert between different text variants:
-# {correctInputText} (currently called {makeDisplayText}): This converts input text to corrected input text.
+# {makeDisplayText}: This converts input text to display text.
-# {stripDiacritics}: This converts to stripped display text. [FIXME: This needs some rethinking. In particular,
+# {lua|makeEntryName}: This converts input or display text to entry text. [FIXME: This needs some rethinking. In
-  {stripDiacritics} is sometimes called on input text, corrected input text or display text (in various paths inside of
+particular, {lua|makeEntryName} is sometimes called on display text (in some paths inside of [[Module:links]]) and
-  [[Module:links]], and, in the case of input text, usually from other modules). We need to make sure we don't try to
+sometimes called on input text (in other paths inside of [[Module:links]], and usually from other modules). We need
-  convert input text to display text twice, but at the same time we need to support calling it directly on input text
+to make sure we don't try to convert input text to display text twice, but at the same time we need to support
-  since so many modules do this. This means we need to add a parameter indicating whether the passed-in text is input,
+calling it directly on input text since so many modules do this. This means we need to add a parameter indicating
-  corrected input, or display text; if the former two, we call {correctInputText} ourselves.]
+whether the passed-in text is input or display text; if that former, we call {lua|makeDisplayText} ourselves.]
-# {logicalToPhysical}: This converts logical pagenames to physical pagenames.
+# {lua|transliterate}: This appears to convert input text with embedded brackets removed into a transliteration.
-# {transliterate}: This appears to convert input text with embedded brackets removed into a transliteration.
+[FIXME: This needs some rethinking. In particular, it calls {lua|processDisplayText} on its input, which won't work
-  [FIXME: This needs some rethinking. In particular, it calls {processDisplayText} on its input, which won't work
+for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the
-  for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the
+language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code;
-  language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code;
+a lot of callers remove the links themselves before calling {lua|transliterate()}, which I assume is wrong.]
-  a lot of callers remove the links themselves before calling {transliterate()}, which I assume is wrong.]
+# {lua|makeSortKey}: This converts entry text (?) to a sort key. [FIXME: Clarify this.]
-# {makeSortKey}: This converts display text (?) to a sort key. [FIXME: Clarify this.]
+]=]
-]==]
 local export = {}
 local etymology_languages_data_module = "Module:etymology languages/data"
 local families_module = "Module:families"
-local headword_page_module = "Module:headword/page"
 local json_module = "Module:JSON"
 local language_like_module = "Module:language-like"
@@ Line 145: / Line 118: @@
 local links_data_module = "Module:links/data"
 local load_module = "Module:load"
+local patterns_module = "Module:patterns"
 local scripts_module = "Module:scripts"
 local scripts_data_module = "Module:scripts/data"
 local string_encode_entities_module = "Module:string/encode entities"
-local string_pattern_escape_module = "Module:string/patternEscape"
-local string_replacement_escape_module = "Module:string/replacementEscape"
 local string_utilities_module = "Module:string utilities"
 local table_module = "Module:table"
@@ Line 188: / Line 160: @@
 local Hant_chars
-local function check_object(...)
+--[==[
-	check_object = require(utilities_module).check_object
+Loaders for functions in other modules, which overwrite themselves with the target function when called. This ensures modules are only loaded when needed, retains the speed/convenience of locally-declared pre-loaded functions, and has no overhead after the first call, since the target functions are called directly in any subsequent calls.]==]
-	return check_object(...)
+	local function check_object(...)
-end
+		check_object = require(utilities_module).check_object
+		return check_object(...)
+	end
-local function decode_entities(...)
+	local function decode_entities(...)
-	decode_entities = require(string_utilities_module).decode_entities
+		decode_entities = require(string_utilities_module).decode_entities
-	return decode_entities(...)
+		return decode_entities(...)
-end
+	end
-local function decode_uri(...)
+	local function decode_uri(...)
-	decode_uri = require(string_utilities_module).decode_uri
+		decode_uri = require(string_utilities_module).decode_uri
-	return decode_uri(...)
+		return decode_uri(...)
-end
+	end
-local function deep_copy(...)
+	local function deep_copy(...)
-	deep_copy = require(table_module).deepCopy
+		deep_copy = require(table_module).deepCopy
-	return deep_copy(...)
+		return deep_copy(...)
-end
+	end
-local function encode_entities(...)
+	local function encode_entities(...)
-	encode_entities = require(string_encode_entities_module)
+		encode_entities = require(string_encode_entities_module)
-	return encode_entities(...)
+		return encode_entities(...)
-end
+	end
-local function get_L2_sort_key(...)
+	local function get_script(...)
-	get_L2_sort_key = require(headword_page_module).get_L2_sort_key
+		get_script = require(scripts_module).getByCode
-	return get_L2_sort_key(...)
+		return get_script(...)
-end
+	end
-local function get_script(...)
+	local function find_best_script_without_lang(...)
-	get_script = require(scripts_module).getByCode
+		find_best_script_without_lang = require(scripts_module).findBestScriptWithoutLang
-	return get_script(...)
+		return find_best_script_without_lang(...)
-end
+	end
-local function find_best_script_without_lang(...)
+	local function get_family(...)
-	find_best_script_without_lang = require(scripts_module).findBestScriptWithoutLang
+		get_family = require(families_module).getByCode
-	return find_best_script_without_lang(...)
+		return get_family(...)
-end
+	end
-local function get_family(...)
+	local function get_plaintext(...)
-	get_family = require(families_module).getByCode
+		get_plaintext = require(utilities_module).get_plaintext
-	return get_family(...)
+		return get_plaintext(...)
-end
+	end
-local function get_plaintext(...)
+	local function get_wikimedia_lang(...)
-	get_plaintext = require(utilities_module).get_plaintext
+		get_wikimedia_lang = require(wikimedia_languages_module).getByCode
-	return get_plaintext(...)
+		return get_wikimedia_lang(...)
-end
+	end
-local function get_wikimedia_lang(...)
+	local function keys_to_list(...)
-	get_wikimedia_lang = require(wikimedia_languages_module).getByCode
+		keys_to_list = require(table_module).keysToList
-	return get_wikimedia_lang(...)
+		return keys_to_list(...)
-end
+	end
-local function keys_to_list(...)
+	local function list_to_set(...)
-	keys_to_list = require(table_module).keysToList
+		list_to_set = require(table_module).listToSet
-	return keys_to_list(...)
+		return list_to_set(...)
-end
+	end
-local function list_to_set(...)
+	local function load_data(...)
-	list_to_set = require(table_module).listToSet
+		load_data = require(load_module).load_data
-	return list_to_set(...)
+		return load_data(...)
-end
+	end
-local function load_data(...)
+	local function make_family_object(...)
-	load_data = require(load_module).load_data
+		make_family_object = require(families_module).makeObject
-	return load_data(...)
+		return make_family_object(...)
-end
+	end
-local function make_family_object(...)
+	local function pattern_escape(...)
-	make_family_object = require(families_module).makeObject
+		pattern_escape = require(patterns_module).pattern_escape
-	return make_family_object(...)
+		return pattern_escape(...)
-end
+	end
-local function pattern_escape(...)
+	local function remove_duplicates(...)
-	pattern_escape = require(string_pattern_escape_module)
+		remove_duplicates = require(table_module).removeDuplicates
-	return pattern_escape(...)
+		return remove_duplicates(...)
-end
+	end
-local function replacement_escape(...)
+	local function replacement_escape(...)
-	replacement_escape = require(string_replacement_escape_module)
+		replacement_escape = require(patterns_module).replacement_escape
-	return replacement_escape(...)
+		return replacement_escape(...)
-end
+	end
-local function safe_require(...)
+	local function safe_require(...)
-	safe_require = require(load_module).safe_require
+		safe_require = require(load_module).safe_require
-	return safe_require(...)
+		return safe_require(...)
-end
+	end
-local function shallow_copy(...)
+	local function shallow_copy(...)
-	shallow_copy = require(table_module).shallowCopy
+		shallow_copy = require(table_module).shallowCopy
-	return shallow_copy(...)
+		return shallow_copy(...)
-end
+	end
-local function split(...)
+	local function split(...)
-	split = require(string_utilities_module).split
+		split = require(string_utilities_module).split
-	return split(...)
+		return split(...)
-end
+	end
-local function to_json(...)
+	local function to_json(...)
-	to_json = require(json_module).toJSON
+		to_json = require(json_module).toJSON
-	return to_json(...)
+		return to_json(...)
-end
+	end
-local function u(...)
+	local function u(...)
-	u = require(string_utilities_module).char
+		u = require(string_utilities_module).char
-	return u(...)
+		return u(...)
-end
+	end
-local function ugsub(...)
+	local function ugsub(...)
-	ugsub = require(string_utilities_module).gsub
+		ugsub = require(string_utilities_module).gsub
-	return ugsub(...)
+		return ugsub(...)
-end
+	end
-local function ulen(...)
+	local function ulen(...)
-	ulen = require(string_utilities_module).len
+		ulen = require(string_utilities_module).len
-	return ulen(...)
+		return ulen(...)
-end
+	end
-local function ulower(...)
+	local function ulower(...)
-	ulower = require(string_utilities_module).lower
+		ulower = require(string_utilities_module).lower
-	return ulower(...)
+		return ulower(...)
-end
+	end
-local function umatch(...)
+	local function umatch(...)
-	umatch = require(string_utilities_module).match
+		umatch = require(string_utilities_module).match
-	return umatch(...)
+		return umatch(...)
-end
+	end
-local function uupper(...)
+	local function uupper(...)
-	uupper = require(string_utilities_module).upper
+		uupper = require(string_utilities_module).upper
-	return uupper(...)
+		return uupper(...)
-end
+	end
 local function normalize_code(code)
@@ Line 381: / Line 355: @@
 	end
 	-- Pre-substitution, of "[[" and "]]", which makes pattern matching more accurate.
-	text = gsub(text, "%f[%[]%[%[", "\1"):gsub("%f[%]]%]%]", "\2")
+	text = gsub(text, "%f[%[]%[%[", "\1")
+		:gsub("%f[%]]%]%]", "\2")
 	local i = #subbedChars
 	for _, pattern in ipairs(patterns) do
@@ Line 405: / Line 380: @@
 		end)
 	end
-	text = gsub(text, "\1", "%[%["):gsub("\2", "%]%]")
+	text = gsub(text, "\1", "%[%[")
+		:gsub("\2", "%]%]")
 	return text, subbedChars
 end
@@ Line 415: / Line 391: @@
 		local byte3 = floor(i / 64) % 64 + 128
 		local byte4 = i % 64 + 128
-		text = gsub(text, "\244[" .. char(byte2) .. char(byte2+8) .. "]" .. char(byte3) .. char(byte4),
+		text = gsub(text, "\244[" .. char(byte2) .. char(byte2+8) .. "]" .. char(byte3) .. char(byte4), replacement_escape(subbedChars[i]))
-			replacement_escape(subbedChars[i]))
 	end
-	text = gsub(text, "\1", "%[%["):gsub("\2", "%]%]")
+	text = gsub(text, "\1", "%[%[")
+		:gsub("\2", "%]%]")
 	return text
 end
@@ Line 445: / Line 421: @@
 end
--- Subfunction of iterateSectionSubstitutions(). Process an individual chunk of text according to the specifications in
+local function doSubstitutions(self, text, sc, substitution_data, function_name, recursed)
--- `substitution_data`. The input parameters are all as in the documentation of iterateSectionSubstitutions() except for
+	local fail, cats = nil, {}
--- `recursed`, which is set to true if we called ourselves recursively to process a script-specific setting or
--- script-wide fallback. Returns two values: the processed text and the actual substitution data used to do the
--- substitutions (same as the `actual_substitution_data` return value to iterateSectionSubstitutions()).
-local function doSubstitutions(self, text, sc, substitution_data, data_field, function_name, recursed)
-	-- BE CAREFUL in this function because the value at any level can be `false`, which causes no processing to be done
-	-- and blocks any further fallback processing.
-	local actual_substitution_data = substitution_data
 	-- If there are language-specific substitutes given in the data module, use those.
 	if type(substitution_data) == "table" then
 		-- If a script is specified, run this function with the script-specific data before continuing.
 		local sc_code = sc:getCode()
-		local has_substitution_data = false
+		if substitution_data[sc_code] then
-		if substitution_data[sc_code] ~= nil then
+			text, fail, cats = doSubstitutions(self, text, sc, substitution_data[sc_code], function_name, true)
-			has_substitution_data = true
+		-- Hant, Hans and Hani are usually treated the same, so add a special case to avoid having to specify each one separately.
-			if substitution_data[sc_code] then
+		elseif sc_code:match("^Han") and substitution_data.Hani then
-				text, actual_substitution_data = doSubstitutions(self, text, sc, substitution_data[sc_code], data_field,
+			text, fail, cats = doSubstitutions(self, text, sc, substitution_data.Hani, function_name, true)
-					function_name, true)
-			end
-		-- Hant, Hans and Hani are usually treated the same, so add a special case to avoid having to specify each one
-		-- separately.
-		elseif sc_code:match("^Han") and substitution_data.Hani ~= nil then
-			has_substitution_data = true
-			if substitution_data.Hani then
-				text, actual_substitution_data = doSubstitutions(self, text, sc, substitution_data.Hani, data_field,
-					function_name, true)
-			end
 		-- Substitution data with key 1 in the outer table may be given as a fallback.
-		elseif substitution_data[1] ~= nil then
+		elseif substitution_data[1] then
-			has_substitution_data = true
+			text, fail, cats = doSubstitutions(self, text, sc, substitution_data[1], function_name, true)
-			if substitution_data[1] then
-				text, actual_substitution_data = doSubstitutions(self, text, sc, substitution_data[1], data_field,
-					function_name, true)
-			end
 		end
-		-- Iterate over all strings in the "from" subtable, and gsub with the corresponding string in "to". We work with
+		-- Iterate over all strings in the "from" subtable, and gsub with the corresponding string in "to". We work with the NFD decomposed forms, as this simplifies many substitutions.
-		-- the NFD decomposed forms, as this simplifies many substitutions.
 		if substitution_data.from then
-			has_substitution_data = true
 			for i, from in ipairs(substitution_data.from) do
 				-- Normalize each loop, to ensure multi-stage substitutions work correctly.
@@ Line 493: / Line 446: @@
 		if substitution_data.remove_diacritics then
-			has_substitution_data = true
 			text = sc:toFixedNFD(text)
 			-- Convert exceptions to PUA.
@@ Line 516: / Line 468: @@
 				text = text:gsub("\242[\128-\191]*", substitutes)
 			end
-		end
-		if not has_substitution_data and sc._data[data_field] then
-			-- If language-specific sort key (etc.) is nil, fall back to script-wide sort key (etc.).
-			text, actual_substitution_data = doSubstitutions(self, text, sc, sc._data[data_field], data_field,
-				function_name, true)
 		end
 	elseif type(substitution_data) == "string" then
@@ Line 529: / Line 476: @@
 			-- TODO: translit functions should be called with form NFD.
 			if function_name == "tr" then
-				if not module[function_name] then
+				text, fail, cats = module[function_name](text, self._code, sc:getCode())
-					error(("Internal error: Module [[%s]] has no function named 'tr'"):format(substitution_data))
-				end
-				text = module[function_name](text, self._code, sc:getCode())
-			elseif function_name == "stripDiacritics" then
-				-- FIXME, get rid of this arm after renaming makeEntryName -> stripDiacritics.
-				if module[function_name] then
-					text = module[function_name](sc:toFixedNFD(text), self, sc)
-				elseif module.makeEntryName then
-					text = module.makeEntryName(sc:toFixedNFD(text), self, sc)
-				else
-					error(("Internal error: Module [[%s]] has no function named 'stripDiacritics' or 'makeEntryName'"
-						):format(substitution_data))
-				end
 			else
-				if not module[function_name] then
+				text, fail, cats = module[function_name](sc:toFixedNFD(text), self, sc)
-					error(("Internal error: Module [[%s]] has no function named '%s'"):format(
-						substitution_data, function_name))
-				end
-				text = module[function_name](sc:toFixedNFD(text), self, sc)
 			end
 		else
 			error("Substitution data '" .. substitution_data .. "' does not match an existing module.")
 		end
-	elseif substitution_data == nil and sc._data[data_field] then
-		-- If language-specific sort key (etc.) is nil, fall back to script-wide sort key (etc.).
-		text, actual_substitution_data = doSubstitutions(self, text, sc, sc._data[data_field], data_field,
-			function_name, true)
 	end
 	-- Don't normalize to NFC if this is the inner loop or if a module returned nil.
 	if recursed or not text then
-		return text, actual_substitution_data
+		return text, fail, cats
 	end
 	-- Fix any discouraged sequences created during the substitution process, and normalize into the final form.
-	return sc:toFixedNFC(sc:fixDiscouragedSequences(text)), actual_substitution_data
+	return sc:toFixedNFC(sc:fixDiscouragedSequences(text)), fail, cats
 end
--- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate
+-- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate over each one to apply substitutions. This avoids putting PUA characters through language-specific modules, which may be unequipped for them.
--- over each section to apply substitutions (e.g. transliteration or diacritic stripping). This avoids putting PUA
+local function iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, substitution_data, function_name)
--- characters through language-specific modules, which may be unequipped for them. This function is passed the following
+	local fail, cats, sections = nil, {}
--- values:
--- * `self` (the Language object);
--- * `text` (the text to process);
--- * `sc` (the script of the text, which must be specified; callers should call checkScript() as needed to autodetect the
---   script of the text if not given explicitly by the user);
--- * `subbedChars` (an array of the same length as the text, indicating which characters have been substituted and by
---   what, or {nil} if no substitutions are to happen);
--- * `keepCarets` (DOCUMENT ME);
--- * `substitution_data` (the data indicating which substitutions to apply, taken directly from `data_field` in the
---   language's data structure in a submodule of [[Module:languages/data]]);
--- * `data_field` (the data field from which `substitution_data` was fetched, such as "sort_key" or "strip_diacritics");
--- * `function_name` (the name of the function to call to do the substitution, in case `substitution_data` specifies a
---   module to do the substitution);
--- * `notrim` (don't trim whitespace at the edges of `text`; set when computing the sort key, because whitespace at the
---   beginning of a sort key is significant and causes the resulting page to be sorted at the beginning of the category
---   it's in).
--- Returns three values:
--- (1) the processed text;
--- (2) the value of `subbedChars` that was passed in, possibly modified with additional character substitutions; will be
---     {nil} if {nil} was passed in;
--- (3) the actual substitution data that was used to apply substitutions to `text`; this may be different from the value
---     of `substitution_data` passed in if that value recursively specified script-specific substitutions or if no
---     substitution data could be found in the language-specific data (e.g. {nil} was passed in or a structure was passed
---     in that had no setting for the script given in `sc`), but a script-wide fallback value was set; currently it is
---     only used by makeSortKey().
-local function iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, substitution_data, data_field,
-	function_name, notrim)
-	local sections
 	-- See [[Module:languages/data]].
-	if not find(text, "\244") or load_data(languages_data_module).substitution[self._code] == "cont" then
+	if not find(text, "\244") or (load_data(languages_data_module).substitution[self._code] == "cont") then
 		sections = {text}
 	else
 		sections = split(text, "\244[\128-\143][\128-\191]*", true)
 	end
-	local actual_substitution_data
 	for _, section in ipairs(sections) do
-		-- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated
+		-- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated modules).
-		-- modules).
 		if gsub(section, "%s+", "") ~= "" then
-			local sub, this_actual_substitution_data = doSubstitutions(self, section, sc, substitution_data, data_field,
+			local sub, sub_fail, sub_cats = doSubstitutions(self, section, sc, substitution_data, function_name)
-				function_name)
+			-- Second round of temporary substitutions, in case any formatting was added by the main substitution process. However, don't do this if the section contains formatting already (as it would have had to have been escaped to reach this stage, and therefore should be given as raw text).
-			actual_substitution_data = this_actual_substitution_data
-			-- Second round of temporary substitutions, in case any formatting was added by the main substitution
-			-- process. However, don't do this if the section contains formatting already (as it would have had to have
-			-- been escaped to reach this stage, and therefore should be given as raw text).
 			if sub and subbedChars then
 				local noSub
@@ Line 626: / Line 518: @@
 				end
 			end
-			if not sub then
+			if (not sub) or sub_fail then
 				text = sub
+				fail = sub_fail
+				cats = sub_cats or {}
 				break
 			end
 			text = sub and gsub(text, pattern_escape(section), replacement_escape(sub), 1) or text
+			if type(sub_cats) == "table" then
+				for _, cat in ipairs(sub_cats) do
+					insert(cats, cat)
+				end
+			end
 		end
 	end
-	if not notrim then
+	-- Trim, unless there are only spacing characters, while ignoring any final formatting characters.
-		-- Trim, unless there are only spacing characters, while ignoring any final formatting characters.
+	text = text and text:gsub("^([\128-\191\244]*)%s+(%S)", "%1%2")
-		-- Do not trim sort keys because spaces at the beginning are significant.
+		:gsub("(%S)%s+([\128-\191\244]*)$", "%1%2")
-		text = text and text:gsub("^([\128-\191\244]*)%s+(%S)", "%1%2"):gsub("(%S)%s+([\128-\191\244]*)$", "%1%2") or
-			nil
+	-- Remove duplicate categories.
+	if #cats > 1 then
+		cats = remove_duplicates(cats)
 	end
-	return text, subbedChars, actual_substitution_data
+	return text, fail, cats, subbedChars
 end
@@ Line 650: / Line 551: @@
 		text, rep = gsub(text, "\\\\(\\*^)", "\3%1")
 	until rep == 0
-	return (text:gsub("\\^", "\4")
+	return text:gsub("\\^", "\4")
 		:gsub(pattern or "%^", repl or "")
 		:gsub("\3", "\\")
-		:gsub("\4", "^"))
+		:gsub("\4", "^")
 end
@@ Line 807: / Line 708: @@
 	Language.hasType = require(language_like_module).hasType
 	return self:hasType(...)
+end
+function Language:getMainCategoryName()
+	return self._data.main_category or "lemma"
 end
@@ Line 865: / Line 770: @@
 function Language:makeWikipediaLink()
 	return make_link(self, (self:hasType("conlang") and self:getCanonicalName() or "w:" .. self:getWikipediaArticle()), self:getCanonicalName())
-end
-function Language:getMainCategoryName()
-	return self._data.main_category or "lemma"
 end
@@ Line 980: / Line 881: @@
 			local t, s, found = 0, 0
 			-- This is faster than using mw.ustring.gmatch directly.
-			for ch in gmatch((ugsub(text, "[" .. Hani.characters .. "]", "\255%0")), "\255(.[\128-\191]*)") do
+			for ch in gmatch(ugsub(text, "[" .. Hani.characters .. "]", "\255%0"), "\255(.[\128-\191]*)") do
 				found = true
 				if Hant_chars[ch] then
@@ Line 1,009: / Line 910: @@
 			-- Count characters by removing everything in the script's charset and comparing to the original length.
 			local charset = sc.characters
-			local count = charset and length - ulen((ugsub(text, "[" .. charset .. "]+", ""))) or 0
+			local count = charset and length - ulen(ugsub(text, "[" .. charset .. "]+", "")) or 0
 			if count >= length then
@@ Line 1,279: / Line 1,180: @@
 		local ancestorsParents = {}
 		for _, ancestor in ipairs(ancestors) do
-			-- When checking the parents of the other language, and the ancestor is also a parent, skip to the next ancestor, so that we exclude any etymology-only children of that parent that are not directly related (see below).
+			local ret = func(ancestor) or iterateOverAncestorTree(ancestor, func, parent_check)
-			local ret = (parent_check or not node:hasParent(ancestor)) and
-				func(ancestor) or iterateOverAncestorTree(ancestor, func, parent_check)
 			if ret then
 				return ret
@@ Line 1,550: / Line 1,449: @@
 function Language:getStandardCharacters(sc)
-	local standard_chars = self._data.standard_chars
+	local standard_chars = self._data.standardChars
 	if type(standard_chars) ~= "table" then
 		return standard_chars
@@ Line 1,569: / Line 1,468: @@
 end
---[==[
+--[==[Make the entry name (i.e. the correct page name).]==]
-Strip diacritics from display text `text` (in a language-specific fashion), which is in the script `sc`. If `sc` is
+function Language:makeEntryName(text, sc)
-omitted or {nil}, the script is autodetected. This also strips certain punctuation characters from the end and (in the
-case of Spanish upside-down question mark and exclamation points) from the beginning; strips any whitespace at the
-end of the text or between the text and final stripped punctuation characters; and applies some language-specific
-Unicode normalizations to replace discouraged characters with their prescribed alternatives. Return the stripped text.
-]==]
-function Language:stripDiacritics(text, sc)
 	if (not text) or text == "" then
-		return text
+		return text, nil, {}
 	end
-	sc = checkScript(text, self, sc)
+	-- Set `unsupported` as true if certain conditions are met.
+	local unsupported
-	text = normalize(text, sc)
+	-- Check if there's an unsupported character. \239\191\189 is the replacement character U+FFFD, which can't be typed directly here due to an abuse filter. Unix-style dot-slash notation is also unsupported, as it is used for relative paths in links, as are 3 or more consecutive tildes.
-	-- FIXME, rename makeEntryName to stripDiacritics and get rid of second and third return values
+	-- Note: match is faster with magic characters/charsets; find is faster with plaintext.
-	-- everywhere
-	text, _, _ = iterateSectionSubstitutions(self, text, sc, nil, nil,
-		self._data.strip_diacritics or self._data.entry_name, "strip_diacritics", "stripDiacritics")
-	text = umatch(text, "^[¿¡]?(.-[^%s%p].-)%s*[؟?!;՛՜ ՞ ՟？！︖︕।॥။၊་།]?$") or text
-	return text
-end
---[==[
-Convert a ''logical'' pagename (the pagename as it appears to the user, after diacritics and punctuation have been
-stripped) to a ''physical'' pagename (the pagename as it appears in the MediaWiki database). Reasons for a difference
-between the two are (a) unsupported titles such as `[ ]` (with square brackets in them), `#` (pound/hash sign) and
-`¯\_(ツ)_/¯` (with underscores), as well as overly long titles of various sorts; (b) "mammoth" pages that are split into
-parts (e.g. `a`, which is split into physical pagenames `a/languages A to L` and `a/languages M to Z`). For almost all
-purposes, you should work with logical and not physical pagenames. But there are certain use cases that require physical
-pagenames, such as checking the existence of a page or retrieving a page's contents.
-`pagename` is the logical pagename to be converted. `is_reconstructed_or_appendix` indicates whether the page is in the
-`Reconstruction` or `Appendix` namespaces. If it is omitted or has the value {nil}, the pagename is checked for an
-initial asterisk, and if found, the page is assumed to be a `Reconstruction` page. Setting a value of `false` or `true`
-to `is_reconstructed_or_appendix` disables this check and allows for mainspace pagenames that begin with an asterisk.
-]==]
-function Language:logicalToPhysical(pagename, is_reconstructed_or_appendix)
-	-- FIXME: This probably shouldn't happen but it happens when makeEntryName() receives nil.
-	if pagename == nil then
-		return nil
-	end
-	local initial_asterisk
-	if is_reconstructed_or_appendix == nil then
-		local pagename_minus_initial_asterisk
-		initial_asterisk, pagename_minus_initial_asterisk = pagename:match("^(%*)(.*)$")
-		if pagename_minus_initial_asterisk then
-			is_reconstructed_or_appendix = true
-			pagename = pagename_minus_initial_asterisk
-		elseif self:hasType("appendix-constructed") then
-			is_reconstructed_or_appendix = true
-		end
-	end
-	if not is_reconstructed_or_appendix then
-		-- Check if the pagename is a listed unsupported title.
-		local unsupportedTitles = load_data(links_data_module).unsupported_titles
-		if unsupportedTitles[pagename] then
-			return "Unsupported titles/" .. unsupportedTitles[pagename]
-		end
-	end
-	-- Set `unsupported` as true if certain conditions are met.
-	local unsupported
-	-- Check if there's an unsupported character. \239\191\189 is the replacement character U+FFFD, which can't be typed
-	-- directly here due to an abuse filter. Unix-style dot-slash notation is also unsupported, as it is used for
-	-- relative paths in links, as are 3 or more consecutive tildes. Note: match is faster with magic
-	-- characters/charsets; find is faster with plaintext.
 	if (
-		match(pagename, "[#<>%[%]_{|}]") or
+		match(text, "[#<>%[%]_{|}]") or
-		find(pagename, "\239\191\189") or
+		find(text, "\239\191\189") or
-		match(pagename, "%f[^%z/]%.%.?%f[%z/]") or
+		match(text, "%f[^%z/]%.%.?%f[%z/]") or
-		find(pagename, "~~~")
+		find(text, "~~~")
 	) then
 		unsupported = true
 	-- If it looks like an interwiki link.
-	elseif find(pagename, ":") then
+	elseif find(text, ":") then
-		local prefix = gsub(pagename, "^:*(.-):.*", ulower)
+		local prefix = gsub(text, "^:*(.-):.*", ulower)
 		if (
 			load_data("Module:data/namespaces")[prefix] or
@@ Line 1,656: / Line 1,496: @@
 	end
-	-- Escape unsupported characters so they can be used in titles. ` is used as a delimiter for this, so a raw use of
+	-- Check if the text is a listed unsupported title.
-	-- it in an unsupported title is also escaped here to prevent interference; this is only done with unsupported
+	local unsupportedTitles = load_data(links_data_module).unsupported_titles
-	-- titles, though, so inclusion won't in itself mean a title is treated as unsupported (which is why it's excluded
+	if unsupportedTitles[text] then
-	-- from the earlier test).
+		return "Unsupported titles/" .. unsupportedTitles[text], nil, {}
+	end
+	sc = checkScript(text, self, sc)
+	local fail, cats
+	text = normalize(text, sc)
+	text, fail, cats = iterateSectionSubstitutions(self, text, sc, nil, nil, self._data.entry_name, "makeEntryName")
+	text = umatch(text, "^[¿¡]?(.-[^%s%p].-)%s*[؟?!;՛՜ ՞ ՟？！︖︕।॥။၊་།]?$") or text
+	-- Escape unsupported characters so they can be used in titles. ` is used as a delimiter for this, so a raw use of it in an unsupported title is also escaped here to prevent interference; this is only done with unsupported titles, though, so inclusion won't in itself mean a title is treated as unsupported (which is why it's excluded from the earlier test).
 	if unsupported then
-		-- FIXME: This conversion needs to be different for reconstructed pages with unsupported characters. There
-		-- aren't any currently, but if there ever are, we need to fix this e.g. to put them in something like
-		-- Reconstruction:Proto-Indo-European/Unsupported titles/`lowbar``num`.
 		local unsupported_characters = load_data(links_data_module).unsupported_characters
-		pagename = pagename:gsub("[#<>%[%]_`{|}\239]\191?\189?", unsupported_characters)
+		text = text:gsub("[#<>%[%]_`{|}\239]\191?\189?", unsupported_characters)
 			:gsub("%f[^%z/]%.%.?%f[%z/]", function(m)
-				return (gsub(m, "%.", "`period`"))
+				return gsub(m, "%.", "`period`")
 			end)
 			:gsub("~~~+", function(m)
-				return (gsub(m, "~", "`tilde`"))
+				return gsub(m, "~", "`tilde`")
 			end)
-		pagename = "Unsupported titles/" .. pagename
+		text = "Unsupported titles/" .. text
-	elseif not is_reconstructed_or_appendix then
+	end
-		-- Check if this is a mammoth page. If so, which subpage should we link to?
-		local m_links_data = load_data(links_data_module)
-		local mammoth_page_type = m_links_data.mammoth_pages[pagename]
-		if mammoth_page_type then
-			local canonical_name = self:getFullName()
-			if canonical_name ~= "Translingual" and canonical_name ~= "English" then
-				local this_subpage
-				local L2_sort_key = get_L2_sort_key(canonical_name)
-				for _, subpage_spec in ipairs(m_links_data.mammoth_page_subpage_types[mammoth_page_type]) do
-					-- unpack() fails utterly on data loaded using mw.loadData() even if offsets are given
-					local subpage, pattern = subpage_spec[1], subpage_spec[2]
-					if pattern == true or L2_sort_key:match(pattern) then
-						this_subpage = subpage
-						break
-					end
-				end
-				if not this_subpage then
-					error(("Internal error: Bad data in mammoth_page_subpage_pages in [[Module:links/data]] for mammoth page %s, type %s; last entry didn't have 'true' in it"):format(
-						pagename, mammoth_page_type))
-				end
-				pagename = pagename .. "/" .. this_subpage
-			end
-		end
-	end
-	return (initial_asterisk or "") .. pagename
-end
---[==[
-Strip the diacritics from a display pagename and convert the resulting logical pagename into a physical pagename.
-This allows you, for example, to retrieve the contents of the page or check its existence. WARNING: This is deprecated
-and will be going away. It is a simple composition of `self:stripDiacritics` and `self:logicalToPhysical`; most callers
-only want the former, and if you need both, call them both yourself.
-`text` and `sc` are as in `self:stripDiacritics`, and `is_reconstructed_or_appendix` is as in `self:logicalToPhysical`.
+	return text, fail, cats
-]==]
-function Language:makeEntryName(text, sc, is_reconstructed_or_appendix)
-	return self:logicalToPhysical(self:stripDiacritics(text, sc), is_reconstructed_or_appendix)
 end
 --[==[Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.]==]
@@ Line 1,725: / Line 1,537: @@
 end
---[==[Creates a sort key for the given stripped text, following the rules appropriate for the language. This removes
+--[==[Creates a sort key for the given entry name, following the rules appropriate for the language. This removes diacritical marks from the entry name if they are not considered significant for sorting, and may perform some other changes. Any initial hyphen is also removed, and anything parentheses is removed as well.
-diacritical marks from the stripped text if they are not considered significant for sorting, and may perform some other
+The <code>sort_key</code> setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the entry name and returns a sortkey.]==]
-changes. Any initial hyphen is also removed, and anything in parentheses is removed as well.
-The <code>sort_key</code> setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the stripped text and returns a sortkey.]==]
 function Language:makeSortKey(text, sc)
 	if (not text) or text == "" then
-		return text
+		return text, nil, {}
 	end
-	-- Remove directional characters, bold, italics, soft hyphens, strip markers and HTML tags.
+	-- Remove directional characters, soft hyphens, strip markers and HTML tags.
-	-- FIXME: Partly duplicated with remove_formatting() in [[Module:links]].
 	text = ugsub(text, "[\194\173\226\128\170-\226\128\174\226\129\166-\226\129\169]", "")
-	text = text:gsub("('*)'''(.-'*)'''", "%1%2"):gsub("('*)''(.-'*)''", "%1%2")
 	text = gsub(unstrip(text), "<[^<>]+>", "")
@@ Line 1,756: / Line 1,564: @@
 		text = sc:toFixedNFD(text)
 	end
-	-- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is
+	-- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is usually not necessary to convert "i" to "İ" and "ı" to "I" first, because "I" will always be interpreted as conventional "I" (not dotless "İ") by any sorting algorithms, which will have been taken into account by the sortkey substitutions themselves. However, if no sortkey substitutions have been specified, then conversion is necessary so as to prevent "i" and "ı" both being sorted as "I".
-	-- usually not necessary to convert "i" to "İ" and "ı" to "I" first, because "I" will always be interpreted as
+	-- An exception is made for scripts that (sometimes) sort by scraping page content, as that means they are sensitive to changes in capitalization (as it changes the target page).
-	-- conventional "I" (not dotless "İ") by any sorting algorithms, which will have been taken into account by the
+	local fail, cats
-	-- sortkey substitutions themselves. However, if no sortkey substitutions have been specified, then conversion is
-	-- necessary so as to prevent "i" and "ı" both being sorted as "I".
-	--
-	-- An exception is made for scripts that (sometimes) sort by scraping page content, as that means they are sensitive
-	-- to changes in capitalization (as it changes the target page).
 	if not sc:sortByScraping() then
 		text = ulower(text)
 	end
-	local actual_substitution_data
+	local sort_key = self._data.sort_key
-	-- Don't trim whitespace here because it's significant at the beginning of a sort key or sort base.
+	text, fail, cats = iterateSectionSubstitutions(self, text, sc, nil, nil, sort_key, "makeSortKey")
-	text, _, actual_substitution_data = iterateSectionSubstitutions(self, text, sc, nil, nil, self._data.sort_key,
-		"sort_key", "makeSortKey", "notrim")
 	if not sc:sortByScraping() then
-		if self:hasDottedDotlessI() and not actual_substitution_data then
+		if self:hasDottedDotlessI() and not sort_key then
-			text = text:gsub("ı", "I"):gsub("i", "İ")
+			text = gsub(gsub(text, "ı", "I"), "i", "İ")
 			text = sc:toFixedNFC(text)
 		end
@@ Line 1,782: / Line 1,583: @@
 	-- Remove parentheses, as long as they are either preceded or followed by something.
-	text = gsub(text, "(.)[()]+", "%1"):gsub("[()]+(.)", "%1")
+	text = gsub(text, "(.)[()]+", "%1")
+		:gsub("[()]+(.)", "%1")
 	text = escape_risky_characters(text)
-	return text
+	return text, fail, cats
 end
---[==[Create the form used as as a basis for display text and transliteration. FIXME: Rename to correctInputText().]==]
+--[==[Create the form used as as a basis for display text and transliteration.]==]
 local function processDisplayText(text, self, sc, keepCarets, keepPrefixes)
 	local subbedChars = {}
@@ Line 1,797: / Line 1,599: @@
 	sc = checkScript(text, self, sc)
+	local fail, cats
 	text = normalize(text, sc)
-	text, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, self._data.display_text,
+	text, fail, cats, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, self._data.display_text, "makeDisplayText")
-		"display_text", "makeDisplayText")
 	text = removeCarets(text, sc)
@@ Line 1,812: / Line 1,614: @@
 		while true do
 			local prefix = gsub(text, "^(.-):.+", function(m1)
-				return (gsub(m1, "\244[\128-\191]*", ""))
+				return gsub(m1, "\244[\128-\191]*", "")
 			end)
 			-- Check if the prefix is an interwiki, though ignore capitalised Wiktionary:, which is a namespace.
@@ Line 1,827: / Line 1,629: @@
 			end)
 		end
-		text = gsub(text, "\3", "\\"):gsub("\4", ":")
+		text = gsub(text, "\3", "\\")
+			:gsub("\4", ":")
+	end
+	--[[if not self:hasType("conlang") then
+		text = gsub(text,"^%*", "")
 	end
+	text = gsub(text,"^%*%*", "*")]]
-	return text, subbedChars
+	return text, fail, cats, subbedChars
 end
 --[==[Make the display text (i.e. what is displayed on the page).]==]
 function Language:makeDisplayText(text, sc, keepPrefixes)
-	if not text or text == "" then
+	if (not text) or text == "" then
-		return text
+		return text, nil, {}
 	end
-	local subbedChars
+	local fail, cats, subbedChars
-	text, subbedChars = processDisplayText(text, self, sc, nil, keepPrefixes)
+	text, fail, cats, subbedChars = processDisplayText(text, self, sc, nil, keepPrefixes)
 	text = escape_risky_characters(text)
-	return undoTempSubstitutions(text, subbedChars)
+	return undoTempSubstitutions(text, subbedChars), fail, cats
 end
---[==[Transliterates the text from the given script into the Latin script (see
+--[==[Transliterates the text from the given script into the Latin script (see [[Wiktionary:Transliteration and romanization]]). The language must have the <code>translit</code> property for this to work; if it is not present, {{code|lua|nil}} is returned.
-[[Wiktionary:Transliteration and romanization]]). The language must have the <code>translit</code> property for this to
+Returns three values:
-work; if it is not present, {{code|lua|nil}} is returned.
+# The transliteration.
+# A boolean which indicates whether the transliteration failed for an unexpected reason. If {{code|lua|false}}, then the transliteration either succeeded, or the module is returning nothing in a controlled way (e.g. the input was {{code|lua|"-"}}). Generally, this means that no maintenance action is required. If {{code|lua|true}}, then the transliteration is {{code|lua|nil}} because either the input or output was defective in some way (e.g. [[Module:ar-translit]] will not transliterate non-vocalised inputs, and this module will fail partially-completed transliterations in all languages). Note that this value can be manually set by the transliteration module, so make sure to cross-check to ensure it is accurate.
-The <code>sc</code> parameter is handled by the transliteration module, and how it is handled is specific to that
+# A table of categories selected by the transliteration module, which should be in the format expected by {{code|lua|format_categories}} in [[Module:utilities]].
-module. Some transliteration modules may tolerate {{code|lua|nil}} as the script, others require it to be one of the
+The <code>sc</code> parameter is handled by the transliteration module, and how it is handled is specific to that module. Some transliteration modules may tolerate {{code|lua|nil}} as the script, others require it to be one of the possible scripts that the module can transliterate, and will show an error if it's not one of them. For this reason, the <code>sc</code> parameter should always be provided when writing non-language-specific code.
-possible scripts that the module can transliterate, and will throw an error if it's not one of them. For this reason,
+The <code>module_override</code> parameter is used to override the default module that is used to provide the transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no default module yet, or you want to demonstrate an alternative version of a transliteration module before making it official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked by [[Wiktionary:Tracking/languages/module_override]].
-the <code>sc</code> parameter should always be provided when writing non-language-specific code.
-The <code>module_override</code> parameter is used to override the default module that is used to provide the
-transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no
-default module yet, or you want to demonstrate an alternative version of a transliteration module before making it
-official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked
-by [[wikt:Wiktionary:Tracking/languages/module_override]].
 '''Known bugs''':
 * This function assumes {tr(s1) .. tr(s2) == tr(s1 .. s2)}. When this assertion fails, wikitext markups like <nowiki>'''</nowiki> can cause wrong transliterations.
-* HTML entities like <code>&amp;apos;</code>, often used to escape wikitext markups, do not work.
+* HTML entities like <code>&amp;apos;</code>, often used to escape wikitext markups, do not work.]==]
-]==]
 function Language:transliterate(text, sc, module_override)
 	-- If there is no text, or the language doesn't have transliteration data and there's no override, return nil.
-	if not text or text == "" or text == "-" then
+	if not (self._data.translit or module_override) then
-		return text
+		return nil, false, {}
+	elseif (not text) or text == "" or text == "-" then
+		return text, false, {}
 	end
 	-- If the script is not transliteratable (and no override is given), return nil.
 	sc = checkScript(text, self, sc)
 	if not (sc:isTransliterated() or module_override) then
-		return nil
+		return nil, true, {}
 	end
@@ Line 1,882: / Line 1,686: @@
 	-- Get the display text with the keepCarets flag set.
-	local subbedChars
+	local fail, cats, subbedChars
 	if processed then
-		text, subbedChars = processDisplayText(text, self, sc, true)
+		text, fail, cats, subbedChars = processDisplayText(text, self, sc, true)
+	end
+	-- Transliterate (using the module override if applicable).
+	text, fail, cats, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, true, module_override or self._data.translit, "tr")
+	if not text then
+		return nil, true, cats
 	end
-	-- Transliterate (using the module override if applicable).
+	-- Incomplete transliterations return nil.
-	text, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, true, module_override or
+	local charset = sc.characters
-		self._data.translit, "translit", "tr")
+	if charset and umatch(text, "[" .. charset .. "]") then
+		-- Remove any characters in Latin, which includes Latin characters also included in other scripts (as these are false positives), as well as any PUA substitutions. Anything remaining should only be script code "None" (e.g. numerals).
-	if not text then
+		local check_text = ugsub(text, "[" .. get_script("Latn").characters .. "􀀀-􏿽]+", "")
-		return nil
+		-- Set none_is_last_resort_only flag, so that any non-None chars will cause a script other than "None" to be returned.
-	end
+		if find_best_script_without_lang(check_text, true):getCode() ~= "None" then
+			return nil, true, cats
-	-- Incomplete transliterations return nil.
+		end
-	local charset = sc.characters
+	end
-	if charset and umatch(text, "[" .. charset .. "]") then
-		-- Remove any characters in Latin, which includes Latin characters also included in other scripts (as these are
+	if processed then
-		-- false positives), as well as any PUA substitutions. Anything remaining should only be script code "None"
+		text = escape_risky_characters(text)
-		-- (e.g. numerals).
+		text = undoTempSubstitutions(text, subbedChars)
-		local check_text = ugsub(text, "[" .. get_script("Latn").characters .. "􀀀-􏿽]+", "")
+	end
-		-- Set none_is_last_resort_only flag, so that any non-None chars will cause a script other than "None" to be
-		-- returned.
+	-- If the script does not use capitalization, then capitalize any letters of the transliteration which are immediately preceded by a caret (and remove the caret).
-		if find_best_script_without_lang(check_text, true):getCode() ~= "None" then
+	if text and not sc:hasCapitalization() and text:find("^", 1, true) then
-			return nil
+		text = processCarets(text, "%^([\128-\191\244]*%*?)([^\128-\191\244][\128-\191]*)", function(m1, m2)
-		end
+			return m1 .. uupper(m2)
-	end
+		end)
-	if processed then
-		text = escape_risky_characters(text)
-		text = undoTempSubstitutions(text, subbedChars)
 	end
-	-- If the script does not use capitalization, then capitalize any letters of the transliteration which are
+	fail = text == nil and (not not fail) or false
-	-- immediately preceded by a caret (and remove the caret).
-	if text and not sc:hasCapitalization() and text:find("^", 1, true) then
-		text = processCarets(text, "%^([\128-\191\244]*%*?)([^\128-\191\244][\128-\191]*)", function(m1, m2)
-			return m1 .. uupper(m2)
-		end)
-	end
-	return text
+	return text, fail, cats
 end
@@ Line 1,961: / Line 1,762: @@
 function Language:toJSON(opts)
-	local strip_diacritics, strip_diacritics_patterns, strip_diacritics_remove_diacritics = self._data.strip_diacritics
+	local entry_name, entry_name_patterns, entry_name_remove_diacritics = self._data.entry_name
-	if strip_diacritics then
+	if entry_name then
-		if strip_diacritics.from then
+		if entry_name.from then
-			strip_diacritics_patterns = {}
+			entry_name_patterns = {}
-			for i, from in ipairs(strip_diacritics.from) do
+			for i, from in ipairs(entry_name.from) do
-				insert(strip_diacritics_patterns, {from = from, to = strip_diacritics.to[i] or ""})
+				insert(entry_name_patterns, {from = from, to = entry_name.to[i] or ""})
 			end
 		end
-		strip_diacritics_remove_diacritics = strip_diacritics.remove_diacritics
+		entry_name_remove_diacritics = entry_name.remove_diacritics
 	end
 	-- mainCode should only end up non-nil if dontCanonicalizeAliases is passed to make_object().
-	-- props should either contain zero-argument functions to compute the value, or the value itself.
+	local ret = {
-	local props = {
+		ancestors = self:getAncestorCodes(),
-		ancestors = function() return self:getAncestorCodes() end,
+		canonicalName = self:getCanonicalName(),
-		canonicalName = function() return self:getCanonicalName() end,
+		categoryName = self:getCategoryName("nocap"),
-		categoryName = function() return self:getCategoryName("nocap") end,
 		code = self._code,
 		mainCode = self._mainCode,
-		parent = function() return self:getParentCode() end,
+		parent = self:getParentCode(),
-		full = function() return self:getFullCode() end,
+		full = self:getFullCode(),
-		stripDiacriticsPatterns = strip_diacritics_patterns,
+		entryNamePatterns = entry_name_patterns,
-		stripDiacriticsRemoveDiacritics = strip_diacritics_remove_diacritics,
+		entryNameRemoveDiacritics = entry_name_remove_diacritics,
-		family = function() return self:getFamilyCode() end,
+		family = self:getFamilyCode(),
-		aliases = function() return self:getAliases() end,
+		aliases = self:getAliases(),
-		varieties = function() return self:getVarieties() end,
+		varieties = self:getVarieties(),
-		otherNames = function() return self:getOtherNames() end,
+		otherNames = self:getOtherNames(),
-		scripts = function() return self:getScriptCodes() end,
+		scripts = self:getScriptCodes(),
-		type = function() return keys_to_list(self:getTypes()) end,
+		type = keys_to_list(self:getTypes()),
-		wikimediaLanguages = function() return self:getWikimediaLanguageCodes() end,
+		wikimediaLanguages = self:getWikimediaLanguageCodes(),
-		wikidataItem = function() return self:getWikidataItem() end,
+		wikidataItem = self:getWikidataItem(),
-		wikipediaArticle = function() return self:getWikipediaArticle(true) end,
+		wikipediaArticle = self:getWikipediaArticle(true),
 	}
-	local ret = {}
-	for prop, val in pairs(props) do
-		if not opts.skip_fields or not opts.skip_fields[prop] then
-			if type(val) == "function" then
-				ret[prop] = val()
-			else
-				ret[prop] = val
-			end
-		end
-	end
 	-- Use `deep_copy` when returning a table, so that there are no editing restrictions imposed by `mw.loadData`.
 	return opts and opts.lua_table and deep_copy(ret) or to_json(ret, opts)
@@ Line 2,134: / Line 1,923: @@
 	--[==[
-	<span style="color: var(--wikt-palette-red,#BA0000)">This function is not for use in entries or other content pages.</span>
+	<span style="color: #BA0000">This function is not for use in entries or other content pages.</span>
 	Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes. If `extra` is set, any extra data in the relevant `/extra` module will be included. (Note that it will be included anyway if it has already been loaded into the language object.) If `raw` is set, then the returned data will not contain any data inherited from parent objects.
 	-- Do NOT use these methods!