Module:languages: Difference between revisions
No edit summary Tag: Reverted |
Tag: Undo |
||
| Line 1: | Line 1: | ||
--[ | --[=[ | ||
This module implements fetching of language-specific information and processing text in a given language. | This module implements fetching of language-specific information and processing text in a given language. | ||
There are two types of languages: full languages and etymology-only languages. The essential difference is that only | There are two types of languages: full languages and etymology-only languages. The essential difference is that only | ||
| Line 9: | Line 7: | ||
their parent (in the parent-child inheritance sense), and for etymology-only languages with another etymology-only | their parent (in the parent-child inheritance sense), and for etymology-only languages with another etymology-only | ||
language as their parent, a full language can always be derived by following the parent links upwards. For example, | language as their parent, a full language can always be derived by following the parent links upwards. For example, | ||
"Canadian French", code | "Canadian French", code 'fr-CA', is an etymology-only language whose parent is the full language "French", code 'fr'. | ||
An example of an etymology-only language with another etymology-only parent is "Northumbrian Old English", code | An example of an etymology-only language with another etymology-only parent is "Northumbrian Old English", code | ||
'ang-nor', which has "Anglian Old English", code 'ang-ang' as its parent; this is an etymology-only language whose | |||
parent is "Old English", code | parent is "Old English", code "ang", which is a full language. (This is because Northumbrian Old English is considered | ||
a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code | a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code 'und'; this is the case, | ||
for example, for "substrate" languages such as "Pre-Greek", code | for example, for "substrate" languages such as "Pre-Greek", code 'qsb-grc', and "the BMAC substrate", code 'qsb-bma'. | ||
It is important to distinguish language ''parents'' from language ''ancestors''. The parent-child relationship is one | It is important to distinguish language ''parents'' from language ''ancestors''. The parent-child relationship is one | ||
of containment, i.e. if X is a child of Y, X is considered a variety of Y. On the other hand, the ancestor-descendant | of containment, i.e. if X is a child of Y, X is considered a variety of Y. On the other hand, the ancestor-descendant | ||
relationship is one of descent in time. For example, "Classical Latin", code | relationship is one of descent in time. For example, "Classical Latin", code 'la-cla', and "Late Latin", code 'la-lat', | ||
are both etymology-only languages with "Latin", code | are both etymology-only languages with "Latin", code 'la', as their parents, because both of the former are varieties | ||
of Latin. However, Late Latin does *NOT* have Classical Latin as its parent because Late Latin is *not* a variety of | of Latin. However, Late Latin does *NOT* have Classical Latin as its parent because Late Latin is *not* a variety of | ||
Classical Latin; rather, it is a descendant. There is in fact a separate | Classical Latin; rather, it is a descendant. There is in fact a separate 'ancestors' field that is used to express the | ||
ancestor-descendant relationship, and Late Latin's ancestor is given as Classical Latin. It is also important to note | ancestor-descendant relationship, and Late Latin's ancestor is given as Classical Latin. It is also important to note | ||
that sometimes an etymology-only language is actually the conceptual ancestor of its parent language. This happens, | that sometimes an etymology-only language is actually the conceptual ancestor of its parent language. This happens, | ||
for example, with "Old Italian" (code | for example, with "Old Italian" (code 'roa-oit'), which is an etymology-only variant of full language "Italian" (code | ||
'it'), and with "Old Latin" (code 'itc-ola'), which is an etymology-only variant of Latin. In both cases, the full | |||
language has the etymology-only variant listed as an ancestor. This allows a Latin term to inherit from Old Latin | language has the etymology-only variant listed as an ancestor. This allows a Latin term to inherit from Old Latin | ||
using the {{tl|inh}} template (where in this template, "inheritance" refers to ancestral inheritance, i.e. inheritance | using the {{tl|inh}} template (where in this template, "inheritance" refers to ancestral inheritance, i.e. inheritance | ||
| Line 50: | Line 48: | ||
functions in [[Module:languages]] and [[Module:etymology languages]] to convert a language's canonical name to a | functions in [[Module:languages]] and [[Module:etymology languages]] to convert a language's canonical name to a | ||
{Language} object (depending on whether the canonical name refers to a full or etymology-only language). | {Language} object (depending on whether the canonical name refers to a full or etymology-only language). | ||
Textual strings belonging to a given language come in several different ''text variants'': | Textual strings belonging to a given language come in several different ''text variants'': | ||
# The ''input text'' is what the user supplies in wikitext, in the parameters to {{tl|m}}, {{tl|l}}, {{tl|ux}}, | # The ''input text'' is what the user supplies in wikitext, in the parameters to {{tl|m}}, {{tl|l}}, {{tl|ux}}, | ||
{{tl|t}}, {{tl|lang}} and the like. | |||
# The ''display text'' is the text in the form as it will be displayed to the user. This can include accent marks that | |||
are stripped to form the entry text (see below), as well as embedded bracketed links that are variously processed | |||
further. The display text is generated from the input text by applying language-specific transformations; for most | |||
languages, there will be no such transformations. Examples of transformations are bad-character replacements for | |||
certain languages (e.g. replacing 'l' or '1' to [[palochka]] in certain languages in Cyrillic); and for Thai and | |||
# The ''display text'' is the text in the form as it will be displayed to the user | Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as [กรีน/กฺรีน], | ||
which indicate how to transliterate given words. | |||
# The ''entry text'' is the text in the form used to generate a link to a Wiktionary entry. This is usually generated | |||
from the display text by stripping certain sorts of diacritics on a per-language basis, and sometimes doing other | |||
transformations. The concept of ''entry text'' only really makes sense for text that does not contain embedded links, | |||
meaning that display text containing embedded links will need to have the links individually processed to get | |||
per-link entry text in order to generate the resolved display text (see below). | |||
# The ''resolved display text'' is the result of resolving embedded links in the display text (e.g. converting them to | |||
two-part links where the first part has entry-text transformations applied, and adding appropriate language-specific | |||
fragments) and adding appropriate language and script tagging. This text can be passed directly to MediaWiki for | |||
display. | |||
# The ''source translit text'' is the text as supplied to the language-specific {transliterate()} method. The form of | |||
the source translit text may need to be language-specific, e.g Thai and Khmer will need the full unprocessed input | |||
text, whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded | |||
bracketed links are handled in the existing code.] In general, embedded links need to be removed (i.e. converted to | |||
their "bare display" form by taking the right part of two-part links and removing double brackets), but when this | |||
happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the | |||
text through the transliterate mechanism, and for others (those listed with "cont" in {substition} in | |||
[[Module:languages/data]]) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is | |||
still unclear to me.) | |||
# The ''transliterated text'' (or ''transliteration'') is the result of transliterating the source translit text. | |||
Unlike for all the other text variants except the transcribed text, it is always in the Latin script. | |||
# The '' | |||
# The ''source translit text'' is the text as supplied to the language-specific { | |||
# The ''transliterated text'' (or ''transliteration'') is the result of transliterating the source translit text. Unlike | |||
# The ''transcribed text'' (or ''transcription'') is the result of transcribing the source translit text, where | # The ''transcribed text'' (or ''transcription'') is the result of transcribing the source translit text, where | ||
"transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian, | |||
Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form. | |||
Unlike for all the other text variants other than the transliterated text, it is always in the Latin script. | |||
Currently, the transcribed text is always supplied manually be the user; there is no such thing as a | |||
{lua|transcribe()} method on language objects. | |||
# The ''sort key'' is the text used in sort keys for determining the placing of pages in categories they belong to. The | # The ''sort key'' is the text used in sort keys for determining the placing of pages in categories they belong to. The | ||
sort key is generated from the pagename or a specified ''sort base'' by lowercasing, doing language-specific | |||
transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it | |||
needs to be converted to display text, have embedded links removed (i.e. resolving them to their right side if they | |||
are two-part links) and have entry text transformations applied. | |||
# There are other text variants that occur in usexes (specifically, there are normalized variants of several of the | # There are other text variants that occur in usexes (specifically, there are normalized variants of several of the | ||
above text variants), but we can skip them for now. | |||
The following methods exist on {Language} objects to convert between different text variants: | The following methods exist on {Language} objects to convert between different text variants: | ||
# | # {makeDisplayText}: This converts input text to display text. | ||
# { | # {lua|makeEntryName}: This converts input or display text to entry text. [FIXME: This needs some rethinking. In | ||
particular, {lua|makeEntryName} is sometimes called on display text (in some paths inside of [[Module:links]]) and | |||
sometimes called on input text (in other paths inside of [[Module:links]], and usually from other modules). We need | |||
to make sure we don't try to convert input text to display text twice, but at the same time we need to support | |||
calling it directly on input text since so many modules do this. This means we need to add a parameter indicating | |||
whether the passed-in text is input or display text; if that former, we call {lua|makeDisplayText} ourselves.] | |||
# { | # {lua|transliterate}: This appears to convert input text with embedded brackets removed into a transliteration. | ||
[FIXME: This needs some rethinking. In particular, it calls {lua|processDisplayText} on its input, which won't work | |||
for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the | |||
language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code; | |||
a lot of callers remove the links themselves before calling {lua|transliterate()}, which I assume is wrong.] | |||
# {lua|makeSortKey}: This converts entry text (?) to a sort key. [FIXME: Clarify this.] | |||
# {makeSortKey}: This converts | ]=] | ||
] | |||
local export = {} | local export = {} | ||
local etymology_languages_data_module = "Module:etymology languages/data" | local etymology_languages_data_module = "Module:etymology languages/data" | ||
local families_module = "Module:families" | local families_module = "Module:families" | ||
local json_module = "Module:JSON" | local json_module = "Module:JSON" | ||
local language_like_module = "Module:language-like" | local language_like_module = "Module:language-like" | ||
| Line 145: | Line 118: | ||
local links_data_module = "Module:links/data" | local links_data_module = "Module:links/data" | ||
local load_module = "Module:load" | local load_module = "Module:load" | ||
local patterns_module = "Module:patterns" | |||
local scripts_module = "Module:scripts" | local scripts_module = "Module:scripts" | ||
local scripts_data_module = "Module:scripts/data" | local scripts_data_module = "Module:scripts/data" | ||
local string_encode_entities_module = "Module:string/encode entities" | local string_encode_entities_module = "Module:string/encode entities" | ||
local string_utilities_module = "Module:string utilities" | local string_utilities_module = "Module:string utilities" | ||
local table_module = "Module:table" | local table_module = "Module:table" | ||
| Line 188: | Line 160: | ||
local Hant_chars | local Hant_chars | ||
local function check_object(...) | --[==[ | ||
Loaders for functions in other modules, which overwrite themselves with the target function when called. This ensures modules are only loaded when needed, retains the speed/convenience of locally-declared pre-loaded functions, and has no overhead after the first call, since the target functions are called directly in any subsequent calls.]==] | |||
local function check_object(...) | |||
end | check_object = require(utilities_module).check_object | ||
return check_object(...) | |||
end | |||
local function decode_entities(...) | local function decode_entities(...) | ||
decode_entities = require(string_utilities_module).decode_entities | |||
return decode_entities(...) | |||
end | end | ||
local function decode_uri(...) | local function decode_uri(...) | ||
decode_uri = require(string_utilities_module).decode_uri | |||
return decode_uri(...) | |||
end | end | ||
local function deep_copy(...) | local function deep_copy(...) | ||
deep_copy = require(table_module).deepCopy | |||
return deep_copy(...) | |||
end | end | ||
local function encode_entities(...) | local function encode_entities(...) | ||
encode_entities = require(string_encode_entities_module) | |||
return encode_entities(...) | |||
end | end | ||
local function | local function get_script(...) | ||
get_script = require(scripts_module).getByCode | |||
return get_script(...) | |||
end | end | ||
local function | local function find_best_script_without_lang(...) | ||
find_best_script_without_lang = require(scripts_module).findBestScriptWithoutLang | |||
return find_best_script_without_lang(...) | |||
end | end | ||
local function | local function get_family(...) | ||
get_family = require(families_module).getByCode | |||
return get_family(...) | |||
end | end | ||
local function | local function get_plaintext(...) | ||
get_plaintext = require(utilities_module).get_plaintext | |||
return get_plaintext(...) | |||
end | end | ||
local function | local function get_wikimedia_lang(...) | ||
get_wikimedia_lang = require(wikimedia_languages_module).getByCode | |||
return get_wikimedia_lang(...) | |||
end | end | ||
local function | local function keys_to_list(...) | ||
keys_to_list = require(table_module).keysToList | |||
return keys_to_list(...) | |||
end | end | ||
local function | local function list_to_set(...) | ||
list_to_set = require(table_module).listToSet | |||
return list_to_set(...) | |||
end | end | ||
local function | local function load_data(...) | ||
load_data = require(load_module).load_data | |||
return load_data(...) | |||
end | end | ||
local function | local function make_family_object(...) | ||
make_family_object = require(families_module).makeObject | |||
return make_family_object(...) | |||
end | end | ||
local function | local function pattern_escape(...) | ||
pattern_escape = require(patterns_module).pattern_escape | |||
return pattern_escape(...) | |||
end | end | ||
local function | local function remove_duplicates(...) | ||
remove_duplicates = require(table_module).removeDuplicates | |||
return remove_duplicates(...) | |||
end | end | ||
local function replacement_escape(...) | local function replacement_escape(...) | ||
replacement_escape = require(patterns_module).replacement_escape | |||
return replacement_escape(...) | |||
end | end | ||
local function safe_require(...) | local function safe_require(...) | ||
safe_require = require(load_module).safe_require | |||
return safe_require(...) | |||
end | end | ||
local function shallow_copy(...) | local function shallow_copy(...) | ||
shallow_copy = require(table_module).shallowCopy | |||
return shallow_copy(...) | |||
end | end | ||
local function split(...) | local function split(...) | ||
split = require(string_utilities_module).split | |||
return split(...) | |||
end | end | ||
local function to_json(...) | local function to_json(...) | ||
to_json = require(json_module).toJSON | |||
return to_json(...) | |||
end | end | ||
local function u(...) | local function u(...) | ||
u = require(string_utilities_module).char | |||
return u(...) | |||
end | end | ||
local function ugsub(...) | local function ugsub(...) | ||
ugsub = require(string_utilities_module).gsub | |||
return ugsub(...) | |||
end | end | ||
local function ulen(...) | local function ulen(...) | ||
ulen = require(string_utilities_module).len | |||
return ulen(...) | |||
end | end | ||
local function ulower(...) | local function ulower(...) | ||
ulower = require(string_utilities_module).lower | |||
return ulower(...) | |||
end | end | ||
local function umatch(...) | local function umatch(...) | ||
umatch = require(string_utilities_module).match | |||
return umatch(...) | |||
end | end | ||
local function uupper(...) | local function uupper(...) | ||
uupper = require(string_utilities_module).upper | |||
return uupper(...) | |||
end | end | ||
local function normalize_code(code) | local function normalize_code(code) | ||
| Line 381: | Line 355: | ||
end | end | ||
-- Pre-substitution, of "[[" and "]]", which makes pattern matching more accurate. | -- Pre-substitution, of "[[" and "]]", which makes pattern matching more accurate. | ||
text = gsub(text, "%f[%[]%[%[", "\1"):gsub("%f[%]]%]%]", "\2") | text = gsub(text, "%f[%[]%[%[", "\1") | ||
:gsub("%f[%]]%]%]", "\2") | |||
local i = #subbedChars | local i = #subbedChars | ||
for _, pattern in ipairs(patterns) do | for _, pattern in ipairs(patterns) do | ||
| Line 405: | Line 380: | ||
end) | end) | ||
end | end | ||
text = gsub(text, "\1", "%[%["):gsub("\2", "%]%]") | text = gsub(text, "\1", "%[%[") | ||
:gsub("\2", "%]%]") | |||
return text, subbedChars | return text, subbedChars | ||
end | end | ||
| Line 415: | Line 391: | ||
local byte3 = floor(i / 64) % 64 + 128 | local byte3 = floor(i / 64) % 64 + 128 | ||
local byte4 = i % 64 + 128 | local byte4 = i % 64 + 128 | ||
text = gsub(text, "\244[" .. char(byte2) .. char(byte2+8) .. "]" .. char(byte3) .. char(byte4), | text = gsub(text, "\244[" .. char(byte2) .. char(byte2+8) .. "]" .. char(byte3) .. char(byte4), replacement_escape(subbedChars[i])) | ||
end | end | ||
text = gsub(text, "\1", "%[%["):gsub("\2", "%]%]") | text = gsub(text, "\1", "%[%[") | ||
:gsub("\2", "%]%]") | |||
return text | return text | ||
end | end | ||
| Line 445: | Line 421: | ||
end | end | ||
local function doSubstitutions(self, text, sc, substitution_data, function_name, recursed) | |||
local fail, cats = nil, {} | |||
local function doSubstitutions(self, text, sc, substitution_data | |||
-- If there are language-specific substitutes given in the data module, use those. | -- If there are language-specific substitutes given in the data module, use those. | ||
if type(substitution_data) == "table" then | if type(substitution_data) == "table" then | ||
-- If a script is specified, run this function with the script-specific data before continuing. | -- If a script is specified, run this function with the script-specific data before continuing. | ||
local sc_code = sc:getCode() | local sc_code = sc:getCode() | ||
if substitution_data[sc_code] then | |||
if substitution_data[sc_code] | text, fail, cats = doSubstitutions(self, text, sc, substitution_data[sc_code], function_name, true) | ||
-- Hant, Hans and Hani are usually treated the same, so add a special case to avoid having to specify each one separately. | |||
elseif sc_code:match("^Han") and substitution_data.Hani then | |||
text, fail, cats = doSubstitutions(self, text, sc, substitution_data.Hani, function_name, true) | |||
-- Hant, Hans and Hani are usually treated the same, so add a special case to avoid having to specify each one | |||
elseif sc_code:match("^Han") and substitution_data.Hani | |||
-- Substitution data with key 1 in the outer table may be given as a fallback. | -- Substitution data with key 1 in the outer table may be given as a fallback. | ||
elseif substitution_data[1] | elseif substitution_data[1] then | ||
text, fail, cats = doSubstitutions(self, text, sc, substitution_data[1], function_name, true) | |||
end | end | ||
-- Iterate over all strings in the "from" subtable, and gsub with the corresponding string in "to". We work with | -- Iterate over all strings in the "from" subtable, and gsub with the corresponding string in "to". We work with the NFD decomposed forms, as this simplifies many substitutions. | ||
if substitution_data.from then | if substitution_data.from then | ||
for i, from in ipairs(substitution_data.from) do | for i, from in ipairs(substitution_data.from) do | ||
-- Normalize each loop, to ensure multi-stage substitutions work correctly. | -- Normalize each loop, to ensure multi-stage substitutions work correctly. | ||
| Line 493: | Line 446: | ||
if substitution_data.remove_diacritics then | if substitution_data.remove_diacritics then | ||
text = sc:toFixedNFD(text) | text = sc:toFixedNFD(text) | ||
-- Convert exceptions to PUA. | -- Convert exceptions to PUA. | ||
| Line 516: | Line 468: | ||
text = text:gsub("\242[\128-\191]*", substitutes) | text = text:gsub("\242[\128-\191]*", substitutes) | ||
end | end | ||
end | end | ||
elseif type(substitution_data) == "string" then | elseif type(substitution_data) == "string" then | ||
| Line 529: | Line 476: | ||
-- TODO: translit functions should be called with form NFD. | -- TODO: translit functions should be called with form NFD. | ||
if function_name == "tr" then | if function_name == "tr" then | ||
text, fail, cats = module[function_name](text, self._code, sc:getCode()) | |||
text = module[function_name](text, self._code, sc:getCode()) | |||
else | else | ||
text, fail, cats = module[function_name](sc:toFixedNFD(text), self, sc) | |||
end | end | ||
else | else | ||
error("Substitution data '" .. substitution_data .. "' does not match an existing module.") | error("Substitution data '" .. substitution_data .. "' does not match an existing module.") | ||
end | end | ||
end | end | ||
-- Don't normalize to NFC if this is the inner loop or if a module returned nil. | -- Don't normalize to NFC if this is the inner loop or if a module returned nil. | ||
if recursed or not text then | if recursed or not text then | ||
return text, | return text, fail, cats | ||
end | end | ||
-- Fix any discouraged sequences created during the substitution process, and normalize into the final form. | -- Fix any discouraged sequences created during the substitution process, and normalize into the final form. | ||
return sc:toFixedNFC(sc:fixDiscouragedSequences(text)), | return sc:toFixedNFC(sc:fixDiscouragedSequences(text)), fail, cats | ||
end | end | ||
-- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate | -- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate over each one to apply substitutions. This avoids putting PUA characters through language-specific modules, which may be unequipped for them. | ||
local function iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, substitution_data, function_name) | |||
local fail, cats, sections = nil, {} | |||
local function iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, substitution_data, | |||
local sections | |||
-- See [[Module:languages/data]]. | -- See [[Module:languages/data]]. | ||
if not find(text, "\244") or load_data(languages_data_module).substitution[self._code] == "cont" then | if not find(text, "\244") or (load_data(languages_data_module).substitution[self._code] == "cont") then | ||
sections = {text} | sections = {text} | ||
else | else | ||
sections = split(text, "\244[\128-\143][\128-\191]*", true) | sections = split(text, "\244[\128-\143][\128-\191]*", true) | ||
end | end | ||
for _, section in ipairs(sections) do | for _, section in ipairs(sections) do | ||
-- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated | -- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated modules). | ||
if gsub(section, "%s+", "") ~= "" then | if gsub(section, "%s+", "") ~= "" then | ||
local sub, | local sub, sub_fail, sub_cats = doSubstitutions(self, section, sc, substitution_data, function_name) | ||
-- Second round of temporary substitutions, in case any formatting was added by the main substitution process. However, don't do this if the section contains formatting already (as it would have had to have been escaped to reach this stage, and therefore should be given as raw text). | |||
-- Second round of temporary substitutions, in case any formatting was added by the main substitution | |||
if sub and subbedChars then | if sub and subbedChars then | ||
local noSub | local noSub | ||
| Line 626: | Line 518: | ||
end | end | ||
end | end | ||
if not sub then | if (not sub) or sub_fail then | ||
text = sub | text = sub | ||
fail = sub_fail | |||
cats = sub_cats or {} | |||
break | break | ||
end | end | ||
text = sub and gsub(text, pattern_escape(section), replacement_escape(sub), 1) or text | text = sub and gsub(text, pattern_escape(section), replacement_escape(sub), 1) or text | ||
if type(sub_cats) == "table" then | |||
for _, cat in ipairs(sub_cats) do | |||
insert(cats, cat) | |||
end | |||
end | |||
end | end | ||
end | end | ||
-- Trim, unless there are only spacing characters, while ignoring any final formatting characters. | |||
text = text and text:gsub("^([\128-\191\244]*)%s+(%S)", "%1%2") | |||
:gsub("(%S)%s+([\128-\191\244]*)$", "%1%2") | |||
-- Remove duplicate categories. | |||
if #cats > 1 then | |||
cats = remove_duplicates(cats) | |||
end | end | ||
return text, subbedChars | return text, fail, cats, subbedChars | ||
end | end | ||
| Line 650: | Line 551: | ||
text, rep = gsub(text, "\\\\(\\*^)", "\3%1") | text, rep = gsub(text, "\\\\(\\*^)", "\3%1") | ||
until rep == 0 | until rep == 0 | ||
return | return text:gsub("\\^", "\4") | ||
:gsub(pattern or "%^", repl or "") | :gsub(pattern or "%^", repl or "") | ||
:gsub("\3", "\\") | :gsub("\3", "\\") | ||
:gsub("\4", "^" | :gsub("\4", "^") | ||
end | end | ||
| Line 807: | Line 708: | ||
Language.hasType = require(language_like_module).hasType | Language.hasType = require(language_like_module).hasType | ||
return self:hasType(...) | return self:hasType(...) | ||
end | |||
function Language:getMainCategoryName() | |||
return self._data.main_category or "lemma" | |||
end | end | ||
| Line 865: | Line 770: | ||
function Language:makeWikipediaLink() | function Language:makeWikipediaLink() | ||
return make_link(self, (self:hasType("conlang") and self:getCanonicalName() or "w:" .. self:getWikipediaArticle()), self:getCanonicalName()) | return make_link(self, (self:hasType("conlang") and self:getCanonicalName() or "w:" .. self:getWikipediaArticle()), self:getCanonicalName()) | ||
end | end | ||
| Line 980: | Line 881: | ||
local t, s, found = 0, 0 | local t, s, found = 0, 0 | ||
-- This is faster than using mw.ustring.gmatch directly. | -- This is faster than using mw.ustring.gmatch directly. | ||
for ch in gmatch | for ch in gmatch(ugsub(text, "[" .. Hani.characters .. "]", "\255%0"), "\255(.[\128-\191]*)") do | ||
found = true | found = true | ||
if Hant_chars[ch] then | if Hant_chars[ch] then | ||
| Line 1,009: | Line 910: | ||
-- Count characters by removing everything in the script's charset and comparing to the original length. | -- Count characters by removing everything in the script's charset and comparing to the original length. | ||
local charset = sc.characters | local charset = sc.characters | ||
local count = charset and length - ulen | local count = charset and length - ulen(ugsub(text, "[" .. charset .. "]+", "")) or 0 | ||
if count >= length then | if count >= length then | ||
| Line 1,279: | Line 1,180: | ||
local ancestorsParents = {} | local ancestorsParents = {} | ||
for _, ancestor in ipairs(ancestors) do | for _, ancestor in ipairs(ancestors) do | ||
local ret = func(ancestor) or iterateOverAncestorTree(ancestor, func, parent_check) | |||
local ret = | |||
if ret then | if ret then | ||
return ret | return ret | ||
| Line 1,550: | Line 1,449: | ||
function Language:getStandardCharacters(sc) | function Language:getStandardCharacters(sc) | ||
local standard_chars = self._data. | local standard_chars = self._data.standardChars | ||
if type(standard_chars) ~= "table" then | if type(standard_chars) ~= "table" then | ||
return standard_chars | return standard_chars | ||
| Line 1,569: | Line 1,468: | ||
end | end | ||
--[==[ | --[==[Make the entry name (i.e. the correct page name).]==] | ||
function Language:makeEntryName(text, sc) | |||
]==] | |||
function Language: | |||
if (not text) or text == "" then | if (not text) or text == "" then | ||
return text | return text, nil, {} | ||
end | end | ||
-- Set `unsupported` as true if certain conditions are met. | |||
local unsupported | |||
-- Check if there's an unsupported character. \239\191\189 is the replacement character U+FFFD, which can't be typed directly here due to an abuse filter. Unix-style dot-slash notation is also unsupported, as it is used for relative paths in links, as are 3 or more consecutive tildes. | |||
-- Note: match is faster with magic characters/charsets; find is faster with plaintext. | |||
-- Set `unsupported` as true if certain conditions are met. | |||
local unsupported | |||
-- Check if there's an unsupported character. \239\191\189 is the replacement character U+FFFD, which can't be typed | |||
if ( | if ( | ||
match( | match(text, "[#<>%[%]_{|}]") or | ||
find( | find(text, "\239\191\189") or | ||
match( | match(text, "%f[^%z/]%.%.?%f[%z/]") or | ||
find( | find(text, "~~~") | ||
) then | ) then | ||
unsupported = true | unsupported = true | ||
-- If it looks like an interwiki link. | -- If it looks like an interwiki link. | ||
elseif find( | elseif find(text, ":") then | ||
local prefix = gsub( | local prefix = gsub(text, "^:*(.-):.*", ulower) | ||
if ( | if ( | ||
load_data("Module:data/namespaces")[prefix] or | load_data("Module:data/namespaces")[prefix] or | ||
| Line 1,656: | Line 1,496: | ||
end | end | ||
-- Escape unsupported characters so they can be used in titles. ` is used as a delimiter for this, so a raw use of | -- Check if the text is a listed unsupported title. | ||
local unsupportedTitles = load_data(links_data_module).unsupported_titles | |||
if unsupportedTitles[text] then | |||
return "Unsupported titles/" .. unsupportedTitles[text], nil, {} | |||
end | |||
sc = checkScript(text, self, sc) | |||
local fail, cats | |||
text = normalize(text, sc) | |||
text, fail, cats = iterateSectionSubstitutions(self, text, sc, nil, nil, self._data.entry_name, "makeEntryName") | |||
text = umatch(text, "^[¿¡]?(.-[^%s%p].-)%s*[؟?!;՛՜ ՞ ՟?!︖︕।॥။၊་།]?$") or text | |||
-- Escape unsupported characters so they can be used in titles. ` is used as a delimiter for this, so a raw use of it in an unsupported title is also escaped here to prevent interference; this is only done with unsupported titles, though, so inclusion won't in itself mean a title is treated as unsupported (which is why it's excluded from the earlier test). | |||
if unsupported then | if unsupported then | ||
local unsupported_characters = load_data(links_data_module).unsupported_characters | local unsupported_characters = load_data(links_data_module).unsupported_characters | ||
text = text:gsub("[#<>%[%]_`{|}\239]\191?\189?", unsupported_characters) | |||
:gsub("%f[^%z/]%.%.?%f[%z/]", function(m) | :gsub("%f[^%z/]%.%.?%f[%z/]", function(m) | ||
return | return gsub(m, "%.", "`period`") | ||
end) | end) | ||
:gsub("~~~+", function(m) | :gsub("~~~+", function(m) | ||
return | return gsub(m, "~", "`tilde`") | ||
end) | end) | ||
text = "Unsupported titles/" .. text | |||
end | |||
end | |||
return text, fail, cats | |||
return | |||
end | end | ||
--[==[Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.]==] | --[==[Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.]==] | ||
| Line 1,725: | Line 1,537: | ||
end | end | ||
--[==[Creates a sort key for the given | --[==[Creates a sort key for the given entry name, following the rules appropriate for the language. This removes diacritical marks from the entry name if they are not considered significant for sorting, and may perform some other changes. Any initial hyphen is also removed, and anything parentheses is removed as well. | ||
diacritical marks from the | The <code>sort_key</code> setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the entry name and returns a sortkey.]==] | ||
changes. Any initial hyphen is also removed, and anything | |||
The <code>sort_key</code> setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the | |||
function Language:makeSortKey(text, sc) | function Language:makeSortKey(text, sc) | ||
if (not text) or text == "" then | if (not text) or text == "" then | ||
return text | return text, nil, {} | ||
end | end | ||
-- Remove directional characters | -- Remove directional characters, soft hyphens, strip markers and HTML tags. | ||
text = ugsub(text, "[\194\173\226\128\170-\226\128\174\226\129\166-\226\129\169]", "") | text = ugsub(text, "[\194\173\226\128\170-\226\128\174\226\129\166-\226\129\169]", "") | ||
text = gsub(unstrip(text), "<[^<>]+>", "") | text = gsub(unstrip(text), "<[^<>]+>", "") | ||
| Line 1,756: | Line 1,564: | ||
text = sc:toFixedNFD(text) | text = sc:toFixedNFD(text) | ||
end | end | ||
-- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is | -- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is usually not necessary to convert "i" to "İ" and "ı" to "I" first, because "I" will always be interpreted as conventional "I" (not dotless "İ") by any sorting algorithms, which will have been taken into account by the sortkey substitutions themselves. However, if no sortkey substitutions have been specified, then conversion is necessary so as to prevent "i" and "ı" both being sorted as "I". | ||
-- An exception is made for scripts that (sometimes) sort by scraping page content, as that means they are sensitive to changes in capitalization (as it changes the target page). | |||
local fail, cats | |||
-- An exception is made for scripts that (sometimes) sort by scraping page content, as that means they are sensitive | |||
if not sc:sortByScraping() then | if not sc:sortByScraping() then | ||
text = ulower(text) | text = ulower(text) | ||
end | end | ||
local | local sort_key = self._data.sort_key | ||
text, fail, cats = iterateSectionSubstitutions(self, text, sc, nil, nil, sort_key, "makeSortKey") | |||
text, | |||
if not sc:sortByScraping() then | if not sc:sortByScraping() then | ||
if self:hasDottedDotlessI() and not | if self:hasDottedDotlessI() and not sort_key then | ||
text = | text = gsub(gsub(text, "ı", "I"), "i", "İ") | ||
text = sc:toFixedNFC(text) | text = sc:toFixedNFC(text) | ||
end | end | ||
| Line 1,782: | Line 1,583: | ||
-- Remove parentheses, as long as they are either preceded or followed by something. | -- Remove parentheses, as long as they are either preceded or followed by something. | ||
text = gsub(text, "(.)[()]+", "%1"):gsub("[()]+(.)", "%1") | text = gsub(text, "(.)[()]+", "%1") | ||
:gsub("[()]+(.)", "%1") | |||
text = escape_risky_characters(text) | text = escape_risky_characters(text) | ||
return text | return text, fail, cats | ||
end | end | ||
--[==[Create the form used as as a basis for display text and transliteration | --[==[Create the form used as as a basis for display text and transliteration.]==] | ||
local function processDisplayText(text, self, sc, keepCarets, keepPrefixes) | local function processDisplayText(text, self, sc, keepCarets, keepPrefixes) | ||
local subbedChars = {} | local subbedChars = {} | ||
| Line 1,797: | Line 1,599: | ||
sc = checkScript(text, self, sc) | sc = checkScript(text, self, sc) | ||
local fail, cats | |||
text = normalize(text, sc) | text = normalize(text, sc) | ||
text, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, self._data.display_text | text, fail, cats, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, self._data.display_text, "makeDisplayText") | ||
text = removeCarets(text, sc) | text = removeCarets(text, sc) | ||
| Line 1,812: | Line 1,614: | ||
while true do | while true do | ||
local prefix = gsub(text, "^(.-):.+", function(m1) | local prefix = gsub(text, "^(.-):.+", function(m1) | ||
return | return gsub(m1, "\244[\128-\191]*", "") | ||
end) | end) | ||
-- Check if the prefix is an interwiki, though ignore capitalised Wiktionary:, which is a namespace. | -- Check if the prefix is an interwiki, though ignore capitalised Wiktionary:, which is a namespace. | ||
| Line 1,827: | Line 1,629: | ||
end) | end) | ||
end | end | ||
text = gsub(text, "\3", "\\"):gsub("\4", ":") | text = gsub(text, "\3", "\\") | ||
:gsub("\4", ":") | |||
end | |||
--[[if not self:hasType("conlang") then | |||
text = gsub(text,"^%*", "") | |||
end | end | ||
text = gsub(text,"^%*%*", "*")]] | |||
return text, subbedChars | return text, fail, cats, subbedChars | ||
end | end | ||
--[==[Make the display text (i.e. what is displayed on the page).]==] | --[==[Make the display text (i.e. what is displayed on the page).]==] | ||
function Language:makeDisplayText(text, sc, keepPrefixes) | function Language:makeDisplayText(text, sc, keepPrefixes) | ||
if not text or text == "" then | if (not text) or text == "" then | ||
return text | return text, nil, {} | ||
end | end | ||
local subbedChars | local fail, cats, subbedChars | ||
text, subbedChars = processDisplayText(text, self, sc, nil, keepPrefixes) | text, fail, cats, subbedChars = processDisplayText(text, self, sc, nil, keepPrefixes) | ||
text = escape_risky_characters(text) | text = escape_risky_characters(text) | ||
return undoTempSubstitutions(text, subbedChars) | return undoTempSubstitutions(text, subbedChars), fail, cats | ||
end | end | ||
--[==[Transliterates the text from the given script into the Latin script (see | --[==[Transliterates the text from the given script into the Latin script (see [[Wiktionary:Transliteration and romanization]]). The language must have the <code>translit</code> property for this to work; if it is not present, {{code|lua|nil}} is returned. | ||
[[Wiktionary:Transliteration and romanization]]). The language must have the <code>translit</code> property for this to | Returns three values: | ||
work; if it is not present, {{code|lua|nil}} is returned. | # The transliteration. | ||
# A boolean which indicates whether the transliteration failed for an unexpected reason. If {{code|lua|false}}, then the transliteration either succeeded, or the module is returning nothing in a controlled way (e.g. the input was {{code|lua|"-"}}). Generally, this means that no maintenance action is required. If {{code|lua|true}}, then the transliteration is {{code|lua|nil}} because either the input or output was defective in some way (e.g. [[Module:ar-translit]] will not transliterate non-vocalised inputs, and this module will fail partially-completed transliterations in all languages). Note that this value can be manually set by the transliteration module, so make sure to cross-check to ensure it is accurate. | |||
The <code>sc</code> parameter is handled by the transliteration module, and how it is handled is specific to that | # A table of categories selected by the transliteration module, which should be in the format expected by {{code|lua|format_categories}} in [[Module:utilities]]. | ||
module. Some transliteration modules may tolerate {{code|lua|nil}} as the script, others require it to be one of the | The <code>sc</code> parameter is handled by the transliteration module, and how it is handled is specific to that module. Some transliteration modules may tolerate {{code|lua|nil}} as the script, others require it to be one of the possible scripts that the module can transliterate, and will show an error if it's not one of them. For this reason, the <code>sc</code> parameter should always be provided when writing non-language-specific code. | ||
possible scripts that the module can transliterate, and will | The <code>module_override</code> parameter is used to override the default module that is used to provide the transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no default module yet, or you want to demonstrate an alternative version of a transliteration module before making it official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked by [[Wiktionary:Tracking/languages/module_override]]. | ||
the <code>sc</code> parameter should always be provided when writing non-language-specific code. | |||
The <code>module_override</code> parameter is used to override the default module that is used to provide the | |||
transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no | |||
default module yet, or you want to demonstrate an alternative version of a transliteration module before making it | |||
official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked | |||
by [[ | |||
'''Known bugs''': | '''Known bugs''': | ||
* This function assumes {tr(s1) .. tr(s2) == tr(s1 .. s2)}. When this assertion fails, wikitext markups like <nowiki>'''</nowiki> can cause wrong transliterations. | * This function assumes {tr(s1) .. tr(s2) == tr(s1 .. s2)}. When this assertion fails, wikitext markups like <nowiki>'''</nowiki> can cause wrong transliterations. | ||
* HTML entities like <code>&apos;</code>, often used to escape wikitext markups, do not work. | * HTML entities like <code>&apos;</code>, often used to escape wikitext markups, do not work.]==] | ||
]==] | |||
function Language:transliterate(text, sc, module_override) | function Language:transliterate(text, sc, module_override) | ||
-- If there is no text, or the language doesn't have transliteration data and there's no override, return nil. | -- If there is no text, or the language doesn't have transliteration data and there's no override, return nil. | ||
if not text or text == "" or text == "-" then | if not (self._data.translit or module_override) then | ||
return text | return nil, false, {} | ||
elseif (not text) or text == "" or text == "-" then | |||
return text, false, {} | |||
end | end | ||
-- If the script is not transliteratable (and no override is given), return nil. | -- If the script is not transliteratable (and no override is given), return nil. | ||
sc = checkScript(text, self, sc) | sc = checkScript(text, self, sc) | ||
if not (sc:isTransliterated() or module_override) then | if not (sc:isTransliterated() or module_override) then | ||
return nil | return nil, true, {} | ||
end | end | ||
| Line 1,882: | Line 1,686: | ||
-- Get the display text with the keepCarets flag set. | -- Get the display text with the keepCarets flag set. | ||
local subbedChars | local fail, cats, subbedChars | ||
if processed then | if processed then | ||
text, subbedChars = processDisplayText(text, self, sc, true) | text, fail, cats, subbedChars = processDisplayText(text, self, sc, true) | ||
end | |||
-- Transliterate (using the module override if applicable). | |||
text, fail, cats, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, true, module_override or self._data.translit, "tr") | |||
if not text then | |||
return nil, true, cats | |||
end | end | ||
-- Incomplete transliterations return nil. | |||
local charset = sc.characters | |||
if charset and umatch(text, "[" .. charset .. "]") then | |||
-- Remove any characters in Latin, which includes Latin characters also included in other scripts (as these are false positives), as well as any PUA substitutions. Anything remaining should only be script code "None" (e.g. numerals). | |||
local check_text = ugsub(text, "[" .. get_script("Latn").characters .. "-]+", "") | |||
-- Set none_is_last_resort_only flag, so that any non-None chars will cause a script other than "None" to be returned. | |||
if find_best_script_without_lang(check_text, true):getCode() ~= "None" then | |||
return nil, true, cats | |||
-- Incomplete transliterations return nil. | end | ||
local charset = sc.characters | end | ||
if charset and umatch(text, "[" .. charset .. "]") then | |||
-- Remove any characters in Latin, which includes Latin characters also included in other scripts (as these are | if processed then | ||
text = escape_risky_characters(text) | |||
text = undoTempSubstitutions(text, subbedChars) | |||
local check_text = ugsub(text, "[" .. get_script("Latn").characters .. "-]+", "") | end | ||
-- Set none_is_last_resort_only flag, so that any non-None chars will cause a script other than "None" to be | |||
-- If the script does not use capitalization, then capitalize any letters of the transliteration which are immediately preceded by a caret (and remove the caret). | |||
if find_best_script_without_lang(check_text, true):getCode() ~= "None" then | if text and not sc:hasCapitalization() and text:find("^", 1, true) then | ||
return nil | text = processCarets(text, "%^([\128-\191\244]*%*?)([^\128-\191\244][\128-\191]*)", function(m1, m2) | ||
end | return m1 .. uupper(m2) | ||
end | end) | ||
if processed then | |||
text = escape_risky_characters(text) | |||
text = undoTempSubstitutions(text, subbedChars) | |||
end | end | ||
fail = text == nil and (not not fail) or false | |||
return text | return text, fail, cats | ||
end | end | ||
| Line 1,961: | Line 1,762: | ||
function Language:toJSON(opts) | function Language:toJSON(opts) | ||
local | local entry_name, entry_name_patterns, entry_name_remove_diacritics = self._data.entry_name | ||
if | if entry_name then | ||
if | if entry_name.from then | ||
entry_name_patterns = {} | |||
for i, from in ipairs( | for i, from in ipairs(entry_name.from) do | ||
insert( | insert(entry_name_patterns, {from = from, to = entry_name.to[i] or ""}) | ||
end | end | ||
end | end | ||
entry_name_remove_diacritics = entry_name.remove_diacritics | |||
end | end | ||
-- mainCode should only end up non-nil if dontCanonicalizeAliases is passed to make_object(). | -- mainCode should only end up non-nil if dontCanonicalizeAliases is passed to make_object(). | ||
local ret = { | |||
local | ancestors = self:getAncestorCodes(), | ||
ancestors = | canonicalName = self:getCanonicalName(), | ||
canonicalName = | categoryName = self:getCategoryName("nocap"), | ||
categoryName = | |||
code = self._code, | code = self._code, | ||
mainCode = self._mainCode, | mainCode = self._mainCode, | ||
parent = | parent = self:getParentCode(), | ||
full = | full = self:getFullCode(), | ||
entryNamePatterns = entry_name_patterns, | |||
entryNameRemoveDiacritics = entry_name_remove_diacritics, | |||
family = | family = self:getFamilyCode(), | ||
aliases = | aliases = self:getAliases(), | ||
varieties = | varieties = self:getVarieties(), | ||
otherNames = | otherNames = self:getOtherNames(), | ||
scripts = | scripts = self:getScriptCodes(), | ||
type = | type = keys_to_list(self:getTypes()), | ||
wikimediaLanguages = | wikimediaLanguages = self:getWikimediaLanguageCodes(), | ||
wikidataItem = | wikidataItem = self:getWikidataItem(), | ||
wikipediaArticle = | wikipediaArticle = self:getWikipediaArticle(true), | ||
} | } | ||
-- Use `deep_copy` when returning a table, so that there are no editing restrictions imposed by `mw.loadData`. | -- Use `deep_copy` when returning a table, so that there are no editing restrictions imposed by `mw.loadData`. | ||
return opts and opts.lua_table and deep_copy(ret) or to_json(ret, opts) | return opts and opts.lua_table and deep_copy(ret) or to_json(ret, opts) | ||
| Line 2,134: | Line 1,923: | ||
--[==[ | --[==[ | ||
<span style="color: | <span style="color: #BA0000">This function is not for use in entries or other content pages.</span> | ||
Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes. If `extra` is set, any extra data in the relevant `/extra` module will be included. (Note that it will be included anyway if it has already been loaded into the language object.) If `raw` is set, then the returned data will not contain any data inherited from parent objects. | Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes. If `extra` is set, any extra data in the relevant `/extra` module will be included. (Note that it will be included anyway if it has already been loaded into the language object.) If `raw` is set, then the returned data will not contain any data inherited from parent objects. | ||
-- Do NOT use these methods! | -- Do NOT use these methods! | ||