Module:languages: Difference between revisions

No edit summary
Tag: Manual revert
No edit summary
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
--[=[
--[==[ intro:
This module implements fetching of language-specific information and processing text in a given language.
This module implements fetching of language-specific information and processing text in a given language.
===Types of languages===


There are two types of languages: full languages and etymology-only languages. The essential difference is that only
There are two types of languages: full languages and etymology-only languages. The essential difference is that only
Line 7: Line 9:
their parent (in the parent-child inheritance sense), and for etymology-only languages with another etymology-only
their parent (in the parent-child inheritance sense), and for etymology-only languages with another etymology-only
language as their parent, a full language can always be derived by following the parent links upwards. For example,
language as their parent, a full language can always be derived by following the parent links upwards. For example,
"Canadian French", code 'fr-CA', is an etymology-only language whose parent is the full language "French", code 'fr'.
"Canadian French", code `fr-CA`, is an etymology-only language whose parent is the full language "French", code `fr`.
An example of an etymology-only language with another etymology-only parent is "Northumbrian Old English", code
An example of an etymology-only language with another etymology-only parent is "Northumbrian Old English", code
'ang-nor', which has "Anglian Old English", code 'ang-ang' as its parent; this is an etymology-only language whose
`ang-nor`, which has "Anglian Old English", code `ang-ang` as its parent; this is an etymology-only language whose
parent is "Old English", code "ang", which is a full language. (This is because Northumbrian Old English is considered
parent is "Old English", code `ang`, which is a full language. (This is because Northumbrian Old English is considered
a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code 'und'; this is the case,
a variety of Anglian Old English.) Sometimes the parent is the "Undetermined" language, code `und`; this is the case,
for example, for "substrate" languages such as "Pre-Greek", code 'qsb-grc', and "the BMAC substrate", code 'qsb-bma'.
for example, for "substrate" languages such as "Pre-Greek", code `qsb-grc`, and "the BMAC substrate", code `qsb-bma`.


It is important to distinguish language ''parents'' from language ''ancestors''. The parent-child relationship is one
It is important to distinguish language ''parents'' from language ''ancestors''. The parent-child relationship is one
of containment, i.e. if X is a child of Y, X is considered a variety of Y. On the other hand, the ancestor-descendant
of containment, i.e. if X is a child of Y, X is considered a variety of Y. On the other hand, the ancestor-descendant
relationship is one of descent in time. For example, "Classical Latin", code 'la-cla', and "Late Latin", code 'la-lat',
relationship is one of descent in time. For example, "Classical Latin", code `la-cla`, and "Late Latin", code `la-lat`,
are both etymology-only languages with "Latin", code 'la', as their parents, because both of the former are varieties
are both etymology-only languages with "Latin", code `la`, as their parents, because both of the former are varieties
of Latin. However, Late Latin does *NOT* have Classical Latin as its parent because Late Latin is *not* a variety of
of Latin. However, Late Latin does *NOT* have Classical Latin as its parent because Late Latin is *not* a variety of
Classical Latin; rather, it is a descendant. There is in fact a separate 'ancestors' field that is used to express the
Classical Latin; rather, it is a descendant. There is in fact a separate `ancestors` field that is used to express the
ancestor-descendant relationship, and Late Latin's ancestor is given as Classical Latin. It is also important to note
ancestor-descendant relationship, and Late Latin's ancestor is given as Classical Latin. It is also important to note
that sometimes an etymology-only language is actually the conceptual ancestor of its parent language. This happens,
that sometimes an etymology-only language is actually the conceptual ancestor of its parent language. This happens,
for example, with "Old Italian" (code 'roa-oit'), which is an etymology-only variant of full language "Italian" (code
for example, with "Old Italian" (code `roa-oit`), which is an etymology-only variant of full language "Italian" (code
'it'), and with "Old Latin" (code 'itc-ola'), which is an etymology-only variant of Latin. In both cases, the full
`it`), and with "Old Latin" (code `itc-ola`), which is an etymology-only variant of Latin. In both cases, the full
language has the etymology-only variant listed as an ancestor. This allows a Latin term to inherit from Old Latin
language has the etymology-only variant listed as an ancestor. This allows a Latin term to inherit from Old Latin
using the {{tl|inh}} template (where in this template, "inheritance" refers to ancestral inheritance, i.e. inheritance
using the {{tl|inh}} template (where in this template, "inheritance" refers to ancestral inheritance, i.e. inheritance
Line 48: Line 50:
functions in [[Module:languages]] and [[Module:etymology languages]] to convert a language's canonical name to a
functions in [[Module:languages]] and [[Module:etymology languages]] to convert a language's canonical name to a
{Language} object (depending on whether the canonical name refers to a full or etymology-only language).
{Language} object (depending on whether the canonical name refers to a full or etymology-only language).
===Textual representations===


Textual strings belonging to a given language come in several different ''text variants'':
Textual strings belonging to a given language come in several different ''text variants'':
# The ''input text'' is what the user supplies in wikitext, in the parameters to {{tl|m}}, {{tl|l}}, {{tl|ux}},
# The ''input text'' is what the user supplies in wikitext, in the parameters to {{tl|m}}, {{tl|l}}, {{tl|ux}},
{{tl|t}}, {{tl|lang}} and the like.
  {{tl|t}}, {{tl|lang}} and the like.
# The ''display text'' is the text in the form as it will be displayed to the user. This can include accent marks that
# The ''corrected input text'' is the input text with some corrections and/or normalizations applied, such as
are stripped to form the entry text (see below), as well as embedded bracketed links that are variously processed
  bad-character replacements for certain languages, like replacing `l` or `1` to [[palochka]] in some languages written
further. The display text is generated from the input text by applying language-specific transformations; for most
  in Cyrillic. (FIXME: This currently goes under the name ''display text'' but that will be repurposed below. Also,
languages, there will be no such transformations. Examples of transformations are bad-character replacements for
  [[User:Surjection]] suggests renaming this to ''normalized input text'', but "normalized" is used in a different sense
certain languages (e.g. replacing 'l' or '1' to [[palochka]] in certain languages in Cyrillic); and for Thai and
  in [[Module:usex]].)
Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as [กรีน/กฺรีน],
# The ''display text'' is the text in the form as it will be displayed to the user. This is what appears in headwords,
which indicate how to transliterate given words.
  in usexes, in displayed internal links, etc. This can include accent marks that are removed to form the stripped
# The ''entry text'' is the text in the form used to generate a link to a Wiktionary entry. This is usually generated
  display text (see below), as well as embedded bracketed links that are variously processed further. The display text
from the display text by stripping certain sorts of diacritics on a per-language basis, and sometimes doing other
  is generated from the corrected input text by applying language-specific transformations; for most languages, there
transformations. The concept of ''entry text'' only really makes sense for text that does not contain embedded links,
  will be no such transformations. The general reason for having a difference between input and display text is to allow
meaning that display text containing embedded links will need to have the links individually processed to get
  for extra information in the input text that is not displayed to the user but is sent to the transliteration module.
per-link entry text in order to generate the resolved display text (see below).
  Note that having different display and input text is only supported currently through special-casing but will be
# The ''resolved display text'' is the result of resolving embedded links in the display text (e.g. converting them to
  generalized. Examples of transformations are: (1) Removing the {{cd|^}} that is used in certain East Asian (and
two-part links where the first part has entry-text transformations applied, and adding appropriate language-specific
  possibly other unicameral) languages to indicate capitalization of the transliteration (which is currently
fragments) and adding appropriate language and script tagging. This text can be passed directly to MediaWiki for
  special-cased); (2) for Korean, removing or otherwise processing hyphens (which is currently special-cased); (3) for
display.
  Arabic, removing a ''sukūn'' diacritic placed over a ''tāʔ marbūṭa'' (like this: ةْ) to indicate that the
# The ''source translit text'' is the text as supplied to the language-specific {transliterate()} method. The form of
  ''tāʔ marbūṭa'' is pronounced and transliterated as /t/ instead of being silent [NOTE, NOT IMPLEMENTED YET]; (4) for
the source translit text may need to be language-specific, e.g Thai and Khmer will need the full unprocessed input
  Thai and Khmer, converting space-separated words to bracketed words and resolving respelling substitutions such as
text, whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded
  `[กรีน/กฺรีน]`, which indicate how to transliterate given words [NOTE, NOT IMPLEMENTED YET except in language-specific
bracketed links are handled in the existing code.] In general, embedded links need to be removed (i.e. converted to
  templates like {{tl|th-usex}}].
their "bare display" form by taking the right part of two-part links and removing double brackets), but when this
## The ''right-resolved display text'' is the result of removing brackets around one-part embedded links and resolving
happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the
  two-part embedded links into their right-hand components (i.e. converting two-part links into the displayed form).
text through the transliterate mechanism, and for others (those listed in {contiguous_substition} in
  The process of right-resolution is what happens when you call {{cd|remove_links()}} in [[Module:links]] on some text.
[[Module:languages/data]]) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is
  When applied to the display text, it produces exactly what the user sees, without any link markup.
still unclear to me.)
# The ''stripped display text'' is the result of applying diacritic-stripping to the display text.
# The ''transliterated text'' (or ''transliteration'') is the result of transliterating the source translit text.
## The ''left-resolved stripped display text'' [NEED BETTER NAME] is the result of applying left-resolution to the
Unlike for all the other text variants except the transcribed text, it is always in the Latin script.
  stripped display text, i.e. similar to right-resolution but resolving two-part embedded links into their left-hand
  components (i.e. the linked-to page). If the display text refers to a single page, the resulting of applying
  diacritic stripping and left-resolution produces the ''logical pagename''.
# The ''physical pagename text'' is the result of converting the stripped display text into physical page links. If the
  stripped display text contains embedded links, the left side of those links is converted into physical page links;
  otherwise, the entire text is considered a pagename and converted in the same fashion. The conversion does three
  things: (1) converts characters not allowed in pagenames into their "unsupported title" representation, e.g.
  {{cd|Unsupported titles/`gt`}} in place of the logical name {{cd|>}}; (2) handles certain special-cased
  unsupported-title logical pagenames, such as {{cd|Unsupported titles/Space}} in place of {{cd|[space]}} and
  {{cd|Unsupported titles/Ancient Greek dish}} in place of a very long Greek name for a gourmet dish as found in
  Aristophanes; (3) converts "mammoth" pagenames such as [[a]] into their appropriate split component, e.g.
  [[a/languages A to L]].
# The ''source translit text'' is the text as supplied to the language-specific {{cd|transliterate()}} method. The form
  of the source translit text may need to be language-specific, e.g Thai and Khmer will need the corrected input text,
  whereas other languages may need to work off the display text. [FIXME: It's still unclear to me how embedded bracketed
  links are handled in the existing code.] In general, embedded links need to be right-resolved (see above), but when
  this happens is unclear to me [FIXME]. Some languages have a chop-up-and-paste-together scheme that sends parts of the
  text through the transliterate mechanism, and for others (those listed with "cont" in {{cd|substitution}} in
  [[Module:languages/data]]) they receive the full input text, but preprocessed in certain ways. (The wisdom of this is
  still unclear to me.)
# The ''transliterated text'' (or ''transliteration'') is the result of transliterating the source translit text. Unlike
  for all the other text variants except the transcribed text, it is always in the Latin script.
# The ''transcribed text'' (or ''transcription'') is the result of transcribing the source translit text, where
# The ''transcribed text'' (or ''transcription'') is the result of transcribing the source translit text, where
"transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian,
  "transcription" here means a close approximation to the phonetic form of the language in languages (e.g. Akkadian,
Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form.
  Sumerian, Ancient Egyptian, maybe Tibetan) that have a wide difference between the written letters and spoken form.
Unlike for all the other text variants other than the transliterated text, it is always in the Latin script.
  Unlike for all the other text variants other than the transliterated text, it is always in the Latin script.
Currently, the transcribed text is always supplied manually be the user; there is no such thing as a
  Currently, the transcribed text is always supplied manually be the user; there is no such thing as a
{lua|transcribe()} method on language objects.
  {{cd|transcribe()}} method on language objects.
# The ''sort key'' is the text used in sort keys for determining the placing of pages in categories they belong to. The
# The ''sort key'' is the text used in sort keys for determining the placing of pages in categories they belong to. The
sort key is generated from the pagename or a specified ''sort base'' by lowercasing, doing language-specific
  sort key is generated from the pagename or a specified ''sort base'' by lowercasing, doing language-specific
transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it
  transformations and then uppercasing the result. If the sort base is supplied and is generated from input text, it
needs to be converted to display text, have embedded links removed (i.e. resolving them to their right side if they
  needs to be converted to display text, have embedded links removed through right-resolution and have
are two-part links) and have entry text transformations applied.
  diacritic-stripping applied.
# There are other text variants that occur in usexes (specifically, there are normalized variants of several of the
# There are other text variants that occur in usexes (specifically, there are normalized variants of several of the
above text variants), but we can skip them for now.
  above text variants), but we can skip them for now.


The following methods exist on {Language} objects to convert between different text variants:
The following methods exist on {Language} objects to convert between different text variants:
# {makeDisplayText}: This converts input text to display text.
# {correctInputText} (currently called {makeDisplayText}): This converts input text to corrected input text.
# {lua|makeEntryName}: This converts input or display text to entry text. [FIXME: This needs some rethinking. In
# {stripDiacritics}: This converts to stripped display text. [FIXME: This needs some rethinking. In particular,
particular, {lua|makeEntryName} is sometimes called on display text (in some paths inside of [[Module:links]]) and
  {stripDiacritics} is sometimes called on input text, corrected input text or display text (in various paths inside of
sometimes called on input text (in other paths inside of [[Module:links]], and usually from other modules). We need
  [[Module:links]], and, in the case of input text, usually from other modules). We need to make sure we don't try to
to make sure we don't try to convert input text to display text twice, but at the same time we need to support
  convert input text to display text twice, but at the same time we need to support calling it directly on input text
calling it directly on input text since so many modules do this. This means we need to add a parameter indicating
  since so many modules do this. This means we need to add a parameter indicating whether the passed-in text is input,
whether the passed-in text is input or display text; if that former, we call {lua|makeDisplayText} ourselves.]
  corrected input, or display text; if the former two, we call {correctInputText} ourselves.]
# {lua|transliterate}: This appears to convert input text with embedded brackets removed into a transliteration.
# {logicalToPhysical}: This converts logical pagenames to physical pagenames.
[FIXME: This needs some rethinking. In particular, it calls {lua|processDisplayText} on its input, which won't work
# {transliterate}: This appears to convert input text with embedded brackets removed into a transliteration.
for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the
  [FIXME: This needs some rethinking. In particular, it calls {processDisplayText} on its input, which won't work
language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code;
  for Thai and Khmer, so we may need language-specific flags indicating whether to pass the input text directly to the
a lot of callers remove the links themselves before calling {lua|transliterate()}, which I assume is wrong.]
  language transliterate method. In addition, I'm not sure how embedded links are handled in the existing translit code;
# {lua|makeSortKey}: This converts entry text (?) to a sort key. [FIXME: Clarify this.]
  a lot of callers remove the links themselves before calling {transliterate()}, which I assume is wrong.]
]=]
# {makeSortKey}: This converts display text (?) to a sort key. [FIXME: Clarify this.]
]==]
local export = {}
local export = {}


local debug_track_module = "Module:debug/track"
local etymology_languages_data_module = "Module:etymology languages/data"
local families_module = "Module:families"
local families_module = "Module:families"
local headword_page_module = "Module:headword/page"
local json_module = "Module:JSON"
local json_module = "Module:JSON"
local language_like_module = "Module:language-like"
local language_like_module = "Module:language-like"
local languages_data_module = "Module:languages/data"
local languages_data_patterns_module = "Module:languages/data/patterns"
local links_data_module = "Module:links/data"
local load_module = "Module:load"
local load_module = "Module:load"
local scripts_module = "Module:scripts"
local scripts_module = "Module:scripts"
local scripts_data_module = "Module:scripts/data"
local string_encode_entities_module = "Module:string/encode entities"
local string_encode_entities_module = "Module:string/encode entities"
local string_pattern_escape_module = "Module:string/patternEscape"
local string_replacement_escape_module = "Module:string/replacementEscape"
local string_utilities_module = "Module:string utilities"
local string_utilities_module = "Module:string utilities"
local table_module = "Module:table"
local table_module = "Module:table"
Line 137: Line 172:
local insert = table.insert
local insert = table.insert
local ipairs = ipairs
local ipairs = ipairs
local is_known_language_tag = mw.language.isKnownLanguageTag
local make_object -- Defined below.
local make_object -- Defined below.
local match = string.match
local match = string.match
Line 145: Line 181:
local select = select
local select = select
local setmetatable = setmetatable
local setmetatable = setmetatable
local sub = string.sub
local type = type
local type = type
local unstrip = mw.text.unstrip
local unstrip = mw.text.unstrip
Line 152: Line 189:
local Hant_chars
local Hant_chars


--[==[
local function check_object(...)
Loaders for functions in other modules, which overwrite themselves with the target function when called. This ensures modules are only loaded when needed, retains the speed/convenience of locally-declared pre-loaded functions, and has no overhead after the first call, since the target functions are called directly in any subsequent calls.]==]
check_object = require(utilities_module).check_object
local function check_object(...)
return check_object(...)
check_object = require(utilities_module).check_object
end
return check_object(...)
end


local function decode_entities(...)
local function debug_track(...)
decode_entities = require(string_utilities_module).decode_entities
debug_track = require(debug_track_module)
return decode_entities(...)
return debug_track(...)
end
end


local function decode_uri(...)
local function decode_entities(...)
decode_uri = require(string_utilities_module).decode_uri
decode_entities = require(string_utilities_module).decode_entities
return decode_uri(...)
return decode_entities(...)
end
end


local function deep_copy(...)
local function decode_uri(...)
deep_copy = require(table_module).deepCopy
decode_uri = require(string_utilities_module).decode_uri
return deep_copy(...)
return decode_uri(...)
end
end


local function encode_entities(...)
local function deep_copy(...)
encode_entities = require(string_encode_entities_module)
deep_copy = require(table_module).deepCopy
return encode_entities(...)
return deep_copy(...)
end
end


local function get_script(...)
local function encode_entities(...)
get_script = require(scripts_module).getByCode
encode_entities = require(string_encode_entities_module)
return get_script(...)
return encode_entities(...)
end
end


local function find_best_script_without_lang(...)
local function get_L2_sort_key(...)
find_best_script_without_lang = require(scripts_module).findBestScriptWithoutLang
get_L2_sort_key = require(headword_page_module).get_L2_sort_key
return find_best_script_without_lang(...)
return get_L2_sort_key(...)
end
end


local function get_family(...)
local function get_script(...)
get_family = require(families_module).getByCode
get_script = require(scripts_module).getByCode
return get_family(...)
return get_script(...)
end
end


local function get_plaintext(...)
local function find_best_script_without_lang(...)
get_plaintext = require(utilities_module).get_plaintext
find_best_script_without_lang = require(scripts_module).findBestScriptWithoutLang
return get_plaintext(...)
return find_best_script_without_lang(...)
end
end


local function get_wikimedia_lang(...)
local function get_family(...)
get_wikimedia_lang = require(wikimedia_languages_module).getByCode
get_family = require(families_module).getByCode
return get_wikimedia_lang(...)
return get_family(...)
end
end


local function keys_to_list(...)
local function get_plaintext(...)
keys_to_list = require(table_module).keysToList
get_plaintext = require(utilities_module).get_plaintext
return keys_to_list(...)
return get_plaintext(...)
end
end


local function list_to_set(...)
local function get_wikimedia_lang(...)
list_to_set = require(table_module).listToSet
get_wikimedia_lang = require(wikimedia_languages_module).getByCode
return list_to_set(...)
return get_wikimedia_lang(...)
end
end


local function load_data(...)
local function keys_to_list(...)
load_data = require(load_module).load_data
keys_to_list = require(table_module).keysToList
return load_data(...)
return keys_to_list(...)
end
end


local function make_family_object(...)
local function list_to_set(...)
make_family_object = require(families_module).makeObject
list_to_set = require(table_module).listToSet
return make_family_object(...)
return list_to_set(...)
end
end


local function pattern_escape(...)
local function load_data(...)
pattern_escape = require(string_utilities_module).pattern_escape
load_data = require(load_module).load_data
return pattern_escape(...)
return load_data(...)
end
end


local function remove_duplicates(...)
local function make_family_object(...)
remove_duplicates = require(table_module).removeDuplicates
make_family_object = require(families_module).makeObject
return remove_duplicates(...)
return make_family_object(...)
end
end


local function replacement_escape(...)
local function pattern_escape(...)
replacement_escape = require(string_utilities_module).replacement_escape
pattern_escape = require(string_pattern_escape_module)
return replacement_escape(...)
return pattern_escape(...)
end
end


local function safe_require(...)
local function replacement_escape(...)
safe_require = require(load_module).safe_require
replacement_escape = require(string_replacement_escape_module)
return safe_require(...)
return replacement_escape(...)
end
end


local function shallow_copy(...)
local function safe_require(...)
shallow_copy = require(table_module).shallowCopy
safe_require = require(load_module).safe_require
return shallow_copy(...)
return safe_require(...)
end
end


local function split(...)
local function shallow_copy(...)
split = require(string_utilities_module).split
shallow_copy = require(table_module).shallowCopy
return split(...)
return shallow_copy(...)
end
end


local function to_json(...)
local function split(...)
to_json = require(json_module).toJSON
split = require(string_utilities_module).split
return to_json(...)
return split(...)
end
end


local function u(...)
local function to_json(...)
u = require(string_utilities_module).char
to_json = require(json_module).toJSON
return u(...)
return to_json(...)
end
end


local function ugsub(...)
local function u(...)
ugsub = require(string_utilities_module).gsub
u = require(string_utilities_module).char
return ugsub(...)
return u(...)
end
end


local function ulen(...)
local function ugsub(...)
ulen = require(string_utilities_module).len
ugsub = require(string_utilities_module).gsub
return ulen(...)
return ugsub(...)
end
end


local function ulower(...)
local function ulen(...)
ulower = require(string_utilities_module).lower
ulen = require(string_utilities_module).len
return ulower(...)
return ulen(...)
end
end


local function umatch(...)
local function ulower(...)
umatch = require(string_utilities_module).match
ulower = require(string_utilities_module).lower
return umatch(...)
return ulower(...)
end
end


local function uupper(...)
local function umatch(...)
uupper = require(string_utilities_module).upper
umatch = require(string_utilities_module).match
return uupper(...)
return umatch(...)
end
 
local function uupper(...)
uupper = require(string_utilities_module).upper
return uupper(...)
end
 
local function track(page)
debug_track("languages/" .. page)
return true
end
 
local function normalize_code(code)
return load_data(languages_data_module).aliases[code] or code
end
 
local function check_inputs(self, check, default, ...)
local n = select("#", ...)
if n == 0 then
return false
end
local ret = check(self, (...))
if ret ~= nil then
return ret
elseif n > 1 then
local inputs = {...}
for i = 2, n do
ret = check(self, inputs[i])
if ret ~= nil then
return ret
end
end
end
end
return default
end


local function normalize_code(code)
local function make_link(self, target, display)
return load_data("Module:languages/data").aliases[code] or code
local prefix, main
if self:getFamilyCode() == "qfa-sub" then
prefix, main = display:match("^(the )(.*)")
if not prefix then
prefix, main = display:match("^(a )(.*)")
end
end
return (prefix or "") .. "[[" .. target .. "|" .. (main or display) .. "]]"
end
end


Line 305: Line 381:
local function doTempSubstitutions(text, subbedChars, keepCarets, noTrim)
local function doTempSubstitutions(text, subbedChars, keepCarets, noTrim)
-- Clone so that we don't insert any extra patterns into the table in package.loaded. For some reason, using require seems to keep memory use down; probably because the table is always cloned.
-- Clone so that we don't insert any extra patterns into the table in package.loaded. For some reason, using require seems to keep memory use down; probably because the table is always cloned.
local patterns = shallow_copy(require("Module:languages/data/patterns"))
local patterns = shallow_copy(require(languages_data_patterns_module))
if keepCarets then
if keepCarets then
insert(patterns, "((\\+)%^)")
insert(patterns, "((\\+)%^)")
Line 316: Line 392:
end
end
-- Pre-substitution, of "[[" and "]]", which makes pattern matching more accurate.
-- Pre-substitution, of "[[" and "]]", which makes pattern matching more accurate.
text = gsub(text, "%f[%[]%[%[", "\1")
text = gsub(text, "%f[%[]%[%[", "\1"):gsub("%f[%]]%]%]", "\2")
:gsub("%f[%]]%]%]", "\2")
local i = #subbedChars
local i = #subbedChars
for _, pattern in ipairs(patterns) do
for _, pattern in ipairs(patterns) do
Line 341: Line 416:
end)
end)
end
end
text = gsub(text, "\1", "%[%[")
text = gsub(text, "\1", "%[%["):gsub("\2", "%]%]")
:gsub("\2", "%]%]")
return text, subbedChars
return text, subbedChars
end
end
Line 352: Line 426:
local byte3 = floor(i / 64) % 64 + 128
local byte3 = floor(i / 64) % 64 + 128
local byte4 = i % 64 + 128
local byte4 = i % 64 + 128
text = gsub(text, "\244[" .. char(byte2) .. char(byte2+8) .. "]" .. char(byte3) .. char(byte4), replacement_escape(subbedChars[i]))
text = gsub(text, "\244[" .. char(byte2) .. char(byte2+8) .. "]" .. char(byte3) .. char(byte4),
replacement_escape(subbedChars[i]))
end
end
text = gsub(text, "\1", "%[%[")
text = gsub(text, "\1", "%[%["):gsub("\2", "%]%]")
:gsub("\2", "%]%]")
return text
return text
end
end
Line 362: Line 436:
local function checkNoEntities(self, text)
local function checkNoEntities(self, text)
local textNoEnc = decode_entities(text)
local textNoEnc = decode_entities(text)
if textNoEnc ~= text and load_data("Module:links/data").unsupported_titles[text] then
if textNoEnc ~= text and load_data(links_data_module).unsupported_titles[text] then
return text
return text
else
else
Line 373: Line 447:
if not check_object("script", true, sc) or sc:getCode() == "None" then
if not check_object("script", true, sc) or sc:getCode() == "None" then
return self:findBestScript(text)
return self:findBestScript(text)
else
return sc
end
end
return sc
end
end


Line 383: Line 456:
end
end


local function doSubstitutions(self, text, sc, substitution_data, function_name, recursed)
-- Subfunction of iterateSectionSubstitutions(). Process an individual chunk of text according to the specifications in
local fail, cats = nil, {}
-- `substitution_data`. The input parameters are all as in the documentation of iterateSectionSubstitutions() except for
-- `recursed`, which is set to true if we called ourselves recursively to process a script-specific setting or
-- script-wide fallback. Returns two values: the processed text and the actual substitution data used to do the
-- substitutions (same as the `actual_substitution_data` return value to iterateSectionSubstitutions()).
local function doSubstitutions(self, text, sc, substitution_data, data_field, function_name, recursed)
-- BE CAREFUL in this function because the value at any level can be `false`, which causes no processing to be done
-- and blocks any further fallback processing.
local actual_substitution_data = substitution_data
-- If there are language-specific substitutes given in the data module, use those.
-- If there are language-specific substitutes given in the data module, use those.
if type(substitution_data) == "table" then
if type(substitution_data) == "table" then
-- If a script is specified, run this function with the script-specific data before continuing.
-- If a script is specified, run this function with the script-specific data before continuing.
local sc_code = sc:getCode()
local sc_code = sc:getCode()
if substitution_data[sc_code] then
local has_substitution_data = false
text, fail, cats = doSubstitutions(self, text, sc, substitution_data[sc_code], function_name, true)
if substitution_data[sc_code] ~= nil then
-- Hant, Hans and Hani are usually treated the same, so add a special case to avoid having to specify each one separately.
has_substitution_data = true
elseif sc_code:match("^Han") and substitution_data.Hani then
if substitution_data[sc_code] then
text, fail, cats = doSubstitutions(self, text, sc, substitution_data.Hani, function_name, true)
text, actual_substitution_data = doSubstitutions(self, text, sc, substitution_data[sc_code], data_field,
function_name, true)
end
-- Hant, Hans and Hani are usually treated the same, so add a special case to avoid having to specify each one
-- separately.
elseif sc_code:match("^Han") and substitution_data.Hani ~= nil then
has_substitution_data = true
if substitution_data.Hani then
text, actual_substitution_data = doSubstitutions(self, text, sc, substitution_data.Hani, data_field,
function_name, true)
end
-- Substitution data with key 1 in the outer table may be given as a fallback.
-- Substitution data with key 1 in the outer table may be given as a fallback.
elseif substitution_data[1] then
elseif substitution_data[1] ~= nil then
text, fail, cats = doSubstitutions(self, text, sc, substitution_data[1], function_name, true)
has_substitution_data = true
if substitution_data[1] then
text, actual_substitution_data = doSubstitutions(self, text, sc, substitution_data[1], data_field,
function_name, true)
end
end
end
-- Iterate over all strings in the "from" subtable, and gsub with the corresponding string in "to". We work with the NFD decomposed forms, as this simplifies many substitutions.
-- Iterate over all strings in the "from" subtable, and gsub with the corresponding string in "to". We work with
-- the NFD decomposed forms, as this simplifies many substitutions.
if substitution_data.from then
if substitution_data.from then
has_substitution_data = true
for i, from in ipairs(substitution_data.from) do
for i, from in ipairs(substitution_data.from) do
-- Normalize each loop, to ensure multi-stage substitutions work correctly.
-- Normalize each loop, to ensure multi-stage substitutions work correctly.
Line 408: Line 504:


if substitution_data.remove_diacritics then
if substitution_data.remove_diacritics then
has_substitution_data = true
text = sc:toFixedNFD(text)
text = sc:toFixedNFD(text)
-- Convert exceptions to PUA.
-- Convert exceptions to PUA.
Line 430: Line 527:
text = text:gsub("\242[\128-\191]*", substitutes)
text = text:gsub("\242[\128-\191]*", substitutes)
end
end
end
if not has_substitution_data and sc._data[data_field] then
-- If language-specific sort key (etc.) is nil, fall back to script-wide sort key (etc.).
text, actual_substitution_data = doSubstitutions(self, text, sc, sc._data[data_field], data_field,
function_name, true)
end
end
elseif type(substitution_data) == "string" then
elseif type(substitution_data) == "string" then
Line 435: Line 537:
local module = safe_require("Module:" .. substitution_data)
local module = safe_require("Module:" .. substitution_data)
if module then
if module then
-- TODO: translit functions should take objects, not codes.
-- TODO: translit functions should be called with form NFD.
if function_name == "tr" then
if function_name == "tr" then
text, fail, cats = module[function_name](text, self:getCode(), sc:getCode())
if not module[function_name] then
error(("Internal error: Module [[%s]] has no function named 'tr'"):format(substitution_data))
end
text = module[function_name](text, self._code, sc:getCode())
elseif function_name == "stripDiacritics" then
-- FIXME, get rid of this arm after renaming makeEntryName -> stripDiacritics.
if module[function_name] then
text = module[function_name](sc:toFixedNFD(text), self, sc)
elseif module.makeEntryName then
text = module.makeEntryName(sc:toFixedNFD(text), self, sc)
else
error(("Internal error: Module [[%s]] has no function named 'stripDiacritics' or 'makeEntryName'"
):format(substitution_data))
end
else
else
text, fail, cats = module[function_name](sc:toFixedNFD(text), self:getCode(), sc:getCode())
if not module[function_name] then
error(("Internal error: Module [[%s]] has no function named '%s'"):format(
substitution_data, function_name))
end
text = module[function_name](sc:toFixedNFD(text), self, sc)
end
end
else
else
error("Substitution data '" .. substitution_data .. "' does not match an existing module.")
error("Substitution data '" .. substitution_data .. "' does not match an existing module.")
end
end
elseif substitution_data == nil and sc._data[data_field] then
-- If language-specific sort key (etc.) is nil, fall back to script-wide sort key (etc.).
text, actual_substitution_data = doSubstitutions(self, text, sc, sc._data[data_field], data_field,
function_name, true)
end
end


-- Don't normalize to NFC if this is the inner loop or if a module returned nil.
-- Don't normalize to NFC if this is the inner loop or if a module returned nil.
if recursed or not text then
if recursed or not text then
return text, fail, cats
return text, actual_substitution_data
end
end
-- Fix any discouraged sequences created during the substitution process, and normalize into the final form.
-- Fix any discouraged sequences created during the substitution process, and normalize into the final form.
return sc:toFixedNFC(sc:fixDiscouragedSequences(text)), fail, cats
return sc:toFixedNFC(sc:fixDiscouragedSequences(text)), actual_substitution_data
end
end


-- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate over each one to apply substitutions. This avoids putting PUA characters through language-specific modules, which may be unequipped for them.
-- Split the text into sections, based on the presence of temporarily substituted formatting characters, then iterate
local function iterateSectionSubstitutions(text, subbedChars, keepCarets, self, sc, substitution_data, function_name)
-- over each section to apply substitutions (e.g. transliteration or diacritic stripping). This avoids putting PUA
local fail, cats, sections = nil, {}
-- characters through language-specific modules, which may be unequipped for them. This function is passed the following
-- values:
-- * `self` (the Language object);
-- * `text` (the text to process);
-- * `sc` (the script of the text, which must be specified; callers should call checkScript() as needed to autodetect the
--  script of the text if not given explicitly by the user);
-- * `subbedChars` (an array of the same length as the text, indicating which characters have been substituted and by
--  what, or {nil} if no substitutions are to happen);
-- * `keepCarets` (DOCUMENT ME);
-- * `substitution_data` (the data indicating which substitutions to apply, taken directly from `data_field` in the
--  language's data structure in a submodule of [[Module:languages/data]]);
-- * `data_field` (the data field from which `substitution_data` was fetched, such as "sort_key" or "strip_diacritics");
-- * `function_name` (the name of the function to call to do the substitution, in case `substitution_data` specifies a
--  module to do the substitution);
-- * `notrim` (don't trim whitespace at the edges of `text`; set when computing the sort key, because whitespace at the
--  beginning of a sort key is significant and causes the resulting page to be sorted at the beginning of the category
--  it's in).
-- Returns three values:
-- (1) the processed text;
-- (2) the value of `subbedChars` that was passed in, possibly modified with additional character substitutions; will be
--    {nil} if {nil} was passed in;
-- (3) the actual substitution data that was used to apply substitutions to `text`; this may be different from the value
--    of `substitution_data` passed in if that value recursively specified script-specific substitutions or if no
--    substitution data could be found in the language-specific data (e.g. {nil} was passed in or a structure was passed
--    in that had no setting for the script given in `sc`), but a script-wide fallback value was set; currently it is
--    only used by makeSortKey().
local function iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, substitution_data, data_field,
function_name, notrim)
local sections
-- See [[Module:languages/data]].
-- See [[Module:languages/data]].
if not find(text, "\244") or load_data("Module:languages/data").contiguous_substitution[self._code] then
if not find(text, "\244") or load_data(languages_data_module).substitution[self._code] == "cont" then
sections = {text}
sections = {text}
else
else
sections = split(text, "\244[\128-\143][\128-\191]*", true)
sections = split(text, "\244[\128-\143][\128-\191]*", true)
end
end
local actual_substitution_data
for _, section in ipairs(sections) do
for _, section in ipairs(sections) do
-- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated modules).
-- Don't bother processing empty strings or whitespace (which may also not be handled well by dedicated
-- modules).
if gsub(section, "%s+", "") ~= "" then
if gsub(section, "%s+", "") ~= "" then
local sub, sub_fail, sub_cats = doSubstitutions(self, section, sc, substitution_data, function_name)
local sub, this_actual_substitution_data = doSubstitutions(self, section, sc, substitution_data, data_field,
-- Second round of temporary substitutions, in case any formatting was added by the main substitution process. However, don't do this if the section contains formatting already (as it would have had to have been escaped to reach this stage, and therefore should be given as raw text).
function_name)
actual_substitution_data = this_actual_substitution_data
-- Second round of temporary substitutions, in case any formatting was added by the main substitution
-- process. However, don't do this if the section contains formatting already (as it would have had to have
-- been escaped to reach this stage, and therefore should be given as raw text).
if sub and subbedChars then
if sub and subbedChars then
local noSub
local noSub
for _, pattern in ipairs(require("Module:languages/data/patterns")) do
for _, pattern in ipairs(require(languages_data_patterns_module)) do
if match(section, pattern .. "%z?") then
if match(section, pattern .. "%z?") then
noSub = true
noSub = true
Line 478: Line 637:
end
end
end
end
if (not sub) or sub_fail then
if not sub then
text = sub
text = sub
fail = sub_fail
cats = sub_cats or {}
break
break
end
end
text = sub and gsub(text, pattern_escape(section), replacement_escape(sub), 1) or text
text = sub and gsub(text, pattern_escape(section), replacement_escape(sub), 1) or text
if type(sub_cats) == "table" then
end
for _, cat in ipairs(sub_cats) do
insert(cats, cat)
end
end
end
end
end


-- Trim, unless there are only spacing characters, while ignoring any final formatting characters.
if not notrim then
text = text and text:gsub("^([\128-\191\244]*)%s+(%S)", "%1%2")
-- Trim, unless there are only spacing characters, while ignoring any final formatting characters.
:gsub("(%S)%s+([\128-\191\244]*)$", "%1%2")
-- Do not trim sort keys because spaces at the beginning are significant.
 
text = text and text:gsub("^([\128-\191\244]*)%s+(%S)", "%1%2"):gsub("(%S)%s+([\128-\191\244]*)$", "%1%2") or
-- Remove duplicate categories.
nil
if #cats > 1 then
cats = remove_duplicates(cats)
end
end


return text, fail, cats, subbedChars
return text, subbedChars, actual_substitution_data
end
end


Line 511: Line 661:
text, rep = gsub(text, "\\\\(\\*^)", "\3%1")
text, rep = gsub(text, "\\\\(\\*^)", "\3%1")
until rep == 0
until rep == 0
return text:gsub("\\^", "\4")
return (text:gsub("\\^", "\4")
:gsub(pattern or "%^", repl or "")
:gsub(pattern or "%^", repl or "")
:gsub("\3", "\\")
:gsub("\3", "\\")
:gsub("\4", "^")
:gsub("\4", "^"))
end
end


Line 557: Line 707:
-- Add article and " substrate" to substrates that lack them.
-- Add article and " substrate" to substrates that lack them.
if self:getFamilyCode() == "qfa-sub" then
if self:getFamilyCode() == "qfa-sub" then
if not (match(form, "^[Tt]he ") or match(form, "^[Aa] ")) then
if not (sub(form, 1, 4) == "the " or sub(form, 1, 2) == "a ") then
form = "a " .. form
form = "a " .. form
end
end
if not match(form, "[Ss]ubstrate") then
if not match(form, " [Ss]ubstrate") then
form = form .. " substrate"
form = form .. " substrate"
end
end
Line 668: Line 818:
Language.hasType = require(language_like_module).hasType
Language.hasType = require(language_like_module).hasType
return self:hasType(...)
return self:hasType(...)
end
function Language:getMainCategoryName()
return self._data.main_category or "lemma"
end
end


Line 693: Line 839:
if wm_langs == nil then
if wm_langs == nil then
wm_langs = self._data.wikimedia_codes
wm_langs = self._data.wikimedia_codes
wm_langs = wm_langs and split(wm_langs, ",", true, true) or {self._code}
if wm_langs then
wm_langs = split(wm_langs, ",", true, true)
else
local code = self._code
if is_known_language_tag(code) then
wm_langs = {code}
else
-- Inherit, but only if no codes are specified in the data *and*
-- the language code isn't a valid Wikimedia language code.
local parent = self:getParent()
wm_langs = parent and parent:getWikimediaLanguageCodes() or {}
end
end
self._wikimediaLanguageCodes = wm_langs
self._wikimediaLanguageCodes = wm_langs
end
end
Line 717: Line 875:


function Language:makeWikipediaLink()
function Language:makeWikipediaLink()
return "[[w:" .. self:getWikipediaArticle() .. "|" .. self:getCanonicalName() .. "]]"
    return make_link(self, (self:hasType("conlang") and self:getCanonicalName() or "w:" .. self:getWikipediaArticle()), self:getCanonicalName())
end
end


Line 724: Line 882:
Language.getCommonsCategory = require(language_like_module).getCommonsCategory
Language.getCommonsCategory = require(language_like_module).getCommonsCategory
return self:getCommonsCategory()
return self:getCommonsCategory()
end
function Language:getMainCategoryName()
    return self._data.main_category or "lemma"
end
end


Line 738: Line 900:
local codes = self:getScriptCodes()
local codes = self:getScriptCodes()
if codes[1] == "All" then
if codes[1] == "All" then
scripts = load_data("Module:scripts/data")
scripts = load_data(scripts_data_module)
else
else
scripts = {}
scripts = {}
Line 803: Line 965:
first_sc = get_script(first_sc)
first_sc = get_script(first_sc)
local charset = first_sc.characters
local charset = first_sc.characters
return charset and umatch(text, "[" .. charset .. "]") and first_sc or get_script("Ayer")
return charset and umatch(text, "[" .. ugsub(charset, "%]", "%%]") .. "]") and first_sc or get_script("None")
end
end


Line 829: Line 992:
local t, s, found = 0, 0
local t, s, found = 0, 0
-- This is faster than using mw.ustring.gmatch directly.
-- This is faster than using mw.ustring.gmatch directly.
for ch in gmatch(ugsub(text, "[" .. Hani.characters .. "]", "\255%0"), "\255(.[\128-\191]*)") do
for ch in gmatch((ugsub(text, "[" .. Hani.characters .. "]", "\255%0")), "\255(.[\128-\191]*)") do
found = true
found = true
if Hant_chars[ch] then
if Hant_chars[ch] then
Line 858: Line 1,021:
-- Count characters by removing everything in the script's charset and comparing to the original length.
-- Count characters by removing everything in the script's charset and comparing to the original length.
local charset = sc.characters
local charset = sc.characters
local count = charset and length - ulen(ugsub(text, "[" .. charset .. "]+", "")) or 0
local count = charset and length - ulen((ugsub(text, "[" .. charset:gsub("%]", "%%]") .. "]+", ""))) or 0


if count >= length then
if count >= length then
Line 907: Line 1,070:
end
end


--[==[Check whether the language belongs to `family` (which can be a family code or object). A list of objects can be given in place of `family`; in that case, return true if the language belongs to any of the specified families. Note that some languages (in particular, certain creoles) can have multiple immediate ancestors potentially belonging to different families; in that case, return true if the language belongs to any of the specified families.]==]
do
function Language:inFamily(...)
local function check_family(self, family)
--check_object("family", nil, ...)
for _, family in ipairs{...} do
if type(family) == "table" then
if type(family) == "table" then
family = family:getCode()
family = family:getCode()
end
end
local self_family_code = self:getFamilyCode()
if self:getFamilyCode() == family then
if self_family_code == nil then
return false
elseif self_family_code == family then
return true
return true
end
end
Line 924: Line 1,082:
return true
return true
-- If the family isn't a real family (e.g. creoles) check any ancestors.
-- If the family isn't a real family (e.g. creoles) check any ancestors.
elseif self_family:getFamilyCode() == "qfa-not" then
elseif self_family:inFamily("qfa-not") then
local ancestors = self:getAncestors()
local ancestors = self:getAncestors()
for _, ancestor in ipairs(ancestors) do
for _, ancestor in ipairs(ancestors) do
Line 933: Line 1,091:
end
end
end
end
return false
 
--[==[Check whether the language belongs to `family` (which can be a family code or object). A list of objects can be given in place of `family`; in that case, return true if the language belongs to any of the specified families. Note that some languages (in particular, certain creoles) can have multiple immediate ancestors potentially belonging to different families; in that case, return true if the language belongs to any of the specified families.]==]
function Language:inFamily(...)
if self:getFamilyCode() == nil then
return false
end
return check_inputs(self, check_family, false, ...)
end
end
end


Line 983: Line 1,148:
end
end


function Language:hasParent(...)
do
--check_object("language", nil, ...)
local function check_lang(self, lang)
for _, otherlang in ipairs{...} do
for _, parent in ipairs(self:getParentChain()) do
for _, parent in ipairs(self:getParentChain()) do
if (type(otherlang) == "string" and otherlang or otherlang:getCode()) == parent:getCode() then
if (type(lang) == "string" and lang or lang:getCode()) == parent:getCode() then
return true
return true
end
end
end
end
end
end
return false
 
function Language:hasParent(...)
return check_inputs(self, check_lang, false, ...)
end
end
end


Line 1,120: Line 1,287:
--[==[Given a list of language objects or codes, returns true if at least one of them is an ancestor. This includes any etymology-only children of that ancestor. If the language's ancestor(s) are etymology-only languages, it will also return true for those language parent(s) (e.g. if Vulgar Latin is the ancestor, it will also return true for its parent, Latin). However, a parent is excluded from this if the ancestor is also ancestral to that parent (e.g. if Classical Persian is the ancestor, Persian would return false, because Classical Persian is also ancestral to Persian).]==]
--[==[Given a list of language objects or codes, returns true if at least one of them is an ancestor. This includes any etymology-only children of that ancestor. If the language's ancestor(s) are etymology-only languages, it will also return true for those language parent(s) (e.g. if Vulgar Latin is the ancestor, it will also return true for its parent, Latin). However, a parent is excluded from this if the ancestor is also ancestral to that parent (e.g. if Classical Persian is the ancestor, Persian would return false, because Classical Persian is also ancestral to Persian).]==]
function Language:hasAncestor(...)
function Language:hasAncestor(...)
--check_object("language", nil, ...)
local function iterateOverAncestorTree(node, func, parent_check)
local function iterateOverAncestorTree(node, func, parent_check)
local ancestors = node:getAncestors()
local ancestors = node:getAncestors()
local ancestorsParents = {}
local ancestorsParents = {}
for _, ancestor in ipairs(ancestors) do
for _, ancestor in ipairs(ancestors) do
local ret = func(ancestor) or iterateOverAncestorTree(ancestor, func, parent_check)
-- When checking the parents of the other language, and the ancestor is also a parent, skip to the next ancestor, so that we exclude any etymology-only children of that parent that are not directly related (see below).
local ret = (parent_check or not node:hasParent(ancestor)) and
func(ancestor) or iterateOverAncestorTree(ancestor, func, parent_check)
if ret then
if ret then
return ret
return ret
Line 1,185: Line 1,352:
end
end


function Language:getAncestorChain()
do
local function construct_node(lang, memo)
local branch, ancestors = {lang = lang:getCode()}
memo[lang:getCode()] = branch
for _, ancestor in ipairs(lang:getAncestors()) do
if ancestors == nil then
ancestors = {}
end
insert(ancestors, memo[ancestor:getCode()] or construct_node(ancestor, memo))
end
branch.ancestors = ancestors
return branch
end
 
function Language:getAncestorChain()
local chain = self._ancestorChain
if chain == nil then
chain = construct_node(self, {})
self._ancestorChain = chain
end
return chain
end
end
 
function Language:getAncestorChainOld()
local chain = self._ancestorChain
local chain = self._ancestorChain
if chain == nil then
if chain == nil then
Line 1,193: Line 1,384:
local ancestors = step:getAncestors()
local ancestors = step:getAncestors()
step = #ancestors == 1 and ancestors[1] or nil
step = #ancestors == 1 and ancestors[1] or nil
if not step then break end
if not step then
insert(chain, 1, step)
break
end
insert(chain, step)
end
end
self._ancestorChain = chain
self._ancestorChain = chain
Line 1,261: Line 1,454:
end
end


function Language:hasDescendant(...)
do
for _, lang in ipairs{...} do
local function check_lang(self, lang)
if type(lang) == "string" then
if type(lang) == "string" then
lang = get_by_code(lang, nil, true)
lang = get_by_code(lang, nil, true)
Line 1,270: Line 1,463:
end
end
end
end
return false
 
function Language:hasDescendant(...)
return check_inputs(self, check_lang, false, ...)
end
end
end


local function fetch_children(self, fmt)
local function fetch_children(self, fmt)
local m_etym_data = require("Module:etymology languages/data")
local m_etym_data = require(etymology_languages_data_module)
local self_code, children = self._code, {}
local self_code, children = self._code, {}
for code, lang in pairs(m_etym_data) do
for code, lang in pairs(m_etym_data) do
Line 1,341: Line 1,537:
if name == nil then
if name == nil then
name = self:getCanonicalName()
name = self:getCanonicalName()
-- If a substrate, omit any leading article.
if self:getFamilyCode() == "qfa-sub" then
name = name:gsub("^the ", ""):gsub("^a ", "")
end
-- Only add " language" if a full language.
-- Only add " language" if a full language.
if self:hasType("full") then
if self:hasType("full") then
Line 1,358: Line 1,558:
--[==[Creates a link to the category; the link text is the canonical name.]==]
--[==[Creates a link to the category; the link text is the canonical name.]==]
function Language:makeCategoryLink()
function Language:makeCategoryLink()
return "[[:Category:" .. self:getCategoryName() .. "|" .. self:getDisplayForm() .. "]]"
return make_link(self, ":Category:" .. self:getCategoryName(), self:getDisplayForm())
end
end


function Language:getStandardCharacters(sc)
function Language:getStandardCharacters(sc)
local standard_chars = self._data.standardChars
local standard_chars = self._data.standard_chars
if type(standard_chars) ~= "table" then
if type(standard_chars) ~= "table" then
return standard_chars
return standard_chars
Line 1,381: Line 1,581:
end
end


--[==[Make the entry name (i.e. the correct page name).]==]
--[==[
function Language:makeEntryName(text, sc)
Strip diacritics from display text `text` (in a language-specific fashion), which is in the script `sc`. If `sc` is
omitted or {nil}, the script is autodetected. This also strips certain punctuation characters from the end and (in the
case of Spanish upside-down question mark and exclamation points) from the beginning; strips any whitespace at the
end of the text or between the text and final stripped punctuation characters; and applies some language-specific
Unicode normalizations to replace discouraged characters with their prescribed alternatives. Return the stripped text.
]==]
function Language:stripDiacritics(text, sc)
if (not text) or text == "" then
if (not text) or text == "" then
return text, nil, {}
return text
end
end


-- Set `unsupported` as true if certain conditions are met.
sc = checkScript(text, self, sc)
local unsupported
 
-- Check if there's an unsupported character. \239\191\189 is the replacement character U+FFFD, which can't be typed directly here due to an abuse filter. Unix-style dot-slash notation is also unsupported, as it is used for relative paths in links, as are 3 or more consecutive tildes.
text = normalize(text, sc)
-- Note: match is faster with magic characters/charsets; find is faster with plaintext.
-- FIXME, rename makeEntryName to stripDiacritics and get rid of second and third return values
if (
-- everywhere
match(text, "[#<>%[%]_{|}]") or
text, _, _ = iterateSectionSubstitutions(self, text, sc, nil, nil,
find(text, "\239\191\189") or
self._data.strip_diacritics or self._data.entry_name, "strip_diacritics", "stripDiacritics")
match(text, "%f[^%z/]%.%.?%f[%z/]") or
 
find(text, "~~~")
text = umatch(text, "^[¿¡]?(.-[^%s%p].-)%s*[؟?!;՛՜ ՞ ՟?!︖︕।॥။၊་།]?$") or text
) then
return text
end
 
--[==[
Convert a ''logical'' pagename (the pagename as it appears to the user, after diacritics and punctuation have been
stripped) to a ''physical'' pagename (the pagename as it appears in the MediaWiki database). Reasons for a difference
between the two are (a) unsupported titles such as `[ ]` (with square brackets in them), `#` (pound/hash sign) and
`¯\_(ツ)_/¯` (with underscores), as well as overly long titles of various sorts; (b) "mammoth" pages that are split into
parts (e.g. `a`, which is split into physical pagenames `a/languages A to L` and `a/languages M to Z`). For almost all
purposes, you should work with logical and not physical pagenames. But there are certain use cases that require physical
pagenames, such as checking the existence of a page or retrieving a page's contents.
 
`pagename` is the logical pagename to be converted. `is_reconstructed_or_appendix` indicates whether the page is in the
`Reconstruction` or `Appendix` namespaces. If it is omitted or has the value {nil}, the pagename is checked for an
initial asterisk, and if found, the page is assumed to be a `Reconstruction` page. Setting a value of `false` or `true`
to `is_reconstructed_or_appendix` disables this check and allows for mainspace pagenames that begin with an asterisk.
]==]
function Language:logicalToPhysical(pagename, is_reconstructed_or_appendix)
-- FIXME: This probably shouldn't happen but it happens when makeEntryName() receives nil.
if pagename == nil then
track("nil-passed-to-logicalToPhysical")
return nil
end
local initial_asterisk
if is_reconstructed_or_appendix == nil then
local pagename_minus_initial_asterisk
initial_asterisk, pagename_minus_initial_asterisk = pagename:match("^(%*)(.*)$")
if pagename_minus_initial_asterisk then
is_reconstructed_or_appendix = true
pagename = pagename_minus_initial_asterisk
elseif self:hasType("appendix-constructed") then
is_reconstructed_or_appendix = true
end
end
 
if not is_reconstructed_or_appendix then
-- Check if the pagename is a listed unsupported title.
local unsupportedTitles = load_data(links_data_module).unsupported_titles
if unsupportedTitles[pagename] then
return "Unsupported titles/" .. unsupportedTitles[pagename]
end
end
 
-- Set `unsupported` as true if certain conditions are met.
local unsupported
-- Check if there's an unsupported character. \239\191\189 is the replacement character U+FFFD, which can't be typed
-- directly here due to an abuse filter. Unix-style dot-slash notation is also unsupported, as it is used for
-- relative paths in links, as are 3 or more consecutive tildes. Note: match is faster with magic
-- characters/charsets; find is faster with plaintext.
if (
match(pagename, "[#<>%[%]_{|}]") or
find(pagename, "\239\191\189") or
match(pagename, "%f[^%z/]%.%.?%f[%z/]") or
find(pagename, "~~~")
) then
unsupported = true
unsupported = true
-- If it looks like an interwiki link.
-- If it looks like an interwiki link.
elseif find(text, ":") then
elseif find(pagename, ":") then
local prefix = gsub(text, "^:*(.-):.*", ulower)
local prefix = gsub(pagename, "^:*(.-):.*", ulower)
if (
if (
load_data("Module:data/namespaces")[prefix] or
load_data("Module:data/namespaces")[prefix] or
Line 1,409: Line 1,669:
end
end


-- Check if the text is a listed unsupported title.
-- Escape unsupported characters so they can be used in titles. ` is used as a delimiter for this, so a raw use of
local unsupportedTitles = load_data("Module:links/data").unsupported_titles
-- it in an unsupported title is also escaped here to prevent interference; this is only done with unsupported
if unsupportedTitles[text] then
-- titles, though, so inclusion won't in itself mean a title is treated as unsupported (which is why it's excluded
return "Unsupported titles/" .. unsupportedTitles[text], nil, {}
-- from the earlier test).
end
 
sc = checkScript(text, self, sc)
 
local fail, cats
text = normalize(text, sc)
text, fail, cats = iterateSectionSubstitutions(text, nil, nil, self, sc, self._data.entry_name, "makeEntryName")
 
text = umatch(text, "^[¿¡]?(.-[^%s%p].-)%s*[؟?!;՛՜ ՞ ՟?!︖︕।॥။၊་།]?$") or text
 
 
-- Escape unsupported characters so they can be used in titles. ` is used as a delimiter for this, so a raw use of it in an unsupported title is also escaped here to prevent interference; this is only done with unsupported titles, though, so inclusion won't in itself mean a title is treated as unsupported (which is why it's excluded from the earlier test).
if unsupported then
if unsupported then
local unsupported_characters = load_data("Module:links/data").unsupported_characters
-- FIXME: This conversion needs to be different for reconstructed pages with unsupported characters. There
text = text:gsub("[#<>%[%]_`{|}\239]\191?\189?", unsupported_characters)
-- aren't any currently, but if there ever are, we need to fix this e.g. to put them in something like
-- Reconstruction:Proto-Indo-European/Unsupported titles/`lowbar``num`.
local unsupported_characters = load_data(links_data_module).unsupported_characters
pagename = pagename:gsub("[#<>%[%]_`{|}\239]\191?\189?", unsupported_characters)
:gsub("%f[^%z/]%.%.?%f[%z/]", function(m)
:gsub("%f[^%z/]%.%.?%f[%z/]", function(m)
return gsub(m, "%.", "`period`")
return (gsub(m, "%.", "`period`"))
end)
end)
:gsub("~~~+", function(m)
:gsub("~~~+", function(m)
return gsub(m, "~", "`tilde`")
return (gsub(m, "~", "`tilde`"))
end)
end)
text = "Unsupported titles/" .. text
pagename = "Unsupported titles/" .. pagename
elseif not is_reconstructed_or_appendix then
-- Check if this is a mammoth page. If so, which subpage should we link to?
local m_links_data = load_data(links_data_module)
local mammoth_page_type = m_links_data.mammoth_pages[pagename]
if mammoth_page_type then
local canonical_name = self:getFullName()
if canonical_name ~= "Translingual" and canonical_name ~= "English" then
local this_subpage
local L2_sort_key = get_L2_sort_key(canonical_name)
for _, subpage_spec in ipairs(m_links_data.mammoth_page_subpage_types[mammoth_page_type]) do
-- unpack() fails utterly on data loaded using mw.loadData() even if offsets are given
local subpage, pattern = subpage_spec[1], subpage_spec[2]
if pattern == true or L2_sort_key:match(pattern) then
this_subpage = subpage
break
end
end
if not this_subpage then
error(("Internal error: Bad data in mammoth_page_subpage_pages in [[Module:links/data]] for mammoth page %s, type %s; last entry didn't have 'true' in it"):format(
pagename, mammoth_page_type))
end
pagename = pagename .. "/" .. this_subpage
end
end
end
end


return text, fail, cats
return (initial_asterisk or "") .. pagename
end
 
--[==[
Strip the diacritics from a display pagename and convert the resulting logical pagename into a physical pagename.
This allows you, for example, to retrieve the contents of the page or check its existence. WARNING: This is deprecated
and will be going away. It is a simple composition of `self:stripDiacritics` and `self:logicalToPhysical`; most callers
only want the former, and if you need both, call them both yourself.
 
`text` and `sc` are as in `self:stripDiacritics`, and `is_reconstructed_or_appendix` is as in `self:logicalToPhysical`.
]==]
function Language:makeEntryName(text, sc, is_reconstructed_or_appendix)
return self:logicalToPhysical(self:stripDiacritics(text, sc), is_reconstructed_or_appendix)
end
end


--[==[Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.]==]
--[==[Generates alternative forms using a specified method, and returns them as a table. If no method is specified, returns a table containing only the input term.]==]
Line 1,447: Line 1,735:
end
end
sc = checkScript(text, self, sc)
sc = checkScript(text, self, sc)
return require("Module:" .. self._data.generate_forms).generateForms(text, self._code, sc:getCode())
return require("Module:" .. self._data.generate_forms).generateForms(text, self, sc)
end
end


--[==[Creates a sort key for the given entry name, following the rules appropriate for the language. This removes diacritical marks from the entry name if they are not considered significant for sorting, and may perform some other changes. Any initial hyphen is also removed, and anything parentheses is removed as well.
--[==[Creates a sort key for the given stripped text, following the rules appropriate for the language. This removes
The <code>sort_key</code> setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the entry name and returns a sortkey.]==]
diacritical marks from the stripped text if they are not considered significant for sorting, and may perform some other
changes. Any initial hyphen is also removed, and anything in parentheses is removed as well.
The <code>sort_key</code> setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the stripped text and returns a sortkey.]==]
function Language:makeSortKey(text, sc)
function Language:makeSortKey(text, sc)
if (not text) or text == "" then
if (not text) or text == "" then
return text, nil, {}
return text
end
end
-- Remove directional characters, soft hyphens, strip markers and HTML tags.
if match(text, "<[^<>]+>") then
track("track HTML tag")
end
-- Remove directional characters, bold, italics, soft hyphens, strip markers and HTML tags.
-- FIXME: Partly duplicated with remove_formatting() in [[Module:links]].
text = ugsub(text, "[\194\173\226\128\170-\226\128\174\226\129\166-\226\129\169]", "")
text = ugsub(text, "[\194\173\226\128\170-\226\128\174\226\129\166-\226\129\169]", "")
text = text:gsub("('*)'''(.-'*)'''", "%1%2"):gsub("('*)''(.-'*)''", "%1%2")
text = gsub(unstrip(text), "<[^<>]+>", "")
text = gsub(unstrip(text), "<[^<>]+>", "")


Line 1,477: Line 1,772:
text = sc:toFixedNFD(text)
text = sc:toFixedNFD(text)
end
end
-- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is usually not necessary to convert "i" to "İ" and "ı" to "I" first, because "I" will always be interpreted as conventional "I" (not dotless "İ") by any sorting algorithms, which will have been taken into account by the sortkey substitutions themselves. However, if no sortkey substitutions have been specified, then conversion is necessary so as to prevent "i" and "ı" both being sorted as "I".
-- Convert to lowercase, make the sortkey, then convert to uppercase. Where the language has dotted dotless i, it is
-- An exception is made for scripts that (sometimes) sort by scraping page content, as that means they are sensitive to changes in capitalization (as it changes the target page).
-- usually not necessary to convert "i" to "İ" and "ı" to "I" first, because "I" will always be interpreted as
local fail, cats
-- conventional "I" (not dotless "İ") by any sorting algorithms, which will have been taken into account by the
-- sortkey substitutions themselves. However, if no sortkey substitutions have been specified, then conversion is
-- necessary so as to prevent "i" and "ı" both being sorted as "I".
--
-- An exception is made for scripts that (sometimes) sort by scraping page content, as that means they are sensitive
-- to changes in capitalization (as it changes the target page).
if not sc:sortByScraping() then
if not sc:sortByScraping() then
text = ulower(text)
text = ulower(text)
end
end


local sort_key = self._data.sort_key
local actual_substitution_data
text, fail, cats = iterateSectionSubstitutions(text, nil, nil, self, sc, sort_key, "makeSortKey")
-- Don't trim whitespace here because it's significant at the beginning of a sort key or sort base.
text, _, actual_substitution_data = iterateSectionSubstitutions(self, text, sc, nil, nil, self._data.sort_key,
"sort_key", "makeSortKey", "notrim")


if not sc:sortByScraping() then
if not sc:sortByScraping() then
if self:hasDottedDotlessI() and not sort_key then
if self:hasDottedDotlessI() and not actual_substitution_data then
text = gsub(gsub(text, "ı", "I"), "i", "İ")
text = text:gsub("ı", "I"):gsub("i", "İ")
text = sc:toFixedNFC(text)
text = sc:toFixedNFC(text)
end
end
Line 1,496: Line 1,798:


-- Remove parentheses, as long as they are either preceded or followed by something.
-- Remove parentheses, as long as they are either preceded or followed by something.
text = gsub(text, "(.)[()]+", "%1")
text = gsub(text, "(.)[()]+", "%1"):gsub("[()]+(.)", "%1")
:gsub("[()]+(.)", "%1")


text = escape_risky_characters(text)
text = escape_risky_characters(text)
return text, fail, cats
return text
end
end


--[==[Create the form used as as a basis for display text and transliteration.]==]
--[==[Create the form used as as a basis for display text and transliteration. FIXME: Rename to correctInputText().]==]
local function processDisplayText(text, self, sc, keepCarets, keepPrefixes)
local function processDisplayText(text, self, sc, keepCarets, keepPrefixes)
local subbedChars = {}
local subbedChars = {}
Line 1,512: Line 1,813:


sc = checkScript(text, self, sc)
sc = checkScript(text, self, sc)
local fail, cats
text = normalize(text, sc)
text = normalize(text, sc)
text, fail, cats, subbedChars = iterateSectionSubstitutions(text, subbedChars, keepCarets, self, sc, self._data.display_text, "makeDisplayText")
text, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, keepCarets, self._data.display_text,
"display_text", "makeDisplayText")


text = removeCarets(text, sc)
text = removeCarets(text, sc)
Line 1,527: Line 1,828:
while true do
while true do
local prefix = gsub(text, "^(.-):.+", function(m1)
local prefix = gsub(text, "^(.-):.+", function(m1)
return gsub(m1, "\244[\128-\191]*", "")
return (gsub(m1, "\244[\128-\191]*", ""))
end)
end)
if not prefix or prefix == text then
-- Check if the prefix is an interwiki, though ignore capitalised Wiktionary:, which is a namespace.
break
if not prefix or prefix == text or prefix == "Wiktionary"
end
or not (load_data("Module:data/interwikis")[ulower(prefix)] or prefix == "") then
local lower_prefix = ulower(prefix)
if not (load_data("Module:data/interwikis")[lower_prefix] or prefix == "") then
break
break
end
end
Line 1,544: Line 1,843:
end)
end)
end
end
text = gsub(text, "\3", "\\")
text = gsub(text, "\3", "\\"):gsub("\4", ":")
:gsub("\4", ":")
end
end


return text, fail, cats, subbedChars
return text, subbedChars
end
end


--[==[Make the display text (i.e. what is displayed on the page).]==]
--[==[Make the display text (i.e. what is displayed on the page).]==]
function Language:makeDisplayText(text, sc, keepPrefixes)
function Language:makeDisplayText(text, sc, keepPrefixes)
if (not text) or text == "" then
if not text or text == "" then
return text, nil, {}
return text
end
end


local fail, cats, subbedChars
local subbedChars
text, fail, cats, subbedChars = processDisplayText(text, self, sc, nil, keepPrefixes)
text, subbedChars = processDisplayText(text, self, sc, nil, keepPrefixes)


text = escape_risky_characters(text)
text = escape_risky_characters(text)
return undoTempSubstitutions(text, subbedChars), fail, cats
return undoTempSubstitutions(text, subbedChars)
end
end


--[==[Transliterates the text from the given script into the Latin script (see [[Wiktionary:Transliteration and romanization]]). The language must have the <code>translit</code> property for this to work; if it is not present, {{code|lua|nil}} is returned.
--[==[Transliterates the text from the given script into the Latin script (see
Returns three values:
[[Wiktionary:Transliteration and romanization]]). The language must have the <code>translit</code> property for this to
# The transliteration.
work; if it is not present, {{code|lua|nil}} is returned.
# A boolean which indicates whether the transliteration failed for an unexpected reason. If {{code|lua|false}}, then the transliteration either succeeded, or the module is returning nothing in a controlled way (e.g. the input was {{code|lua|"-"}}). Generally, this means that no maintenance action is required. If {{code|lua|true}}, then the transliteration is {{code|lua|nil}} because either the input or output was defective in some way (e.g. [[Module:ar-translit]] will not transliterate non-vocalised inputs, and this module will fail partially-completed transliterations in all languages). Note that this value can be manually set by the transliteration module, so make sure to cross-check to ensure it is accurate.
 
# A table of categories selected by the transliteration module, which should be in the format expected by {{code|lua|format_categories}} in [[Module:utilities]].
The <code>sc</code> parameter is handled by the transliteration module, and how it is handled is specific to that
The <code>sc</code> parameter is handled by the transliteration module, and how it is handled is specific to that module. Some transliteration modules may tolerate {{code|lua|nil}} as the script, others require it to be one of the possible scripts that the module can transliterate, and will show an error if it's not one of them. For this reason, the <code>sc</code> parameter should always be provided when writing non-language-specific code.
module. Some transliteration modules may tolerate {{code|lua|nil}} as the script, others require it to be one of the
The <code>module_override</code> parameter is used to override the default module that is used to provide the transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no default module yet, or you want to demonstrate an alternative version of a transliteration module before making it official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked by [[Wiktionary:Tracking/languages/module_override]].
possible scripts that the module can transliterate, and will throw an error if it's not one of them. For this reason,
the <code>sc</code> parameter should always be provided when writing non-language-specific code.
 
The <code>module_override</code> parameter is used to override the default module that is used to provide the
transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no
default module yet, or you want to demonstrate an alternative version of a transliteration module before making it
official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked
by [[Wiktionary:Tracking/languages/module_override]].
'''Known bugs''':
'''Known bugs''':
* This function assumes {tr(s1) .. tr(s2) == tr(s1 .. s2)}. When this assertion fails, wikitext markups like <nowiki>'''</nowiki> can cause wrong transliterations.
* This function assumes {tr(s1) .. tr(s2) == tr(s1 .. s2)}. When this assertion fails, wikitext markups like <nowiki>'''</nowiki> can cause wrong transliterations.
* HTML entities like <code>&amp;apos;</code>, often used to escape wikitext markups, do not work.]==]
* HTML entities like <code>&amp;apos;</code>, often used to escape wikitext markups, do not work.
]==]
function Language:transliterate(text, sc, module_override)
function Language:transliterate(text, sc, module_override)
-- If there is no text, or the language doesn't have transliteration data and there's no override, return nil.
-- If there is no text, or the language doesn't have transliteration data and there's no override, return nil.
if not (self._data.translit or module_override) then
if not text or text == "" or text == "-" then
return nil, false, {}
return text
elseif (not text) or text == "" or text == "-" then
return text, false, {}
end
end
-- If the script is not transliteratable (and no override is given), return nil.
-- If the script is not transliteratable (and no override is given), return nil.
Line 1,585: Line 1,889:
if not (sc:isTransliterated() or module_override) then
if not (sc:isTransliterated() or module_override) then
-- temporary tracking to see if/when this gets triggered
-- temporary tracking to see if/when this gets triggered
return nil, true, {}
track("non-transliterable")
track("non-transliterable/" .. self._code)
track("non-transliterable/" .. sc:getCode())
track("non-transliterable/" .. sc:getCode() .. "/" .. self._code)
return nil
end
end


-- Remove any strip markers.
-- Remove any strip markers.
text = unstrip(text)
text = unstrip(text)
-- Do not process the formatting into PUA characters for certain languages.
local processed = load_data(languages_data_module).substitution[self._code] ~= "none"


-- Get the display text with the keepCarets flag set.
-- Get the display text with the keepCarets flag set.
local fail, cats, subbedChars
local subbedChars
text, fail, cats, subbedChars = processDisplayText(text, self, sc, true)
if processed then
text, subbedChars = processDisplayText(text, self, sc, true)
end


-- Transliterate (using the module override if applicable).
-- Transliterate (using the module override if applicable).
text, fail, cats, subbedChars = iterateSectionSubstitutions(text, subbedChars, true, self, sc, module_override or self._data.translit, "tr")
text, subbedChars = iterateSectionSubstitutions(self, text, sc, subbedChars, true, module_override or
self._data.translit, "translit", "tr")


if not text then
if not text then
return nil, true, cats
return nil
end
end


-- Incomplete transliterations return nil.
-- Incomplete transliterations return nil.
local charset = sc.characters
local charset = sc.characters
if charset and umatch(text, "[" .. charset .. "]") then
if charset and umatch(text, "[" .. charset:gsub("%]", "%%]") .. "]") then
-- Remove any characters in Latin, which includes Latin characters also included in other scripts (as these are false positives), as well as any PUA substitutions. Anything remaining should only be script code "None" (e.g. numerals).
-- Remove any characters in Latin, which includes Latin characters also included in other scripts (as these are
-- false positives), as well as any PUA substitutions. Anything remaining should only be script code "None"
-- (e.g. numerals).
local check_text = ugsub(text, "[" .. get_script("Latn").characters .. "􀀀-􏿽]+", "")
local check_text = ugsub(text, "[" .. get_script("Latn").characters .. "􀀀-􏿽]+", "")
-- Set none_is_last_resort_only flag, so that any non-None chars will cause a script other than "None" to be returned.
-- Set none_is_last_resort_only flag, so that any non-None chars will cause a script other than "None" to be
-- returned.
if find_best_script_without_lang(check_text, true):getCode() ~= "None" then
if find_best_script_without_lang(check_text, true):getCode() ~= "None" then
return nil, true, cats
return nil
end
end
end
end


text = escape_risky_characters(text)
if processed then
text = undoTempSubstitutions(text, subbedChars)
text = escape_risky_characters(text)
text = undoTempSubstitutions(text, subbedChars)
end


-- If the script does not use capitalization, then capitalize any letters of the transliteration which are immediately preceded by a caret (and remove the caret).
-- If the script does not use capitalization, then capitalize any letters of the transliteration which are
-- immediately preceded by a caret (and remove the caret).
if text and not sc:hasCapitalization() and text:find("^", 1, true) then
if text and not sc:hasCapitalization() and text:find("^", 1, true) then
text = processCarets(text, "%^([\128-\191\244]*%*?)([^\128-\191\244][\128-\191]*)", function(m1, m2)
text = processCarets(text, "%^([\128-\191\244]*%*?)([^\128-\191\244][\128-\191]*)", function(m1, m2)
Line 1,623: Line 1,943:
end
end


fail = text == nil and (not not fail) or false
-- Track module overrides.
if module_override ~= nil then
track("module_override")
end


return text, fail, cats
return text
end
end


Line 1,664: Line 1,987:


function Language:toJSON(opts)
function Language:toJSON(opts)
local entry_name, entry_name_patterns, entry_name_remove_diacritics = self._data.entry_name
local strip_diacritics, strip_diacritics_patterns, strip_diacritics_remove_diacritics = self._data.strip_diacritics
if entry_name then
if strip_diacritics then
if entry_name.from then
if strip_diacritics.from then
entry_name_patterns = {}
strip_diacritics_patterns = {}
for i, from in ipairs(entry_name.from) do
for i, from in ipairs(strip_diacritics.from) do
insert(entry_name_patterns, {from = from, to = entry_name.to[i] or ""})
insert(strip_diacritics_patterns, {from = from, to = strip_diacritics.to[i] or ""})
end
end
end
end
entry_name_remove_diacritics = entry_name.remove_diacritics
strip_diacritics_remove_diacritics = strip_diacritics.remove_diacritics
end
end
-- mainCode should only end up non-nil if dontCanonicalizeAliases is passed to make_object().
-- mainCode should only end up non-nil if dontCanonicalizeAliases is passed to make_object().
local ret = {
-- props should either contain zero-argument functions to compute the value, or the value itself.
ancestors = self:getAncestorCodes(),
local props = {
canonicalName = self:getCanonicalName(),
ancestors = function() return self:getAncestorCodes() end,
categoryName = self:getCategoryName("nocap"),
canonicalName = function() return self:getCanonicalName() end,
categoryName = function() return self:getCategoryName("nocap") end,
code = self._code,
code = self._code,
mainCode = self._mainCode,
mainCode = self._mainCode,
parent = self:getParentCode(),
parent = function() return self:getParentCode() end,
full = self:getFullCode(),
full = function() return self:getFullCode() end,
entryNamePatterns = entry_name_patterns,
stripDiacriticsPatterns = strip_diacritics_patterns,
entryNameRemoveDiacritics = entry_name_remove_diacritics,
stripDiacriticsRemoveDiacritics = strip_diacritics_remove_diacritics,
family = self:getFamilyCode(),
family = function() return self:getFamilyCode() end,
aliases = self:getAliases(),
aliases = function() return self:getAliases() end,
varieties = self:getVarieties(),
varieties = function() return self:getVarieties() end,
otherNames = self:getOtherNames(),
otherNames = function() return self:getOtherNames() end,
scripts = self:getScriptCodes(),
scripts = function() return self:getScriptCodes() end,
type = keys_to_list(self:getTypes()),
type = function() return keys_to_list(self:getTypes()) end,
wikimediaLanguages = self:getWikimediaLanguageCodes(),
wikimediaLanguages = function() return self:getWikimediaLanguageCodes() end,
wikidataItem = self:getWikidataItem(),
wikidataItem = function() return self:getWikidataItem() end,
wikipediaArticle = self:getWikipediaArticle(true),
wikipediaArticle = function() return self:getWikipediaArticle(true) end,
}
}
local ret = {}
for prop, val in pairs(props) do
if not opts.skip_fields or not opts.skip_fields[prop] then
if type(val) == "function" then
ret[prop] = val()
else
ret[prop] = val
end
end
end
-- Use `deep_copy` when returning a table, so that there are no editing restrictions imposed by `mw.loadData`.
-- Use `deep_copy` when returning a table, so that there are no editing restrictions imposed by `mw.loadData`.
return opts and opts.lua_table and deep_copy(ret) or to_json(ret, opts)
return opts and opts.lua_table and deep_copy(ret) or to_json(ret, opts)
Line 1,701: Line 2,036:
function export.getDataModuleName(code)
function export.getDataModuleName(code)
local letter = match(code, "^(%l)%l%l?$")
local letter = match(code, "^(%l)%l%l?$")
return letter == nil and "languages/data/exceptional" or
return "Module:" .. (
letter == nil and "languages/data/exceptional" or
#code == 2 and "languages/data/2" or
#code == 2 and "languages/data/2" or
"languages/data/3/" .. letter
"languages/data/3/" .. letter
)
end
end
get_data_module_name = export.getDataModuleName
get_data_module_name = export.getDataModuleName
Line 1,721: Line 2,058:
varieties = "unique",
varieties = "unique",
wikipedia_article = "unique",
wikipedia_article = "unique",
wikimedia_codes = "unique"
}
}
 
local function __index(self, k)
local function __index(self, k)
local stack, key_type = getmetatable(self), key_types[k]
local stack, key_type = getmetatable(self), key_types[k]
Line 1,761: Line 2,099:
end
end
end
end
 
local function __newindex()
local function __newindex()
error("table is read-only")
error("table is read-only")
end
end
 
local function __pairs(self)
local function __pairs(self)
-- Iterate down the stack, caching keys to avoid duplicate returns.
-- Iterate down the stack, caching keys to avoid duplicate returns.
Line 1,797: Line 2,135:
end
end
end
end
 
local __ipairs = require(table_module).indexIpairs
local __ipairs = require(table_module).indexIpairs
 
function make_stack(data)
function make_stack(data)
local stack = {
local stack = {
Line 1,812: Line 2,150:
return setmetatable({}, stack), stack
return setmetatable({}, stack), stack
end
end
 
return make_stack(data)
return make_stack(data)
end
end
 
local function get_stack(data)
local function get_stack(data)
local stack = getmetatable(data)
local stack = getmetatable(data)
return stack and type(stack) == "table" and stack[make_stack] and stack or nil
return stack and type(stack) == "table" and stack[make_stack] and stack or nil
end
end
 
--[==[
--[==[
<span style="color: #BA0000">This function is not for use in entries or other content pages.</span>
<span style="color: var(--wikt-palette-red,#BA0000)">This function is not for use in entries or other content pages.</span>
Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes. If `extra` is set, any extra data in the relevant `/extra` module will be included. (Note that it will be included anyway if it has already been loaded into the language object.) If `raw` is set, then the returned data will not contain any data inherited from parent objects.
Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes. If `extra` is set, any extra data in the relevant `/extra` module will be included. (Note that it will be included anyway if it has already been loaded into the language object.) If `raw` is set, then the returned data will not contain any data inherited from parent objects.
-- Do NOT use these methods!
-- Do NOT use these methods!
Line 1,855: Line 2,193:
return data
return data
end
end
 
function Language:loadInExtraData()
function Language:loadInExtraData()
-- Only full languages have extra data.
-- Only full languages have extra data.
Line 1,875: Line 2,213:
local modulename = get_extra_data_module_name(code)
local modulename = get_extra_data_module_name(code)
-- No data cached as false.
-- No data cached as false.
stack[0] = modulename and load_data("Module:" .. modulename)[code] or false
stack[0] = modulename and load_data(modulename)[code] or false
end
 
--[==[Returns the name of the module containing the language's data. Currently, this is always [[Module:scripts/data]].]==]
function Language:getDataModuleName()
local name = self._dataModuleName
if name == nil then
name = self:hasType("etymology-only") and etymology_languages_data_module or
get_data_module_name(self._mainCode or self._code)
self._dataModuleName = name
end
return name
end
end
 
--[==[Returns the name of the module containing the language's data. Currently, this is always [[Module:scripts/data]].]==]
function Language:getExtraDataModuleName()
local name = self._extraDataModuleName
if name == nil then
name = not self:hasType("etymology-only") and get_extra_data_module_name(self._mainCode or self._code) or false
self._extraDataModuleName = name
end
return name or nil
end
 
function export.makeObject(code, data, dontCanonicalizeAliases)
function export.makeObject(code, data, dontCanonicalizeAliases)
local data_type = type(data)
local data_type = type(data)
Line 1,902: Line 2,261:
lang._mainCode = code
lang._mainCode = code
end
end
 
local parent_data = parent._data
local parent_data = parent._data
if parent_data == nil then
if parent_data == nil then
Line 1,919: Line 2,278:
end
end
lang._data = data
lang._data = data
 
return setmetatable(lang, parent)
return setmetatable(lang, parent)
end
end
Line 1,928: Line 2,287:
function export.getByCode(code, paramForError, allowEtymLang, allowFamily)
function export.getByCode(code, paramForError, allowEtymLang, allowFamily)
-- Track uses of paramForError, ultimately so it can be removed, as error-handling should be done by [[Module:parameters]], not here.
-- Track uses of paramForError, ultimately so it can be removed, as error-handling should be done by [[Module:parameters]], not here.
 
if paramForError ~= nil then
track("paramForError")
end
if type(code) ~= "string" then
if type(code) ~= "string" then
local typ
local typ
Line 1,943: Line 2,304:
end
end


local m_data = load_data("Module:languages/data")
local m_data = load_data(languages_data_module)
if m_data.aliases[code] or m_data.track[code] then
track(code)
end


local norm_code = normalize_code(code)
local norm_code = normalize_code(code)
local modulename = get_data_module_name(norm_code)
 
-- If modulename is nil, the code is invalid.
if modulename == nil then
return nil
end
-- Get the data, checking for etymology-only languages if allowEtymLang is set.
-- Get the data, checking for etymology-only languages if allowEtymLang is set.
local data = load_data("Module:" .. modulename)[norm_code] or
local data = load_data(get_data_module_name(norm_code))[norm_code] or
allowEtymLang and load_data("Module:etymology languages/data")[norm_code]
allowEtymLang and load_data(etymology_languages_data_module)[norm_code]
 
-- If no data was found and allowFamily is set, check the family data. If the main family data was found, make the object with [[Module:families]] instead, as family objects have different methods. However, if it's an etymology-only family, use make_object in this module (which handles object inheritance), and the family-specific methods will be inherited from the parent object.
-- If no data was found and allowFamily is set, check the family data. If the main family data was found, make the object with [[Module:families]] instead, as family objects have different methods. However, if it's an etymology-only family, use make_object in this module (which handles object inheritance), and the family-specific methods will be inherited from the parent object.
if data == nil and allowFamily then
if data == nil and allowFamily then
data = load_data("Module:families/data")[norm_code]
data = load_data("Module:families/data")[norm_code]
if data ~= nil then
if data ~= nil then
return make_family_object(norm_code, data)
if data.parent == nil then
elseif allowEtymLang then
return make_family_object(norm_code, data)
data = load_data("Module:families/data/etymology")[norm_code]
elseif not allowEtymLang then
data = nil
end
end
end
end
end
Line 2,008: Line 2,369:
end
end


--[==[Used by [[Module:languages/data/2]] (et al.) and [[Module:etymology languages/data]], [[Module:families/data]], [[Module:families/data/etymology]], [[Module:scripts/data]] and [[Module:writing systems/data]] to finalize the data into the format that is actually returned.]==]
--[==[Used by [[Module:languages/data/2]] (et al.) and [[Module:etymology languages/data]], [[Module:families/data]], [[Module:scripts/data]] and [[Module:writing systems/data]] to finalize the data into the format that is actually returned.]==]
function export.finalizeData(data, main_type, variety)
function export.finalizeData(data, main_type, variety)
local fields = {"type"}
local fields = {"type"}
Line 2,026: Line 2,387:
entity.parent, entity[3], entity.family = entity[3], entity.family
entity.parent, entity[3], entity.family = entity[3], entity.family
-- Give the type "regular" iff not a variety and no other types are assigned.
-- Give the type "regular" iff not a variety and no other types are assigned.
elseif not entity.type then
elseif not (entity.type or entity.parent) then
entity.type = "regular"
entity.type = "regular"
end
end