Module:Languages: Difference between revisions
Chrysophylax (talk | contribs) (Created page with "local export = {} --[=[ This function checks for things that could plausibly be a language code: two or three lowercase letters, two or three groups of three lowercase le...") |
No edit summary |
||
Line 319: | Line 319: | ||
if not ((module_override or self._rawData.translit_module) and text) then | if not ((module_override or self._rawData.translit_module) and text) then | ||
return nil | return nil | ||
end | end | ||
Line 402: | Line 398: | ||
function export.makeObject(code, data) | function export.makeObject(code, data) | ||
return data and setmetatable({ _rawData = data, _code = code }, Language) or nil | return data and setmetatable({ _rawData = data, _code = code }, Language) or nil |
Latest revision as of 20:19, 13 July 2021
This module is used to retrieve and manage the languages that can have Wiktionary entries, and the information associated with them. See Wiktionary:Languages for more information.
For the languages and language varieties that may be used in etymologies, see Module:etymology languages. For language families, which sometimes also appear in etymologies, see Module:families.
This module provides access to other modules. To access the information from within a template, see Module:languages/templates.
The information itself is stored in the various data modules that are subpages of this module. These modules should not be used directly by any other module, the data should only be accessed through the functions provided by this module.
Data submodules:
- Two-letter codes
- Three-letter codes by their first letter: a b c d e f g h i j k l m n o p q r s t u v w x y z
- Codes containing hyphens (
-
)
Finding and retrieving languages
The module exports a number of functions that are used to find languages.
getByCode
getByCode(code, paramForError, allowEtymLang, allowFamily)
Finds the language whose code matches the one provided. If it exists, it returns a Language
object representing the language. Otherwise, it returns
, unless paramForError is given, in which case an error is generated. If lua
paramForError
is
, a generic error message mentioning the bad code is generated; otherwise lua
paramForError
should be a string or number specifying the parameter that the code came from, and this parameter will be mentioned in the error message along with the bad code. If allowEtymLang
is specified, etymology language codes are allowed and looked up along with normal language codes. If allowFamily
is specified, language family codes are allowed and looked up along with normal language codes.
getByCanonicalName
getByCanonicalName(code, errorIfInvalid, allowEtymLang, allowFamily)
Finds the language whose canonical name (the name used to represent that language on Wiktionary) or other name matches the one provided. If it exists, it returns a Language
object representing the language. Otherwise, it returns
, unless paramForError is given, in which case an error is generated. If lua
allowEtymLang
is specified, etymology language codes are allowed and looked up along with normal language codes. If allowFamily
is specified, language family codes are allowed and looked up along with normal language codes.
The canonical name of languages should always be unique (it is an error for two languages on Wiktionary to share the same canonical name), so this is guaranteed to give at most one result.
This function is powered by Module:languages/canonical names, which contains a pre-generated mapping of non-etymology-language canonical names to codes. It is generated by going through the Category:Language data modules for non-etymology languages. When allowEtymLang
is specified for the above function, Module:etymology languages/by name may also be used, and when allowFamily
is specified for the above function, Module:families/by name may also be used.
getByName
getByName(name)
Like
, except it also looks at the lua
otherNames
listed in the non-etymology language data modules, and does not (currently) have options to look up etymology languages and families.
iterateAll
iterateAll()
- This function is expensive
Returns a table containing Language
objects for all languages, sorted by code.
This function searches through the whole database of languages, and is therefore relatively resource-intensive. It should be used sparingly.
Language objects
A Language
object is returned from one of the functions above. It is a Lua representation of a language and the data associated with it. It has a number of methods that can be called on it, using the :
syntax. For example:
local m_languages = require("Module:languages")
local lang = m_languages.getByCode("fr")
local name = lang:getCanonicalName()
-- "name" will now be "French"
Language:getCode
:getCode()
Returns the language code of the language. Example:
for French.
lua
Language:getCanonicalName
:getCanonicalName()
Returns the canonical name of the language. This is the name used to represent that language on Wiktionary, and is guaranteed to be unique to that language alone. Example:
for French.
lua
Language:getAllNames
:getAllNames()
Returns a table of all names that the language is known by, including the canonical name. The names are not guaranteed to be unique, sometimes more than one language is known by the same name. Example:
for French.
lua
Language:getType
:getType()
Returns the type of language, which can be
, lua
or lua
.
lua
Language:getWikimediaLanguages
:getWikimediaLanguages()
Returns a table containing WikimediaLanguage
objects (see Module:wikimedia languages), which represent languages and their codes as they are used in Wikimedia projects for interwiki linking and such. More than one object may be returned, as a single Wiktionary language may correspond to multiple Wikimedia languages. For example, Wiktionary's single code sh
(Serbo-Croatian) maps to four Wikimedia codes: sh
(Serbo-Croatian), bs
(Bosnian), hr
(Croatian) and sr
(Serbian).
The code for the Wikimedia language is retrieved from the wikimedia_codes
property in the data modules. If that property is not present, the code of the current language is used. If none of the available codes is actually a valid Wikimedia code, an empty table is returned.
Language:getWikipediaArticle
:getWikipediaArticle()
Returns the name of the Wikipedia article for the language. If the property wikipedia_article
is present in the data module it will be used first, otherwise a sitelink will be generated from :getWikidataItem
(if set). Otherwise :getCategoryName
is used as fallback.
Language:getWikidataItem
:getWikidataItem()
Returns the Wikidata item id for the language or nil
. This corresponds to the the second field in the data modules.
Language:getScripts
:getScripts()
Returns a table of Script
objects for all scripts that the language is written in. See Module:scripts.
Language:getScriptCodes
:getScriptCodes()
Returns the table of script codes in the language's data file.
Language:getFamily
:getFamily()
Returns a Family
object for the language family that the language belongs to. See Module:families.
Language:getAncestors
:getAncestors()
Returns a table of Language
objects for all languages that this language is directly descended from. Generally this is only a single language, but creoles, pidgins and mixed languages can have multiple ancestors.
Language:getCategoryName
:getCategoryName()
Returns the name of the main category of that language. Example:
for French, whose category is at Category:French language.
lua
Language:makeCategoryLink
:makeCategoryLink()
Creates a link to the category; the link text is the canonical name.
Language:makeEntryName
:makeEntryName(term)
Converts the given term into the form used in the names of entries. This removes diacritical marks from the term if they are not considered part of the normal written form of the language, and which therefore are not permitted in page names. It also removes certain punctuation characters like final question marks or periods which are never present in page names. Example for Latin:
→ lua
(macron is removed).
lua
The replacements made by this function are defined by the entry_name
setting for each language in the data modules.
Language:makeSortKey
:makeSortKey(entryName)
Creates a sort key for the given entry name, following the rules appropriate for the language. This removes diacritical marks from the entry name if they are not considered significant for sorting, and may perform some other changes. Any initial hyphen is also removed, and anything parentheses is removed as well.
The sort_key
setting for each language in the data modules defines the replacements made by this function, or it gives the name of the module that takes the entry name and returns a sortkey.
Language:transliterate
:transliterate(text, sc, module_override)
Transliterates the text from the given script into the Latin script (see Wiktionary:Transliteration and romanization). The language must have the translit_module
property for this to work; if it is not present,
is returned.
lua
The sc
parameter is handled by the transliteration module, and how it is handled is specific to that module. Some transliteration modules may tolerate
as the script, others require it to be one of the possible scripts that the module can transliterate, and will show an error if it's not one of them. For this reason, the lua
sc
parameter should always be provided when writing non-language-specific code.
The module_override
parameter is used to override the default module that is used to provide the transliteration. This is useful in cases where you need to demonstrate a particular module in use, but there is no default module yet, or you want to demonstrate an alternative version of a transliteration module before making it official. It should not be used in real modules or templates, only for testing. All uses of this parameter are tracked by Template:tracking/module_override.
Language:hasTranslit
:hasTranslit()
Returns
if the language has a transliteration module, lua
if it doesn't.
lua
Language:getRawData
:getRawData()
- This function is not for use in entries or other content pages.
Returns a blob of data about the language. The format of this blob is undocumented, and perhaps unstable; it's intended for things like the module's own unit-tests, which are "close friends" with the module and will be kept up-to-date as the format changes.
Error function
err(lang, param, text)
Looks at a supposed language code passed through a template parameter and returns a helpful error message depending on whether the language code has a valid form (two or three lowercase basic Latin letters, two or three groups of three lowercase basic Latin letters separated by hyphens).
Add the parameter value in argument #1 and the parameter name in argument #2. For instance, if parameter
of the template is supposed to be a language code, this function can be called the following way:
lua
local m_languages = require("Module:languages")
local lang = m_languages.getByCode(frame.args[1]) or m_languages.err(frame.args[1], 1)
If you would like the error message to say something other than "language code", place the phrase in argument #3.
See also
{{Module:etymology languages}}
{{Module:families}}
{{Module:languages/templates}}
{{Module:languages/JSON}}
local export = {}
--[=[ This function checks for things that could plausibly be a language code:
two or three lowercase letters, two or three groups of three lowercase
letters with hyphens between them. If such a pattern is not found,
it is likely the editor simply forgot to enter a language code. ]=]
function export.err(langCode, param, text, template_tag)
local ordinals = {
"first", "second", "third", "fourth", "fifth", "sixth",
"seventh", "eighth", "ninth", "tenth", "eleventh", "twelfth",
"thirteenth", "fourteenth", "fifteenth", "sixteenth", "seventeenth",
"eighteenth", "nineteenth", "twentieth"
}
text = text or "language code"
if not template_tag then
template_tag = ""
else
if type(template_tag) ~= "string" then
template_tag = template_tag()
end
template_tag = " (Original template: " .. template_tag .. ")"
end
local paramType = type(param)
if paramType == "number" then
ordinal = ordinals[param]
param = ordinal .. ' parameter'
elseif paramType == "string" then
param = 'parameter "' .. param .. '"'
else
error("The parameter name is "
.. (paramType == "table" and "a table" or tostring(param))
.. ", but it should be a number or a string." .. template_tag, 2)
end
-- Can use string.find because language codes only contain ASCII.
if not langCode or langCode == "" then
error("The " .. param .. " (" .. text .. ") is missing." .. template_tag, 2)
elseif langCode:find("^%l%l%l?$")
or langCode:find("^%l%l%l%-%l%l%l$")
or langCode:find("^%l%l%l%-%l%l%l%-%l%l%l$") then
error("The " .. text .. " \"" .. langCode .. "\" is not valid." .. template_tag, 2)
else
error("Please enter a " .. text .. " in the " .. param .. "." .. template_tag, 2)
end
end
local function do_entry_name_or_sort_key_replacements(text, replacements)
if replacements.from then
for i, from in ipairs(replacements.from) do
local to = replacements.to[i] or ""
text = mw.ustring.gsub(text, from, to)
end
end
if replacements.remove_diacritics then
text = mw.ustring.toNFD(text)
text = mw.ustring.gsub(text,
'[' .. replacements.remove_diacritics .. ']',
'')
text = mw.ustring.toNFC(text)
end
return text
end
local Language = {}
function Language:getCode()
return self._code
end
function Language:getCanonicalName()
return self._rawData[1] or self._rawData.canonicalName
end
function Language:getOtherNames()
return self._rawData.otherNames or {}
end
function Language:getType()
return self._rawData.type or "regular"
end
function Language:getWikimediaLanguages()
if not self._wikimediaLanguageObjects then
local m_wikimedia_languages = require("Module:wikimedia languages")
self._wikimediaLanguageObjects = {}
local wikimedia_codes = self._rawData.wikimedia_codes or { self._code }
for _, wlangcode in ipairs(wikimedia_codes) do
table.insert(self._wikimediaLanguageObjects, m_wikimedia_languages.getByCode(wlangcode))
end
end
return self._wikimediaLanguageObjects
end
function Language:getWikipediaArticle()
if self._rawData.wikipedia_article then
return self._rawData.wikipedia_article
elseif self._wikipedia_article then
return self._wikipedia_article
elseif self:getWikidataItem() and mw.wikibase then
self._wikipedia_article = mw.wikibase.sitelink(self:getWikidataItem(), 'enwiki')
end
if not self._wikipedia_article then
self._wikipedia_article = mw.ustring.gsub(self:getCategoryName(), "Creole language", "Creole")
end
return self._wikipedia_article
end
function Language:makeWikipediaLink()
return "[[w:" .. self:getWikipediaArticle() .. "|" .. self:getCanonicalName() .. "]]"
end
function Language:getWikidataItem()
return self._rawData[2] or self._rawData.wikidata_item
end
function Language:getScripts()
if not self._scriptObjects then
local m_scripts = require("Module:scripts")
self._scriptObjects = {}
for _, sc in ipairs(self._rawData.scripts or { "None" }) do
table.insert(self._scriptObjects, m_scripts.getByCode(sc))
end
end
return self._scriptObjects
end
function Language:getScriptCodes()
return self._rawData.scripts or { "None" }
end
function Language:getFamily()
if self._familyObject then
return self._familyObject
end
local family = self._rawData[3] or self._rawData.family
if family then
self._familyObject = require("Module:families").getByCode(family)
end
return self._familyObject
end
function Language:getAncestors()
if not self._ancestorObjects then
self._ancestorObjects = {}
if self._rawData.ancestors then
for _, ancestor in ipairs(self._rawData.ancestors) do
table.insert(self._ancestorObjects, export.getByCode(ancestor) or require("Module:etymology languages").getByCode(ancestor))
end
else
local fam = self:getFamily()
local protoLang = fam and fam:getProtoLanguage() or nil
-- For the case where the current language is the proto-language
-- of its family, we need to step up a level higher right from the start.
if protoLang and protoLang:getCode() == self:getCode() then
fam = fam:getFamily()
protoLang = fam and fam:getProtoLanguage() or nil
end
while not protoLang and not (not fam or fam:getCode() == "qfa-not") do
fam = fam:getFamily()
protoLang = fam and fam:getProtoLanguage() or nil
end
table.insert(self._ancestorObjects, protoLang)
end
end
return self._ancestorObjects
end
local function iterateOverAncestorTree(node, func)
for _, ancestor in ipairs(node:getAncestors()) do
if ancestor then
local ret = func(ancestor) or iterateOverAncestorTree(ancestor, func)
if ret then
return ret
end
end
end
end
function Language:getAncestorChain()
if not self._ancestorChain then
self._ancestorChain = {}
local step = #self:getAncestors() == 1 and self:getAncestors()[1] or nil
while step do
table.insert(self._ancestorChain, 1, step)
step = #step:getAncestors() == 1 and step:getAncestors()[1] or nil
end
end
return self._ancestorChain
end
function Language:hasAncestor(otherlang)
local function compare(ancestor)
return ancestor:getCode() == otherlang:getCode()
end
return iterateOverAncestorTree(self, compare) or false
end
function Language:getCategoryName()
local name = self:getCanonicalName()
-- If the name already has "language" in it, don't add it.
if name:find("[Ll]anguage$") then
return name
else
return name .. " language"
end
end
function Language:makeCategoryLink()
return "[[:Category:" .. self:getCategoryName() .. "|" .. self:getCanonicalName() .. "]]"
end
function Language:getStandardCharacters()
return self._rawData.standardChars
end
function Language:makeEntryName(text)
text = mw.ustring.match(text, "^[¿¡]?(.-[^%s%p].-)%s*[؟?!;՛՜ ՞ ՟?!︖︕।॥။၊་།]?$") or text
if self:getCode() == "ar" then
local U = mw.ustring.char
local taTwiil = U(0x640)
local waSla = U(0x671)
-- diacritics ordinarily removed by entry_name replacements
local Arabic_diacritics = U(0x64B, 0x64C, 0x64D, 0x64E, 0x64F, 0x650, 0x651, 0x652, 0x670)
if text == waSla or mw.ustring.find(text, "^" .. taTwiil .. "?[" .. Arabic_diacritics .. "]" .. "$") then
return text
end
end
if type(self._rawData.entry_name) == "table" then
text = do_entry_name_or_sort_key_replacements(text, self._rawData.entry_name)
end
return text
end
-- Add to data tables?
local has_dotted_undotted_i = {
["az"] = true,
["crh"] = true,
["gag"] = true,
["kaa"] = true,
["tt"] = true,
["tr"] = true,
["zza"] = true,
}
function Language:makeSortKey(name, sc)
if has_dotted_undotted_i[self:getCode()] then
name = name:gsub("I", "ı")
end
name = mw.ustring.lower(name)
-- Remove initial hyphens and *
local hyphens_regex = "^[-־ـ*]+(.)"
name = mw.ustring.gsub(name, hyphens_regex, "%1")
-- If there are language-specific rules to generate the key, use those
if type(self._rawData.sort_key) == "table" then
name = do_entry_name_or_sort_key_replacements(name, self._rawData.sort_key)
elseif type(self._rawData.sort_key) == "string" then
name = require("Module:" .. self._rawData.sort_key).makeSortKey(name, self:getCode(), sc and sc:getCode())
end
-- Remove parentheses, as long as they are either preceded or followed by something
name = mw.ustring.gsub(name, "(.)[()]+", "%1")
name = mw.ustring.gsub(name, "[()]+(.)", "%1")
if has_dotted_undotted_i[self:getCode()] then
name = name:gsub("i", "İ")
end
return mw.ustring.upper(name)
end
function Language:overrideManualTranslit()
if self._rawData.override_translit then
return true
else
return false
end
end
function Language:transliterate(text, sc, module_override)
if not ((module_override or self._rawData.translit_module) and text) then
return nil
end
return require("Module:" .. (module_override or self._rawData.translit_module)).tr(text, self:getCode(), sc and sc:getCode() or nil)
end
function Language:hasTranslit()
return self._rawData.translit_module and true or false
end
function Language:link_tr()
return self._rawData.link_tr and true or false
end
function Language:toJSON()
local entryNamePatterns = nil
local entryNameRemoveDiacritics = nil
if self._rawData.entry_name then
entryNameRemoveDiacritics = self._rawData.entry_name.remove_diacritics
if self._rawData.entry_name.from then
entryNamePatterns = {}
for i, from in ipairs(self._rawData.entry_name.from) do
local to = self._rawData.entry_name.to[i] or ""
table.insert(entryNamePatterns, { from = from, to = to })
end
end
end
local ret = {
ancestors = self._rawData.ancestors,
canonicalName = self:getCanonicalName(),
categoryName = self:getCategoryName(),
code = self._code,
entryNamePatterns = entryNamePatterns,
entryNameRemoveDiacritics = entryNameRemoveDiacritics,
family = self._rawData[3] or self._rawData.family,
otherNames = self:getOtherNames(),
scripts = self._rawData.scripts,
type = self:getType(),
wikimediaLanguages = self._rawData.wikimedia_codes,
wikidataItem = self:getWikidataItem(),
}
return require("Module:JSON").toJSON(ret)
end
-- Do NOT use this method!
-- All uses should be pre-approved on the talk page!
function Language:getRawData()
return self._rawData
end
Language.__index = Language
function export.getDataModuleName(code)
if code:find("^%l%l$") then
return "languages/data"
elseif code:find("^%l%l%l$") then
local prefix = code:sub(1, 1)
return "languages/data"
elseif code:find("^[%l-]+$") then
return "languages/data"
else
return nil
end
end
local function getRawLanguageData(code)
local modulename = export.getDataModuleName(code)
return modulename and mw.loadData("Module:" .. modulename)[code] or nil
end
function export.makeObject(code, data)
return data and setmetatable({ _rawData = data, _code = code }, Language) or nil
end
function export.getByCode(code, paramForError, allowEtymLang, allowFamily)
if type(code) ~= "string" then
error("The function getByCode expects a string as its first argument, but received " .. (code == nil and "nil" or "a " .. type(code)) .. ".")
end
local retval = export.makeObject(code, getRawLanguageData(code))
if not retval and allowEtymLang then
retval = require("Module:etymology languages").getByCode(code)
end
if not retval and allowFamily then
retval = require("Module:families").getByCode(code)
end
if not retval and paramForError then
local codetext = nil
if allowEtymLang and allowFamily then
codetext = "language, etymology language or family code"
elseif allowEtymLang then
codetext = "language or etymology language code"
elseif allowFamily then
codetext = "language or family code"
else
codetext = "language code"
end
if paramForError == true then
error("The " .. codetext .. " \"" .. code .. "\" is not valid.")
else
export.err(code, paramForError, codetext)
end
end
return retval
end
function export.getByName(name, errorIfInvalid)
local byName = mw.loadData("Module:languages/by name")
local code = byName.all and byName.all[name] or byName[name]
if not code then
if errorIfInvalid then
error("The language name \"" .. name .. "\" is not valid.")
else
return nil
end
end
return export.makeObject(code, getRawLanguageData(code))
end
function export.getByCanonicalName(name, errorIfInvalid, allowEtymLang, allowFamily)
local byName = mw.loadData("Module:languages/canonical names")
local code = byName and byName[name]
local retval = code and export.makeObject(code, getRawLanguageData(code)) or nil
if not retval and allowEtymLang then
retval = require("Module:etymology languages").getByCanonicalName(code)
end
if not retval and allowFamily then
retval = require("Module:families").getByCanonicalName(code)
end
if not retval and errorIfInvalid then
local text
if allowEtymLang and allowFamily then
text = "language, etymology language or family name"
elseif allowEtymLang then
text = "language or etymology language name"
elseif allowFamily then
text = "language or family name"
else
text = "language name"
end
error("The " .. text .. " \"" .. name .. "\" is not valid.")
end
return retval
end
function export.iterateAll()
mw.incrementExpensiveFunctionCount()
local m_data = mw.loadData("Module:languages/alldata")
local func, t, var = pairs(m_data)
return function()
local code, data = func(t, var)
return export.makeObject(code, data)
end
end
--[[ If language is an etymology language, iterates through parent languages
until it finds a non-etymology language. ]]
function export.getNonEtymological(lang)
while lang:getType() == "etymology language" do
local parentCode = lang:getParentCode()
lang = export.getByCode(parentCode)
or require("Module:etymology languages").getByCode(parentCode)
or require("Module:families").getByCode(parentCode)
end
return lang
end
return export