Using Wikipedia’s API to find inconsistently hyphenated French names
The other day I noticed that a lot of English Wikipedia articles about French people for some reason use a space between parts of the given name where primary sources and/or the French Wikipedia use a hyphen; for example, the English Wikipedia (as of May 2024) has “Marie Thérèse Geoffrin” where both the Bibliothèque nationale and Encyclopaedia Britannica have “Marie-Thérèse Geoffrin.”
I wrote a Python script that uses the Mediawiki API to enumerate the articles with this inconsistency, and generate requested moves for each page where the French article’s title begins with e.g. “Marie-Thérèse” and the English article’s title begins with “Marie Thérèse.” Wikipedia’s API documentation is a little scattered, but really great, as documentation goes. It includes lots of runnable examples. My script ended up using:
- Wikipedia’s
prefixsearch
API - Wikidata’s
wbgetentities
API, thanks to this StackOverflow answer - the common pagination API
The first of these lets us query for all French Wikipedia articles whose titles begin with e.g. "Jean-"
:
def frtitles_for_name(name):
S = requests.Session()
PARAMS = {
"action": "query",
"format": "json",
"list": "allpages",
"aplimit": "500",
"apprefix": (name + '-'),
"apfilterredir": "nonredirects",
}
while True:
R = S.get(url="https://fr.wikipedia.org/w/api.php", params=PARAMS)
DATA = R.json()
for r in DATA["query"]["allpages"]:
yield r["title"]
if "continue" not in DATA:
break
PARAMS.update(DATA["continue"])
The second lets us find the corresponding English Wikipedia article and check whether its
title begins with "Jean-"
or with "Jean "
:
def entitles_for_frtitle(frtitle):
S = requests.Session()
R = S.get(url="https://www.wikidata.org/w/api.php", params={
"action": "wbgetentities",
"sites": "frwiki",
"titles": frtitle,
"props": "sitelinks",
"format": "json",
})
DATA = R.json()
for r in DATA["entities"].values():
entitle = r.get("sitelinks", {}).get("enwiki", {}).get("title", None)
if entitle is not None:
yield entitle
I used yield
to make both functions into generators, even though we expect entitles_for_frtitle
to return only a single result (or zero results), simply because that lets me write a nice clean
main function like this:
for name in ['Anne', 'Claude', 'Guy', 'Jean']:
for frtitle in frtitles_for_name(name):
for entitle in entitles_for_frtitle(frtitle):
hyphenated_name = frtitle.split()[0].strip(',')
spaced_name = hyphenated_name.replace('-', ' ')
if entitle.startswith(spaced_name):
new_entitle = hyphenated_name + entitle[len(hyphenated_name):]
print('* [[:%s]] → ' % (entitle, new_entitle))
This does miss a few cases, e.g. when the English title is missing not only the hyphen but also an accent (e.g. “Henri Evrard” for “Henri-Évrard”). And it has false positives for cases that aren’t personal names, for example the name of the 1956 film Marie Antoinette Queen of France (French: Marie-Antoinette reine de France). But it’s pretty darn good as a first pass.
See the complete Python script here, and look for those Wikipedia pages to get moved at some point in the near future.