Documentation¶

Classes¶

Article class¶

class johnny5.article(I, Itype=None, slow_connection=False)[source]¶

This is the main class for this module. All other classes belong to this class. The class can be initialized by a en_curid, a title, or a wikidata_id.

Parameters:

I : str or int

Either the english curid, the wikidata id, or the title for the english Wikipedia.

Itype : str (optional)

Either ‘title’, ‘curid’, or ‘wdid’ Type of I.

L()[source]¶

Returns the number of language editions of the article.

Returns:

L : int

Number of Wikipedia language editions this article exists in.

content(lang=’en’)[source]¶

Returns the content of the Wikipedia page in the selected language. The output is in Wikipedia markup.

Parameters:

lang : str (default=’en’)

Language

Returns:

content : str

Content for the page in the given language. Content is in WikiMarkup

creation_date(lang=None)[source]¶

Gets the creation date of the different Wikipedia language editions. The Wikipedia API requires this data to be requestes one page at a time, so there is no boost in collecting pages into a list.

Parameters:

lang : str (optional)

Language to get the creation date for.

Returns:

timestamp : str or dict

Timestamp in the format ‘2002-07-26T04:32:17Z’. If lang is not provided it will return a dictionary with languages as keys and timestamps as values.

curid()[source]¶: Returns the english curid of the article. Will get it if it is not provided.

curid_nonen()[source]¶: Gets the curid in a non-english language. The curid is a string and has the form: ‘lang.curid’

data_wd()[source]¶: Returns the metadata about the Wikidata page.

data_wp()[source]¶: Returns the metadata about the Wikipedia page.

dump(path=”, file_name=None)[source]¶: Dumps the object to a file.

extract(lang=’en’)[source]¶

Returns the page extract (brief description).

Parameters:

lang : str (default=’en’)

Language edition to get the infobox from.

Returns:

extract : str

Wikipedia page extract.

find_article()[source]¶: Find the article by trying different combinations of the title’s capitalization.

html_soup()[source]¶: Gets the html for the English Wikipedia page parsed as a BeautifulSoup object.

image_url()[source]¶

Gets the url for the image that appears in the infobox. It iterates over a list of languages, ordered according to their wikipedia size, until it finds one.

Returns:

img_url : str

Ful url for the image.

infobox(lang=’en’, force=False)[source]¶

Returns the infobox of the article.

Parameters:

lang : str (default=’en’)

Language edition to get the infobox from.

force : boolean (False)

If True it will ‘force’ the search for the infobox by getting the template that is the most similar to an Infobox. Recommended usage is only for non english editions.

langlinks(lang=None)[source]¶

Returns the langlinks of the article.

Parameters:

lang : str (optional)

Language to get the link for.

Returns:

out : str or dict

If a language is provided, it will return the name of the page in that language. If no language is provided, it will resturn a dictionary with the languages as keys and the titles as values.

pageviews(start_date, end_date=None, lang=’en’, cdate_override=False, daily=False, get_previous=True)[source]¶

Gets the pageviews between the provided dates for the given language editions. Unless specified, this function checks whether the english page had any other title, and gets the pageviews accordingly.

Parameters:

start_date : str

Start date in format ‘yyyy-mm’. If start_date=None is passed, it will get all the pageviews for that edition.

end_date : str

End date in format ‘yyyy-mm’. If it is not provided it will get pagviews until today.

lang : str (‘en’)

Language edition to get the pageviews for. If lang=None is passed, it will get the pageviews for all language editions.

cdate_override : boolean (False)

If True it will get the pageviews before the creation date

daily : boolean (False)

If True it will return the daily pageviews.

get_previous : boolean (True)

If True it will search for all the previous titles of the pages and get the pageviews for them as well. Only works for English.

Returns:

views : pandas.DataFrame

Table with columns year,month,(day),views.

previous_titles()[source]¶

Gets all the previous titles the page had. ONLY WORKS FOR ENGLISH FOR NOW

Returns:

titles : set

Collection of previous titles

redirect()[source]¶: Handles redirects if the page has one.

revisions(user=True)[source]¶

Gets the timestamps for the edit history of the Wikipedia article.

Parameters:

user : boolean (True)

If True it returns the user who made the edit as well as the edit timestamp.

section(section_title)[source]¶

Returns the content inside the given section of the English Wikipedia page.

Parameters:

section_title : str

Title of the section.

Returns:

content : str

Content of the section in WikiMarkup

tables(i=None)[source]¶

Gets tables in the page.

Parameters:

i : int (optional)

Position of the table to get. If not provided it will return a list of tables

Returns:

tables : list or pandas.DataFrame

The parsed tables found in the page.

title()[source]¶: Returns the title of the article. Will get it if it is not provided.

url(wiki=’wp’, lang=’en’)[source]¶

wd_prop(prop)[source]¶

Gets the requested Wikidata propery.

Parameters:

prop : str

Wikidata code for the property.

Returns:

props : list

List of values for the given property.

Examples

To get the date of birth of Albert Einstein run: >>> b = johnny5.article(‘Q937’) >>> b.wd_prop(‘P569’)

wdid()[source]¶: Returns the wdid of the article. Will get it if it is not provided.

wiki_links(section_title=None)[source]¶

Gets all the Wikipedia pages linked from the article. It only returns Wikipedia pages.

Returns:

titles : set

Set of titles for the Wikipedia pages linked from the article.

Article sub-classes¶

`johnny5.biography`(I[, Itype])	Class for biographies of real people.
`johnny5.place`(I[, Itype])	Places (includes methods to get coordinates).
`johnny5.band`(I[, Itype])	Class for music bands.
`johnny5.song`(I[, Itype])	Class for songs.

Other Classes¶

class johnny5.CTY(city_data=’geonames’)[source]¶

City classifier. Used to classify coordinates into cities. THIS FUNCTION NEEDS TO BE UPDATED!

city(coords)[source]¶: Returns the city

class johnny5.Occ[source]¶

Occupation classifier based on Wikipedia and Wikidata information.

Examples

>>> C = johnny5.Occ()
>>> b = johnny5.biography('Q937')
>>> C.classify(b)

classify(article, return_all=False, override_train=False)[source]¶

Classifier function

Parameters:

article : johnny5.biography

Biography to classify.

return_all : boolean (False)

If True it will return the probabilities for all occupations in as list of 2-tuples.

override_train : boolean (False)

If True it will run the classifier even if the given biography belongs to the training set.

Returns:

label : str

Most likely occupation

prob_ratio : float

Ratio between the most likely occupation, and the second most likely occupation. If the biography belongs to the training set, it will return prob_ratio=0.

feats(article)[source]¶

Gets the features of the article that feed into the classifier.

Parameters:

article : johnny5.biography

Biography to classify.

Returns:

features : collections.defaultdict

Dictionary of features.

Functions¶

Bulk download functions¶

johnny5.latest_wddump()[source]¶: Gets the latest Wikidata RDF dump.

johnny5.dumps_path(new_path=None)[source]¶

Handle the path to the Wikipedia and Wikidata dumps. If new_path is provided, it will set the new path.

Parameters:

new_path : str (optional)

If provided it will set the dumps path to this path. Path where to store the Wikipedia and Wikidata dumps. (Must be full path)

johnny5.check_wddump()[source]¶

Used to check whether the Wikidata dump found on file is up to date.

Returns:

status : boolean

True if it is necessary to update

johnny5.download_latest()[source]¶

Downloads the latest Wikidata RDF dump.

If the dump is updated, it will delete all the instances files.

johnny5.wd_instances(cl)[source]¶

Gets all the instances of the given class.

Returns:

instances : set

wd_id for every instance of the given class.

Examples

To get all universities: >>> wd_instances(‘Q3918’) To get all humans: >>> wd_instances(‘Q5’)

johnny5.wd_subclasses(cl)[source]¶

Gets all the subclasses of the given class.

Returns:

subclasses : set

wd_id for every subclass of the given class.

Examples

To get all subclasses of musical ensemble: >>> wd_subclasses(‘Q2088357’)

johnny5.all_wikipages(update=False)[source]¶: Downloads all the names of the Wikipedia articles

johnny5.check_wpdump()[source]¶: Used to check the current status of the WikiData Dump. It returns None, but prints the information.

johnny5.dumps_path(new_path=None)[source]

Handle the path to the Wikipedia and Wikidata dumps. If new_path is provided, it will set the new path.

Parameters:

new_path : str (optional)

If provided it will set the dumps path to this path. Path where to store the Wikipedia and Wikidata dumps. (Must be full path)

Query functions¶

johnny5.wp_q(d, lang=’en’, continue_override=False, show=False)[source]¶

Queries the Wikipedia API provided a dictionary of features. It handles the pages limit and the results limit by doing multiple queries and then merging the resulting json objects.

Parameters:

d : dict

Dictionary.

lang : str (default = ‘en’)

Language edition to query.

continue_override : boolean (False)

If True it will not get any of the continuation queries.

show : boolean (False)

If True it will print all the used urls.

Returns:

r : dict

Dictionary with the result of the query.

Examples

>>> wp_q({'pageids':[306,207]})

johnny5.wd_q(d, show=False)[source]¶

Queries the Wikidata API provided a dictionary of features. It handles the pages limit and the results limit by doing multiple queries and then merging the resulting json objects.

Parameters:

d : dict

Dictionary.

show : boolean (False)

If True it will print all the used urls.

Returns:

r : dict

Dictionary with the result of the query.

Examples

>>> r = j5.wp_q({'prop':'extracts','exintro':'','explaintext':'','pageids':736})
>>> print list(r['query']['pages'].values())[0]['extract']

Other functions¶

johnny5.country(coords, path=”, save=True, GAPI_KEY=None)[source]¶

Uses the Google geocode API to get the country of a given geographical point.

Parameters:

coords : (lat,lon) tuple

Coordinates to query.

path : string

Path to save the json file containing the query result.

save : boolean (True)

If True it will save the result of the query as ‘lat,lon.json’

GAPI_KEY : string (None)

Name of the environment variable holding the Google API key.

Returns:

country : (name,code) tuple

Country name and 2-digit country code.

johnny5.chunker(seq, size)[source]¶

Used to iterate a list by chunks.

Parameters:

seq : list (or iterable)

List or iterable to iterate over.

size : int

Size of each chunk

Returns:

chunks : list

List of lists (chunks)

johnny5.drop_comments(value)[source]¶: Drops wikimarkup comments from the provided string.

johnny5.correct_titles(title)[source]¶: Checks the capitalization of the given title and returns a set of possible uses

johnny5.get_links(text)[source]¶: Gets the links from the provided in Wiki Markup. Links are between square brackets [[]].

johnny5.first_month(pt, as_num=False)[source]¶: Returns the first month in the string. Returns ‘NA’ when there is no month.

johnny5.parse_ints(s)[source]¶: Parses a list of integers from s