Documentation¶
Classes¶
Article class¶
-
class
johnny5.
article
(I, Itype=None, slow_connection=False)[source]¶ This is the main class for this module. All other classes belong to this class. The class can be initialized by a en_curid, a title, or a wikidata_id.
Parameters: I : str or int
Either the english curid, the wikidata id, or the title for the english Wikipedia.
Itype : str (optional)
Either ‘title’, ‘curid’, or ‘wdid’ Type of I.
-
L
()[source]¶ Returns the number of language editions of the article.
Returns: L : int
Number of Wikipedia language editions this article exists in.
-
content
(lang=’en’)[source]¶ Returns the content of the Wikipedia page in the selected language. The output is in Wikipedia markup.
Parameters: lang : str (default=’en’)
Language
Returns: content : str
Content for the page in the given language. Content is in WikiMarkup
-
creation_date
(lang=None)[source]¶ Gets the creation date of the different Wikipedia language editions. The Wikipedia API requires this data to be requestes one page at a time, so there is no boost in collecting pages into a list.
Parameters: lang : str (optional)
Language to get the creation date for.
Returns: timestamp : str or dict
Timestamp in the format ‘2002-07-26T04:32:17Z’. If lang is not provided it will return a dictionary with languages as keys and timestamps as values.
-
curid_nonen
()[source]¶ Gets the curid in a non-english language. The curid is a string and has the form: ‘lang.curid’
-
extract
(lang=’en’)[source]¶ Returns the page extract (brief description).
Parameters: lang : str (default=’en’)
Language edition to get the infobox from.
Returns: extract : str
Wikipedia page extract.
-
find_article
()[source]¶ Find the article by trying different combinations of the title’s capitalization.
-
image_url
()[source]¶ Gets the url for the image that appears in the infobox. It iterates over a list of languages, ordered according to their wikipedia size, until it finds one.
Returns: img_url : str
Ful url for the image.
-
infobox
(lang=’en’, force=False)[source]¶ Returns the infobox of the article.
Parameters: lang : str (default=’en’)
Language edition to get the infobox from.
force : boolean (False)
If True it will ‘force’ the search for the infobox by getting the template that is the most similar to an Infobox. Recommended usage is only for non english editions.
-
langlinks
(lang=None)[source]¶ Returns the langlinks of the article.
Parameters: lang : str (optional)
Language to get the link for.
Returns: out : str or dict
If a language is provided, it will return the name of the page in that language. If no language is provided, it will resturn a dictionary with the languages as keys and the titles as values.
-
pageviews
(start_date, end_date=None, lang=’en’, cdate_override=False, daily=False, get_previous=True)[source]¶ Gets the pageviews between the provided dates for the given language editions. Unless specified, this function checks whether the english page had any other title, and gets the pageviews accordingly.
Parameters: start_date : str
Start date in format ‘yyyy-mm’. If start_date=None is passed, it will get all the pageviews for that edition.
end_date : str
End date in format ‘yyyy-mm’. If it is not provided it will get pagviews until today.
lang : str (‘en’)
Language edition to get the pageviews for. If lang=None is passed, it will get the pageviews for all language editions.
cdate_override : boolean (False)
If True it will get the pageviews before the creation date
daily : boolean (False)
If True it will return the daily pageviews.
get_previous : boolean (True)
If True it will search for all the previous titles of the pages and get the pageviews for them as well. Only works for English.
Returns: views : pandas.DataFrame
Table with columns year,month,(day),views.
-
previous_titles
()[source]¶ Gets all the previous titles the page had. ONLY WORKS FOR ENGLISH FOR NOW
Returns: titles : set
Collection of previous titles
-
revisions
(user=True)[source]¶ Gets the timestamps for the edit history of the Wikipedia article.
Parameters: user : boolean (True)
If True it returns the user who made the edit as well as the edit timestamp.
-
section
(section_title)[source]¶ Returns the content inside the given section of the English Wikipedia page.
Parameters: section_title : str
Title of the section.
Returns: content : str
Content of the section in WikiMarkup
-
tables
(i=None)[source]¶ Gets tables in the page.
Parameters: i : int (optional)
Position of the table to get. If not provided it will return a list of tables
Returns: tables : list or pandas.DataFrame
The parsed tables found in the page.
-
Article sub-classes¶
johnny5.biography (I[, Itype]) |
Class for biographies of real people. |
johnny5.place (I[, Itype]) |
Places (includes methods to get coordinates). |
johnny5.band (I[, Itype]) |
Class for music bands. |
johnny5.song (I[, Itype]) |
Class for songs. |
Other Classes¶
-
class
johnny5.
CTY
(city_data=’geonames’)[source]¶ City classifier. Used to classify coordinates into cities. THIS FUNCTION NEEDS TO BE UPDATED!
-
class
johnny5.
Occ
[source]¶ Occupation classifier based on Wikipedia and Wikidata information.
Examples
>>> C = johnny5.Occ() >>> b = johnny5.biography('Q937') >>> C.classify(b)
-
classify
(article, return_all=False, override_train=False)[source]¶ Classifier function
Parameters: article : johnny5.biography
Biography to classify.
return_all : boolean (False)
If True it will return the probabilities for all occupations in as list of 2-tuples.
override_train : boolean (False)
If True it will run the classifier even if the given biography belongs to the training set.
Returns: label : str
Most likely occupation
prob_ratio : float
Ratio between the most likely occupation, and the second most likely occupation. If the biography belongs to the training set, it will return prob_ratio=0.
-
Functions¶
Bulk download functions¶
-
johnny5.
dumps_path
(new_path=None)[source]¶ Handle the path to the Wikipedia and Wikidata dumps. If new_path is provided, it will set the new path.
Parameters: new_path : str (optional)
If provided it will set the dumps path to this path. Path where to store the Wikipedia and Wikidata dumps. (Must be full path)
-
johnny5.
check_wddump
()[source]¶ Used to check whether the Wikidata dump found on file is up to date.
Returns: status : boolean
True if it is necessary to update
-
johnny5.
download_latest
()[source]¶ Downloads the latest Wikidata RDF dump.
If the dump is updated, it will delete all the instances files.
-
johnny5.
wd_instances
(cl)[source]¶ Gets all the instances of the given class.
Returns: instances : set
wd_id for every instance of the given class.
Examples
To get all universities: >>> wd_instances(‘Q3918’) To get all humans: >>> wd_instances(‘Q5’)
-
johnny5.
wd_subclasses
(cl)[source]¶ Gets all the subclasses of the given class.
Returns: subclasses : set
wd_id for every subclass of the given class.
Examples
To get all subclasses of musical ensemble: >>> wd_subclasses(‘Q2088357’)
-
johnny5.
check_wpdump
()[source]¶ Used to check the current status of the WikiData Dump. It returns None, but prints the information.
-
johnny5.
dumps_path
(new_path=None)[source] Handle the path to the Wikipedia and Wikidata dumps. If new_path is provided, it will set the new path.
Parameters: new_path : str (optional)
If provided it will set the dumps path to this path. Path where to store the Wikipedia and Wikidata dumps. (Must be full path)
Query functions¶
-
johnny5.
wp_q
(d, lang=’en’, continue_override=False, show=False)[source]¶ Queries the Wikipedia API provided a dictionary of features. It handles the pages limit and the results limit by doing multiple queries and then merging the resulting json objects.
Parameters: d : dict
Dictionary.
lang : str (default = ‘en’)
Language edition to query.
continue_override : boolean (False)
If True it will not get any of the continuation queries.
show : boolean (False)
If True it will print all the used urls.
Returns: r : dict
Dictionary with the result of the query.
Examples
>>> wp_q({'pageids':[306,207]})
-
johnny5.
wd_q
(d, show=False)[source]¶ Queries the Wikidata API provided a dictionary of features. It handles the pages limit and the results limit by doing multiple queries and then merging the resulting json objects.
Parameters: d : dict
Dictionary.
show : boolean (False)
If True it will print all the used urls.
Returns: r : dict
Dictionary with the result of the query.
Examples
>>> r = j5.wp_q({'prop':'extracts','exintro':'','explaintext':'','pageids':736}) >>> print list(r['query']['pages'].values())[0]['extract']
Other functions¶
-
johnny5.
country
(coords, path=”, save=True, GAPI_KEY=None)[source]¶ Uses the Google geocode API to get the country of a given geographical point.
Parameters: coords : (lat,lon) tuple
Coordinates to query.
path : string
Path to save the json file containing the query result.
save : boolean (True)
If True it will save the result of the query as ‘lat,lon.json’
GAPI_KEY : string (None)
Name of the environment variable holding the Google API key.
Returns: country : (name,code) tuple
Country name and 2-digit country code.
-
johnny5.
chunker
(seq, size)[source]¶ Used to iterate a list by chunks.
Parameters: seq : list (or iterable)
List or iterable to iterate over.
size : int
Size of each chunk
Returns: chunks : list
List of lists (chunks)
-
johnny5.
correct_titles
(title)[source]¶ Checks the capitalization of the given title and returns a set of possible uses
-
johnny5.
get_links
(text)[source]¶ Gets the links from the provided in Wiki Markup. Links are between square brackets [[]].