Usage¶
To use Data Portal Explorer in a project:
import data_portal_explorer
To use Data Portal Explorer from the command line:
$ data_portal_explorer
Usage: data_portal_explorer [OPTIONS] CONFIG DEST COMMAND1 [ARGS]... [COMMAND2
[ARGS]...]...
Console script for data_portal_explorer.
Options:
--format [csv|json] [default: json]
--help Show this message and exit.
Commands:
extensions Gets the available extensions.
packages Gets packages.
resources Extracts metadata from resources from previously downloaded packages.
tags Gets the tags used by the datasets.
themes Gets the themes used by the datasets.
Options¶
By default the command line tool saves the repositories data as JSON, to save as
CSV use the --format csv
option:
$ data_portal_explorer --format csv
Configuration¶
Data Portal Explorer makes use of a configuration file to define which data repositories to harvest data from and a logging configuration file.
Before using the tool create a config.ini
following the format:
[DEFAULT]
logging = logging.ini
# prefix for new properties being added to the metadata from the repositories
namespace = dpe
# defaults to the number of processors on the machine, multiplied by 5
workers =
[data_formats]
text =
csv
tsv
excel =
ods
spreadsheet
xls
xlsx
[portals]
# one portal per line
# each of the active portals should have its own section/settings below
active =
# data.gov
data.gov.ie
data.gov.uk
data.london.gov.uk
# open.canada.ca
[example.portal.section]
# the CKAN API endpoint URL
url =
# the CKAN theme field for the portal
themes =
[data.gov]
url = https://catalog.data.gov/
themes = groups
[data.gov.ie]
url = https://data.gov.ie/
themes = theme
[data.gov.uk]
url = https://ckan.publishing.service.gov.uk/
themes = theme-primary
[data.london.gov.uk]
url = https://data.london.gov.uk/
themes = tags
[open.canada.ca]
url = https://open.canada.ca/data/
themes = subject
And a logging configuration file logging.ini
following the format:
[loggers]
keys=root
[handlers]
keys=fileHandler
[formatters]
keys=simpleFormatter
[logger_root]
level=INFO
handlers=fileHandler
[handler_fileHandler]
args=('dpe.log', 'w',)
class=FileHandler
formatter=simpleFormatter
level=INFO
[formatter_simpleFormatter]
format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
Then tell the tool which configuration file to use by passing the path to the tool, for example:
$ data_portal_explorer --format csv config.ini destination_path extensions
Commands¶
Data Portal Explorer has several commands to harvest different types of data from the repositories.
Note
Most of the commands are independent from one another, except for the resources
command that needs to be run after the packages
command has been used to harvest
packages metadata.
It is possible to run multiple commands at the same time, for example to get the tags and themes:
$ data_portal_explorer --format csv config.ini destination_path tags themes
Warning
Depending on the size of the data repositories being harvest, the packages
and
resources
commands can take a long time to complete.
Extensions¶
The extensions
command gets a list of the extensions/plugins installed in a data
repository:
$ data_portal_explorer --format csv config.ini destination_path extensions
Packages¶
The packages
command gets metadata about the datasets/packages stored in a data
repository:
$ data_portal_explorer --format csv config.ini destination_path packages
Tags¶
The tags
command gets a list of the tags used to classify the datasets/packages:
$ data_portal_explorer --format csv config.ini destination_path tags
Themes¶
The themes
command gets a list of the themes used to group the datasets/packages:
$ data_portal_explorer --format csv config.ini destination_path themes
Resources¶
The resources
command gets metadata, and if the resource file is in a readable
format, field names and dates from about the resources in a repository datasets/packages.
This command needs to be run after the packages
command:
$ data_portal_explorer --format csv config.ini destination_path resources destination_path/packages.json