Usage

To use Data Portal Explorer in a project:

import data_portal_explorer

To use Data Portal Explorer from the command line:

$ data_portal_explorer
Usage: data_portal_explorer [OPTIONS] CONFIG DEST COMMAND1 [ARGS]... [COMMAND2
                            [ARGS]...]...

Console script for data_portal_explorer.

Options:
--format [csv|json]  [default: json]
--help               Show this message and exit.

Commands:
extensions  Gets the available extensions.
packages    Gets packages.
resources   Extracts metadata from resources from previously downloaded packages.
tags        Gets the tags used by the datasets.
themes      Gets the themes used by the datasets.

Options

By default the command line tool saves the repositories data as JSON, to save as CSV use the --format csv option:

$ data_portal_explorer --format csv

Configuration

Data Portal Explorer makes use of a configuration file to define which data repositories to harvest data from and a logging configuration file.

Before using the tool create a config.ini following the format:

[DEFAULT]
logging = logging.ini
# prefix for new properties being added to the metadata from the repositories
namespace = dpe
# defaults to the number of processors on the machine, multiplied by 5
workers =

[data_formats]
text =
    csv
    tsv
excel =
    ods
    spreadsheet
    xls
    xlsx

[portals]
# one portal per line
# each of the active portals should have its own section/settings below
active =
    # data.gov
    data.gov.ie
    data.gov.uk
    data.london.gov.uk
    # open.canada.ca

[example.portal.section]
# the CKAN API endpoint URL
url =
# the CKAN theme field for the portal
themes =

[data.gov]
url = https://catalog.data.gov/
themes = groups

[data.gov.ie]
url = https://data.gov.ie/
themes = theme

[data.gov.uk]
url = https://ckan.publishing.service.gov.uk/
themes = theme-primary

[data.london.gov.uk]
url = https://data.london.gov.uk/
themes = tags

[open.canada.ca]
url = https://open.canada.ca/data/
themes = subject

And a logging configuration file logging.ini following the format:

[loggers]
keys=root

[handlers]
keys=fileHandler

[formatters]
keys=simpleFormatter

[logger_root]
level=INFO
handlers=fileHandler

[handler_fileHandler]
args=('dpe.log', 'w',)
class=FileHandler
formatter=simpleFormatter
level=INFO

[formatter_simpleFormatter]
format=%(asctime)s - %(name)s - %(levelname)s - %(message)s

Then tell the tool which configuration file to use by passing the path to the tool, for example:

$ data_portal_explorer --format csv config.ini destination_path extensions

Commands

Data Portal Explorer has several commands to harvest different types of data from the repositories.

Note

Most of the commands are independent from one another, except for the resources command that needs to be run after the packages command has been used to harvest packages metadata.

It is possible to run multiple commands at the same time, for example to get the tags and themes:

$ data_portal_explorer --format csv config.ini destination_path tags themes

Warning

Depending on the size of the data repositories being harvest, the packages and resources commands can take a long time to complete.

Extensions

The extensions command gets a list of the extensions/plugins installed in a data repository:

$ data_portal_explorer --format csv config.ini destination_path extensions

Packages

The packages command gets metadata about the datasets/packages stored in a data repository:

$ data_portal_explorer --format csv config.ini destination_path packages

Tags

The tags command gets a list of the tags used to classify the datasets/packages:

$ data_portal_explorer --format csv config.ini destination_path tags

Themes

The themes command gets a list of the themes used to group the datasets/packages:

$ data_portal_explorer --format csv config.ini destination_path themes

Resources

The resources command gets metadata, and if the resource file is in a readable format, field names and dates from about the resources in a repository datasets/packages. This command needs to be run after the packages command:

$ data_portal_explorer --format csv config.ini destination_path resources destination_path/packages.json