Keeping a Tidy House

ECHILD: Structuring your project and code

Tony Stone

21 Sep 2023

The ONS Duck Book - search for “ONS Duck”

"That's not my duck" book example page
  • Based on the principles in the UK Gov’s Analytical Quality Assurance (“Aqua”) book

    • Arose from the Review of quality assurance of government analytical models after the InterCity West Coast franchise competition in 2012

    • Seeks to ensure appropriate quality assurance of models used by government: their inputs, methodology and outputs

  • Long title: QUality Assurance of Code for Analysis and Research (QUACAR /kwakə/)

    • UK Civil Service Analysis Function guidance
    • Produced by the Analytical Standards and Pipelines (ASaP) hub within the Office for National Statistics (ONS).
  • And you thought initialisms and acronyms in academia were contrived…

Principles (based on Aqua Book)

  • Reproducible
  • Auditable
  • Assured
  • Reproducible analytical pipelines

Reproducible

If you can’t prove that you can run the same analysis, with the same data, and obtain the same results then you are not adding a valuable analysis

With a repeatable, transparent production process we can:

  • Focus on verifying that the analysis is sensible (rather than worry about whether the code runs)
  • Reuse and build on the methods

Requires good documentation, not simply code in a repo.

Auditable

  • Transparency can help to increase trustworthiness
  • More 👀 = tips and increased identification of flaws = improved analyses
  • Document design decisions:
    • Who?
    • When?
    • What evidence?

Requires good documentation.

Assured

Quality assurance is time-consuming and resource-intensive

  • Quality assure code and outputs proportionately

  • Some assurance processes can be automated, e.g.

    does your code run

  • …But not all, e.g.

    does your code do what you think it does (though automation can help here)

Document any quality assurance processes or their absence.

Reproducible analytical pipelines (RAP)

This is most helpful when conducting the same analysis routinely (e.g. official statistic).

Less helpful in research when generally developing new, one-time analyses.

However, both should adopt a modular design:

break complex logic down into small, understandable chunks that can be documented and tested more easily.

Structuring your project (from Duck Book)

  • Good directory structure
  • Consistent file and directory name conventions
  • Organise analysis as a Directed Acyclic Graph
  • Preserve input data
  • Check outputs are disposable
  • Modular design

Directory structure

Good directory structure and file hygiene goes a long way…

Logical segregation of projects and analytical tasks

  • HOPE-study
    • WP1.1-cohort-development
    • WP1.2-phenotypes
  • ECHILD-linkage-evaluation
    • 01-linkage-by-birth-cohort
    • 02-linkage-by-npd-module

Good file and directory name conventions

  • Consistency, above all else

  • Short but descriptive and human readable names

  • No spaces, for machine readability - underscores (_) or dashes (-) are preferred

  • Use of consistent ISO date formatting (i.e. YYYY-MM-DD)

  • Padding the left side of numbers with zeros to maintain order

    e.g. 01 instead of 1. So 10 appears after 09 rather than before 2.

Organise analysis as a Directed Acyclic Graph

A DAG is for life, not just for causal inference.

(Euler, 1736)1

Visualise relationships between tasks:

  • dependencies
  • dependants
Dataflow input Ingest data d d input->d a a input->a b b input->b c c a->c b->d b->c c->d

Preserve input data

  • Input data - the starting point of an analysis - should be read-only
  • Data cleaning and re-modelling often incorporates many design decisions and forms part of your analysis

Check outputs are disposable

  • You should be able to dispose of your outputs without worrying1
  • If not, it is unlikely you have confidence in being able to reproduce your results
  • It is good practice to delete and regenerate your outputs frequently when developing an analysis2

Code documentation

Documentation is a love letter that you write to your future self.

Damian Conway

  • Focus on readable, modular code

  • Use docstrings for modules, functions, classes, and methods

  • Use code comments with purpose:

    • For non-obvious details
    • How and why code has been written in a particular way
  • But sparingly:

    • Avoid redundancy
    • Burden of maintaining code-comment agreement

Project documentation

Documenting your project will makes it much easier for others to understand your goal and ways of working

  • README
  • HOW TO (get the code to run)
  • Dependencies
  • Citation
  • Vignettes
  • Versioning
  • Change log
  • Copyright and License

TL;DR

WRITE A README

You’ll thank yourself later