How to Improve Your Relationship with Your Future Self

reading summary
reproducibility
workflows
data science

Reproducible workflows are not just good for science.

Author
Published

August 15, 2023

Reading Summary

bowers2016improve (Bowers and Voors 2016)

Title: How to Improve Your Relationship with Your Future Self. {Revista de Ciencia Politica, 2016} (19 pages).

Authors: Jake Bowers and Maarten Voors. (University of Illinois, Wageningen University)

Key words: Reproducibility, Workflows, Data Science.

Title page of Bowers and Voors (2016).

Bowers and Voors introduce seven principles to guide the reader in their move toward reproducible research. Based around their own workflows, they justify and give examples of how each of these principles can be put into practice. Although many of these principles are widely known, it can take many iterations for an individual to implement them; the authors seek to ameliorate the reader’s journey toward current best practices. The article focuses mainly on the implementation of best practices, rather than justifying them, and so is aimed at data scientists and data analysts who are aware of the need for reproducibility but don’t know where to get started.

Notes

The seven principles introduced are:

  1. Data analysis is computer programming.
  2. No data analyst is an island for long.
  3. The territory of data analysis requires maps.
  4. Version control prevents clobbering, reconciles history and helps organise work.
  5. Testing minimises error.
  6. Work can be reproducible.
  7. Research ought to be credible communication.

There is a large crossover between these and Wilson et al. (2017).

  • “In another study, 29 research teams recently collaborated on a project focusing on applied statistics to see if the same answers would emerge from re-analyses of the same dataset (Silberzahn and Uhlmann 2015). They don’t.”

  • Coding scales better and reproduces better than using a GUI.

  • Scripting reduces the opportunity for mistakes and makes them faster to correct.

  • Use human-friendly file names and directory structures.

  • Split projects into modular tasks, each gets its own script. This allows parallel working without conflicts.

  • Data Science is just a series of decisions. Document the options you have and why you make the choice that you do.

  • Write your code for other people, not yourself. This holds you accountable for its quality and to writing clear code.

Let us change our traditional attitude to the construction of programs: instead of imagining that our main task is to instruct a computer what to do, let us concentrate on explaining to human beings what we want a computer to do.

Knuth (1984)

  • Literate programming allows mixing of code and description, helping code and report to be seen as one thing. Misses the difficulties of scaling and automation with notebooks.

  • Write portable file paths from the root directory of your project.

  • Section 4 opens with an excellent and concise description of why we want version control.

  • Learning version control takes time, energy and lots of mistakes. Make these mistakes early and on low-stakes projects - you can burn it all to the ground if you need to!

  • The paper was written reproducibly: source code available at https://github.com/jbowers/workflow. Could be a useful example for data science course.

  • A formal example of the fork and pull-request workflow is given in this gist by Chase Pettit

  • Section 6 ends with a nice quote to motivate practice:

“We all learn by doing. When we create a reproducible workflow and share reproducible materials we improve both cumulation of knowledge and our methods for doing social science.”

(Freese 2007; King 1995)

References

Bowers, Jake, and Maarten Voors. 2016. “How to Improve Your Relationship with Your Future Self.” Revista de Ciencia Política 36 (3): 829–48. https://doi.org/10.4067/S0718-090X2016000300011.
Freese, Jeremy. 2007. “Replication Standards for Quantitative Social Science: Why Not Sociology?” Sociological Methods & Research 36 (2): 153–72. https://doi.org/10.1177/0049124107306659.
King, Gary. 1995. “Replication, Replication.” PS: Political Science & Politics 28 (3): 444452.
Knuth, D. E. 1984. “Literate Programming.” The Computer Journal 27 (2): 97–111. https://doi.org/10.1093/comjnl/27.2.97.
Silberzahn, Raphael, and Eric L. Uhlmann. 2015. “Crowdsourced Research: Many Hands Make Tight Work.” Nature 526 (7572): 189–91. https://doi.org/10.1038/526189a.
Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K Teal. 2017. “Good Enough Practices in Scientific Computing.” PLoS Computational Biology 13 (June): 1–20. https://doi.org/10.1371/journal.pcbi.1005510.

Reuse