git and GitHub: laying a foundation for reproducible data analysis

Kyle A. Gervers 2024-10-08

The content and organization of this demonstration is heavily inspired by walkthroughs designed by Pat Schloss. Please explore all of his work, perhaps starting here and/or here!

Sections

Introduction
Git and GitHub
Getting started
Git
GitHub CLI
The virtuous circle
GitHub organizations
Reproducibility

Introduction

Return

There are a lot of moving parts involved in most microbiome data analysis projects. Demultiplexing, trimming, and quality filtering can each yield dozens to hundreds of files, parameters need to be tuned to the quirks of each data set, and entirely novel scripts are often needed to examine properties of particular interest. Sometimes older versions of the code are needed, and sometimes code needs to be shared with advisors and collaborators.

It’s certainly possible to manually keep track of all these moving parts. Every time I change a file, I could type those changes into a log, indicating which lines or which elements I changed. But it can be hard for me to remember and log all of the changes I made to a single file, let alone every file in my project. Further, in the case of research projects, getting confused by my code can have disastrous consequences. Add a collaborator to a project, and the room for confusion doubles. This is where version control software like Git and platforms like GitHub lend a helping hand. These two solutions allow me to (1) easily download code, (2) add and track the edits I make, (3) undo those edits to a previous state, (4) store the edited code on a remote web platform, (5) share this code with a co-author, who can then (6) independently edit my code. I can then (7) merge these edits with the edits I had been making in the meantime, and then, once the project matures (8) publish a release of our code with the manuscripts I submit for publication in a journal. Other researchers can then use my code which (9) builds trust in my research and (10) helps other researchers improve on my approaches, improving the field overall.

In short, using Git and GitHub improves makes practically every step of the analysis, collaboration, and publication workflow easier. As with all tools, using them consistently and comfortably requires patience and practice. Thankfully, over 100 million people use Git and GitHub, so a lot can be learned from this huge user base by googling.

Git and GitHub

Return

Before we get too far along: Git is not the same as GitHub. These are two separate tools that have evolved to play very nicely with each other. Linus Torvalds authored Git in 2005 as way to coordinate the development of the Linux kernel among several developers, while GitHub.com was launched in 2008 by GitHub, Inc. (owned by Microsoft) to host software development projects that used Git to organize workflows. While it used to be the case that the two could be separated by the fact that Git was a program and GitHub was a platform, GitHub has since developed a suite of handy protocols and programs to improve developer workflows, notably GitHub CLI, which we will use here.

Getting started

Return

Make a GitHub account

Click here to make a GitHub account, and set up two-factor authentication while you’re at it.

I recommend using a non-OSU email to set up this account, as you might not always have access to your OSU email address. In addition to your email address, record the username you provided for this GitHub account.

Install and configure Git

As indicated in the Git installation guide, installation depends on operating system (Linux, macOS, or Windows), so follow the directions specific to your OS.

Once installed, Git needs to be configured to make it a little easier to use. Replace my information with your information on the command line.

git config --global user.name "Kyle Austin Gervers"
git config --global user.email "kyle@gervers.com"
git config --global github.user "gerverska"
git config --global color.ui "auto"
git config --global core.editor "nano -w"
git config --global push.default simple

For Linux + macOS:

git config --global core.autocrlf input

For Windows:

git config --global core.autocrlf true

Feel free to change any of these Git configurations, but I have found these settings to work fine for most of my projects.

I can always see what I’ve configured by running the following:

git config --list

Install and configure GitHub CLI

This program is not entirely necessary for using Git with GitHub, but I think it simplifies account security operations and project repository creation. It also seems like this program simplifies higher-level operations that I won’t cover here (and don’t full understand at the time of this writing), and I’ll probably “grow into” this package over time. Again, follow the installation directions relevant to your operating system.

There’s minimal configuration to do, but consider changing the editor to nano (my preference).

gh config set editor nano

To check the configuration state, run:

gh config list

A note on reproducibility

If you are a Windows user, it might be easier to use Git and GitHub CLI on the CQLS HPC (i.e., “the cluster”) or Posy’s lab computer. The system-wide installs can be used, but both of these programs can also be installed and managed to your liking using conda.

A guide to setting up your own conda install on the CQLS HPC can be found here. While not covered in this demo, using conda stands to help all users keep their workflows reproducible and customizable. Refer to videos on this topic by Pat Schloss on his Riffomonas Project YouTube channel.

Git

Return

`clone`

The place where project code is kept is called a repository. In the language of Git and GitHub, this is often shortened to repo. If I follow this link, I will be taken to my repo named git-gh-demo. From here, anyone can download the code by running the following git command:

git clone https://github.com/gerverska/git-gh-demo.git

Where did I get the https://github.com/gerverska/git-gh-demo.git link? After following the link to the git-gh-demo repo, I clicked the green <> Code button and copied and pasted the HTTPS link onto the command line.

After running git clone, (1) use mv to change the name of git-gh-demo as shown below (but don’t use my name!), (2) use cd to access the renamed directory, and then (3) use rm to remove the .git folder that came with the repo.

mv git-gh-demo/ git-gh-demo_gerverska/
cd git-gh-demo_gerverska/
rm -fr .git

ls -a

The output of ls -a shows all the files and folders associated with the repo.

.               02-trim     05-rarefy   config.yml         .gitignore    README.rmd
..              03-denoise  06-analyze  data               README_files
01-demultiplex  04-compile  code        git-gh-demo.Rproj  README.md

`init`

Run git init . to initialize a repo inside the renamed git-gh-demo folder. It’s as easy as that.

git init .

Initialized empty Git repository in /path/to/git-gh-demo_gerverska/.git/

I now have a bouncing baby repo. Specifically, after running git init ., I now have a local repo (i.e., stored locally on my computer). The link that I cloned git-gh-demo from is a remote repo (i.e., stored on something that’s not my local repo). The goal now for folks following along is to create a remote repo from this local repo.

However, there’s nothing in my repo. Whenever I run git status while I’m in the renamed git-gh-demo folder, I get this:

On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    .gitignore
    01-demultiplex/
    02-trim/
    03-denoise/
    04-compile/
    05-rarefy/
    06-analyze/
    README.md
    README.rmd
    README_files/
    code/
    config.yml
    data/
    git-gh-demo.Rproj

nothing added to commit but untracked files present (use "git add" to track)

This shows me that all the files in git-gh-demo are untracked. To help me only track the files I care about, git doesn’t track any files or directories after I run git init .. Instead I have to add the files I want tracked. The output I got from git status above suggests:

use "git add <file>..." to include in what will be committed

I’ll do just that!

`add`

git add .

Instead of specifying a file, I specified .. This is the same as adding all of the files and folders to this repo. Now when I run git status I get:

On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   .gitignore
    new file:   01-demultiplex/README.md
    new file:   01-demultiplex/logs/R1.html
    new file:   01-demultiplex/logs/R1_data/multiqc.log
    new file:   01-demultiplex/logs/R1_data/multiqc_citations.txt
    new file:   01-demultiplex/logs/R1_data/multiqc_data.json
    new file:   01-demultiplex/logs/R1_data/multiqc_fastqc.txt
    new file:   01-demultiplex/logs/R1_data/multiqc_general_stats.txt
    new file:   01-demultiplex/logs/R1_data/multiqc_sources.txt

Several other files are listed too. If I wanted to undo this add, git status says I could run git rm . to unstage, which is good to know. I try not to add files that I think might present security vulnerabilities, or contain speculation or important information that I don’t think other GitHub users should see. If I’m happy with adding all these files, the next step is to commit those changes.

Ignoring files

If I attempt to commit a file I’ve added that is larger than 50 MiB, I’ll receive a warning from Git. The commit will still push to the repository (covered below this section), but I should probably consider removing the commit to minimize performance impact. In the case of plant microbiome research, we often have large FASTQ or image files in our project folders that we don’t want on GitHub. This is where the .gitignore file helps us out. The . before gitignore indicates that it’s an “invisible” file only visible with commands like ls -a.

How do I use .gitignore?

Edit the file using nano .gitignore

env
data/raw
scratch
.git
01-demultiplex/*.fq.gz
02-trim/*.fq.gz
.adapters
03-denoise/filter
*-log
.Rproj.user

The above shows my current .gitignore file. Each file or folder listed here will not be tracked whenever I run git add .. I can even put tell git to ignore the .gitignore file itself so that creeps on the internet can’t see the names of files that I’d like to ignore.

Importantly, I can also use the wildcard symbol * to ignore multiple files or folders at once. I can see that 01-demultiplex/*.fq.gz and 02-trim/*.fq.gz are listed here. By including this, none of my many dozens of FASTQ files are ever added to the repo as far as git is concerned. However, when using the wildcard *, make sure that the correct files are tracked or ignored before committing!

`commit`

git commit -a

This command commits all of the changes we just added and opens up an instance of nano:

GNU nano 6.2

# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# On branch main
#
# Initial commit
#
# Changes to be committed:
#       new file:   .gitignore
#       new file:   01-demultiplex/README.md
#       new file:   01-demultiplex/logs/R1.html
#       new file:   01-demultiplex/logs/R1_data/multiqc.log
#       new file:   01-demultiplex/logs/R1_data/multiqc_citations.txt
#       new file:   01-demultiplex/logs/R1_data/multiqc_data.json
#       new file:   01-demultiplex/logs/R1_data/multiqc_fastqc.txt
#       new file:   01-demultiplex/logs/R1_data/multiqc_general_stats.txt
#       new file:   01-demultiplex/logs/R1_data/multiqc_sources.txt
#       new file:   01-demultiplex/logs/R2.html

There’s a note in the nano window that asks me to enter the commit message for your changes. I try to make these commit messages short and sweet (10 words or so), while also accurately communicating what was changed. Since this is my initial commit, let’s just write Initial commit and write nano output.

[main (root-commit) #######] Initial commit
 192 files changed, 3217058 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 01-demultiplex/README.md
 create mode 100644 01-demultiplex/logs/R1.html
 create mode 100644 01-demultiplex/logs/R1_data/multiqc.log
 create mode 100644 01-demultiplex/logs/R1_data/multiqc_citations.txt
 create mode 100644 01-demultiplex/logs/R1_data/multiqc_data.json
 create mode 100644 01-demultiplex/logs/R1_data/multiqc_fastqc.txt
 create mode 100644 01-demultiplex/logs/R1_data/multiqc_general_stats.txt
 create mode 100644 01-demultiplex/logs/R1_data/multiqc_sources.txt
 create mode 100644 01-demultiplex/logs/R2.html

When I run git status again, I now have this update:

 On branch main
 nothing to commit, working tree clean

GitHub CLI

Return

`push`

I’m very close to pushing this local repo to GitHub (the remote)! Because this is my first push to GitHub (pretend for this example!), I’ll actually use gh from GitHub CLI to do this. You can also create an empty repo through the GitHub website that you use git to push to, but I think it’s just easier to use gh.

gh repo create

This starts an interactive session with gh. Answer all the prompts as shown below (follow the > arrow). If you’re in your renamed git-gh-demo folder (which you should be if you’re following along), just press Enter for each step.

? What would you like to do?  [Use arrows to move, type to filter]
  Create a new repository on GitHub from scratch
> Push an existing local repository to GitHub

Press Enter

? Path to local repository (.)

Press Enter

? Repository name (git-gh-demo_gerverska)

Press Enter

? Repository owner  [Use arrows to move, type to filter]
> BusbyLab
  gerverska

I’ll select BusbyLab for this example so my PI can associate this repo with their organization. If you don’t have the option to select BusbyLab, choose yourself.

? Description (git-gh-demo_gerverska) Learning git + gh

Provide an informative description of the repo.

? Visibility  [Use arrows to move, type to filter]
  Public
> Private
  Internal

I’ll select Private for this exercise, but I could select Public if my PI gave me the go-ahead. Always ask your PI before making a repo public.

✓ Created repository BusbyLab/git-gh-demo_gerverska on GitHub

? What should the new remote be called? (origin)

Press Enter

? Would you like to push commits from the current branch to "origin"? Yes

Enumerating objects: 207, done.
Counting objects: 100% (207/207), done.
Delta compression using up to 12 threads
Compressing objects: 100% (202/202), done.
Writing objects: 100% (207/207), 46.31 MiB | 12.47 MiB/s, done.
Total 207 (delta 73), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (73/73), done.
To https://github.com/BusbyLab/git-gh-demo_gerverska.git
 * [new branch]      HEAD -> main
branch 'main' set up to track 'origin/main'.
✓ Pushed commits to https://github.com/BusbyLab/git-gh-demo_gerverska.git

Excellent! I now have a remote repo I can view at https://github.com/BusbyLab/git-gh-demo_gerverska that matches my local repo!

Subsequent pushes

Perhaps I make more changes to my code after the intial repo push. After running git add ., git commit -a, and git status, I can see that my local repo (on my computer) is ahead of my remote repo (on GitHub).

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean

Before I can run git push (which is what I’ll use for subsequent pushes), I need to refresh my login.

gh auth login

GitHub CLI will start an interactive session, asking me a series of questions.

? What account do you want to log into?  [Use arrows to move, type to filter]
> GitHub.com
  GitHub Enterprise Server

Select GitHub.com

? What is your preferred protocol for Git operations?  [Use arrows to move, type to filter]
> HTTPS
  SSH

For now, I’ll select HTTPS. SSH is technically safer, but requires some setup.

? Authenticate Git with your GitHub credentials? (Y/n)

Type Y.

? How would you like to authenticate GitHub CLI?  [Use arrows to move, type to filter]
> Login with a web browser
  Paste an authentication token

For now, select Login with a web browser. You can choose to paste an authentication token (fine-grained or classic) if you set that up, but just make sure that you keep the token somewhere memorable and safe. However, in my opinion, if you want to go the token route, you might as well set up an SSH sign in.

! First copy your one-time code: 19C2-CD83
Press Enter to open github.com in your browser...

After copying the code, I’ll press Enter. A GitHub tab in my default browser window should pop up with a place for you to paste the code.

✓ Authentication complete.
- gh config set -h github.com git_protocol https
✓ Configured git protocol
✓ Logged in as gerverska

And I’m signed in! Now that I’m signed in, I’ll go ahead and run git push. If I don’t push soon, I’ll need to run gh auth login again.

Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 12 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 5.20 KiB | 1.04 MiB/s, done.
Total 4 (delta 3), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (3/3), completed with 2 local objects.
To https://github.com/BusbyLab/git-gh-demo_gerverska.git
   #######..#######  main -> main

Success! The remote repo on GitHub now reflects the changes made to the local repo on my computer.

The virtuous circle

Return

Most of my experience with Git and GitHub looks like this.

git status
git add .
git status
git commit -a
git status
gh auth login
git push
git status

As long as I’m running git status (and reading the output), I shouldn’t get too surprised! When in doubt, take it slowly.

Usually, it’s best practice to make several commits in series while you’re working on different sections of your code. By choosing to work on specific sections of your code, you can write more specific commit messages. We can treat these commit messages as notes to ourselves or our collaborators describing what we did. If I’m happy with my changes after making a series of commits, then I’ll go ahead and push.

I’ll be updating this repo with more information on how to use other git and gh commands!

GitHub organizations

GitHub organizations give members access to each other’s code. Importantly, private repos included in the organization can only be seen by other members, allowing us to easily share insight and build off each other without having to make the repo prematurely public.

At the moment, I think it’s easier to make everyone an owner in the organization. This allows members to start the repo anywhere and still associate the repo with the BusbyLab organization. Because owners have access to all the repos in the organization, we need to be careful to not accidentally disturb each other’s repos. Thankfully, this is very hard to do. Further, since git + GitHub tracks all changes, no change should be a surprise and changes can be easily rolled back.

I think this is currently the best way to use organizations:

Create a Private repo using your personal GitHub account.
Fork that repo to the BusbyLab organization.
If the Private repo changes later, the BusbyLab fork can just sync those changes. I see this as the more common use case.
If a change is made on the BusbyLab fork, this change can be suggested to the user’s personal Private repo.

We can consider removing ownership privileges after a certain point, but it might help to set everyone as an owner for a trial period. If individuals have their own Private repos that are just forked into BusbyLab, those personal Private repos can always be used as recovery points.

Reproducibility

Return

All packages were installed and managed with conda.

conda 24.4.0
name: /home/gerverska/projects/git-gh-demo/env
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - atropos=1.1.31
  - bioconductor-dada2=1.26.0
  - cutadapt=4.2
  - fastqc=0.11.9
  - itsx=1.1.3
  - multiqc=1.13
  - pheniqs=2.1.0
  - r-base=4.2.2
  - r-caret=6.0_93
  - r-dplyr=1.0.10
  - r-markdown=1.4
  - r-remotes=2.4.2
  - r-rmarkdown=2.20
  - r-vegan=2.6_4
  - vsearch=2.22.1
prefix: /home/gerverska/projects/git-gh-demo/env

Install the above bioinformatic environment from config.yml using the script 00-build.sh

# Clone the repo (using the GitHub CLI tool) ####
gh repo clone gerverska/git-gh-demo

# Run the build script ####
bash code/00-build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

git and GitHub: laying a foundation for reproducible data analysis

Sections

Introduction

Git and GitHub

Getting started

Make a GitHub account

Install and configure Git

Install and configure GitHub CLI

A note on reproducibility

Git

`clone`

`init`

`add`

Ignoring files

`commit`

GitHub CLI

`push`

Subsequent pushes

The virtuous circle

GitHub organizations

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
01-demultiplex		01-demultiplex
02-trim		02-trim
03-denoise		03-denoise
04-compile		04-compile
05-rarefy		05-rarefy
06-analyze		06-analyze
README_files/figure-gfm		README_files/figure-gfm
code		code
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.rmd		README.rmd
config.yml		config.yml
git-gh-demo.Rproj		git-gh-demo.Rproj

Folders and files

Latest commit

History

Repository files navigation

git and GitHub: laying a foundation for reproducible data analysis

Sections

Introduction

Git and GitHub

Getting started

Make a GitHub account

Install and configure Git

Install and configure GitHub CLI

A note on reproducibility

Git

clone

init

add

Ignoring files

commit

GitHub CLI

push

Subsequent pushes

The virtuous circle

GitHub organizations

Reproducibility

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`clone`

`init`

`add`

`commit`

`push`

Packages