LTARE Data Storage & Version Control

WaSHI Data

Non-Digital Data:

We recommend that Non-digital data, such as paper forms, must be transcribed or converted to digital file formats and then stored in the respective shared LTARE OneDrive.

GitHub organizations for code-based projects:

Microsoft Teams for data sharing between LTARE sites:

  • WaSHI Team

WaSHI Google Drive for external file sharing:

  • WaSHI has a gmail.com account. The WaSHI Director and Data Management can help with access to this space.

Individual devices (laptop, tablet, phone):

  • Must NOT be the only place data are stored!

Backup

Data must be stored in multiple locations. Use the 3-2-1 Rule (3 copies of data, 2 different media types, 1 offsite copy). At minimum, data on an individual computer must also be saved on the LTARE external hard drive. Backing up data using version control (GitHub) or a cloud service (IDrive) is strongly recommended.

Read-only raw data

Always set raw data files, such as lab results, as Read-Only to avoid accidental corruption or overwriting. For example, in the lab-data folder, all original data files are set to Read-Only and saved in the raw folder.

Copy the raw data file to the working folder for processing and analyses. Then save the final dataset in the separate clean folder with a descriptive title. Keeping a readme.txt to document processing steps is good practice, as discussed in the Documentation section.

Y:/NRAS/soil-health-initiative/state-of-the-soils/2023_sampling/lab-data
├── 2023_data-template-soiltest.xlsx
├── clean
├── qc
├── raw
└── working

To set a file as Read-Only: right-click the file > Properties > check the Read-only attribute box > OK.

 

Screenshot of the above directions to set files to Read-only on a Windows computer.

Version control with Git and GitHub

A version control system records changes to files over time. Git is a free and open-source distributed version control system. GitHub is the hosting site we use to interface with Git. Git and GitHub are fundamental to reproducible statistical and data scientific workflows ().Version control ensures changes are documented and previous versions are accessible if changes must be recalled. Additionally, version control enables robust collaboration across projects.It’s useful for not only code projects, but also for documents, presentations, and books (like this DMP!). Git and GitHub automatically save the revision history of each file, so there is only a single name for each file (e.g., report.docx) instead of report_v01.docx and report_v02.docx. For a reminder on version naming, see the Naming Conventions section.The screenshot below shows who made commits (i.e., named version histories) and when they were made. From this screen, a user can click on the commit message to view all files that were changed.Screenshot of GitHub commits for the WaSHI DMP.After clicking the first commit message, a diff (i.e., a visual of what changed) displays the additions to documentation.qmd highlighted in green and deletions highlighted in red.

Screenshot of GitHub commit message titled 'First complete draft of documentation chapter' which shows the changed file with additions highlighted in green and deletions highlighted in red.

Privacy considerations

Review the Data Sharing secttion to categorize the data included in the repository to protect grower privacy. If the data are not anonymized and aggregated, either 1) the repository must be set to private or 2) data files and any scripts containing Category 3 data must be added to the .gitignore file.

Git and GitHub resources

GitHub's Official Documentation

Integrating Git with Python Development Environments (IDEs)

Most popular Python IDEs have built-in Git integrations, making version control seamless.

  • VS Code (Visual Studio Code): VS Code has excellent, intuitive Git integration.
  • PyCharm: PyCharm (by JetBrains) also has robust Git and GitHub integration.

Cheat Sheets and Best Practices