LibGuides: Research Data Management: 1.2 Data Organization

1.2 Data Organization - Page Contents

Data Organization

Data Types:

Various kinds of research projects generate and collect different kinds of data. Data can be easily categorized into these four categories:

Observational
- Usually captured in real time and not in the laboratory
- Often irreplaceable (i.e. one time event) and not likely reproducible
- E.g. astronomical observations, sensor readings, sensory observations etc.
Experimental
- Captured in the laboratory under controlled conditions
- Likely reproducible but can be expensive both in time and costs
- E.g. gene sequences, microscopy, chromatograms etc.
Computational/Simulation
- Computer generated from test models
- Likely reproducible if computer inputs are preserved but is expensive both in time and costs
- E.g. economic models, climate models etc.
Derived
- Produced by existing datasets
- Likely reproducible but can be expensive both in time and costs
- E.g. text and data mining, compiled databases etc.

Back to Page Contents

File Organization

Organizing your data into various files and file directories is a very important part of data management; it saves you time locating a particular file at a later date. Overall, it is best to keep your file organization clear, descriptive, and unique with a documented naming convention. Unique, but clear file descriptions of the file contents aid precise file identification and discovery. We will consider the following things when it comes to data/file organization

Directory structure and folder naming conventions
File naming conventions
File versioning
File formats

Back to Page Contents

Directory Structure & Folder Naming Conventions

Directory Structure/Folder Naming Conventions:

The top level folder or directory should have the following descriptors and folder names should be kept under 32 characters

Project title
Unique identifier
Date (yyyy or yyyymmdd)

Folder Hierarchy Example: [Project]/[Experiment]/[Instrument Used]

FOLDER SUBSTRUCTURE - The folders/directories within the substructure should be split according to a particular theme; e.g. each folder may contain a run of an experiment or a different version of a particular dataset.

Back to Page Contents

File Naming Conventions

File Naming:

File names should give people a meaningful context for the named files and people should be able to identify and distinguish similar files from one another. In general, here are some key descriptors that you should consider when deciding on your file names:

Experiment or research project name
Data type
Experimental Conditions (e.g. temperature, lab instrument used etc)
Location of research
Researcher name/initials
Experiment date (or date range)
File version number
Application-specific codes for 3-letter file extension --- e.g. .mov, .tif etc.
Filename Example: [Project]_[Instrument]_[Date]_[Version].[ext]
GN7799_ G1000_ 180308_v03.tif
- GN7799 – Experiment/project name
- G1000 – Instrument used
- 180308 – Experiment date
- v03 – File version number
- .tif – 3-letter file extension

File Naming Tips:

Here are some popular tips on file naming in general:

Date should be formatted in the following way (i.e. ISO 8601): YYYYMMDD or YYMMDD
File name length shouldn’t be too long as it becomes incompatible with all software types --- leave to 32 characters maximum
Avoid special characters usage in file names like: ! @ $ % * () ‘;<>,[]{}”
When sequentially numbering files, use leading zeros in order to guarantee that files will sort properly; e.g. 0001, 0002 … 1001 vs. 1,2, … 1001
Avoid using spaces in file names; instead, use underscores (e.g. file_name), no separation (e.g. filename), dashes (e.g. file-name), or camel case (e.g. FileName)

README Text:

Think about designing a “README.txt” file that explains your naming convention, abbreviations and used codes to accompany your data. For more information on on README.txt files, click here for more information on metadata/README.txt files.

Bulk File Renaming:

Renaming loads of files (i.e. too difficult by hand) is easy with these tools:

Bulk Rename Utility (Windows)
Renamer (Mac)
PSRenamer (Linux, Mac, or Windows, free)

Back to Page Contents

File Versioning

Versioning:

When your research is collaborative in nature, keeping track of your changes/versions is very important to managing your data well. It allows you to make changes so that you can go back and retrieve particular versions of your files at a later date instead of having to retrace your steps in order to recreate it. You can manually keep track of your research data by using a sequential numbered system like in the following: e.g. v01, v02, … v99 etc. You can also use version control software like SVN. Try to avoid using confusing labels like “revision, final, final2” etc. and remove obsolete versions

File Versioning Example: DataMgmtNotesv03.txt instead of DataMgmtNotesFinalReally2.txt

Back to Page Contents

File Formats

Ideal File Format Types:

Selecting which file format to save your research has long term usage and access implications; for example, if the file format that you use is proprietary its long term accessibility and subsequent usage is unpredictable as it depends on the success and longevity of the business. The reality of technology changing is real and as a result, researchers should plan for both hardware and software obsolescence and should plan to make file format decisions that will ensure long term usage and accessibility. The following are some guidelines to help you in choosing an appropriate file format for your research:

Non-proprietary
Uncompressed
Unencrypted
Commonly used by the general research community
Open, documented standards
Using standard character encodings (ASCII, UTF-8)

Preferred File Formats:

Text: XML, PDF/A, HTML, ASCII, UTF-8 (not Word)
Tabular Data: CSV (not Excel)
Still Images: TIFF, JPEG 2000, PDF, PNG, BMP (not GIF or JPG)
Moving Images: MOV, MPEG, AVI, MXF (not Quicktime)
Sounds: WAVE, AIFF, MP3, MXF
Databases: XML, CSV
Statistics: ASCII, DTA, POR, SAS, SAV
Containers: TAR, GZIP, ZIP
Geospatial: SHP, DBF, GeoTIFF, NetCDF
Web Archive: WARC

Oregon State University has a table of other acceptable formats on top of the preferred file formats:

Oregon State University

Back to Page Contents

Libraries

Research Data Management

1.2 Data Organization - Page Contents

Data Organization

File Organization

Directory Structure & Folder Naming Conventions

File Naming Conventions

File Versioning

File Formats