A beginner-friendly guide for researchers to write clean, reproducible code.
Overview
This guide is for anyone writing code - even if you have zero formal programming training. No prior experience required. Simple, practical tips that apply to any language (Python, R, MATLAB, etc.).
Why This Guide Exists 💭
Good code is like a well-organized lab notebook:
Easier for you to understand later.
Easier for others to reproduce your work.
Ready for journals and funding agencies requiring open and reproducible research.
How to Use This Guide
- Start with Introduction if you’re new to coding.
- Jump to any principle for quick tips and examples.
- If you need a quick reference, this cheat sheet summarises all the key concepts!
Further Reading (Extra resources)
Introduction
Welcome! A General Coding Guide that’s made for YOU!
If you’ve ever opened a code file and thought:
> “This looks like a different language…” > > “I’m not sure I can do this.” > > “How do I use this?”
You’re not alone. Many researchers (biologists, psychologists, economists) learn coding on the go without any formal training. That means you’re learning the hard way, often under pressure. This guide is here to make that easier.
❌ You do NOT need to be an expert ❌
❌ You do NOT need prior experience ❌
✅ You just need to start small ✅
What You’ll Learn
This resource will show you simple principles that apply to any programming language (Python, R, MATLAB, JavaScript, etc.) to make your code:
- Readable → So future you knows what you meant (even when you come back after a few weeks, months, or even years!)
- Organised → So your projects don’t feel like chaos (Just like your bedroom — let’s know where everything is!)
- Reusable → So you never have to rewrite the same thing twice (You can share it with others so they can try your code too without getting lost!)
Think of it as a lab notebook for your code.
The goal is to help you write good code, not turn you into a software engineer.
What Makes Code Good?
Good code isn’t about being clever or using the shortest syntax. It’s about writing code that is:
Clear → Anyone (including future you) can understand it
Readable → Others can follow the logic without guessing
Maintainable → Easy to modify without breaking everything
Structured → Organised into logical parts
Why Good Code Matters
Good code is all about making your work easier to understand and reuse. It’s not about how fancy it looks.
This is becoming more important because:
You’ll probably need to revisit your code later.
Ever looked at an old script and thought, “What was I even doing here?”
Clean code saves you from decoding your own logic.Collaboration is easier when code is readable.
If a colleague needs to reproduce your results, they shouldn’t need to keep asking you where everything is or how to follow your work.Reproducibility is expected.
In research projects, more journals and funders are asking for code alongside publications. If you follow good practices, sharing your code later will be much easier.Less stress.
If something breaks (and it usually will), well-structured code makes debugging quicker.
Think of your code like a lab notebook.
Would you write your experimental steps in a messy way that no one can follow? Probably not. Your code deserves the same clarity so others can also use it.
Examples
Heads up: Both do the same thing, but the second is a lot more self-explanatory…
Bad (short, but cryptic): ❌
= [3,4,5,6]; y=sum(x)/len(x); print(y) x
Good (structured and readable): ✅
# Calculate mean height of samples
= [3, 4, 5, 6]
sample_heights = sum(sample_heights) / len(sample_heights)
mean_height print(mean_height)
Why is the second better? 💭
- Variables describe what they store (
sample_heights
,mean_height
) - There’s a short comment explaining the purpose
- Steps are on separate lines for readability
❌ What to Avoid ❌
- ❌ Hard-to-follow logic
- ❌ Unnecessary complexity
- ❌ Giant scripts without sections
💡 Tip:
When in doubt:
> Code for humans first, computers second.
Computers don’t care about messy code but future you will!
Every time you write code, ask yourself:
> “Would future me understand this in six months?”
If the answer is no, clean it up.
🏷 Naming Things Well
Why Naming Matters
Imagine opening a script and seeing this:
= 10
a = 20
b = a + b c
What does this even do??? Who knows….
Now compare this:
= 10
apples = 20
oranges = apples + oranges total_fruit
Immediately clear, right?
Names are like labels that make your code self-explanatory.
Here’s some Golden Rules for Naming
✔ Be descriptive, not cryptic
- ❌ Bad: a
, b
, c
- ✅ Good: mean_height
, patient_age
✔ Be consistent
Choose one style of naming convention and stick to it throughout: - snake_case
→ Common in Python, R (total_sales
) - camelCase
→ Common in JavaScript (totalSales
) - PascalCase
→ Common in object-oriented programming eg. Python classes, C# (DataFrame
)
✔ Avoid misleading names
- Don’t name a variable temp
if it stores humidity values! - or ‘data’… WHAT DATA??
✔ Keep it short but clear
avg_score
is better than average_score_of_all_exam_results
.
Let’s Compare some Bad vs Good Examples
(So you get what we mean!)
Example 1 (in R):
# Bad
<- 0.05
x
# Good
<- 0.05 significance_threshold
Example 2 in Python:
# Bad
= pd.read_csv("data.csv")
pd1
# Good
= pd.read_csv("data.csv") patient_data
📌 Tips
- Use nouns for variables (
patient_age
) - Use verbs for functions (
calculate_mean
) - Avoid single letter variables except for loop counters (
i
,j
in short loops) where they’re common
✍️ Writing Readable Code
Why Readability Matters
Code should run the same whether it’s perfectly formatted or a confusing mess, but you’re not writing for the computer, you’re writing for humans (including future you).
Readable code: - Reduces errors
- Makes debugging easier
- Helps collaborators understand your logic without asking 10 questions
What Makes Code Readable?
✔ Consistent indentation
✔ Logical spacing between sections
✔ Short lines (80–100 characters)
✔ Group related code into chunks
Let’s see an example:
Bad (hard to read):
import pandas as pd;df=pd.read_csv("data.csv");df.dropna(inplace=True);print(df.describe())
Good (structured and spaced):
import pandas as pd
# Load dataset
= pd.read_csv("data.csv")
data
# Remove missing values
=True)
data.dropna(inplace
# Display summary statistics
print(data.describe())
See the difference? The second example is easier to scan and understand.
Time-Saving Tools
Instead of formatting everything manually: - Python: black
or autopep8
- R: RStudio’s built-in formatting (shortcut: Ctrl+Shift+A
)
- JavaScript: Prettier
These tools auto-format your code so you don’t waste time fixing spacing or line breaks.
📝 Commenting and Documenting
Why Comments Matter
Comments are your way of talking to your future self and collaborators.
Code explains what is happening, but comments explain why.
Without them, you might look at your script in 6 months and wonder:
> “What was I thinking here?”
Golden Rules for Commenting
✔ Explain “why”, not “what” - ❌ Bad:
# Add 1 to x
- ✅ Good:
# Correct for baseline offset
✔ Keep it short and relevant
Avoid long essays. A single line often does the job.
✔ Use section headers
Break your script into steps using comments:
# Step 1: Load data
# Step 2: Clean data
# Step 3: Analyze and visualize
✔ Update comments when code changes
Outdated comments can mislead people more than no comments.
✅ Example: Bad vs Good
Bad commenting (R):
# Calculate p-value
= 0.04 p
Good commenting (R):
# Significance threshold for t-test
<- 0.04 p_value
Document Your Project
Beyond comments in code: - README file: Explains what the code does and how to run it. - Docstrings: In Python, use triple quotes inside functions:
def calculate_mean(values):
"""
Calculate the mean of a list of numeric values.
"""
return sum(values) / len(values)
💡 Tip:
If someone new opened your project today, could they: 1. Understand what it does?
2. Know how to run it?
If not, add comments or a README.
🧩 Breaking Code Into Chunks
Why Break Code Into Chunks?
Writing your entire research paper as one giant paragraph would be both boring and strenuous to read and this would be what your code looks like too if you dont break it into sections. If your script is 500 lines long with no breaks, debugging becomes a nightmare.
Breaking code into chunks: - Makes it easier to read and maintain - Allows you to reuse parts without rewriting everything - Helps with testing — smaller pieces mean easier troubleshooting
How to Do It
✔ Use functions for repeated tasks
✔ Group related steps together
✔ Separate logic into modules or scripts (e.g., data_cleaning.py
, analysis.py
)
Here’s an example:
Bad (repeated code):
# Calculate average
= sum([1, 2, 3]) / 3
a print(a)
# Calculate another average
= sum([4, 5, 6]) / 3
b print(b)
Good (use a function):
def calculate_mean(values):
"""Calculate the mean of a list of numbers."""
return sum(values) / len(values)
print(calculate_mean([1, 2, 3]))
print(calculate_mean([4, 5, 6]))
Organising Your Project
Instead of one huge script, structure your project like this:
/project/
data/
scripts/
data_cleaning.py
analysis.py
output/
README.md
This keeps things modular and tidy — like labelled drawers instead of a messy box.
🧠 Core Coding Principles (DRY, KISS, YAGNI)
Why These Principles Matter
These aren’t just buzzwords from software engineering — they make life easier for researchers too.
Following these rules means: - Less time debugging
- Less repeated effort
- Cleaner, more reproducible scripts
The Big Three Principles
1. DRY — Don’t Repeat Yourself
Instead of repeating the same logic multiple times, turn it into a function or reuse a script.
Bad: ❌
# Calculate two averages
= sum([1, 2, 3]) / 3
a = sum([4, 5, 6]) / 3 b
Good: ✅
def calculate_mean(values):
return sum(values) / len(values)
print(calculate_mean([1, 2, 3]))
print(calculate_mean([4, 5, 6]))
2. KISS — Keep It Simple, Stupid
Don’t over-engineer. Your goal isn’t to impress other programmers - it’s to make your code understandable.
Bad (too fancy): ❌
# Nested one-liner that's hard to read
print(sum([x for x in [1, 2, 3] if x % 2 == 1]) / len([x for x in [1, 2, 3] if x % 2 == 1]))
Good (clear and simple):✅
# Calculate mean of odd numbers
= [x for x in [1, 2, 3] if x % 2 == 1]
odd_numbers = sum(odd_numbers) / len(odd_numbers)
mean_odd print(mean_odd)
3. YAGNI — You Aren’t Gonna Need It
Don’t write code “just in case.” Extra complexity = extra bugs.
Bad (R):❌
# Writing a function for a feature you might never use
<- function(data, use_3d=FALSE, color_gradient="blue-red", add_labels=TRUE, ...) {
fancy_plot_generator # Overkill for a simple plot
}
Good (R):✅
# Simple and works for now
plot(data)
Researcher-Friendly Analogy
Think of these rules like lab work: - DRY: If you’re writing the same protocol twice, make a template.
- KISS: Don’t build a rocket when you just need a test tube.
- YAGNI: Don’t buy a centrifuge for a project that only needs a pipette.
⚠️ Avoiding Common Mistakes
Why This Matters
Most coding errors aren’t caused by “advanced” or complex problems! They usually come from simple mistakes like syntax errors. These are easy to fix once you know what to look for.
❌ Mistake 1: Hardcoding File Paths
Researchers often write absolute paths like:
= pd.read_csv("C:/Users/Parvathy/Desktop/PhD/data.csv") data
This works only on your computer. If someone else runs your code (or you switch machines), it fails.
✅ Fix: Use relative paths
= pd.read_csv("./data/data.csv") data
or:
import os
= os.path.join("data", "data.csv")
data_path = pd.read_csv(data_path) data
❌ Mistake 2: Over-commenting
Bad:
= x + 1 # Add 1 to x x
This adds no value. Your code should explain itself through clear names, not cluttered comments.
✅ Better:
# Adjust for baseline offset
= raw_value - baseline adjusted_value
❌ Mistake 3: Using Cryptic Variable Names
Bad (R):
<- 0.05 a
What does a
mean? Future-you won’t remember.
✅ Better (R):
<- 0.05 significance_threshold
❌ Mistake 4: Giant Scripts With No Structure
A 500-line script without sections is like a thesis with no chapters.
✅ Fix: Use headers and functions:
# Step 1: Load data
# Step 2: Clean data
# Step 3: Analyse
❌ Mistake 5: Nested Loops Without Explanation
Hard to follow:
for i in range(10):
for j in range(10):
do_something(i, j)
✅ Fix: Add context or break into helper functions:
for sample in samples:
process_sample(sample)
❌ Mistake 6: Forgetting to Save Random Seeds
When working with random processes (bootstrapping, ML), forgetting to set a seed means results will change every time.
✅ Fix:
import numpy as np
42) np.random.seed(
❌ Mistake 7: Ignoring Error Messages
Many researchers panic when they see red text and don’t read the actual error.
Error messages usually tell you: - What went wrong (e.g., FileNotFoundError
) - Where it happened (file and line number)
💡 Tip: Read the first line of the error message, copy it, and Google it.
📓 Write Code Like a Lab Notebook
Why?
Your research workflow is already structured when you keep lab notes, log experimental steps, and label results clearly.
Your code should follow the same logic.
How to Structure Your Code
Think of your script as a step-by-step protocol.
Use clear sections with comments or headers.
Suggested Structure
# Step 1: Import libraries
# Step 2: Load data
# Step 3: Clean and prepare data
# Step 4: Analyse
# Step 5: Visualise
# Step 6: Save results
Example Template (Python)
# Step 1: Import libraries
import pandas as pd
import matplotlib.pyplot as plt
# Step 2: Load data
= pd.read_csv("./data/experiment.csv")
data
# Step 3: Clean data
=True)
data.dropna(inplace
# Step 4: Analyse
= data['height'].mean()
mean_value
# Step 5: Visualise
'height'])
plt.hist(data[
plt.show()
# Step 6: Save results
"./output/cleaned_data.csv", index=False) data.to_csv(
Organising Your Project Directory
Instead of keeping everything in one folder, use a clear structure:
/project/
data/ # Raw data files
scripts/ # Your code
output/ # Results and plots
README.md # What this project does and how to run it
🔄 Reproducibility in Practice
Why Reproducibility Matters
Science depends on the ability to repeat results.
If your code can’t reproduce the same outputs tomorrow (or on a collaborator’s machine), your findings are at risk.
Core Principles of Reproducible Code
✔ Same input → same output
Running the same script with the same data should always give the same result.
✔ Minimal manual steps
Scripts should run from start to finish without manual clicks or edits.
✔ Self-contained projects
Include all files (or instructions to get them), a README, and dependency info so others can run your code easily.
📝 Checklist for Reproducibility
- ✓ Use relative paths, not hardcoded ones
# ❌ Bad
= pd.read_csv("C:/Users/Parvathy/Desktop/PhD/data.csv")
data
# ✅ Good
= pd.read_csv("./data/experiment.csv") data
- ✓ Save random seeds for consistent results
import numpy as np
42) np.random.seed(
✓ Record software versions
- Python:
bash pip freeze > requirements.txt
- R:
r sessionInfo()
- Python:
✓ Avoid interactive manual steps
Don’t require someone to “click” or “select” something for the script to run.✓ Include all required files
If your code depends ondata.csv
, make sure it’s in the repo (or provide a download link).✓ Use standard formats
Prefer.csv
for data over Excel.xlsx
(to avoid version issues).
Example Project Setup
/project/
data/ # Raw data files
scripts/ # Code scripts
output/ # Processed data and results
requirements.txt # List of software dependencies
README.md # Instructions to run everything
💡 Tip:
Before sharing code, test it on a clean machine or environment.
If it works there, it’ll work for others.
🔢 Magic Numbers & Constants (and Writing Checks)
What Are “Magic Numbers”?
A magic number is a number that appears in your code without context.
Example (Python):
if p < 0.05:
print("Significant")
Why 0.05
? If someone else reads your code (or you revisit it later), they won’t know what that number means.
❌ Why This Is a Problem
- Hard to understand
- Easy to break if you need to change it in multiple places
- Makes your code less reusable
✅ Fix: Use Named Constants
Define important numbers or thresholds once at the top of your script.
Python:
= 0.05
SIGNIFICANCE_THRESHOLD
if p_value < SIGNIFICANCE_THRESHOLD:
print("Significant")
R:
<- 0.05
SIGNIFICANCE_THRESHOLD
if (p_value < SIGNIFICANCE_THRESHOLD) {
print("Significant")
}
This makes your code: ✔ Easier to read
✔ Easier to maintain (change it in one place)
Add Simple Checks (Validation)
Small checks prevent big headaches.
Python:
assert len(data.columns) > 1, "Data should have more than one column"
R:
stopifnot(ncol(data) > 1)
Why Checks Are Important
- Catch errors early
- Avoid wasting time running broken code
- Make your code more robust
💡 Tip: Think of constants and checks like labels and safety checks in your lab that prevent accidents.
🛠 Reading Error Messages & Using Documentation
When you see an error in your code, do you: - ❌ Panic?
- ❌ Close your laptop?
- ❌ Think, “I’m bad at coding”? - ❌ HOWWWWW! It was working a second ago! Stop right there….
Errors are normal — every coder sees them every day.
The key is learning how to read them and use them to fix your code.
Golden Rule
Error messages are your friend, not your enemy.
They usually tell you: 1. What went wrong (the error type)
2. Where it happened (line number)
3. Sometimes, why it happened (a hint)
Example in Python
FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'
What does this mean?
- The script tried to open data.csv
, but it wasn’t there.
Fix:
- Check if the file exists in the right folder.
- Check your file path spelling.
Example in R
Error in read.csv("data.csv") : cannot open file 'data.csv'
Same idea — file path issue.
How to Handle Errors Like a Pro
✔ Step 1: Read the first line of the error.
✔ Step 2: Look for the file name or variable name.
✔ Step 3: Copy the error message (not your whole screen) and Google it.
✔ Step 4: Check official docs or Stack Overflow.
Documentation is Your Best Friend
Every language has official documentation: - Python: https://docs.python.org
- R: ?function_name
in R console
- Pandas, NumPy: https://pandas.pydata.org/docs
💡 When you Google, add the language name:
"TypeError unsupported operand type" Python
💡 Tip:
Learning to debug is like learning to troubleshoot in the lab and it’s all part of the process, not a failure.
Good Coding Habits in a Nutshell
Use this as a quick reminder when writing code:
General Principles
- Code for humans first, machines second.
- One clear purpose per function or script.
- Use descriptive names (not
a
ortemp
).
Structure
Organise scripts like a lab protocol:
# Step 1: Import libraries
# Step 2: Load data
# Step 3: Clean data
# Step 4: Analyse
# Step 5: Save results
Project layout:
/project/
data/
scripts/
output/
README.md
Core Rules
- DRY: Don’t Repeat Yourself (reuse code, write functions).
- KISS: Keep It Simple, Stupid (avoid complexity).
- YAGNI: You Aren’t Gonna Need It (don’t add features you don’t need).
Formatting
- Consistent indentation and spacing.
- Blank lines between sections.
- Auto-format with:
- Python →
black
- R → RStudio (
Ctrl+Shift+A
) - JavaScript → Prettier
- Python →
Reproducibility
- Use relative paths (
./data/file.csv
), notC:/Users/...
. - Save random seeds (
np.random.seed(42)
orset.seed(42)
). - Keep dependencies documented (
requirements.txt
orsessionInfo()
in R).
Constants & Checks
Replace magic numbers with named constants:
= 0.05 SIGNIFICANCE_THRESHOLD
Add sanity checks:
assert len(data.columns) > 1, "Data should have more than one column"
🛠 Emergency Fixes: Common Problems & Quick Solutions
Problem: Code won’t run
Check for: - Missing brackets or colons ()
, :
- Incorrect indentation - Typos in variable names
Problem: “File Not Found”
Check: - Is the file in the correct folder? - Are you using the correct relative path? (./data/file.csv
)
Problem: “Object Not Defined”
Fix: - Did you spell the variable name correctly? - Did you run the cell/script where it was defined?
Problem: Weird output
Check: - Data types (e.g., numbers stored as text) - Missing values (NaN
or NA
)
Problem: Random results every time
Fix: - Set a random seed.
💡 Pro Tip:
If all else fails: 1. Read the error message carefully.
2. Copy the first line into Google (with your language name).
3. Check official docs or Stack Overflow.
Comments
- Explain why, not what.
- Use headers for steps:
Further Reading (Extra resources)
View Cheat Sheet || Download Cheat Sheet (JPG)
Back to Homepage || Introduction to Git & GitHub || Guide to Sample Size Calculations
About This Guide - Author: Parvathy Sureshkumarnair - Part of the Research Skills Toolkit - Funded by Cardiff University Research Culture Fund - View on GitHub || Report Issues
Last updated: [01 Aug 2025] | Licensed under MIT
Comments
# Load data
,# Clean data
.