Biostatistics Part III: Statistical Analyses

We did biostats once upon a time in two previous episodes around the beginning of our podcast series. If you’d like to take a listen, here are the links: Part I and Part II.

Let’s do a quick review and discuss the different types of studies that we could do:

  • Exposure (or intervention): risk factor whose effect is being studied 

    • Also can be referred to as the “independent” or “predictor” variable 

  • Outcome: something that develops as a consequence of exposure (or intervention)

    • Also referred to as the predicted or dependent variable (or variables) 

  • First study category: temporality

    • Retrospective studies 

      • Means that the outcomes or the dependent variables (and likely independent variables!) have already occurred, or you have that data already  

    • Prospective studies 

      • Means that the outcomes or dependent variables (or even independent variables) have not yet been measured 

  • Second category: descriptive vs analytical

    • Descriptive studies - where you’re merely trying to describe data on one or more characteristics of a group of individuals; these types of studies don’t usually try to answer a question or establish a relationship between variables:

      • Case report

      • Case series 

      • Cross-sectional studies 

    • Analytical - attempt to test a hypothesis and establish causal relationships between variables:

      • Observational - studies where a researcher is documenting a naturally occurring relationship between exposure and outcome 

        • Case Control studies - first determine if the outcome is present (ie. cases of lung cancer vs. cases where there is no lung cancer) and then traces the presence of prior exposure to a risk factor (ie. tobacco use)

        • Cohort studies - first determine the exposure to a risk factor and then assesses whether the outcome occurs a future time point 

      • Experimental - research actively performs an intervention in some or all members of a group 

        • Remember: only experimental studies can establish a causal relationship; observational studies can show correlation, but not reliably show causation! 

  • It is important to know that you can have both retrospective and prospective observational studies, but experimental studies are all prospective.

(c) CREOGS OVER COFFEE, 2022

The next step is, let’s say you have your study and you’ve collected your data… now… HOW DO I ANALYZE IT ALL? Dr. Rebecca Hamm at UPenn has shared with us this crazy but excellent flow chart to figure it out. While we won’t hit everything in the podcast, we’ll hit some of the more common tests and the first few questions of the flowchart.

Courtesy of rebecca hamm, md

First question: What type of data do you have? 

  • Continuous (example: age, BMI, weight, etc.)

    • See second question  

  • Categorical (Gender; Yes/No; Category 1, 2, or 3)

    • You can use a Chi-Square test!

      • In simple terms, a Chi-Square (or Pearson’s Chi Square test) is going to determine if there is a statistically significant difference between expected frequencies and the observed frequencies in one or more categories of a contingency table

      • In your contingency table, if any category has <5 observations, then you have to use a Fisher’s exact test

  • Second question: If you have a continuous variable, do you have parametric or nonparametric data? 

    • Parametric basically means: 

      • You have independent, unbiased samples 

        • Independence (in statistics terms) basically means the occurrence of one thing does not affect the probability of the occurrence of another thing 

      • The data is normally distributed 

        • How do you figure that out? Easiest way - create a histogram to check 

      • Harder way: there are many statistical tests of skewness that we won’t describe here! 

      • Equal variances 

        • Basically, variance is a statistical measurement of the spread between numbers in a data set, or how far each number in the set is from the mean (average) 

        • The square root of variance is standard deviation (you’ve probably heard of that!) 

        • Therefore, equal variance means that in order for us to consider data parametric, we have to assume that the variance is the same for both populations we are comparing 

      • Third question: If you have a continuous variable, what type of question are you asking? 

        • I want to know about relationships:

          • If you have a true independent variable, you can use a regression analysis

            • Example: a linear regression, where you actually have an equation and an R^2 value

            • Doesn’t have to be linear relationship - other forms of regression exist.

          • If you don’t have a true independent variable, then we have to do a correlation analysis  

          1. If parametric: Pearsons’ r test  

          2. Nonparametric: Spearman’s Rank Correlation 

        • I want to know about differences between the means of my groups: 

          • How many treatment groups do you have? 

            • If two

              1. If parametric, can use student’s t-test (paired or unpaired) 

              2. If nonparametric, then can use Mann-Whitney U or Wilcoxon Rank sum test 

            • If more than two: 

              1. If parametric, can use an ANOVA 

              2. If nonparametric, can use Kruskal-Wallis test 

Examples

  •  Let’s try and figure out what the best statistical test is for the following situations! 

    • What is the frequency of repeat hypertensive disease of pregnancy in patients who took low dose aspirin vs. those that did not take low dose aspirin?

      • Questions you’ll want to ask: is this categorical or continuous? 

        • Categorical! Hypertensive disease is a “yes” or “no” in this case 

        • Therefore, we will want to use a Chi-Square test.

    • What is the gestational age at which patients with short cervix delivered if they got a cerclage or not? 

      • Question you want to ask: is gestational age at delivery categorical or continuous? 

        • Continuous! 

      • Now… is gestational age at delivery going to give us a parametric data set? Let’s see!

        • Is it independent: yes – the gestational age at which one person delivers should not affect the gestational age at which another person delivers in this data set.

        • Is it normal? Nope! – just going to give this one to you, but gestational age at delivery is not normally distributed (lots of people will delivery right around 39-40 weeks, and then there is a long, skewed tail of those that delivery very early, like 24 weeks etc) 

        • So we have a continuous, non-parametric set of data 

        • Next question: do we want to know relationship or difference of means? Difference of means!

          • So we can use a Wilcoxon Rank Sum test  

    • Is there a difference in admission hemoglobin between patients who received iron supplementation during pregnancy or not?

      • Question: is Hgb a continuous or categorical variable - continuous

      • Question: Is Hgb a parametric data set - for our purposes, let’s say yes! 

      • Question: Do we want to know a relationship or a difference of means? Difference of means 

        • So we can use Student’s t-test  

Biostatistics Part II

Welcome back to biostatistics! Today we spend some time on study design and study-specific statistical calculations.

If you have more time, check out the Khan Academy series of videos and infographics on statistics and study design. Their resources are phenomenal and can really help with both understanding CREOG questions as well as helping you out in your own research design!

And for a concise review, check out our own quick notes on the subject.

Biostatistics Part I

On today’s episode, we try to tackle the highly testable, last-minute-cram topic of biostatistics! This will be the first in a two part series. Sorry about the sound issues — had some problems with Nick’s microphone, but should be fixed after this series!

Below is the official cheat sheet of equations from us for this episode. Hopefully this is helpful in guiding your studying! And stay tuned for next week when we talk more about study design and study-specific statistics.

We also talk about a few other statistical points today:

Prevalence represents the number of people in a population who have a disease. From the above table, this could be calculated as (A+C) / (A+B+C+D).

Likelihood ratio is a value that can represent the significance or utility of a diagnostic test, and is calculated as Sensitivity / 1 - Specificity. In other words, the true positive rate divided by the false positive rate.

An LR > 1 signifies the test is associated with the disease.
An LR < 1 signifies the test is associated with absence of a disease.
An LR that is close to 1 demonstrates the test doesn’t have a strong association with either presence or absence of disease.

Why use LR? If you know the prevalence of disease in a population, you know the pre-test probability of the patient in front of you having the disease. An LR away from 1 demonstrates that your post-test probability is more likely to make you certain of diagnosis. LR of close to 1 doesn’t change your pre-test probability.