Chapter 1: Design

What is a study?

A study seeks to establish whether there is an association between a dependent and independent variable.

Statisticians use the method of comparison to find the effect of treatment/exposure on a disease/response.

exposure response
vaccine polio
heart disease polio
smoking heart disease
obesity depression

Compare the responses of a treatment group to a control group.


Randomized-controlled experiments

Objective: ensure similarity between treatment and control group

  1. Put subjects into treatment and control at random
  2. If possible, give control placebo:
    • neutral, but resembles treatment.
    • Response should be treatment itself and not idea of it
  3. Double-blind:
    • Subjects and evaluators do not know whether subject is in treatment or control group.
    • Prevents bias in analysis

Observational Studies

In controlled experiments, investigators decide who will be in the treatment group and who will be in the control group.

In observational studies, subjects assign themselves to the different groups. To see if confounding is a problem, look at how the exposed and non-exposed groups are selected.

One way to control for confounders is to make comparisons for smaller and more homogeneous grups (eg. by age, sex). This is called “slicing” (not an official term).

Observational studies can establish association. But association does not imply causation.

Variable Types


Smoking is an example of a discrete variable (a.k.a. categorical variable).

Eg. Smoking has two categories (binary categorical): you smoke or you don’t.


a.k.a. numerical, measurement

2x2 contingency table

Y not Y
A a c
not A b d


A and Y are associated if

(1) rate(A|Y) != rate(A|!Y) OR
(2) rate(Y|A) != rate(Y|!A)

Consistency rule states than (1) iff (2), and vice-versa.

a/(a+c) != b/(b+d)
a/(a+b) != c/(c+d)


Simpsons Paradox

Relationships between percentages in subgroups can be reversed if subgroups are combined.


Randomized and controlled studies minimize confounding.


Suppose units are randomly assigned to be exposed or not. If the sample size is very large, then the likelihood that a given variable C is not associated to exposure x tends to almost certainty.

Risk Ratio

A not A row Total
B x y x + y
not B a b a + b
risk (A | B) = x / (x+y)
risk (A | !B) = a / (a+b)
RR = risk(A|B) / risk(A|!B)
RR = 1 means no association

Odds Ratio

A not A
B x y
not B a b
odds(A|B) = x/y
odds(A|!B)= a/b

OR = bx/ay
odds = risk/(1-risk)

Population vs Estimated RR

population sample size too large, calculation done based on samples.

Study Samples from Advantage
Cohort Exposure Risk and RR can be estimated
Case-control Response Good for rare diseases

Chapter 2: Association

  1. Deterministic Relationship
    • Value of variable can be determined if we know the value of the other variable
  2. Statistical Relationship
    • Natural variability exists in measurements
    • Average pattern of one variable can be described given the value of the other variable

Categorical Variables

Data that consists of group or category names. Measurements can be grouped too.

Measurements of Association: RR and OR

Measurement Variables

Bivariate data and Scatter Diagram

Exploring relationship

Average: eg. son’s average height is taller than dad association: positive gradient? linear or exponential relationship? Standard deviation: spread or variability of data

Correlation Coefficient

Summarizes direction and strength of linear association: -1 <= r <= 1

weak   moderate    strong
0    0.3        0.7       1

Not affected by:

  1. Interchanging two variables
  2. Adding a number to all values of a variable
  3. Multiplying a number to all values of a variable

Standard Unit

SU = (X - X_bar) / sd_x

To obtain r, obtain the product of standard unit of father-son pairs, then take the average of the products


Ecological Correlation

Correlation based on aggregated data, such as gorup averages or rates.

In general, when the associations for both individuals and aggregates are in the same direction, the ecological correlation, based on the aggregates, will typically overstate the strength of the association in individuals.

Variability among individuals are eliminated during aggregation

Ecological Fallacy

Deduction of inferences about individuals based on aggregate data

Atomistic Fallacy

Generalize the correlation based on individuals toward the aggregate level correlation


Attentuation Effect

Due to range restriction in one variable, the correlation coefficient obtained tends to understate the strength of association between two variables.

Range restriction: bivariate data set formed based on criteria on one variable data for the other variable is only available for a limited range.

Range restriction tends to have diminishing influence on the strength of the association, called the attenuation effect.

Regression fallacy

In virtual test-retest situations, the bottom group on the first test will on average show some improvement on the second test, and the top group will, on average, fall back.

Prediction with linear regression

Y = a + bX

slope and intercept determined using least-square-method. Predicting “average”, not exact. Also dangerous to predict beyond observed range.

Chapter 3: Sampling


  1. Unit: Object/Individual
  2. Population: Collection of units
  3. Sample: Subset of a population
  4. Sampling frame: list of sampling units intended to identify all units in the population
    1. Good Coverage
    2. Up-to-date and complete

Sampling methods

  1. Probability Sampling
    • Every unit must have a known probability of being sampled
    • Simple random sampling: all units have equal probability
  2. Systematic sampling
    • Selecting units from a list through the application of a selection interval K, so every Kth unit following a random start is included in the sample
    • treated as simple random when sampling units are arranged randomly
    • might obtain undesirable sample if sampling units and K have cyclical effect
    • can use when # sampling units unknown
  3. Stratified
    • first divide population of units into strata, take a probability sample from each group
  4. Multi-stage

Difficulties in Sampling

  1. Imperfect sample frame
    • Perfect sampling frame consists of all units in population
    • otherwise, might include unwanted units (increased cost of study), or exclude desired units (need to redefine target population).
  2. Non-response
    • not all units are contactable, willing to take part. Non-respondents typically differ from respondents, and this effect needs to be studied.
  3. Volunteer sample (biased)
  4. Convenience sample (biased)
  5. Judgement sample (uses own discretion, biased)
  6. Quota sample (Having proportions of categories dose not make extension of results to population better)

Chapter 4: Probability


Relative Frequency Personal Probability
Will you win the lottery Will you be working overseas once you graduate?
Can be quantified exactly Cannot be quantified exactly
Based on repeated observation of outcomes Based on personal belief

Odds of having disease = P(disease) / P(no disease)

Average value = expected value


Testing rare events (Medical screening)

  1. Base rate: P(disease)
  2. Sensitivity: P(positive | disease)
  3. Specificity: P(negative | no disease)
To test Not to test
no alternative test Alternative more reliable test
Test is inexpensive & more expensive 2nd test Test is expensive
Goo chance of successful treatment Unreliable treatment

Chapter 5: Networks


  1. Degree: number of other vertices in the network a node is adjacent to
  2. Order: number of vertices
  3. Size: number of edges
  4. Distance d(X,Y) = distance between X and Y


n vertices

Centrality Formula
Closeness Ccen(u) = sum[d(u,vi)/ n-1 ]
Degree Dcen(u) = deg(u) / n-1

Betweeness: For a vertex Z in any graph, how many shortest paths are there, between any pair of 2 vertices, passing through Z?

If 2 shortest paths between a,b, only 1 pass through z, add 12.

Appendix: Answering Questions