## Chapter 1: Design

### What is a study?

A study seeks to establish whether there is an *association* between a
*dependent* and *independent* variable.

Statisticians use the method of comparison to find the effect of treatment/exposure on a disease/response.

exposure | response |
---|---|

vaccine | polio |

heart disease | polio |

smoking | heart disease |

obesity | depression |

Compare the responses of a treatment group to a control group.

### Confounders

- If the control group is similar to the treatment group, apart from the treatment, the differences in response are likely to be due to the effect of the exposure
- If not, then other effects could be “confounded” with the results of the treatment. These are called confounders.
- Confounders must be associated with both the exposure and the response
- Minimized through randomized-control.

### Randomized-controlled experiments

Objective: ensure similarity between treatment and control group

- Put subjects into treatment and control at random
- If possible, give control placebo:
- neutral, but resembles treatment.
- Response should be treatment itself and not idea of it

- Double-blind:
- Subjects and evaluators do not know whether subject is in treatment or control group.
- Prevents bias in analysis

### Observational Studies

In controlled experiments, investigators decide who will be in the treatment group and who will be in the control group.

In observational studies, subjects assign themselves to the different groups. To see if confounding is a problem, look at how the exposed and non-exposed groups are selected.

One way to control for confounders is to make comparisons for smaller and more homogeneous grups (eg. by age, sex). This is called “slicing” (not an official term).

Observational studies can establish *association*. But association
does not imply causation.

### Variable Types

#### Discrete

Smoking is an example of a *discrete* variable (a.k.a. categorical
variable).

Eg. Smoking has two categories (binary categorical): you smoke or you don’t.

#### Continuous

a.k.a. numerical, measurement

### 2x2 contingency table

Y | not Y | |
---|---|---|

A | a | c |

not A | b | d |

### Association

A and Y are associated if

```
(1) rate(A|Y) != rate(A|!Y) OR
(2) rate(Y|A) != rate(Y|!A)
```

Consistency rule states than (1) iff (2), and vice-versa.

```
a/(a+c) != b/(b+d)
a/(a+b) != c/(c+d)
```

### Adherence

- Assignment may be random, but adherence is not
- Clues in to success of blinding (eg. drug has negative side effects)

### Simpsons Paradox

Relationships between percentages in subgroups can be reversed if subgroups are combined.

### Design

- Experimental
- Controlled
- Randomized

- Not controlled

- Controlled
- Observational

Randomized and controlled studies minimize confounding.

#### Theorem

Suppose units are randomly assigned to be exposed or not. If the sample size is very large, then the likelihood that a given variable C is not associated to exposure x tends to almost certainty.

### Risk Ratio

A | not A | row Total | |
---|---|---|---|

B | x | y | x + y |

not B | a | b | a + b |

```
risk (A | B) = x / (x+y)
risk (A | !B) = a / (a+b)
```

```
RR = risk(A|B) / risk(A|!B)
RR = 1 means no association
```

- RR > 1 => first group has higher risk
- Population risk cannot be estimated in case-control studies, even with random samples.

### Odds Ratio

A | not A | |
---|---|---|

B | x | y |

not B | a | b |

```
odds(A|B) = x/y
odds(A|!B)= a/b
OR = bx/ay
odds = risk/(1-risk)
```

#### Population vs Estimated RR

population sample size too large, calculation done based on samples.

Study | Samples from | Advantage |
---|---|---|

Cohort | Exposure | Risk and RR can be estimated |

Case-control | Response | Good for rare diseases |

## Chapter 2: Association

- Deterministic Relationship
- Value of variable can be determined if we know the value of the other variable

- Statistical Relationship
- Natural variability exists in measurements
- Average pattern of one variable can be described given the value of the other variable

### Categorical Variables

Data that consists of group or category names. Measurements can be grouped too.

#### Measurements of Association: RR and OR

- RR and OR can be accurately estimated to a cohort study
- RR is intuitively clearer and can only be estimated from cohort studies
- OR applies to both cohort and case-control studies

### Measurement Variables

#### Bivariate data and Scatter Diagram

#### Exploring relationship

Average: eg. son’s average height is taller than dad association: positive gradient? linear or exponential relationship? Standard deviation: spread or variability of data

#### Correlation Coefficient

Summarizes direction and strength of linear association: -1 <= r <= 1

- r > 0 positive association
- r < 0 negative association
- r = 0 no association
- r = 1 perfectly positive association
- r value close to 0 weak association

```
weak moderate strong
0 0.3 0.7 1
```

Not affected by:

- Interchanging two variables
- Adding a number to all values of a variable
- Multiplying a number to all values of a variable

#### Standard Unit

```
SU = (X - X_bar) / sd_x
```

To obtain r, obtain the product of standard unit of father-son pairs, then take the average of the products

#### Limitations

Causation

A change in one variable produces a change in the other variable.

Outliers in data set

Data points that are unusually far away from the bulk of the data. Dangerous to exclude outliers without understanding the cause of the occurrence.

non-linear association

- zero correlation only says no “linear association”
- high correlation doesn’t mean linear association

### Ecological Correlation

Correlation based on aggregated data, such as gorup averages or rates.

In general, when the associations for both individuals and aggregates are in the same direction, the ecological correlation, based on the aggregates, will typically overstate the strength of the association in individuals.

Variability among individuals are eliminated during aggregation

#### Ecological Fallacy

Deduction of inferences about individuals based on aggregate data

#### Atomistic Fallacy

Generalize the correlation based on individuals toward the aggregate level correlation

### Association

#### Attentuation Effect

Due to range restriction in one variable, the correlation coefficient obtained tends to understate the strength of association between two variables.

Range restriction: bivariate data set formed based on criteria on one variable data for the other variable is only available for a limited range.

Range restriction tends to have diminishing influence on the strength of the association, called the attenuation effect.

#### Regression fallacy

In virtual test-retest situations, the bottom group on the first test will on average show some improvement on the second test, and the top group will, on average, fall back.

### Prediction with linear regression

Y = a + bX

slope and intercept determined using least-square-method. Predicting “average”, not exact. Also dangerous to predict beyond observed range.

## Chapter 3: Sampling

### Definitions

- Unit: Object/Individual
- Population: Collection of units
- Sample: Subset of a population
- Sampling frame: list of sampling units intended to identify all
units in the population
- Good Coverage
- Up-to-date and complete

### Sampling methods

- Probability Sampling
- Every unit must have a known probability of being sampled
- Simple random sampling: all units have equal probability

- Systematic sampling
- Selecting units from a list through the application of a selection interval K, so every Kth unit following a random start is included in the sample
- treated as simple random when sampling units are arranged randomly
- might obtain undesirable sample if sampling units and K have cyclical effect
- can use when # sampling units unknown

- Stratified
- first divide population of units into strata, take a probability sample from each group

- Multi-stage

### Difficulties in Sampling

- Imperfect sample frame
- Perfect sampling frame consists of all units in population
- otherwise, might include unwanted units (increased cost of study), or exclude desired units (need to redefine target population).

- Non-response
- not all units are contactable, willing to take part. Non-respondents typically differ from respondents, and this effect needs to be studied.

- Volunteer sample (biased)
- Convenience sample (biased)
- Judgement sample (uses own discretion, biased)
- Quota sample (Having proportions of categories dose not make extension of results to population better)

## Chapter 4: Probability

### Interpretations

Relative Frequency | Personal Probability |
---|---|

Will you win the lottery | Will you be working overseas once you graduate? |

Can be quantified exactly | Cannot be quantified exactly |

Based on repeated observation of outcomes | Based on personal belief |

Odds of having disease = P(disease) / P(no disease)

Average value = expected value

### P-values

- p-value = the probability of obtaining an outcome
*equivalent to or more extreme*than the observed - null hypothesis: assumption used to calculate p-value (eg. coin is fair)
- if p-value is small, unlikely for observed to occur by chance, and unlikely for null hypothesis to be true. Converse for large.
- p-value > 0.05 : do not reject NH at 5% significance level. Cannot conclude that it is not fair. Observed effect in sample is likely to reflect effect in population.

### Testing rare events (Medical screening)

- Base rate: P(disease)
- Sensitivity: P(positive | disease)
- Specificity: P(negative | no disease)

To test | Not to test |
---|---|

no alternative test | Alternative more reliable test |

Test is inexpensive & more expensive 2nd test | Test is expensive |

Goo chance of successful treatment | Unreliable treatment |

## Chapter 5: Networks

- Collection of objects and well-defined relations between objects

### Definitions

- Degree: number of other vertices in the network a node is adjacent to
- Order: number of vertices
- Size: number of edges
- Distance d(X,Y) = distance between X and Y

### Centrality

n vertices

Centrality | Formula |
---|---|

Closeness | Ccen(u) = sum[d(u,vi)/ n-1 ] |

Degree | Dcen(u) = deg(u) / n-1 |

Betweeness: For a vertex Z in any graph, how many shortest paths are there, between any pair of 2 vertices, passing through Z?

If 2 shortest paths between a,b, only 1 pass through z, add ^{1}⁄_{2}.

## Appendix: Answering Questions

- exposure (potential cause)
- response (potential effect)
- design
- sampling
- unit