Chapter 1: Design
What is a study?
A study seeks to establish whether there is an association between a dependent and independent variable.
Statisticians use the method of comparison to find the effect of treatment/exposure on a disease/response.
exposure  response 

vaccine  polio 
heart disease  polio 
smoking  heart disease 
obesity  depression 
Compare the responses of a treatment group to a control group.
Confounders
 If the control group is similar to the treatment group, apart from the treatment, the differences in response are likely to be due to the effect of the exposure
 If not, then other effects could be “confounded” with the results of the treatment. These are called confounders.
 Confounders must be associated with both the exposure and the response
 Minimized through randomizedcontrol.
Randomizedcontrolled experiments
Objective: ensure similarity between treatment and control group
 Put subjects into treatment and control at random
 If possible, give control placebo:
 neutral, but resembles treatment.
 Response should be treatment itself and not idea of it
 Doubleblind:
 Subjects and evaluators do not know whether subject is in treatment or control group.
 Prevents bias in analysis
Observational Studies
In controlled experiments, investigators decide who will be in the treatment group and who will be in the control group.
In observational studies, subjects assign themselves to the different groups. To see if confounding is a problem, look at how the exposed and nonexposed groups are selected.
One way to control for confounders is to make comparisons for smaller and more homogeneous grups (eg. by age, sex). This is called “slicing” (not an official term).
Observational studies can establish association. But association does not imply causation.
Variable Types
 Discrete
Smoking is an example of a _discrete_ variable (a.k.a. categorical
variable).
Eg. Smoking has two categories (binary categorical): you smoke or you
don't.
 Continuous
a.k.a. numerical, measurement
2x2 contingency table
Y  not Y  

A  a  c 
not A  b  d 
Association
A and Y are associated if
(1) rate(AY) != rate(A!Y) OR
(2) rate(YA) != rate(Y!A)
Consistency rule states than (1) iff (2), and viceversa.
a/(a+c) != b/(b+d)
a/(a+b) != c/(c+d)
Adherence
 Assignment may be random, but adherence is not
 Clues in to success of blinding (eg. drug has negative side effects)
Simpsons Paradox
Relationships between percentages in subgroups can be reversed if subgroups are combined.
Design
 Experimental
 Controlled
 Randomized
 Not controlled
 Controlled
 Observational
Randomized and controlled studies minimize confounding.
 Theorem
Suppose units are randomly assigned to be exposed or not. If the
sample size is very large, then the likelihood that a given variable C
is not associated to exposure x tends to almost certainty.
Risk Ratio
A  not A  row Total  

B  x  y  x + y 
not B  a  b  a + b 
risk (A  B) = x / (x+y)
risk (A  !B) = a / (a+b)
RR = risk(AB) / risk(A!B)
RR = 1 means no association
 RR > 1 => first group has higher risk
 Population risk cannot be estimated in casecontrol studies, even with random samples.
Odds Ratio
A  not A  

B  x  y 
not B  a  b 
odds(AB) = x/y
odds(A!B)= a/b
OR = bx/ay
odds = risk/(1risk)
 Population vs Estimated RR
population sample size too large, calculation done based on samples.
 Study  Samples from  Advantage 

 Cohort  Exposure  Risk and RR can be estimated 
 Casecontrol  Response  Good for rare diseases 
Chapter 2: Association
 Deterministic Relationship
 Value of variable can be determined if we know the value of the other variable
 Statistical Relationship
 Natural variability exists in measurements
 Average pattern of one variable can be described given the value of the other variable
Categorical Variables
Data that consists of group or category names. Measurements can be grouped too.
 Measurements of Association: RR and OR
 RR and OR can be accurately estimated to a cohort study
 RR is intuitively clearer and can only be estimated from cohort
studies
 OR applies to both cohort and casecontrol studies
Measurement Variables

Bivariate data and Scatter Diagram

Exploring relationship
Average: eg. son's average height is taller than dad
association: positive gradient?
linear or exponential relationship?
Standard deviation: spread or variability of data
 Correlation Coefficient
Summarizes direction and strength of linear association: 1 <= r <= 1
 r > 0 positive association
 r < 0 negative association
 r = 0 no association
 r = 1 perfectly positive association
 r value close to 0 weak association
<!listend>
```text
weak moderate strong
0 0.3 0.7 1
```
Not affected by:
1. Interchanging two variables
2. Adding a number to all values of a variable
3. Multiplying a number to all values of a variable
 Standard Unit
```text
SU = (X  X_bar) / sd_x
```
To obtain r, obtain the product of standard unit of fatherson pairs,
then take the average of the products
 Limitations
 Causation
A change in one variable produces a change in the other variable.
 Outliers in data set
Data points that are unusually far away from the bulk of the data.
Dangerous to exclude outliers without understanding the cause of the
occurrence.
 nonlinear association
 zero correlation only says no "linear association"
 high correlation doesn't mean linear association
Ecological Correlation
Correlation based on aggregated data, such as gorup averages or rates.
In general, when the associations for both individuals and aggregates are in the same direction, the ecological correlation, based on the aggregates, will typically overstate the strength of the association in individuals.
Variability among individuals are eliminated during aggregation
 Ecological Fallacy
Deduction of inferences about individuals based on aggregate data
 Atomistic Fallacy
Generalize the correlation based on individuals toward the aggregate
level correlation
Association
 Attentuation Effect
> Due to range restriction in one variable, the correlation coefficient
> obtained tends to understate the strength of association between two
> variables.
Range restriction: bivariate data set formed based on criteria on one
variable data for the other variable is only available for a limited
range.
Range restriction tends to have diminishing influence on the strength
of the association, called the attenuation effect.
 Regression fallacy
> In virtual testretest situations, the bottom group on the first test
> will on average show some improvement on the second test, and the top
> group will, on average, fall back.
Prediction with linear regression
Y = a + bX
slope and intercept determined using leastsquaremethod. Predicting “average”, not exact. Also dangerous to predict beyond observed range.
Chapter 3: Sampling
Definitions
 Unit: Object/Individual
 Population: Collection of units
 Sample: Subset of a population
 Sampling frame: list of sampling units intended to identify all
units in the population
 Good Coverage
 Uptodate and complete
Sampling methods
 Probability Sampling
 Every unit must have a known probability of being sampled
 Simple random sampling: all units have equal probability
 Systematic sampling
 Selecting units from a list through the application of a selection interval K, so every Kth unit following a random start is included in the sample
 treated as simple random when sampling units are arranged randomly
 might obtain undesirable sample if sampling units and K have cyclical effect
 can use when # sampling units unknown
 Stratified
 first divide population of units into strata, take a probability sample from each group
 Multistage
Difficulties in Sampling
 Imperfect sample frame
 Perfect sampling frame consists of all units in population
 otherwise, might include unwanted units (increased cost of study), or exclude desired units (need to redefine target population).
 Nonresponse
 not all units are contactable, willing to take part. Nonrespondents typically differ from respondents, and this effect needs to be studied.
 Volunteer sample (biased)
 Convenience sample (biased)
 Judgement sample (uses own discretion, biased)
 Quota sample (Having proportions of categories dose not make extension of results to population better)
Chapter 4: Probability
Interpretations
Relative Frequency  Personal Probability 

Will you win the lottery  Will you be working overseas once you graduate? 
Can be quantified exactly  Cannot be quantified exactly 
Based on repeated observation of outcomes  Based on personal belief 
Odds of having disease = P(disease) / P(no disease)
Average value = expected value
Pvalues
 pvalue = the probability of obtaining an outcome equivalent to or more extreme than the observed
 null hypothesis: assumption used to calculate pvalue (eg. coin is fair)
 if pvalue is small, unlikely for observed to occur by chance, and unlikely for null hypothesis to be true. Converse for large.
 pvalue > 0.05 : do not reject NH at 5% significance level. Cannot conclude that it is not fair. Observed effect in sample is likely to reflect effect in population.
Testing rare events (Medical screening)
 Base rate: P(disease)
 Sensitivity: P(positive  disease)
 Specificity: P(negative  no disease)
To test  Not to test 

no alternative test  Alternative more reliable test 
Test is inexpensive & more expensive 2nd test  Test is expensive 
Goo chance of successful treatment  Unreliable treatment 
Chapter 5: Networks
 Collection of objects and welldefined relations between objects
Definitions
 Degree: number of other vertices in the network a node is adjacent to
 Order: number of vertices
 Size: number of edges
 Distance d(X,Y) = distance between X and Y
Centrality
n vertices
Centrality  Formula 

Closeness  Ccen(u) = sum[d(u,vi)/ n1 ] 
Degree  Dcen(u) = deg(u) / n1 
Betweeness: For a vertex Z in any graph, how many shortest paths are there, between any pair of 2 vertices, passing through Z?
If 2 shortest paths between a,b, only 1 pass through z, add 1/2.
Appendix: Answering Questions
 exposure (potential cause)
 response (potential effect)
 design
 sampling
 unit