Quantitative Reasoning

Chapter 1: Design

What is a study?

A study seeks to establish whether there is an association between a dependent and independent variable.

Statisticians use the method of comparison to find the effect of treatment/exposure on a disease/response.

exposure	response
vaccine	polio
heart disease	polio
smoking	heart disease
obesity	depression

Compare the responses of a treatment group to a control group.

Confounders

If the control group is similar to the treatment group, apart from the treatment, the differences in response are likely to be due to the effect of the exposure
If not, then other effects could be “confounded” with the results of the treatment. These are called confounders.
Confounders must be associated with both the exposure and the response
Minimized through randomized-control.

Randomized-controlled experiments

Objective: ensure similarity between treatment and control group

Put subjects into treatment and control at random
If possible, give control placebo:
- neutral, but resembles treatment.
- Response should be treatment itself and not idea of it
Double-blind:
- Subjects and evaluators do not know whether subject is in treatment or control group.
- Prevents bias in analysis

Observational Studies

In controlled experiments, investigators decide who will be in the treatment group and who will be in the control group.

In observational studies, subjects assign themselves to the different groups. To see if confounding is a problem, look at how the exposed and non-exposed groups are selected.

One way to control for confounders is to make comparisons for smaller and more homogeneous grups (eg. by age, sex). This is called “slicing” (not an official term).

Observational studies can establish association. But association does not imply causation.

Variable Types

Discrete

Smoking is an example of a discrete variable (a.k.a. categorical variable).

Eg. Smoking has two categories (binary categorical): you smoke or you don’t.

Continuous

a.k.a. numerical, measurement

2x2 contingency table

	Y	not Y
A	a	c
not A	b	d

Association

A and Y are associated if

(1) rate(A|Y) != rate(A|!Y) OR
(2) rate(Y|A) != rate(Y|!A)

Consistency rule states than (1) iff (2), and vice-versa.

a/(a+c) != b/(b+d)
a/(a+b) != c/(c+d)

Adherence

Assignment may be random, but adherence is not
Clues in to success of blinding (eg. drug has negative side effects)

Simpsons Paradox

Relationships between percentages in subgroups can be reversed if subgroups are combined.

Design

Experimental
- Controlled
  - Randomized
- Not controlled
Observational

Randomized and controlled studies minimize confounding.

Theorem

Suppose units are randomly assigned to be exposed or not. If the sample size is very large, then the likelihood that a given variable C is not associated to exposure x tends to almost certainty.

Risk Ratio

	A	not A	row Total
B	x	y	x + y
not B	a	b	a + b

risk (A | B) = x / (x+y)
risk (A | !B) = a / (a+b)

RR = risk(A|B) / risk(A|!B)
RR = 1 means no association

RR > 1 => first group has higher risk
Population risk cannot be estimated in case-control studies, even with random samples.

Odds Ratio

	A	not A
B	x	y
not B	a	b

odds(A|B) = x/y
odds(A|!B)= a/b

OR = bx/ay
odds = risk/(1-risk)

Population vs Estimated RR

population sample size too large, calculation done based on samples.

Study	Samples from	Advantage
Cohort	Exposure	Risk and RR can be estimated
Case-control	Response	Good for rare diseases

Chapter 2: Association

Deterministic Relationship
- Value of variable can be determined if we know the value of the other variable
Statistical Relationship
- Natural variability exists in measurements
- Average pattern of one variable can be described given the value of the other variable

Categorical Variables

Data that consists of group or category names. Measurements can be grouped too.

Measurements of Association: RR and OR

RR and OR can be accurately estimated to a cohort study
RR is intuitively clearer and can only be estimated from cohort studies
OR applies to both cohort and case-control studies

Measurement Variables

Bivariate data and Scatter Diagram

Exploring relationship

Average: eg. son’s average height is taller than dad association: positive gradient? linear or exponential relationship? Standard deviation: spread or variability of data

Correlation Coefficient

Summarizes direction and strength of linear association: -1 <= r <= 1

r > 0 positive association
r < 0 negative association
r = 0 no association
r = 1 perfectly positive association
r value close to 0 weak association

weak   moderate    strong
0    0.3        0.7       1

Not affected by:

Interchanging two variables
Adding a number to all values of a variable
Multiplying a number to all values of a variable

Standard Unit

SU = (X - X_bar) / sd_x

To obtain r, obtain the product of standard unit of father-son pairs, then take the average of the products

Limitations

Causation

A change in one variable produces a change in the other variable.

Outliers in data set

Data points that are unusually far away from the bulk of the data. Dangerous to exclude outliers without understanding the cause of the occurrence.

non-linear association
- zero correlation only says no “linear association”
- high correlation doesn’t mean linear association

Ecological Correlation

Correlation based on aggregated data, such as gorup averages or rates.

In general, when the associations for both individuals and aggregates are in the same direction, the ecological correlation, based on the aggregates, will typically overstate the strength of the association in individuals.

Variability among individuals are eliminated during aggregation

Ecological Fallacy

Deduction of inferences about individuals based on aggregate data

Atomistic Fallacy

Generalize the correlation based on individuals toward the aggregate level correlation

Association

Attentuation Effect

Due to range restriction in one variable, the correlation coefficient obtained tends to understate the strength of association between two variables.

Range restriction: bivariate data set formed based on criteria on one variable data for the other variable is only available for a limited range.

Range restriction tends to have diminishing influence on the strength of the association, called the attenuation effect.

Regression fallacy

In virtual test-retest situations, the bottom group on the first test will on average show some improvement on the second test, and the top group will, on average, fall back.

Prediction with linear regression

Y = a + bX

slope and intercept determined using least-square-method. Predicting “average”, not exact. Also dangerous to predict beyond observed range.

Chapter 3: Sampling

Definitions

Unit: Object/Individual
Population: Collection of units
Sample: Subset of a population
Sampling frame: list of sampling units intended to identify all units in the population
1. Good Coverage
2. Up-to-date and complete

Sampling methods

Probability Sampling
- Every unit must have a known probability of being sampled
- Simple random sampling: all units have equal probability
Systematic sampling
- Selecting units from a list through the application of a selection interval K, so every Kth unit following a random start is included in the sample
- treated as simple random when sampling units are arranged randomly
- might obtain undesirable sample if sampling units and K have cyclical effect
- can use when # sampling units unknown
Stratified
- first divide population of units into strata, take a probability sample from each group
Multi-stage

Difficulties in Sampling

Imperfect sample frame
- Perfect sampling frame consists of all units in population
- otherwise, might include unwanted units (increased cost of study), or exclude desired units (need to redefine target population).
Non-response
- not all units are contactable, willing to take part. Non-respondents typically differ from respondents, and this effect needs to be studied.
Volunteer sample (biased)
Convenience sample (biased)
Judgement sample (uses own discretion, biased)
Quota sample (Having proportions of categories dose not make extension of results to population better)

Chapter 4: Probability

Interpretations

Relative Frequency	Personal Probability
Will you win the lottery	Will you be working overseas once you graduate?
Can be quantified exactly	Cannot be quantified exactly
Based on repeated observation of outcomes	Based on personal belief

Odds of having disease = P(disease) / P(no disease)

Average value = expected value

P-values

p-value = the probability of obtaining an outcome equivalent to or more extreme than the observed
null hypothesis: assumption used to calculate p-value (eg. coin is fair)
if p-value is small, unlikely for observed to occur by chance, and unlikely for null hypothesis to be true. Converse for large.
p-value > 0.05 : do not reject NH at 5% significance level. Cannot conclude that it is not fair. Observed effect in sample is likely to reflect effect in population.

Testing rare events (Medical screening)

Base rate: P(disease)
Sensitivity: P(positive | disease)
Specificity: P(negative | no disease)

To test	Not to test
no alternative test	Alternative more reliable test
Test is inexpensive & more expensive 2nd test	Test is expensive
Goo chance of successful treatment	Unreliable treatment

Chapter 5: Networks

Collection of objects and well-defined relations between objects

Definitions

Degree: number of other vertices in the network a node is adjacent to
Order: number of vertices
Size: number of edges
Distance d(X,Y) = distance between X and Y

Centrality

n vertices

Centrality	Formula
Closeness	Ccen(u) = sum[d(u,vi)/ n-1 ]
Degree	Dcen(u) = deg(u) / n-1

Betweeness: For a vertex Z in any graph, how many shortest paths are there, between any pair of 2 vertices, passing through Z?

If 2 shortest paths between a,b, only 1 pass through z, add 1/2.

Appendix: Answering Questions

exposure (potential cause)
response (potential effect)
design
sampling
unit