Στατιστικές μέθοδοι στην Επιδημιολογία (Statistical Methods in Epidemiology) Ακαδ. Έτος 2018-2019 ΜΠΑΜΙΑ ΧΡΙΣΤΙΝΑ* (Υπεύθυνη) ΒΟΥΡΛΗ ΓΕΩΡΓΙΑ* gvourli@med.uoa.gr ΚΑΛΠΟΥΡΤΖΗ ΝΑΤΑΣΑ* (Ασκήσεις) *Ιατρική Σχολή Πανεπιστημίου Αθηνών Εργαστήριο Υγιεινής, Επιδημιολογίας & Ιατρικής Στατιστικής
Poisson regression
Models for counts: The Poisson probability density function is given by The mean and variance of the Poisson distribution is Poisson regression models allow researchers to examine the relationship between predictors and count outcome variables.
Examples The Poisson distribution arises in many biological and medical contexts where counts are involved: The number of bacterial colonies in a dish The number of trees in an area of land The number of children an individual has The number of nucleotide base substitutions in a gene over a period of time The number of deaths in a group of patients over a study period
The model for counts ln(i) = 0 + 1 xi1 + 2xi2 + …+ pxip , We need the logarithm, since 0 + 1 xi1 + 2xi2 + …+ pxip can take any real value, but we are modelling counts (i), so we want it to be >0.
Example Respiratory deaths were counted in Athens between 1988 and 1991 We have 4 variables: Cases: the number of deaths Pop: the population of each age group Age: the categorical age group; one of 40 − 54, 55 − 59, 60 − 64, 65 − 74 or > 74 Questions of interest: How does the expected number of deaths vary by age? ln(i) = 0 + 1 I40−54 + 2I55-59 + 3I60-64 + 4I65-74
What about modelling rates? If all patients are assumed to be followed-up for the same time interval, a rate is unnecessary But if there is variation in the time each individual is followed-up, modelling the count of deaths would be misleading.
Models for rates Example The British doctors study: The classical cohort study by Doll et al which was used (among other objectives) to investigate the effect of smoking on coronary heart disease (CHD) among male British doctors. agegrp smoke deaths Pyrs 1 32 52407 2 104 43248 3 206 28612 4 186 12663 5 102 5317 18790 12 10673 28 5710 2585 31 1462 With age groups being 1: 35-44, 2: 45-54, 3: 55-64, 4: 65-74, 5: 75+ The crude CHD death rate for the non-smokers is 101/39220 = 0.0026 and for the smokers 630/142247 = 0.0044. Therefore the rate ratio of non smokers compared to smokers is 0.6, i.e., non-smokers have a reduced rate of CHD deaths.
Models for rates Imagine events which occur independently in time intervals ti with rates i , Yi random variables denoting the numbers of events in the corresponding ti and they have Poisson distributions with means i= i ti Poisson models handle exposure variables by using simple algebra to change the dependent variable from a rate into a count.
Models for rates If the rate is count/exposure, multiplying both sides of the equation by exposure moves it to the right side of the equation. The Poisson model is a regression model of the mean i on p explanatory variables xi1, xi2, …, xip , where the link function is the log function. The model we are interested in is a model for the rates i , i.e., ln(i) = 0 + 1 xi1 + 2xi2 + …+ pxip , and because i= i ti . ln(i) – ln(ti ) = 0 + 1 xi1 + 2xi2 + …+ pxip , thus ln(i) = ln(ti ) + 0 + 1 xi1 + 2xi2 + …+ pxip , The Poisson model includes the offset term –ln(ti)- representing the log-person years which has a coefficient equal to one. Using the offset is just a way of accounting for population sizes, which could vary by time. The coefficients 1 … p represent the effect of each of the xi1, xi2 , …, xip on the log of rates i
MPH Program, Biostatistics II, April 30, 2010, W.D. Dupont Assumptions for Poisson regression: The distribution of deaths in each time interval will be well approximated by a Poisson distribution if the following is true Only one event can occur in each interval Low event rates: The proportion of patients who have the event/disease of interest in each risk group should be small. The rate parameter λ is the same across all intervals The time intervals are independent, i.e the probability of observing an event in an interval l does not depend on whether we observed event(s) in any other interval Of note, the denominators of rates used in Poisson regressions is often patient-years rather than patients. In fact, it depends on what rate we want to estimate (see examples below) 7: Introduction to Poisson regression
Poisson versus Survival analysis models Poisson regression is a very useful tool when we need to estimate rates i.e. for analyzing cohort studies If we have detailed individual-level data (accurate data on the follow-up time for each of the cohort participants) , we can apply the more sophisticated approaches that have been developed in the field of survival analysis
4. Examples: What we want to investigate are: Define Si = 1 xi1 where xi1 is an indicator with 1 denoting that group i consists of smokers and 0 otherwise. Define Ai = 2xi2 + 3xi3 + 4xi4 +5xi5 , where xij , j=2,…,5 are indicators with 1 if group i is the age class j and 0 otherwise. Now j represent the effect of the age groups on the log rate of CHD deaths. What we want to investigate are: the effect of smoking on CHD rate, and the effect of smoking on CHD rate, having adjusted for age.
Model 1: smoking ln(μi) = ln (ti) + β0 + Si Fitting this model through Stata we have the following: xi : poisson deaths i.smokes, e (pyrs) i. smokes Ismoke_0-1 (naturally coded; Ismoke_0 omitted) Iteration 0: log likelihood = -480.77391 Iteration 1: log likelihood = -480.52234 Iteration 2: log likelihood = -480 52206 Iteration 3: log likelihood = -480. 52206 Poisson regression Number of obs = 10 LR chi2(1) = 29.09 Prob > chi2 = 0.0000 Log likelihood = -480.52206 Pseudo R2 = 0.0294 deaths Coef. Std.Err. z P> [95% Conf Interval] Ismoke_1 .5422211 .1071834 5.059 0.000 .3321454 .7522968 _cons -5.961822 .0995037 -59.916 -6.156845 -5.766798 pyrs (exposure) The interpretation of the parameter estimates is as follows. : the estimated log rate for non-smokers. : the estimated difference in log rates between non-smokers and smokers. : the estimated crude rate ratio between non-smokers and smokers.
Model 2: adjusting for age ln(μi) = ln (ti) + Si + Ai Fitting this model through Stata we have the following: xi : poisson deaths i. smokes i. agegrp, e (pyrs) deaths Coef. Std. Err. z P> [95% Conf Interval] Ismoke_1 .3545356 .1073741 3.302 0.001 .1440862 .564985 Iagegr _ 2 1.484007 .1951034 7.606 0.000 1.101611 1.866403 Iagegr_ 3 2.627505 .1837273 14.301 2.267406 2.987604 Iagegr_ 4 3.350493 .1847992 18.130 2.988293 3.712693 Iagegr_ 5 3.700096 .1922195 19.249 3.323353 4.07684 _cons -7.919326 .1917618 -41.298 -8.295172 -7.543479 pyrs (exposure) The age-adjusted rate ratio, comparing smokers with non-smokers, is: e0.3545 = 1.43 (similar to M-H=1.39), with 95% CI (e0.3545-1.96 X0.1074 , e0.3545+1.96 X0.1074 ) = (1.16, 1.76). Assumes effects of smoke and age combine multiplicatively (i.e. there is no significant interaction between them)
5. Testing hypotheses in Poisson regression 1. Wald test (given directly in STATA output) 2. the likelihood ratio test. Testing for effect modification Poisson model in its simple form assumes no interaction between explanatory variables But we can check whether the effect of the exposure (e.g. smoke) differs according to the levels of the potential confounder (e.g. age) Using the LR test and the nested models
The model in full The model in full In multiplicative form Exposure (X) Stratum (Z=0,1,2) 1 λc λc θ λc φ1 λc θ φ1 2 λc φ2 λc θ φ2 Exposure (X) Stratum (Z=0,1,2) 1 a a + β a + γ1 a + γ1 + β 2 a + γ2 a + γ2 + β Log rates of outcome rates of outcome Log (λ) = α + βx + γ1z1 + γ2z2 What is the difference in the log rates in stratum 0 between exposed and unexposed? In stratum 1? What do we assume here?
Interactions (effect modification) Log (λ) = α + βx + γ1z1 + γ2z2 + δ1 (xz1) + δ2 (xz2) Exposure Stratum 1 a a + β a + γ1 a + γ1 + β + δ1 2 a + γ2 a + γ2 + β + δ2 What is the difference in the log rates in stratum 0 between exposed and unexposed? In stratum 1? What do we assume here? In multiplicative form Exposure Stratum 1 λc λc θ λc φ1 λc θ φ1ρ1 2 λc φ2 λc θ φ2ρ2
Poisson regression - Stratification Remember the Whitehall Study with grade of work (exposure) and age (confounder) xi : poisson deaths i. grade*agegrp, e (pyrs) irr Deaths RR SE z p-value 95% CI Test for interaction xi : poisson deaths i. grade i. agegrp, e (pyrs) est store B lrtest A B likelihood-ratio test chi2(5) = 10.43, (Assumption: B nested in A) Prob > chi2 = 0.0640
Quantitative exposure in Poisson regression If explanatory variable is quantitative (or ordered) can use Poisson model to test for linearity. Stronger assumption than treating as categorical since uses fewer parameters - assumes log rate changes linearly across explanatory variable. Caution: this is strong assumption since very few relationships are exactly log-linear Only one parameter in model - change in log rate per unit of variable. e.g. log λ = b0 + Age
Quantitative exposure in Poisson regression (cont.) LRT compares: Model 1: age categorical -no assumption of how log rate changes Model 2: age quantitative - assumes log rate changes linearly H0: association is log-linear vs. H1:association is not log-linear
Quantitative exposure in Poisson regression (cont.) If exposure is quantitative, interpretation is “change in log rate per unit change”. Imagine age as the exposure – how is it coded? If age categorical is coded as groups of years, e.g, in 5 years groups as 1,2,3,4,…then jump of one unit represents 5-year jump if values of the categories were 40,45,50,55,…, then jump of one unit represents 1-year jump
Quantitative exposure in Poisson regression (cont.) In STATA: e.g age coded: 40, 50, 55, 60, 65, 70, 75, 80 (agegrp) xi : poisson deaths i. grade agegrp, e (pyrs) Deaths rate SE z p 95% CI -------+--------------------------------------------------------------------------------------- Igrade_2 | 1.394941 .2350773 1.975 0.048 1.00256 1.940892 Agegrp | 1.089857 .0112403 8.343 0.000 1.068048 1.112112 ----------------------------------------------------------------------------------------------------- 1 unit jump = 1 year in age RR increases by 1.089 for each 1-year increase in age Approx. 1.0895 = 1.53 over 5 years
Quantitative exposure in Poisson regression (cont.) If same groups of age were grouped as 1,2,3,4…(agegrp) xi : poisson deaths i. grade agegrp, e (pyrs) Deaths log(rate) SE z p 95% CI ------------+-------------------------------------------------------------------------------- _Igrade_2 | 1.396122 .2355799 1.98 0.048 1.002981 1.943364 Agegrp | 1.552698 .0809653 8.44 0.000 1.401848 1.719779 ---------------------------------------------------------------------------------------------------- 1 unit jump = 5 years in age RR increases by 1.553 for each 5-year increase in age
Exercise Έστω επιδημιολογική μελέτη κοορτής η οποία στοχεύει στην εκτίμηση της επίδρασης της κατανάλωσης του καφέ στην ανάπτυξη καρκίνου του ήπατος. Έστω: Υi ο χρόνος παρακολούθησης για κάθε άτομο i της κοορτής λi ο ρυθμός ανάπτυξης του καρκίνου του ήπατος για το άτομο i. Θεωρείστε επίσης πιθανό συχγυτικό παράγοντα την κατανάλωση αλκοόλ .
Exercise Εάν η ανάλυση πραγματοποιηθεί με το παρακάτω μοντέλο Poisson: Log (λ) = α + βX + γ1Z1 + γ2Z2 + δ1 (XZ1) + δ2 (XZ2), όπου: Χ= κατανάλωση καφέ κατά την είσοδο στην μελέτη: 0 = < 1 φλυτζάνι την ημέρα; 1 = > 1 φλυτζάνι την ημέρα, Ζ= ημ/σια κατανάλωση αλκοόλ κατά την είσοδο στην μελέτη: 0=χαμηλή; 1=μέτρια; 2=υψηλή, και Z1 , Z2 μεταβλητές-δείκτες (dummy variables) για τον παράγοντα «κατανάλωση αλκοόλ», με: Z1 (1=μέτρια ημ/σια κατανάλωση αλκοόλ, 0 = άλλο), Z2 (1=υψηλή ημ/σια κατανάλωση αλκοόλ, 0 = άλλο) Δώστε την εκτίμηση του ρυθμού ανάπτυξης καρκίνου του ήπατος σε κάθε μια από τις κατηγορίες του παράγοντα κατανάλωση αλκοόλ Τι υποθέτει το παραπάνω μοντέλο για τον ρόλο της κατανάλωσης αλκοόλ όσον αφορά την επίδρασή του στην εκτίμηση του σχετικού κινδύνου για ανάπτυξη καρκίνου του ήπατος σε σχέση με την κατανάλωση καφέ;
More on the offset As we mentioned, Poisson regression may also be appropriate for rate data, the rate is a count of events divided by some measure of that unit's exposure (a particular unit of observation). Examples: biologists may count the number of tree species in a forest: events would be tree observations, exposure would be unit area, and rate would be the number of species per unit area. Demographers may model death rates in geographic areas as the count of deaths divided by person−years. Event rates can be calculated as events per unit time, which allows the observation window to vary for each unit.
More on the offset In these examples, exposure is respectively unit area, person−years and unit time. This is handled as an offset, where the exposure variable enters on the right-hand side of the equation, but with a parameter estimate (for log(exposure)) constrained to 1. Using the offset is a way of accounting for population sizes, which could vary not only by time but with age, region, area etc.
More on the offset The fact that an offset variable is required to have a coefficient of 1 allows it to be part of the rate. It allows you to theoretically move it back to the right side of the equation to turn your rate back into a count. by defining an offset variable, we are only adjusting for the amount of opportunity an event has.
More on the offset Let’s assume individuals in a rehab center. Including time in the offset, means that we assume that every day in rehab makes a patient equally likely to have an aggressive incident. Each day is simply an opportunity for an incident. A patient in for 20 days is twice as likely to have an incident as a patient in for 10 days. We assume that the likelihood of events is not changing over time (λ is constant for all time intervals). If, for example, it takes patients a few weeks to learn the consequences of aggressive behavior, then stop or lessen their rates, then time is not just a matter of exposure. Likewise, if patients start becoming more agitated after being in a program after a few months, so that the longer residence time is actually creating more aggression, then time is not just a matter of exposure. In either of these cases, number of days in a program would serve better as a predictor than as an exposure variable. As a predictor, the coefficient will be estimated from the data, not set to 1.
More on the offset Let’s assume children in the first grade. Including time in the offset, means that we assume that every day in school makes a child equally likely to learn one new word. Each day is simply an opportunity for new word to be learned. A child in the first 20 days is twice as likely to learn a new word as a child in the first 10 days. We assume that the likelihood of learning new words is not changing over time (λ is constant for all time intervals). If, for example, it takes children a few weeks to get used to the new environment, then the number of words learned increases, then time is not just a matter of exposure. Similarly, if children start learning other things (e.g. phrases), so that the number of words learned decreases, then time is not just a matter of exposure. In either of these cases, time in school would serve better as a predictor than as an exposure variable. As a predictor, the coefficient will be estimated from the data, not set to 1.
Μελέτη για την πυκνότητα των δασών Μεταβλητές: Ημερολογιακό έτος, περιοχή, υψόμετρο, αριθμός δέντρων (Ν) και έκταση Ερώτημα: λαμβάνοντας υπόψιν την περιοχή και το υψόμετρο, έχει μειωθεί η πυκνότητα των δασών στην Ελλάδα; Πελοπόννησος Στερεά Ελλάδα Χαμηλό υψόμετρο Μέτριο υψόμετρο Υψηλό υψόμετρο Ημερ/κό έτος Ν Έκταση (km2) 1961-70 3950 5 4200 4450 7 4700 4950 8 5200 3 1971-80 2450 2700 2950 3200 3450 2 3700 9 1981-90 1000 4 1100 1500 1700 1950 2200
Example +-----------------------------------------------------+ | cyear region altitude N Surface | |-----------------------------------------------------| 1. | 1960-1970 Peloponisos low 3950 5 | 2. | 1960-1970 Peloponisos moderate 4200 5 | 3. | 1960-1970 Peloponisos high 4450 7 | 4. | 1960-1970 Sterea low 4700 7 | 5. | 1960-1970 Sterea moderate 4950 8 | 6. | 1960-1970 Sterea high 5200 3 | 7. | 1971-1980 Peloponisos low 2450 3 | 8. | 1971-1980 Peloponisos moderate 2700 5 | 9. | 1971-1980 Peloponisos high 2950 7 | 10. | 1971-1980 Sterea low 3200 5 | 11. | 1971-1980 Sterea moderate 3450 2 | 12. | 1971-1980 Sterea high 3700 9 | 13. | 1981-1990 Peloponisos low 1000 4 | 14. | 1981-1990 Peloponisos moderate 1100 5 | 15. | 1981-1990 Peloponisos high 1500 7 | 16. | 1981-1990 Sterea low 1700 8 | 17. | 1981-1990 Sterea moderate 1950 9 | 18. | 1981-1990 Sterea high 2200 2 |
Example- The model Poisson regression Number of obs = 18 . poisson N i.cyear i.region i.altitude, e(Surface) Poisson regression Number of obs = 18 LR chi2(5) = 9766.87 Prob > chi2 = 0.0000 Log likelihood = -5113.6953 Pseudo R2 = 0.4885 ------------------------------------------------------------------------------ N | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- cyear | 1971-1980 | -.2859798 .0098027 -29.17 0.000 -.3051928 -.2667668 1981-1990 | -1.070379 .011933 -89.70 0.000 -1.093767 -1.04699 | region | Sterea | .1747286 .0086865 20.11 0.000 .1577034 .1917539 altitude | moderate | .0460915 .0106708 4.32 0.000 .0251771 .067006 high | .0698382 .0107716 6.48 0.000 .0487262 .0909501 _cons | 6.53408 .0102482 637.58 0.000 6.513994 6.554166 ln(Surface) | 1 (exposure)
Example- Interpretation 19.1% more trees per square km in Sterea compared to Peloponisos (1-e-0.28)*100=24.9% less trees per square km in 1971-80 compared to 1960-70 (1-e-1.07)*100=65.7% less trees per square km in 1981-90 compared to 1960-70
Example-2 Μεταβλητές: Έτη από τη διάγνωση της οστεοπόρωσης, φύλο, σωματική δραστηριότητα (ΣΔ), αριθμός καταγμάτων (n) και αριθμός ατόμων Ερώτημα: λαμβάνοντας υπόψιν την περιοχή και το υψόμετρο, έχει μειωθεί η πυκνότητα των δασών στην Ελλάδα; Άνδρες Γυναίκες Χαμηλή ΣΔ Μέτρια ΣΔ Υψηλή ΣΔ Έτη από τη διάγνωση n Άτομα 1-3.0 25 1700 19 1950 11 2200 30 1000 22 1100 15 1200 3.1-5.0 208 3200 207 3450 203 3950 196 2450 189 2700 177 2950 5.1-7.0 541 4700 544 4920 546 5200 513 504 4200 489 4450
The dataset +------------------------------------------------------------+ | group time gender pactiv~y rate N fractu~s | |------------------------------------------------------------| 1. | 1 1-3.0 Woman low .03 1000 30 | 2. | 2 1-3.0 Woman moderate .02 1100 22 | 3. | 3 1-3.0 Woman high .01 1500 15 | 4. | 4 1-3.0 Man low .015 1700 25 | 5. | 5 1-3.0 Man moderate .01 1950 19 | 6. | 6 1-3.0 Man high .005 2200 11 | 7. | 7 3.1-5 Woman low .08 2450 196 | 8. | 8 3.1-5 Woman moderate .07 2700 189 | 9. | 9 3.1-5 Woman high .06 2950 177 | 10. | 10 3.1-5 Man low .065 3200 208 | 11. | 11 3.1-5 Man moderate .06 3450 207 | 12. | 12 3.1-5 Man high .055 3700 203 | 13. | 13 5.1-7 Woman low .13 3950 513 | 14. | 14 5.1-7 Woman moderate .12 4200 504 | 15. | 15 5.1-7 Woman high .11 4450 489 | 16. | 16 5.1-7 Man low .115 4700 541 | 17. | 17 5.1-7 Man moderate .11 4950 544 | 18. | 18 5.1-7 Man high .105 5200 546 |
The model Poisson regression Number of obs = 18 LR chi2(5) = 1281.32 . poisson fract i.time i.gender i.pact, e(N) Poisson regression Number of obs = 18 LR chi2(5) = 1281.32 Prob > chi2 = 0.0000 Log likelihood = -75.164546 Pseudo R2 = 0.8950 ------------------------------------------------------------------------------ fractures | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- time | 3.1-5 | 1.588463 .0951235 16.70 0.000 1.402025 1.774902 5.1-7 | 2.16491 .0923208 23.45 0.000 1.983964 2.345855 | gender | Man | -.1191379 .0300545 -3.96 0.000 -.1780435 -.0602322 pactivity | moderate | -.0844677 .0365294 -2.31 0.021 -.1560641 -.0128713 high | -.1789299 .0368145 -4.86 0.000 -.2510849 -.1067748 _cons | -4.182919 .0948143 -44.12 0.000 -4.368752 -3.997086 ln(N) | 1 (exposure)
Example 2- Interpretation 4.9 times more bone fractures per individual for people having been diagnosed for 3-5 years compared to those having been diagnosed for 1-3 years e2.16 =8.7 8.7 times more bone fractures per individual for people having been diagnosed for 3-5 years (e-0.18 -1)*100=16.4% less bone fractures per individual for those that have high physical activity compared to those that have low physical activity