3.2 Swine Disease Breakout Data

The swine disease data includes 120 simulated survey questions from 800 farms. There are three choices for each question. The outbreak status for the ith farm is generated from a Bernoulli(1,pi) distribution with pi being a function of the question answers:

ln(pi1pi)=β0+Σg=1G\symbfxi,gTβg

where β0 is the intercept, xi,g is a three-dimensional indication vector for question answer and \symbfβg is the parameter vector corresponding to the gth predictor. Three types of questions are considered regarding their effects on the outcome. The first forty survey questions are important questions such that the coefficients of the three answers to these questions are all different:

\symbfβg=(1,0,1)×γ, g=1,,40

The second forty survey questions are also important questions but only one answer has a coefficient that is different from the other two answers:

\symbfβg=(1,0,0)×γ, g=41,,80

The last forty survey questions are also unimportant questions such that all three answers have the same coefficients:

\symbfβg=(0,0,0)×γ, g=81,,120

The baseline coefficient β0 is set to be 403γ so that on average a farm have 50% of chance to have an outbreak. The parameter γ in the above simulation is set to control the strength of the questions’ effect on the outcome. In this simulation study, we consider the situations where γ=0.1,0.25,0.5,1,2. So the parameter settings are:

\symbfβT=(403,1,0,1question 1,...,1,0,0question 41,...,0,0,0question 81,...,0,0,0question 120)γ

For each value of γ, 20 data sets are simulated. The bigger γ is, the larger the corresponding parameter. We provided the data sets with γ=2. Let’s check the data:

disease_dat <- read.csv("http://bit.ly/2KXb1Qi")
# only show the last 7 columns here
head(subset(disease_dat,select=c("Q118.A","Q118.B","Q119.A",
                                 "Q119.B","Q120.A","Q120.B","y"))) 
##   Q118.A Q118.B Q119.A Q119.B Q120.A Q120.B y
## 1      1      0      0      0      0      1 1
## 2      0      1      0      1      0      0 1
## 3      1      0      0      0      1      0 1
## 4      1      0      0      0      0      1 1
## 5      1      0      0      0      1      0 0
## 6      1      0      0      1      1      0 1

Here y indicates the outbreak situation of the farms. y=1 means there is an outbreak in 5 years after the survey. The rest columns indicate survey responses. For example Q120.A = 1 means the respondent chose A in Q120. We consider C as the baseline.

Refer to Appendix for the simulation code.