3.2 Swine Disease Breakout Data
The swine disease data includes 120 simulated survey questions from 800 farms. There are three choices for each question. The outbreak status for the farm is generated from a distribution with being a function of the question answers:
where is the intercept, is a three-dimensional indication vector for question answer and is the parameter vector corresponding to the predictor. Three types of questions are considered regarding their effects on the outcome. The first forty survey questions are important questions such that the coefficients of the three answers to these questions are all different:
The second forty survey questions are also important questions but only one answer has a coefficient that is different from the other two answers:
The last forty survey questions are also unimportant questions such that all three answers have the same coefficients:
The baseline coefficient is set to be so that on average a farm have 50% of chance to have an outbreak. The parameter in the above simulation is set to control the strength of the questions’ effect on the outcome. In this simulation study, we consider the situations where . So the parameter settings are:
For each value of , 20 data sets are simulated. The bigger is, the larger the corresponding parameter. We provided the data sets with . Let’s check the data:
<- read.csv("http://bit.ly/2KXb1Qi")
disease_dat # only show the last 7 columns here
head(subset(disease_dat,select=c("Q118.A","Q118.B","Q119.A",
"Q119.B","Q120.A","Q120.B","y")))
## Q118.A Q118.B Q119.A Q119.B Q120.A Q120.B y
## 1 1 0 0 0 0 1 1
## 2 0 1 0 1 0 0 1
## 3 1 0 0 0 1 0 1
## 4 1 0 0 0 0 1 1
## 5 1 0 0 0 1 0 0
## 6 1 0 0 1 1 0 1
Here y
indicates the outbreak situation of the farms. y=1
means there is an outbreak in 5 years after the survey. The rest columns indicate survey responses. For example Q120.A = 1
means the respondent chose A
in Q120. We consider C
as the baseline.
Refer to Appendix for the simulation code.