3.2 Swine Disease Breakout Data

The swine disease data includes 120 simulated survey questions from 800 farms. There are three choices for each question. The outbreak status for the $i^{t h}$ farm is generated from a $B e r n o u l l i (1, p_{i})$ distribution with $p_{i}$ being a function of the question answers:

$l n (\frac{p_{i}}{1 - p_{i}}) = β_{0} + Σ_{g = 1}^{G} \symbf x_{i, g}^{T} β_{g}$

where $β_{0}$ is the intercept, $x_{i, g}$ is a three-dimensional indication vector for question answer and $\symbf β_{g}$ is the parameter vector corresponding to the $g^{t h}$ predictor. Three types of questions are considered regarding their effects on the outcome. The first forty survey questions are important questions such that the coefficients of the three answers to these questions are all different:

$\symbf β_{g} = (1, 0, - 1) \times γ, g = 1, \dots, 40$

The second forty survey questions are also important questions but only one answer has a coefficient that is different from the other two answers:

$\symbf β_{g} = (1, 0, 0) \times γ, g = 41, \dots, 80$

The last forty survey questions are also unimportant questions such that all three answers have the same coefficients:

$\symbf β_{g} = (0, 0, 0) \times γ, g = 81, \dots, 120$

The baseline coefficient $β_{0}$ is set to be $- \frac{40}{3} γ$ so that on average a farm have 50% of chance to have an outbreak. The parameter $γ$ in the above simulation is set to control the strength of the questions’ effect on the outcome. In this simulation study, we consider the situations where $γ = 0.1, 0.25, 0.5, 1, 2$ . So the parameter settings are:

$\symbf β^{T} = (\underset{q u e s t i o n 1}{\frac{40}{3}, \underset{⏟}{1, 0, - 1}}, . . ., \underset{q u e s t i o n 41}{\underset{⏟}{1, 0, 0}}, . . ., \underset{q u e s t i o n 81}{\underset{⏟}{0, 0, 0}}, . . ., \underset{q u e s t i o n 120}{\underset{⏟}{0, 0, 0}}) * γ$

For each value of $γ$ , 20 data sets are simulated. The bigger $γ$ is, the larger the corresponding parameter. We provided the data sets with $γ = 2$ . Let’s check the data:

disease_dat <- read.csv("http://bit.ly/2KXb1Qi")
# only show the last 7 columns here
head(subset(disease_dat,select=c("Q118.A","Q118.B","Q119.A",
                                 "Q119.B","Q120.A","Q120.B","y")))

##   Q118.A Q118.B Q119.A Q119.B Q120.A Q120.B y
## 1      1      0      0      0      0      1 1
## 2      0      1      0      1      0      0 1
## 3      1      0      0      0      1      0 1
## 4      1      0      0      0      0      1 1
## 5      1      0      0      0      1      0 0
## 6      1      0      0      1      1      0 1

Here y indicates the outbreak situation of the farms. y=1 means there is an outbreak in 5 years after the survey. The rest columns indicate survey responses. For example Q120.A = 1 means the respondent chose A in Q120. We consider C as the baseline.

Refer to Appendix for the simulation code.