The mcglm
package is designed for fitting multivariance covariance generalized linear models (McGLMs). McGLMs is a further extension of the generalized linear models (GLM) designed with the following goals:
The mcglm
package allows a flexible specification of the mean and covariance structures, and explicitly deals with multivariate response variables, through a user friendly formula interface similar to the ordinary glm
function. For details about the package, see Bonat, 2018.
Consider the soya
data set available in the mcglm
package. The soya
data set gather dat from an experiment carried out in a vegetation house with soybeans. The experiment has two plants by plot with three levels of the factor corresponding to amount of water in the soil (water)
and five levels of potassium fertilization (pot)
. The plots were arranged in five blocks (block)
. Three response variables are of the interest, namely, grain yield (continuous), number of seeds (count) and number of viable peas (binomial) per plant. The experiment’s goal is to measure the covariates effects on the response variable.
Let’s start by fitting a standard linear regression model for the continuous response, i.e. grain yield. In the code below, we load the mcglm
package and the soya
data set.
## pot water block grain seeds viablepeas totalpeas
## 1 0 37.5 I 14.55 136 22 56
## 2 30 37.5 I 21.51 159 2 62
## 3 60 37.5 I 24.62 156 0 66
## 4 120 37.5 I 21.88 171 2 68
## 5 180 37.5 I 28.11 190 0 82
## 6 0 50 I 17.16 140 20 63
The first step for fitting a McGLM is to specify the linear predictor for each response variable. For instance, we have one response variable and the three covariates. Thus,
form.grain <- grain ~ block + water * pot
The second step is to specify the matrix linear predictor. For this example, we assume the observations are independent, consequently, the matrix linear predictor is composed of an identity matrix and can be easily obtained using the mc_id()
function.
Z0 <- mc_id(soya)
We specifed the linear and matrix linear predictors, thus the next step is to fitting the model to data.
## Automatic initial values selected.
Note that, we are using the default choice for the link, variance and covariance link functions, which correspond to identity
, constant
and identity
. Consequently, such specification corresponds to the standard linear regression model. The output of the mcglm
function is an object of mcglm
class for which a set of methods is available.
methods(class = 'mcglm')
## [1] anova coef confint fitted plot print
## [7] residuals summary vcov
## see '?methods' for accessing help and source code
Let’s summarize the fitting results.
summary(fit_grain)
## Call: grain ~ block + water * pot
##
## Link function: identity
## Variance function: constant
## Covariance function: identity
## Regression:
## Estimates Std.error Z value Pr(>|z|)
## (Intercept) 14.2661333 1.409429 10.12192413 4.416467e-24
## blockII 1.0660000 1.022507 1.04253538 2.971635e-01
## blockIII -0.8606667 1.022507 -0.84172181 3.999437e-01
## blockIV -1.4433333 1.022507 -1.41156292 1.580787e-01
## blockV -2.4926667 1.022507 -2.43779850 1.477701e-02
## water50 2.1920000 1.771035 1.23769467 2.158293e-01
## water62.5 2.5700000 1.771035 1.45112925 1.467439e-01
## pot30 6.8140000 1.771035 3.84746875 1.193445e-04
## pot60 10.4060000 1.771035 5.87566185 4.211573e-09
## pot120 11.7880000 1.771035 6.65599672 2.813865e-11
## pot180 11.8700000 1.771035 6.70229734 2.051680e-11
## water50:pot30 0.0440000 2.504621 0.01756753 9.859839e-01
## water62.5:pot30 -1.9160000 2.504621 -0.76498599 4.442799e-01
## water50:pot60 2.5740000 2.504621 1.02770038 3.040908e-01
## water62.5:pot60 3.3320000 2.504621 1.33034097 1.834060e-01
## water50:pot120 2.2860000 2.504621 0.91271292 3.613935e-01
## water62.5:pot120 5.5600000 2.504621 2.21989670 2.642578e-02
## water50:pot180 1.1780000 2.504621 0.47033063 6.381188e-01
## water62.5:pot180 9.1860000 2.504621 3.66762070 2.448180e-04
##
## Dispersion:
## Estimates Std.error Z value Pr(>|z|)
## 1 7.841408 1.541329 5.087433 3.629417e-07
##
## Algorithm: chaser
## Correction: TRUE
## Number iterations: 2
The first block of results presents the regression coefficients, while the second presents the dispersion parameter estimate. In this case only the variance of the Gaussian distribution. The last block presents the fitting algorithm, if the bias correction term was used or not and the number of iterations required to reach convergence.
The model can easily be specified to deal with count data by choosing a suitable link and variance function. For example, a quasi-Poisson regression model is obtained by specifying the link = "log"
and variance = "tweedie"
.
## Linear predictor form.seed <- seeds ~ block + water * pot ## Model fitting fit_seed <- mcglm(linear_pred = c(form.seed), matrix_pred = list(Z0), link = "log", variance = "tweedie", data = soya)
## Automatic initial values selected.
Similarly, we can easily fit a quasi-binomial regression model for the response number of viable peas In this case, we have to create a new response variable with the proportions of viable peas and the total number of peas have to be included through the Ntrial
argument. Finally, the standard logistic regression model is obtained by specifying link = "logit"
and variance = "binomialP"
.
## Proportion of viable peas soya$viablepeasP <- soya$viablepeas / soya$totalpeas ## Linear predictor form.peas <- viablepeasP ~ block + water * pot ## Fitting fit_peas <- mcglm(linear_pred = c(form.peas), matrix_pred = list(Z0), link = "logit", variance = "binomialP", Ntrial = list(soya$totalpeas), data = soya)
## Automatic initial values selected.
The joint regression model for the three response variables can easily be fitted as follows
fit_joint <- mcglm(linear_pred = c(form.grain, form.seed, form.peas), matrix_pred = list(Z0, Z0, Z0), link = c("identity","log", "logit"), variance = c("constant", "tweedie", "binomialP"), Ntrial = list(NULL, NULL, soya$totalpeas), data = soya)
## Automatic initial values selected.
Beside the output for each response variable as shown in the fit_grain
model. The joint fit provides the correlation between response variables.
summary(fit_joint, verbose = TRUE, print = "Correlation")
## Correlation matrix:
## Parameters Estimates Std.error Z value Pr(>|z|)
## 1 rho12 0.63666557 0.07035952 9.0487480 1.446169e-19
## 2 rho13 0.06831739 0.11493205 0.5944155 5.522343e-01
## 3 rho23 0.08609243 0.11442675 0.7523803 4.518224e-01
##
## Algorithm: chaser
## Correction: TRUE
## Number iterations: 9
For continuous response variables the mcglm
package provides the Tweedie family of distribution that can deal with non-negative highly right-skewed data and continuous data with probability mass at zero. It is specified using link = "log"
, variance = "tweedie"
and power_fixed = FALSE
. Note that, for power_fixed = TRUE
it corresponds to the quasi-Poisson regression suitable for count data. For details, see Bonat and Kokonendji, 2017.
For count response variables the mcglm
package fits the extended Poisson-Tweedie Bonat, et. al, 2018 class. It is specified using link = "log"
and variance = "poisson_tweedie"
. The power parameter can be fixed or fitted based on data. This class deals with under, equi, overdispersed, zero-inflated and heavy-tailed count data.
Similarly, for discrete/continuous bounded data, the mcglm
package fits the Flexible quasi-beta class Bonat, et. al 2018(2). The models are specified using link = "logit"
and variance = "binomialP"
. However, the user can specified other link function, see ?mc_link_function
for other options. For the variance function an even more flexible variance function is available as "binomialPQ"
. For more article on the McGLMs class, see my personal webpage.