Introduction to mcglm

The mcglm package is designed for fitting multivariance covariance generalized linear models (McGLMs). McGLMs is a further extension of the generalized linear models (GLM) designed with the following goals:

  • Goal 1: Make GLM more flexible to deal with
    • Non-negative highly right-skewed data and continuous data with probability mass at zero;
    • Under, equi, overdispersed, zero-inflated and heavy-tailed count data;
    • Discrete and continuous bounded data.
  • Goal 2: Extend GLM to cGLM to deal with:
    • Repeated measures and longitudinal data;
    • Time series;
    • Spatial and space-time data;
    • Genetic and Twin data;
    • Other types of dependent data.
  • Goal 3: Extend cGLM to McGLM to deal with:
    • Multiple responses;
    • Response variables of mixed types.

The mcglm package allows a flexible specification of the mean and covariance structures, and explicitly deals with multivariate response variables, through a user friendly formula interface similar to the ordinary glm function. For details about the package, see Bonat, 2018.

Worked example

Consider the soya data set available in the mcglmpackage. The soyadata set gather dat from an experiment carried out in a vegetation house with soybeans. The experiment has two plants by plot with three levels of the factor corresponding to amount of water in the soil (water) and five levels of potassium fertilization (pot). The plots were arranged in five blocks (block). Three response variables are of the interest, namely, grain yield (continuous), number of seeds (count) and number of viable peas (binomial) per plant. The experiment’s goal is to measure the covariates effects on the response variable.

Continuous response variable

Let’s start by fitting a standard linear regression model for the continuous response, i.e. grain yield. In the code below, we load the mcglm package and the soyadata set.

require(mcglm)
data(soya)
head(soya)
##   pot water block grain seeds viablepeas totalpeas
## 1   0  37.5     I 14.55   136         22        56
## 2  30  37.5     I 21.51   159          2        62
## 3  60  37.5     I 24.62   156          0        66
## 4 120  37.5     I 21.88   171          2        68
## 5 180  37.5     I 28.11   190          0        82
## 6   0    50     I 17.16   140         20        63

The first step for fitting a McGLM is to specify the linear predictor for each response variable. For instance, we have one response variable and the three covariates. Thus,

form.grain <- grain ~ block + water * pot

The second step is to specify the matrix linear predictor. For this example, we assume the observations are independent, consequently, the matrix linear predictor is composed of an identity matrix and can be easily obtained using the mc_id() function.

Z0 <- mc_id(soya)

We specifed the linear and matrix linear predictors, thus the next step is to fitting the model to data.

fit_grain <- mcglm(linear_pred = c(form.grain), matrix_pred = list(Z0), data = soya)
## Automatic initial values selected.

Note that, we are using the default choice for the link, variance and covariance link functions, which correspond to identity, constantand identity. Consequently, such specification corresponds to the standard linear regression model. The output of the mcglm function is an object of mcglmclass for which a set of methods is available.

methods(class = 'mcglm')
## [1] anova     coef      confint   fitted    plot      print    
## [7] residuals summary   vcov     
## see '?methods' for accessing help and source code

Let’s summarize the fitting results.

summary(fit_grain)
## Call: grain ~ block + water * pot
## 
## Link function: identity
## Variance function: constant
## Covariance function: identity
## Regression:
##                   Estimates Std.error     Z value     Pr(>|z|)
## (Intercept)      14.2661333  1.409429 10.12192413 4.416467e-24
## blockII           1.0660000  1.022507  1.04253538 2.971635e-01
## blockIII         -0.8606667  1.022507 -0.84172181 3.999437e-01
## blockIV          -1.4433333  1.022507 -1.41156292 1.580787e-01
## blockV           -2.4926667  1.022507 -2.43779850 1.477701e-02
## water50           2.1920000  1.771035  1.23769467 2.158293e-01
## water62.5         2.5700000  1.771035  1.45112925 1.467439e-01
## pot30             6.8140000  1.771035  3.84746875 1.193445e-04
## pot60            10.4060000  1.771035  5.87566185 4.211573e-09
## pot120           11.7880000  1.771035  6.65599672 2.813865e-11
## pot180           11.8700000  1.771035  6.70229734 2.051680e-11
## water50:pot30     0.0440000  2.504621  0.01756753 9.859839e-01
## water62.5:pot30  -1.9160000  2.504621 -0.76498599 4.442799e-01
## water50:pot60     2.5740000  2.504621  1.02770038 3.040908e-01
## water62.5:pot60   3.3320000  2.504621  1.33034097 1.834060e-01
## water50:pot120    2.2860000  2.504621  0.91271292 3.613935e-01
## water62.5:pot120  5.5600000  2.504621  2.21989670 2.642578e-02
## water50:pot180    1.1780000  2.504621  0.47033063 6.381188e-01
## water62.5:pot180  9.1860000  2.504621  3.66762070 2.448180e-04
## 
## Dispersion:
##   Estimates Std.error  Z value     Pr(>|z|)
## 1  7.841408  1.541329 5.087433 3.629417e-07
## 
## Algorithm: chaser
## Correction: TRUE
## Number iterations: 2

The first block of results presents the regression coefficients, while the second presents the dispersion parameter estimate. In this case only the variance of the Gaussian distribution. The last block presents the fitting algorithm, if the bias correction term was used or not and the number of iterations required to reach convergence.

Counting response variable

The model can easily be specified to deal with count data by choosing a suitable link and variance function. For example, a quasi-Poisson regression model is obtained by specifying the link = "log" and variance = "tweedie".

## Linear predictor
form.seed <- seeds ~ block + water * pot

## Model fitting
fit_seed <- mcglm(linear_pred = c(form.seed), matrix_pred = list(Z0),
                  link = "log", variance = "tweedie", data = soya)
## Automatic initial values selected.

Binomial response variable

Similarly, we can easily fit a quasi-binomial regression model for the response number of viable peas In this case, we have to create a new response variable with the proportions of viable peas and the total number of peas have to be included through the Ntrial argument. Finally, the standard logistic regression model is obtained by specifying link = "logit" and variance = "binomialP".

## Proportion of viable peas
soya$viablepeasP <- soya$viablepeas / soya$totalpeas

## Linear predictor
form.peas <- viablepeasP ~ block + water * pot

## Fitting
fit_peas <- mcglm(linear_pred = c(form.peas), matrix_pred = list(Z0),
                  link = "logit", variance = "binomialP",
                  Ntrial = list(soya$totalpeas),
                  data = soya)
## Automatic initial values selected.

Multiple response variables

The joint regression model for the three response variables can easily be fitted as follows

fit_joint <- mcglm(linear_pred = c(form.grain, form.seed, form.peas),
                   matrix_pred = list(Z0, Z0, Z0),
                   link = c("identity","log", "logit"),
                   variance = c("constant", "tweedie", "binomialP"),
                   Ntrial = list(NULL, NULL, soya$totalpeas),
                   data = soya)
## Automatic initial values selected.

Beside the output for each response variable as shown in the fit_grain model. The joint fit provides the correlation between response variables.

summary(fit_joint, verbose = TRUE, print = "Correlation")
## Correlation matrix:
##   Parameters  Estimates  Std.error   Z value     Pr(>|z|)
## 1      rho12 0.63666557 0.07035952 9.0487480 1.446169e-19
## 2      rho13 0.06831739 0.11493205 0.5944155 5.522343e-01
## 3      rho23 0.08609243 0.11442675 0.7523803 4.518224e-01
## 
## Algorithm: chaser
## Correction: TRUE
## Number iterations: 9

Further options

For continuous response variables the mcglm package provides the Tweedie family of distribution that can deal with non-negative highly right-skewed data and continuous data with probability mass at zero. It is specified using link = "log", variance = "tweedie" and power_fixed = FALSE. Note that, for power_fixed = TRUE it corresponds to the quasi-Poisson regression suitable for count data. For details, see Bonat and Kokonendji, 2017.

For count response variables the mcglm package fits the extended Poisson-Tweedie Bonat, et. al, 2018 class. It is specified using link = "log" and variance = "poisson_tweedie". The power parameter can be fixed or fitted based on data. This class deals with under, equi, overdispersed, zero-inflated and heavy-tailed count data.

Similarly, for discrete/continuous bounded data, the mcglmpackage fits the Flexible quasi-beta class Bonat, et. al 2018(2). The models are specified using link = "logit" and variance = "binomialP". However, the user can specified other link function, see ?mc_link_function for other options. For the variance function an even more flexible variance function is available as "binomialPQ". For more article on the McGLMs class, see my personal webpage.