ALM stands for “Advanced Linear Model”. It’s not so much advanced as it sounds, but it has some advantages over the basic LM, retaining some basic features. In some sense `alm()`

resembles the `glm()`

function from stats package, but with a higher focus on forecasting rather than on hypothesis testing. You will not get p-values anywhere from the `alm()`

function and won’t see \(R^2\) in the outputs. The maximum what you can count on is having confidence intervals for the parameters or for the regression line. The other important difference from `glm()`

is the availability of distributions that are not supported by `glm()`

(for example, Folded Normal or Chi Squared distributions).

The core of the function is the likelihood approach. The estimation of parameters in the model is done via the maximisation of likelihood function of a selected distribution. The calculation of the standard errors is done based on the calculation of hessian of the distribution. And in the centre of all of that are information criteria that can be used for the models comparison.

All the supported distributions have specific functions which form the following four groups for the `distribution`

parameter in `alm()`

:

- Density functions of continuous distributions,
- Density functions for continuous positive data,
- Continuous distributions on a specific interval,
- Density functions of discrete distributions,
- Cumulative functions for binary variables.

All of them rely on respective d- and p- functions in R. For example, Log Normal distribution uses `dlnorm()`

function from `stats`

package.

The `alm()`

function also supports `occurrence`

parameter, which allows modelling non-zero values and the occurrence of non-zeroes as two different models. The combination of any distribution from (1) - (3) for the non-zero values and a distribution from (4) for the occurrence will result in a mixture distribution model, e.g. a mixture of Log-Normal and Cumulative Logistic or a Hurdle Poisson (with Cumulative Normal for the occurrence part).

Every model produced using `alm()`

can be represented as: \[\begin{equation} \label{eq:basicALM}
y_t = f(\mu_t, \epsilon_t) = f(x_t' B, \epsilon_t) ,
\end{equation}\] where \(y_t\) is the value of the response variable, \(x_t\) is the vector of exogenous variables, \(B\) is the vector of the parameters, \(\mu_t\) is the conditional mean (produced based on the exogenous variables and the parameters of the model), \(\epsilon_t\) is the error term on the observation \(t\) and \(f(\cdot)\) is the distribution function that does a transformation of the inputs into the output. In case of a mixture distribution the model becomes slightly more complicated: \[\begin{equation} \label{eq:basicALMMixture}
\begin{matrix}
y_t = o_t f(x_t' B, \epsilon_t) \\
o_t \sim \text{Bernoulli}(p_t) \\
p_t = g(z_t' A, \eta_t)
\end{matrix},
\end{equation}\] where \(o_t\) is the binary variable, \(p_t\) is the probability of occurrence, \(z_t\) is the vector of exogenous variables, \(A\) is the vector of parameters and \(\eta\) is the error term for the \(p_t\).

The `alm()`

function returns, along with the set of common for `lm()`

variables (such as `coefficient`

and `fitted.values`

), the variable `mu`

, which corresponds to the conditional mean used inside the distribution, and `scale`

– the second parameter, which usually corresponds to standard error or dispersion parameter. The values of these two variables vary from distribution to distribution. Note, however, that the `model`

variable returned by `lm()`

function was renamed into `data`

in `alm()`

, and that `alm()`

does not return `terms`

and QR decomposition.

Given that the parameters of any model in `alm()`

are estimated via likelihood, it can be assumed that they have asymptotically normal distribution, thus the confidence intervals for any model rely on the normality and are constructed based on the unbiased estimate of variance, extracted using `sigma()`

function.

The covariance matrix of parameters almost in all the cases is calculated as an inverse of the hessian of respective distribution function. The exclusions are Normal, Log-Normal, Cumulative Logistic and Cumulative Normal distributions, that use analytical solutions.

`alm()`

function also supports factors in the explanatory variables, creating the set of dummies from them. In case of ordered variables (ordinal scale, `is.ordered()`

), the ordering is removed and the set of dummies is produced. This is done in order to avoid the built in behaviour of R, which creates linear, squared, cube etc levels for ordered variables, which makes the interpretation of the parameters difficult.

Although the basic principles of estimation of models and predictions from them are the same for all the distributions, each of the distribution has its own features. So it makes sense to discuss them individually. We discuss the distributions in the four groups mentioned above.

This group of functions includes:

- Normal distribution,
- Laplace distribution,
- Asymmetric Laplace distribution,
- Logistic distribution,
- S distribution,
- Student t distribution,

For all the functions in this category `resid()`

method returns \(e_t = y_t - \mu_t\).

The density of normal distribution is: \[\begin{equation} \label{eq:Normal} f(y_t) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(y_t - \mu_t \right)^2}{2 \sigma^2} \right) , \end{equation}\] where \(\sigma^2\) is the variance of the error term.

`alm()`

with Normal distribution (`distribution="dnorm"`

) is equivalent to `lm()`

function from `stats`

package and returns roughly the same estimates of parameters, so if you are concerned with the time of calculation, I would recommend reverting to `lm()`

.

Maximising the likelihood of the model is equivalent to the estimation of the basic linear regression using Least Squares method: \[\begin{equation} \label{eq:linearModel} y_t = \mu_t + \epsilon_t = x_t' B + \epsilon_t, \end{equation}\] where \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\).

The variance \(\sigma^2\) is estimated in `alm()`

based on likelihood: \[\begin{equation} \label{eq:sigmaNormal}
\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(y_t - \mu_t \right)^2 ,
\end{equation}\] where \(T\) is the sample size. Its square root (standard deviation) is used in the calculations of `dnorm()`

function, and the value is then return via `scale`

variable. This value does not have bias correction. However the `sigma()`

method applied to the resulting model, returns the bias corrected version of standard deviation. And `vcov()`

, `confint()`

, `summary()`

and `predict()`

rely on the value extracted by `sigma()`

.

\(\mu_t\) is returned as is in `mu`

variable, and the fitted values are set equivalent to `mu`

.

In order to produce confidence intervals for the mean (`predict(model, newdata, interval="c")`

) the conditional variance of the model is calculated using: \[\begin{equation} \label{eq:varianceNormalForCI}
V({\mu_t}) = x_t V(B) x_t',
\end{equation}\] where \(V(B)\) is the covariance matrix of the parameters returned by the function `vcov`

. This variance is then used for the construction of the confidence intervals of a necessary level \(\alpha\) using the distribution of Student: \[\begin{equation} \label{eq:intervalsNormal}
y_t \in \left(\mu_t \pm \tau_{df,\frac{1+\alpha}{2}} \sqrt{V(\mu_t)} \right),
\end{equation}\] where \(\tau_{df,\frac{1+\alpha}{2}}\) is the upper \({\frac{1+\alpha}{2}}\)-th quantile of the Student’s distribution with \(df\) degrees of freedom (e.g. with \(\alpha=0.95\) it will be 0.975-th quantile, which, for example, for 100 degrees of freedom will be \(\approx 1.984\)).

Similarly for the prediction intervals (`predict(model, newdata, interval="p")`

) the conditional variance of the \(y_t\) is calculated: \[\begin{equation} \label{eq:varianceNormalForPI}
V(y_t) = V(\mu_t) + s^2 ,
\end{equation}\] where \(s^2\) is the bias-corrected variance of the error term, calculated using: \[\begin{equation} \label{eq:varianceNormalUnbiased}
s^2 = \frac{1}{T-k} \sum_{t=1}^T \left(y_t - \mu_t \right)^2 ,
\end{equation}\] where \(k\) is the number of estimated parameters (including the variance itself). This value is then used for the construction of the prediction intervals of a specify level, also using the distribution of Student, in a similar manner as with the confidence intervals.

Laplace distribution has some similarities with the Normal one: \[\begin{equation} \label{eq:Laplace} f(y_t) = \frac{1}{2 s} \exp \left( -\frac{\left| y_t - \mu_t \right|}{s} \right) , \end{equation}\] where \(s\) is the scale parameter, which, when estimated using likelihood, is equal to the mean absolute error: \[\begin{equation} \label{eq:bLaplace} s = \frac{1}{T} \sum_{t=1}^T \left| y_t - \mu_t \right| . \end{equation}\] So maximising the likelihood is equivalent to estimating the linear regression via the minimisation of \(s\) . So when estimating a model via minimising \(s\), the assumption imposed on the error term is \(\epsilon_t \sim \text{Laplace}(0, s)\). The main difference of Laplace from Normal distribution is its fatter tails.

`alm()`

function with `distribution="dlaplace"`

returns `mu`

equal to \(\mu_t\) and the fitted values equal to `mu`

. \(s\) is returned in the `scale`

variable. The prediction intervals are derived from the quantiles of Laplace distribution after transforming the conditional variance into the conditional scale parameter \(s\) using the connection between the two in Laplace distribution: \[\begin{equation} \label{eq:bLaplaceAndSigma}
s = \sqrt{\frac{\sigma^2}{2}},
\end{equation}\] where \(\sigma^2\) is substituted either by the conditional variance of \(\mu_t\) or \(y_t\).

The kurtosis of Laplace distribution is 6, making it suitable for modelling rarely occurring events.

Asymmetric Laplace distribution can be considered as a two Laplace distributions with different parameters \(s\) for left and right side. There are several ways to summarise the probability density function, the one used in `alm()`

relies on the asymmetry parameter \(\alpha\) (Yu and Zhang 2005): \[\begin{equation} \label{eq:ALaplace}
f(y_t) = \frac{\alpha (1- \alpha)}{s} \exp \left( -\frac{y_t - \mu_t}{s} (\alpha - I(y_t \leq \mu_t)) \right) ,
\end{equation}\] where \(s\) is the scale parameter, \(\alpha\) is skewness parameter and \(I(y_t \leq \mu_t)\) is the indicator function, which is equal to one, when the condition is satisfied and to zero otherwise. The scale parameter \(s\) estimated using likelihood is equal to the quantile loss: \[\begin{equation} \label{eq:bALaplace}
s = \frac{1}{T} \sum_{t=1}^T \left(y_t - \mu_t \right)(\alpha - I(y_t \leq \mu_t)) .
\end{equation}\] Thus maximising the likelihood is equivalent to estimating the linear regression via the minimisation of \(\alpha\) quantile, making this equivalent to quantile regression. So quantile regression models assume indirectly that the error term is \(\epsilon_t \sim \text{ALaplace}(0, s, \alpha)\) (Geraci and Bottai 2007). The advantage of using `alm()`

in this case is in having the full distribution, which allows to do all the fancy things you can do when you have likelihood.

In case of \(\alpha=0.5\) the function reverts to the symmetric Laplace where \(s=\frac{1}{2}\text{MAE}\).

`alm()`

function with `distribution="dalaplace"`

accepts an additional parameter `alpha`

in ellipsis, which defines the quantile \(\alpha\). If it is not provided, then the function will estimated it maximising the likelihood and return it as the first coefficient. `alm()`

returns `mu`

equal to \(\mu_t\) and the fitted values equal to `mu`

. \(s\) is returned in the `scale`

variable. The parameter \(\alpha\) is returned in the variable `other`

of the final model. The prediction intervals are produced using `qalaplace()`

function. In order to find the values of \(s\) for the holdout the following connection between the variance of the variable and the scale in Asymmetric Laplace distribution is used: \[\begin{equation} \label{eq:bALaplaceAndSigma}
s = \sqrt{\sigma^2 \frac{\alpha^2 (1-\alpha)^2}{(1-\alpha)^2 + \alpha^2}},
\end{equation}\] where \(\sigma^2\) is substituted either by the conditional variance of \(\mu_t\) or \(y_t\).

The density function of Logistic distribution is: \[\begin{equation} \label{eq:Logistic}
f(y_t) = \frac{\exp \left(- \frac{y_t - \mu_t}{s} \right)} {s \left( 1 + \exp \left(- \frac{y_t - \mu_t}{s} \right) \right)^{2}},
\end{equation}\] where \(s\) is the scale parameter, which is estimated in `alm()`

based on the connection between the parameter and the variance in the logistic distribution: \[\begin{equation} \label{eq:sLogisticAndSigma}
s = \sigma \sqrt{\frac{3}{\pi^2}}.
\end{equation}\] Once again the maximisation of implies the estimation of the linear model , where \(\epsilon_t \sim \text{Logistic}(0, s)\).

Logistic is considered a fat tailed distribution, but its tails are not as fat as in Laplace. Kurtosis of standard Logistic is 4.2.

`alm()`

function with `distribution="dlogis"`

returns \(\mu_t\) in `mu`

and in `fitted.values`

variables, and \(s\) in the `scale`

variable. Similar to Laplace distribution, the prediction intervals use the connection between the variance and scale, and rely on the `qlogis`

function.

The S distribution has the following density function: \[\begin{equation} \label{eq:S} f(y_t) = \frac{1}{4b^2} \exp \left( -\frac{\sqrt{|y_t - \mu_t|}}{s} \right) , \end{equation}\] where \(s\) is the scale parameter. If estimated via maximum likelihood, the scale parameter is equal to: \[\begin{equation} \label{eq:bS} s = \frac{1}{T} \sum_{t=1}^T \sqrt{\left| y_t - \mu_t \right|} , \end{equation}\] which corresponds to the minimisation of “Half Absolute Error” or “Half Absolute Moment”, which is equal to \(2b\).

S distribution has a kurtosis of 25.2, which makes it an “extreme excess” distribution. It might be useful in cases of randomly occurring incidents and extreme values (Black Swans?).

`alm()`

function with `distribution="ds"`

returns \(\mu_t\) in the same variables `mu`

and `fitted.values`

, and \(s\) in the `scale`

variable. Similarly to the previous functions, the prediction intervals are based on the `qs()`

function from `greybox`

package and use the connection between the scale and the variance: \[\begin{equation} \label{eq:bSAndSigma}
s = \left( \frac{\sigma^2}{120} \right) ^{\frac{1}{4}},
\end{equation}\] where once again \(\sigma^2\) is substituted either by the conditional variance of \(\mu_t\) or \(y_t\).

The Student t distribution has a difficult density function: \[\begin{equation} \label{eq:T} f(y_t) = \frac{\Gamma\left(\frac{d+1}{2}\right)}{\sqrt{d \pi} \Gamma\left(\frac{d}{2}\right)} \left( 1 + \frac{x^2}{d} \right)^{-\frac{d+1}{2}} , \end{equation}\] where \(d\) is the number of degrees of freedom, which can also be considered as the scale parameter of the distribution. It has the following connection with the in-sample variance of the error (but only for the case, when \(d>2\)): \[\begin{equation} \label{eq:scaleOfT} d = \frac{2}{1-\sigma^{-2}}. \end{equation}\] Given that the formula holds only for cases of \(d>2\) (and respectively for \(\sigma^2>1\)), the degrees of freedom in this case are restricted by 2 from below.

Kurtosis of Student t distribution depends on the value of \(d\), and for the cases of \(d>4\) is equal to \(\frac{6}{d-4}\).

`alm()`

function with `distribution="dt"`

returns \(\mu_t\) in the same variables `mu`

and `fitted.values`

, and \(d\) in the `scale`

variable. Both prediction and confidence intervals use `qt()`

function from `stats`

package and rely on the estimated in-sample value of \(d\). The intervals are constructed similarly to how it is done in Normal distribution (based on `qt()`

function).

In order to see how this works, we will create the following data:

```
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5))
xreg <- cbind(500+0.5*xreg[,1]-0.75*xreg[,2]+rs(100,0,3),xreg,rnorm(100,300,10))
colnames(xreg) <- c("y","x1","x2","Noise")
inSample <- xreg[1:80,]
outSample <- xreg[-c(1:80),]
```

ALM can be run either with data frame or with matrix. Here’s an example with normal distribution:

```
ourModel <- alm(y~x1+x2, data=inSample, distribution="dnorm")
summary(ourModel)
#> Response variable: y
#> Distribution used in the estimation: Normal
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 481.96103 148.74997 185.69979 778.22228
#> x1 -3.53686 4.72991 -12.95730 5.88358
#> x2 0.51401 2.67771 -4.81911 5.84712
#> ICs:
#> AIC AICc BIC BICc
#> 1003.409 1003.942 1012.937 1014.105
#>
#> Error standard deviation: 121.7983
#> Sample size: 80
#> Number of estimated parameters: 4
#> Number of degrees of freedom: 76
plot(predict(ourModel,outSample,interval="p"))
```

And here’s an example with Asymmetric Laplace and predefined \(\alpha=0.95\):

```
ourModel <- alm(y~x1+x2, data=inSample, distribution="dalaplace",alpha=0.95)
summary(ourModel)
#> Warning: Choleski decomposition of hessian failed, so we had to revert to the simple inversion.
#> The estimate of the covariance matrix of parameters might be inacurate.
#> Response variable: y
#> Distribution used in the estimation: Asymmetric Laplace with alpha=0.95
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 722.94155 11610.27621 -22400.92757 23846.81067
#> x1 -30.58407 0.00000 -30.58407 -30.58407
#> x2 5.96328 1.58971 2.79710 9.12947
#> ICs:
#> AIC AICc BIC BICc
#> 1108.102 1108.635 1117.630 1118.798
#>
#> Error standard deviation: 297.4635
#> Sample size: 80
#> Number of estimated parameters: 4
#> Number of degrees of freedom: 76
plot(predict(ourModel,outSample))
#> Warning: Choleski decomposition of hessian failed, so we had to revert to the simple inversion.
#> The estimate of the covariance matrix of parameters might be inacurate.
```

This group includes:

- Log Normal distribution,
- Folded Normal distribution,
- Noncentral Chi Squared distribution.

Although (2) and (3) in theory allow having zeroes in data, given that the density function is equal to zero in any specific point, it will be zero in these cases as well. So the `alm()`

will return some solutions for these distributions, but don’t expect anything good. As for (1), it supports strictly positive data.

Log Normal distribution appears when a normally distributed variable is exponentiated. This means that if \(x \sim \mathcal{N}(\mu, \sigma^2)\), then \(\exp x \sim \text{log}\mathcal{N}(\mu, \sigma^2)\). The density function of Log Normal distribution is: \[\begin{equation} \label{eq:LogNormal} f(y_t) = \frac{1}{y_t \sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(\log y_t - \mu_t \right)^2}{2 \sigma^2} \right) , \end{equation}\] where the variance estimated using likelihood is: \[\begin{equation} \label{eq:sigmaLogNormal} \hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\log y_t - \mu_t \right)^2 . \end{equation}\] Estimating the model with Log Normal distribution is equivalent to estimating the parameters of log-linear model: \[\begin{equation} \label{eq:logLinearModel} \log y_t = \mu_t + \epsilon_t, \end{equation}\] where \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) or: \[\begin{equation} \label{eq:logLinearModelExp} y_t = \exp(\mu_t + \epsilon_t). \end{equation}\]

`alm()`

with `distribution="dlnorm"`

does not transform the provided data and estimates the density directly using `dlnorm()`

function with the estimated mean \(\mu_t\) and the variance . If you need a log-log model, then you would need to take logarithms of the external variables. The \(\mu_t\) is returned in the variable `mu`

, the \(\sigma^2\) is in the variable `scale`

, while the `fitted.values`

contains the exponent of \(\mu_t\), which, given the connection between the Normal and Log Normal distributions, corresponds to median of distribution rather than mean. Finally, `resid()`

method returns \(e_t = \log y_t - \mu_t\).

Folded Normal distribution is obtained when the absolute value of normally distributed variable is taken: if \(x \sim \mathcal{N}(\mu, \sigma^2)\), then \(|x| \sim \text{folded }\mathcal{N}(\mu, \sigma^2)\). The density function is: \[\begin{equation} \label{eq:foldedNormal}
f(y_t) = \frac{1}{\sqrt{2 \pi \sigma^2}} \left( \exp \left( -\frac{\left(y_t - \mu_t \right)^2}{2 \sigma^2} \right) + \exp \left( -\frac{\left(y_t + \mu_t \right)^2}{2 \sigma^2} \right) \right),
\end{equation}\] Conditional mean and variance of Folded Normal are estimated in `alm()`

(with `distribution="dfnorm"`

) similarly to how this is done for Normal distribution. They are returned in the variables `mu`

and `scale`

respectively. In order to produce the fitted value (which is returned in `fitted.values`

), the following correction is done: \[\begin{equation} \label{eq:foldedNormalFitted}
\hat{y_t} = \sqrt{\frac{2}{\pi}} \sigma \exp \left( -\frac{\mu_t^2}{2 \sigma^2} \right) + \mu_t \left(1 - 2 \Phi \left(-\frac{\mu_t}{\sigma} \right) \right),
\end{equation}\] where \(\Phi(\cdot)\) is the CDF of Normal distribution.

The model that is assumed in the case of Folded Normal distribution can be summarised as: \[\begin{equation} \label{eq:foldedNormalModel} y_t = \left| \mu_t + \epsilon_t \right|. \end{equation}\]

The conditional variance of the forecasts is calculated based on the elements of `vcov()`

(as in all the other functions), the predicted values are corrected in the same way as the fitted values , and the prediction intervals are generated from the `qfnorm()`

function of `greybox`

package. As for the residuals, `resid()`

method returns \(e_t = y_t - \mu_t\).

Noncentral Chi Squared distribution arises, when a normally distributed variable with a unity variance is squared and summed up: if \(x_i \sim \mathcal{N}(\mu_i, 1)\), then \(\sum_{i=1}^k x_i^2 \sim \chi^2(k, \lambda)\), where \(k\) is the number of degrees of freedom and \(\lambda = \sum_{i=1}^k \mu_i^2\). In the case of non-unity variance, with \(z_i \sim \mathcal{N}(\mu_i, \sigma^2)\), the variable can also be represented as \(z_i = \sigma x_i\), and then it can be assumed that \(\sum_{i=1}^k z_i^2 \sim \chi^2(k \sigma, \lambda)\). In the perfect world, \(\lambda_t\) would correspond to the location of the original distribution of \(z_i\), while \(k\) would need to be time varying and would need to include both number of elements \(k\) and the individual variances \(\sigma^2_i\) for each of the element, depending on the external variables values. However, given that the squares of the normal data are used, it is not possible to disaggregate the values into the original two parts. Thus we assume that the variance is constant for all the cases, and estimate it using likelihood. As a result the non-centrality parameter covers two parts that would be split in the ideal world.

The density function of Noncentral Chi Squared distribution is quite difficult. `alm()`

uses `dchisq()`

function from `stats`

package, assuming constant number of degrees of freedom \(k\) and time varying noncentrality parameter \(\lambda_t\): \[\begin{equation} \label{eq:NCChiSquared}
f(y_t) = \frac{1}{2} \exp \left( -\frac{y_t + \lambda_t}{2} \right) \left(\frac{y_t}{\lambda_t} \right)^{\frac{k}{4}-0.5} I_{\frac{k}{2}-1}(\sqrt{\lambda_t y_t}),
\end{equation}\] where \(I_k(x)\) is the Bessel function of the first kind. The \(\lambda_t\) parameter is estimated from a regression with exogenous variables: \[\begin{equation} \label{eq:lambdaValue}
\lambda_t = ( x_t' B )^2 ,
\end{equation}\] where \(\exp\) is taken in order to make \(\lambda_t\) strictly positive, while \(k\) is estimated directly by maximising the likelihood. In order to avoid the negative values of \(k\), it’s absolute value is used.

The model that is assumed in the case of Noncentral Chi Squared distribution is: \[\begin{equation} \label{eq:chiSquaredModel}
y_t = \left( \mu_t + \epsilon_t \right)^2.
\end{equation}\] Given that square function is not monotonic, there are always two sets of parameters that give exactly the same \(\lambda_t\) with positive and negative \(\mu_t\). The `alm()`

function returns the positive one (due to the restrictions imposed on the solver).

\(\lambda_t\) is returned in the variable `mu`

, while \(k\) is returned in `scale`

. Finally, `fitted.values`

returns \(\lambda_t + k\). Similar correction is done in `predict()`

function. As for the prediction intervals, they are generated using `qchisq()`

function from `stats`

package. Last but not least, `resid()`

method returns \(e_t = \sqrt{y_t} - \sqrt{\mu_t}\).

Square the response variable for the next example:

An example with exotic Chi-Squared distribution with predefined number of degrees of freedom:

```
ourModel <- alm(y~x1+x2, data=inSample, distribution="dchisq",df=1)
summary(ourModel)
#> Warning: Choleski decomposition of hessian failed, so we had to revert to the simple inversion.
#> The estimate of the covariance matrix of parameters might be inacurate.
#> Response variable: y
#> Distribution used in the estimation: Chi-Squared with df=1
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 484.43174 0.03716 484.35775 484.50573
#> x1 -1.31175 0.00002 -1.31179 -1.31172
#> x2 -0.32591 0.00003 -0.32597 -0.32584
#> ICs:
#> AIC AICc BIC BICc
#> 1306081 1306081 1306088 1306089
#>
#> Error standard deviation: 123.5073
#> Sample size: 80
#> Number of estimated parameters: 3
#> Number of degrees of freedom: 77
plot(predict(ourModel,outSample))
#> Warning: Choleski decomposition of hessian failed, so we had to revert to the simple inversion.
#> The estimate of the covariance matrix of parameters might be inacurate.
```

There is currently only one distribution in this group:

- Beta distribution.

Beta distribution is a distribution for a continuous variable that is defined on the interval of \((0, 1)\). Note that the bounds are not included here, because the probability density function is not well defined on them. If the provided data contains either zeroes or ones, the function will modify the values using: \[\begin{equation} \label{eq:BetaWarning} y^\prime_t = y_t (1 - 2 \cdot 10^{-10}), \end{equation}\] and it will warn the user about this modification. This correction makes sure that there are no boundary values in the data, and it is quite artificial and needed for estimation purposes only.

The density function of Beta distribution has the form: \[\begin{equation} \label{eq:Beta}
f(y_t) = \frac{y_t^{\alpha_t-1}(1-y_t)^{\beta_t-1}}{B(\alpha_t, \beta_t)} ,
\end{equation}\] where \(\alpha_t\) is the first shape parameter and \(\beta_t\) is the second one. Note indices for the both shape parameters. This is what makes the `alm()`

implementation of Beta distribution different from any other. We assume that both of them have underlying deterministic models, so that: \[\begin{equation} \label{eq:BetaAt}
\alpha_t = \exp(x_t' A) ,
\end{equation}\] and \[\begin{equation} \label{eq:BetaBt}
\beta_t = \exp(x_t' B),
\end{equation}\] where \(A\) and \(B\) are the vectors of parameters for the respective shape variables. This allows the function to model any shapes depending on the values of exogenous variables. The conditional expectation of the model is calculated using: \[\begin{equation} \label{eq:BetaExpectation}
\hat{y}_t = \frac{\alpha_t}{\alpha_t + \beta_t} ,
\end{equation}\] while the conditional variance is: \[\begin{equation} \label{eq:BetaVariance}
\text{V}({y}_t) = \frac{\alpha_t \beta_t}{((\alpha_t + \beta_t)^2 (\alpha_t + \beta_t + 1))} .
\end{equation}\]

`alm()`

function with `distribution="dbeta"`

returns \(\hat{y}_t\) in the variables `mu`

and `fitted.values`

, and \(\text{V}({y}_t)\) in the `scale`

variable. The shape parameters are returned in the respective variables `other$shape1`

and `other$shape2`

. You will notice that the output of the model contains twice more parameters than the number of variables in the model. This is because of the estimation of two models: \(\alpha_t\) and \(\beta_t\) - instead of one.

Respectively, when `predict()`

function is used for the `alm`

model with Beta distribution, the two models are used in order to produce predicted values for \(\alpha_t\) and \(\beta_t\). After that the conditional mean `mu`

and conditional variance `variances`

are produced using the formulae above. The prediction intervals are generated using `qbeta`

function with the provided shape parameters for the holdout. As for the confidence intervals, they are produced assuming normality for the parameters of the model and using the estimate of the variance of the mean based on the `variances`

(which is weird and probably wrong).

This group includes:

- Poisson distribution,
- Negative Binomial distribution,

These distributions should be used in cases of count data.

Poisson distribution used in ALM has the following standard probability mass function: \[\begin{equation} \label{eq:Poisson} P(X=y_t) = \frac{\lambda_t^{y_t} \exp(-\lambda_t)}{y_t!}, \end{equation}\] where \(\lambda_t = \mu_t = \sigma^2_t = \exp(x_t' B)\). As it can be noticed, here we assume that the variance of the model varies in time and depends on the values of the exogenous variables, which is a specific case of heteroscedasticity. The exponent of \(x_t' B\) is needed in order to avoid the negative values in \(\lambda_t\).

`alm()`

with `distribution="dpois"`

returns `mu`

, `fitted.values`

and `scale`

equal to \(\lambda_t\). The quantiles of distribution in `predict()`

method are generated using `qpois()`

function from `stats`

package. Finally, the returned residuals correspond to \(y_t - \mu_t\), which is not really helpful or meaningful…

Negative Binomial distribution implemented in `alm()`

is parameterised in terms of mean and variance: \[\begin{equation} \label{eq:NegBin}
P(X=y_t) = \binom{y_t+\frac{\mu_t^2}{\sigma^2-\mu_t}}{y_t} \left( \frac{\sigma^2 - \mu_t}{\sigma^2} \right)^{y_t} \left( \frac{\mu_t}{\sigma^2} \right)^\frac{\mu_t^2}{\sigma^2 - \mu_t},
\end{equation}\] where \(\mu_t = \exp(x_t' B)\) and \(\sigma^2\) is estimated separately in the optimisation process. These values are then used in the `dnbinom()`

function in order to calculate the log-likelihood based on the distribution function.

`alm()`

with `distribution="dnbinom"`

returns \(\mu_t\) in `mu`

and `fitted.values`

and \(\sigma^2\) in `scale`

. The prediction intervals are produces using `qnbinom()`

function. Similarly to Poisson distribution, `resid()`

method returns \(y_t - \mu_t\).

Round up the response variable for the next example:

Negative Binomial distribution:

```
ourModel <- alm(y~x1+x2, data=inSample, distribution="dnbinom")
summary(ourModel)
#> Warning: Choleski decomposition of hessian failed, so we had to revert to the simple inversion.
#> The estimate of the covariance matrix of parameters might be inacurate.
#> Response variable: y
#> Distribution used in the estimation: Negative Binomial with size=13.04
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 6.28198 0.02790 6.22641 6.33755
#> x1 -0.00851 0.00095 -0.01041 -0.00661
#> x2 -0.00075 0.00024 -0.00123 -0.00027
#> ICs:
#> AIC AICc BIC BICc
#> 1014.159 1014.692 1023.687 1024.856
#>
#> Error standard deviation: 121.8655
#> Sample size: 80
#> Number of estimated parameters: 4
#> Number of degrees of freedom: 76
```

And an example with predefined size:

```
ourModel <- alm(y~x1+x2, data=inSample, distribution="dnbinom", size=30)
summary(ourModel)
#> Warning: Choleski decomposition of hessian failed, so we had to revert to the simple inversion.
#> The estimate of the covariance matrix of parameters might be inacurate.
#> Response variable: y
#> Distribution used in the estimation: Negative Binomial with size=30
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 6.22760 0.02525 6.17731 6.27788
#> x1 -0.00789 0.00134 -0.01056 -0.00521
#> x2 0.00019 0.00001 0.00017 0.00021
#> ICs:
#> AIC AICc BIC BICc
#> 1045.286 1045.601 1052.432 1053.123
#>
#> Error standard deviation: 121.8075
#> Sample size: 80
#> Number of estimated parameters: 3
#> Number of degrees of freedom: 77
```

The final class of models includes two cases:

- Logistic distribution (logit model),
- Normal distribution (probit model).

In both of them it is assumed that the response variable is binary and can be either zero or one. The main idea for this class of models is to use a transformation of the original data and link a continuous latent variable with a binary one. As a reminder, all the models eventually assume that: \[\begin{equation} \label{eq:basicALMCumulative}
\begin{matrix}
o_t \sim \text{Bernoulli}(p_t) \\
p_t = g(x_t' A, \eta_t)
\end{matrix},
\end{equation}\] where \(o_t\) is the binary response variable and \(g(\cdot)\) is the cumulative distribution function. Given that we work with the probability of occurrence, the `predict()`

method produces forecasts for the probability of occurrence rather than the binary variable itself. Finally, although many other cumulative distribution functions can be used for this transformation (e.g. `plaplace()`

or `plnorm()`

), the most popular ones are logistic and normal CDFs.

Given that the binary variable has Bernoulli distribution, its log-likelihood is: \[\begin{equation} \label{eq:BernoulliLikelihood} \ell(p_t | o_t) = \sum_{o_t=1} \log p_t + \sum_{o_t=0} \log(1 - p_t), \end{equation}\] So the estimation of parameters for all the CDFs can be done maximising this likelihood.

In all the functions it is assumed that there is an actual level \(q_t\) that underlies the probability \(p_t\). This level can be modelled as: \[\begin{equation} \label{eq:CDFLevelALM} q_t = \nu_t + \eta_t , \end{equation}\] and it can be transformed to the probability with \(p_t = g(q_t)\). So the aim of all the functions is to estimate the expectation \(\nu_t\) and transform it to the estimate of the probability \(\hat{p}_t\).

In order to estimate the error \(\eta_t\), we assume that \(o_t=1\) happens mainly when the respective estimated probability \(\hat{p}_t\) is very close to one as well. Based on that the error can be calculated as: \[\begin{equation} \label{eq:BinaryError} u_t' = o_t - \hat{p}_t . \end{equation}\] However this error is not useful and should be somehow transformed into the scale of the underlying unobserved variable \(q_t\). Given that both \(o_t \in (0, 1)\) and \(\hat{p}_t \in (0, 1)\), the error will lie in \((-1, 1)\). We therefore standardise it so that it lies in the region of \((0, 1)\): \[\begin{equation} \label{eq:BinaryErrorBounded} u_t = \frac{u_t' + 1}{2} = \frac{o_t - \hat{p}_t + 1}{2}. \end{equation}\]

This transformation means that, when \(o_t=\hat{p}_t\), then the error \(u_t=0.5\), when \(o_t=1\) and \(\hat{p}_t=0\) then \(u_t=1\) and finally, in the opposite case of \(o_t=0\) and \(\hat{p}_t=1\), it is \(u_t=0\). After that this error is transformed using either Logistic or Normal quantile generation function into the scale of \(q_t\), making sure that the case of \(u_t=0.5\) corresponds to zero, the \(u_t>0.5\) corresponds to the positive and \(u_t<0.5\) corresponds to the negative errors.

We have previously discussed the density function of logistic distribution. The standardised cumulative distribution function used in `alm()`

is: \[\begin{equation} \label{eq:LogisticCDFALM}
\hat{p}_t = \frac{1}{1+\exp(-\nu_t)},
\end{equation}\] where \(\nu_t = x_t' A\) is the conditional mean of the level, underlying the probability. This value is then used in the likelihood in order to estimate the parameters of the model. The error term of the model is calculated using the formula: \[\begin{equation} \label{eq:LogisticError}
e_t = \log \left( \frac{u_t}{1 - u_t} \right) = \log \left( \frac{1 + o_t (1 + \exp(\nu_t))}{1 + \exp(\nu_t) (2 - o_t) - o_t} \right).
\end{equation}\] This way the error varies from \(-\infty\) to \(\infty\) and is equal to zero, when \(u_t=0.5\). The error is assumed to be normally distributed (because… why not?).

The `alm()`

function with `distribution="plogis"`

returns \(\nu_t\) in `mu`

, standard deviation, calculated using the respective errors in `scale`

and the probability \(\hat{p}_t\) based on in `fitted.values`

. `resid()`

method returns the errors discussed above. `predict()`

method produces point forecasts and the intervals for the probability of occurrence. The intervals use the assumption of normality of the error term, generating respective quantiles (based on the estimated \(\nu_t\) and variance of the error) and then transforming them into the scale of probability using Logistic CDF.

The case of cumulative Normal distribution is quite similar to the cumulative Logistic one. The transformation is done using the standard normal CDF: \[\begin{equation} \label{eq:NormalCDFALM} \hat{p}_t = \Phi(\nu_t) = \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{\nu_t} \exp \left(-\frac{1}{2}x^2 \right) dx , \end{equation}\] where \(\nu_t = x_t' A\). Similarly to the Logistic CDF, the estimated probability is used in the likelihood in order to estimate the parameters of the model. The error term is calculated using the standardised quantile function of Normal distribution: \[\begin{equation} \label{eq:NormalError} e_t = \Phi \left(\frac{o_t - \hat{p}_t + 1}{2}\right)^{-1} . \end{equation}\] It acts similar to the error from Logistic distribution, but is based on the different functions. Once again we assume that the error has Normal distribution.

Similar to the Logistic CDF, the `alm()`

function with `distribution="pnorm"`

returns \(\nu_t\) in `mu`

, standard deviation, calculated based on the errors in `scale`

and the probability \(\hat{p}_t\) based on in `fitted.values`

. `resid()`

method returns the errors discussed above. `predict()`

method produces point forecasts and the intervals for the probability of occurrence. The intervals use the assumption of normality of the error term and are based on the same idea as in Logistic CDF: quantiles of normal distribution (using the estimated mean and standard deviation) and then the transformation using the standard Normal CDF.

Finally, mixture distribution models can be used in `alm()`

by defining `distribution`

and `occurrence`

parameters. Currently only `plogis()`

and `pnorm()`

are supported for the occurrence variable, but all the other distributions discussed above can be used for the modelling of the non-zero values. If `occurrence="plogis"`

or `occurrence="pnorm"`

, then `alm()`

is fit two times: first on the non-zero data only (defining the subset) and second - using the same data, substituting the response variable by the binary occurrence variable and specifying `distribution=occurrence`

. As an alternative option, occurrence `alm()`

model can be estimated separately and then provided as a variable in `occurrence`

.

As an example of mixture model, let’s generate some data:

```
xreg[,1] <- round(exp(xreg[,1]-400) / (1 + exp(xreg[,1]-400)),0) * xreg[,1]
# Sometimes the generated data contains huge values
xreg[is.nan(xreg[,1]),1] <- 0;
inSample <- xreg[1:80,]
outSample <- xreg[-c(1:80),]
```

First, we estimate the occurrence model (it will complain that the response variable is not binary, but it will work):

And then use it for the mixture model:

The occurrence model will be return in the respective variable:

```
summary(modelMixture)
#> Response variable: y
#> Distribution used in the estimation: Mixture of Log Normal and Cumulative logistic
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 5.91194 0.50294 4.91004 6.91384
#> x1 -0.00849 0.00572 -0.01988 0.00291
#> x2 -0.00083 0.00326 -0.00733 0.00567
#> Noise 0.00138 0.00162 -0.00184 0.00461
#> ICs:
#> AIC AICc BIC BICc
#> 879.6797 870.4905 903.4999 883.3663
#>
#> Error standard deviation: 0.14724
#> Sample size: 80
#> Number of estimated parameters: 5
#> Number of degrees of freedom: 75
summary(modelMixture$occurrence)
#> Response variable: y
#> Distribution used in the estimation: Cumulative logistic
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 35.96730 4.34649 27.30865 44.62595
#> x1 0.00399 0.04944 -0.09450 0.10248
#> x2 0.14026 0.02819 0.08410 0.19641
#> Noise -0.13469 0.01399 -0.16255 -0.10683
#> ICs:
#> AIC AICc BIC BICc
#> 81.92462 82.73543 93.83475 95.61125
#>
#> Error standard deviation: 1.27247
#> Sample size: 80
#> Number of estimated parameters: 5
#> Number of degrees of freedom: 75
```

After that we can produce forecasts using the data from the holdout sample:

```
predict(modelMixture,outSample,interval="p")
#> Mean Lower 2.7% Upper 97.3%
#> 1 456.4769 349.8352 639.2850
#> 2 475.4785 374.1629 667.0182
#> 3 469.0029 347.0686 636.8518
#> 4 453.9423 378.8795 671.4977
#> 5 520.0606 385.0298 716.3098
#> 6 387.5092 382.2922 663.5157
#> 7 478.3313 386.8736 692.2477
#> 8 385.9622 369.2908 649.3134
#> 9 482.5621 387.7402 696.3131
#> 10 466.0458 376.5861 667.6786
#> 11 412.8769 367.7416 646.6805
#> 12 465.0656 355.3933 639.2714
#> 13 440.3319 383.9213 677.0725
#> 14 465.9037 356.1993 638.8538
#> 15 437.8511 347.1068 628.6281
#> 16 464.6484 375.0597 672.1414
#> 17 478.0589 355.3003 647.0887
#> 18 484.2742 368.6226 660.4383
#> 19 461.0321 373.5431 661.3935
#> 20 441.3618 366.3855 649.2353
```

If you expect autoregressive elements in the data, then you can specify the order of AR via the respective parameter

```
modelMixtureAR <- alm(y~x1+x2+Noise, inSample, distribution="dlnorm", occurrence=modelOccurrence, ar=1)
summary(modelMixtureAR)
#> Response variable: y
#> Distribution used in the estimation: Mixture of Log Normal and Cumulative logistic
#> ARIMA(1,0,0) components were included in the model
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 4.88386 0.81755 3.25485 6.51288
#> x1 -0.00760 0.00565 -0.01886 0.00367
#> x2 -0.00087 0.00322 -0.00729 0.00554
#> Noise 0.00120 0.00160 -0.00198 0.00438
#> yLag1 0.17342 0.10815 -0.04207 0.38891
#> ICs:
#> AIC AICc BIC BICc
#> 879.3469 870.4976 905.5492 886.1603
#>
#> Error standard deviation: 0.14511
#> Sample size: 80
#> Number of estimated parameters: 6
#> Number of degrees of freedom: 74
plot(predict(modelMixtureAR,outSample,interval="p",side="u"))
```

If the explanatory variables are not available for the holdout sample, the `forecast()`

function can be used:

```
plot(forecast(modelMixtureAR, h=10, interval="p",side="u"))
#> Warning: No newdata provided, the values will be forecasted
```

It will produce forecasts for each of the explanatory variables based on the available data using `es()`

function from `smooth`

package (if it is available; otherwise, it will use Naive) and use those values as the new data.

Geraci, Marco, and Matteo Bottai. 2007. “Quantile regression for longitudinal data using the asymmetric Laplace distribution.” *Biostatistics* 8 (1): 140–54. https://doi.org/10.1093/biostatistics/kxj039.

Yu, Keming, and Jin Zhang. 2005. “A three-parameter asymmetric laplace distribution and its extension.” *Communications in Statistics - Theory and Methods* 34 (9-10): 1867–79. https://doi.org/10.1080/03610920500199018.