In this vignette we demonstrate how to define new methods, with varying degrees of completeness.

We will generate a dataset comprising two clusters with a different intercept and slope. Firstly, we configure the default id, time, and response variable, so that these do not need to be provided to the other function calls.

```
library(latrend)
options(latrend.id = "Traj",
latrend.time = "Time",
latrend.verbose = TRUE)
```

Next, we generate the dataset, with 40 trajectories for cluster A and 60 trajectories for cluster B. Cluster A involves trajectories with a downward slope, whereas cluster B has an upward slope. We define a fixed mean of 1, such that all the cluster trajectories are shifted, placing cluster A at intercept 2, and cluster B at intercept 1.

```
set.seed(1)
<- generateLongData(sizes = c(40, 60),
casedata data = data.frame(Time = 0:10),
fixed = Y ~ 1,
fixedCoefs = 1,
cluster = ~ Time,
clusterCoefs = cbind(c(2, -.1), c(0, .05)),
random = ~ Time,
randomScales = cbind(c(.2, .02), c(.2, .02)),
noiseScales = .05
)
```

We plot the data to get a sense of different trajectories that have been generated. Visually, there is little group separation.

`plotTrajectories(casedata, response = "Y")`

Since we generated the data, we have the luxury of looking at reference cluster trajectories, as stored in the `Mu.cluster`

column. Note that `Mu.fixed`

must be added to obtain the correct values.

`plotClusterTrajectories(casedata, response = "Y", cluster = "Class")`

Rather than starting with clustering, in some case studies there may be prior knowledge on how to sensibly stratify the trajectories. Either by an observed independent variable (e.g., age or gender), or a certain aspect of the observed trajectory (e.g., the intercept, slope, variability).

Letâ€™s presume in this example that domain knowledge suggests that stratifying by the intercept may provide a sensible grouping of the trajectories. This approach is further supported by the density plot of the trajectory intercepts, which shows a bimodal distribution.

```
ggplot(casedata[Time == 0], aes(x = Mu)) +
geom_density(fill = "gray", adjust = .3)
```

Based on the density plot, we will assign trajectories with an intercept above 1.6 to cluster A, and the remainder to cluster B.

```
<- lcMethodStratify(response = "Y", Y[1] > 1.6)
method <- latrend(method, casedata) model
```

```
clusterProportions(model)
#> A B
#> 0.6 0.4
```

The approach we specified requires the first observation to be available, and is sensitive to noise. A more robust alternative would be to fit a linear model per trajectory, and to use the estimated intercept to stratify the trajectories on.

We can specify this stratification model by defining a function which takes the data of separate trajectories as input. This function should return a single cluster assignment for the respective trajectory. By returning a factor, we can pre-specify the cluster names.

```
<- function(data) {
stratfun <- coef(lm(Y ~ Time, data))[1]
int factor(int > 1.7,
levels = c(FALSE, TRUE),
labels = c("Low", "High"))
}<- lcMethodStratify(response = "Y", stratify = stratfun, center = mean)
m2 <- latrend(m2, casedata)
model2
clusterProportions(model2)
#> Low High
#> 0.6 0.4
```

In case the linear regression step is time-intensive, a more efficient approach is to save the pre-computed trajectory intercepts as a column in the original data. This column can then be referred to in the expression of the stratification model.

```
:= coef(lm(Y ~ Time, .SD))[1], by = Traj]
casedata[, Intercept
<- lcMethodStratify(response = "Y", stratify = Intercept[1] > 1.7,
m3 clusterNames = c("Low", "High"))
<- latrend(m3, casedata) model3
```

We can take the approach involving the estimation of a linear model per trajectory one step further. Instead of using a pre-defined threshold on the intercept, we use a cluster algorithm on both the intercept and slope to automatically find clusters.

We first define the representation step, which estimates the model coefficients per trajectory, and outputs a `matrix`

with the coefficients per trajectory per row.

```
<- function(method, data, verbose) {
repStep <- data[, lm(Y ~ Time, .SD) %>% coef() %>% as.list(), keyby = Traj]
coefdata <- subset(coefdata, select = -1) %>% as.matrix()
coefmat rownames(coefmat) <- coefdata$Traj
return(coefmat)
}
```

The cluster step takes the coefficient matrix as input. A cross-sectional cluster algorithm can then be applied to the matrix. In this example, we apply \(k\)-means. The cluster step should output a `lcModel`

object.

The `lcModelCustom`

function creates a `lcModel`

object for a given vector of cluster assignments.

```
<- function(method, data, repMat, envir, verbose) {
clusStep <- kmeans(repMat, centers = 3)
km
lcModelCustom(response = method$response,
method = method,
data = data,
trajectoryAssignments = km$cluster,
clusterTrajectories = method$center,
model = km)
}
```

We are now ready to create the `lcMethodFeature`

method.

`<- lcMethodFeature(response = "Y", representationStep = repStep, clusterStep = clusStep) m.twostep `

```
<- latrend(m.twostep, data = casedata)
model.twostep summary(model.twostep)
#> Longitudinal cluster model using two-step clustering
#> id: "Traj"
#> time: "Time"
#> center: function (x) { mean(x, na.rm = TRUE)}
#> standardize: `scale`
#> clusterStep: `clusStep`
#> representationStep:`repStep`
#> response: "Y"
#>
#> Cluster sizes (K=3):
#> A B C
#> 35 (35%) 25 (25%) 40 (40%)
#>
#> Number of obs: 1100, strata (Traj): 100
#>
#> Scaled residuals:
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> NA NA NA NaN NA NA 1
```

The two-step model defined above is hard-coded for a given formula and a fixed number of clusters. In an exploratory setting, it is convenient to define a parameterized method. Here, we change the two functions to take arguments through the `lcMethod`

object in the `method`

variable.

Note that we can introduce new arguments which are not originally part of `lcMethodFeature`

(e.g., `nClusters`

) to enable the specification of the number of clusters in our method.

```
<- function(method, data, verbose) {
repStep.gen <- data[, lm(method$formula, .SD) %>% coef() %>% as.list(), keyby = c(method$id)]
coefdata # exclude the id column
<- subset(coefdata, select = -1) %>% as.matrix()
coefmat rownames(coefmat) <- coefdata[[method$id]]
return(coefmat)
}
<- function(method, data, repMat, envir, verbose) {
clusStep.gen <- kmeans(repMat, centers = method$nClusters)
km
lcModelCustom(response = "Y",
method = method,
data = data,
trajectoryAssignments = km$cluster,
clusterTrajectories = method$center,
model = km)
}
```

We create a new `lcMethodFeature`

instance with the more generic functions. Defining values for `formula`

and `nClusters`

here makes these arguments values act as default values in a call of `latrend`

.

```
<- lcMethodFeature(response = "Y",
m.twostepgen representationStep = repStep.gen,
clusterStep = clusStep.gen)
```

However, because we omitted the specification of `formula`

and `nClusters`

, these need to be provided in the `latrend`

call.

```
<- latrend(m.twostepgen, formula = Y ~ Time, nClusters = 2, casedata)
model.twostepgen summary(model.twostepgen)
#> Longitudinal cluster model using two-step clustering
#> nClusters: 2
#> formula: Y ~ Time
#> id: "Traj"
#> time: "Time"
#> center: function (x) { mean(x, na.rm = TRUE)}
#> standardize: `scale`
#> clusterStep: `clusStep.gen`
#> representationStep:`repStep.gen`
#> response: "Y"
#>
#> Cluster sizes (K=2):
#> A B
#> 40 (40%) 60 (60%)
#>
#> Number of obs: 1100, strata (Traj): 100
#>
#> Scaled residuals:
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> NA NA NA NaN NA NA 1
```

The use of `lcMethodStratify`

and `lcMethodFeature`

enables rapid prototyping. Once the desired model has been identified, it may be worthwhile to implement it as a standalone method in the framework. This way the model can be extended to output more representative cluster trajectories, or to extend the model with predictive capabilities such that it can be validated on external data.

To illustrate this, we will implement a basic group-based trajectory model (GBTM), also known as latent-class growth analysis. We make use of the implementation available in the `lcmm`

package through the `lcmm()`

function, which allows for a relatively concise illustration.

`lcMethod`

classFirstly, we create a new method class named `lcMethodSimpleGBTM`

, which extends the `lcMethod`

class.

`setClass("lcMethodSimpleGBTM", contains = "lcMethod")`

Note that a `lcMethod`

class has a single slot `call`

, which stores all the arguments to the method. The other relevant aspects of a `lcMethod`

implementation are the `prepareData()`

and `fit()`

functions, which are called when passing a `lcMethod`

object to the `latrend()`

function or any other model estimation function.

```
<- function(formula = Value ~ Time,
lcMethodSimpleGBTM time = getOption("latrend.time"),
id = getOption("latrend.id"),
nClusters = 2,
nwg = FALSE) {
lcMethod.call("lcMethodSimpleGBTM", call = stackoverflow::match.call.defaults())
}
```

Next, we override the `getName()`

and `getShortName()`

functions in our method class to ensure that the model is easily distinguishable from other methods in the summary output. While we return a simple constant character sequence, the naming functions could be improved by generating a more detailed description of the model based on the arguments of the method object.

```
setMethod("getName",
signature("lcMethodSimpleGBTM"), function(object, ...) "simple group-based trajectory model")
setMethod("getShortName", signature("lcMethodSimpleGBTM"), function(object, ...) "sgbtm")
```

Implementing the `prepareData()`

function enables the `lcMethod`

to do data processing or transforming the data or other arguments in a structure which is suitable for the model fitting procedure. The prepare function takes a `lcMethod`

, the data as a `data.frame`

, and the verbosity level as inputs. The processing is passed onto the `fit`

function by returning an `environment`

with the relevant variables.

In this example, we need to ensure that the `Id`

column is an integer.

```
setMethod("prepareData", signature("lcMethodSimpleGBTM"), function(method, data, verbose, ...) {
<- new.env()
envir $data <- as.data.frame(data)
envir$data[[method$id]] <- factor(data[[method$id]]) %>% as.integer()
envirreturn(envir)
})
```

The `fit()`

function estimates the model and should return an object that extends the `lcModel`

class. In our implementation, we call the `lcmm()`

with the appropriate arguments, and we construct a `lcModelSimpleGBTM`

instance. The class is defined in the subsection below. The `envir`

argument contains the return value of the `prepare()`

function, which would be `NULL`

here.

```
setMethod("fit", signature("lcMethodSimpleGBTM"), function(method, data, envir, verbose, ...) {
<- as.list(method, args = lcmm::lcmm)
args $data <- envir$data
args$fixed <- method$formula
argsif (method$nClusters > 1) {
$mixture <- update(method$formula, NULL ~ .)
argselse {
} $mixture <- NULL
args
}$subject <- method$id
args$ng <- method$nClusters
args$returndata <- TRUE
args
<- do.call(lcmm::lcmm, args)
model
new("lcModelSimpleGBTM",
method = method,
data = data,
model = model,
clusterNames = LETTERS[seq_len(method$nClusters)])
})
```

`lcModel`

classWe start off by defining the `lcModelSimpleGBTM`

as an extension of the `lcModel`

class. Extending the base class also enables the addition of new fields to the class, although there is no need for this in the present model.

`setClass("lcModelSimpleGBTM", contains = "lcModel")`

Our model inherits a couple of slots from the `lcModel`

class, namely:

```
slotNames("lcModelSimpleGBTM")
#> [1] "model" "method" "call" "data"
#> [5] "id" "time" "response" "label"
#> [9] "ids" "clusterNames" "date" "estimationTime"
#> [13] "tag"
```

The `"model"`

slot is free to be used to assign an arbitrary data structure that represents the internal model representation(s). In our implementation, this slot contains the `lcmm`

model. The other slots are assigned in the constructor of the `lcModel`

class, or by the `latrend()`

estimation function. It is best practice to use the relevant getter functions for obtaining these values (via e.g., `idVariable()`

, `getLcMethod()`

, `clusterNames()`

).

Typically, most effort of implementing the model interface goes to the `predictForCluster()`

function. While implementing the function is optional, it is used for obtaining the fitted values, fitted trajectories, residuals, and cluster trajectories. Without this function, there is little functionality from a model except for the partitioning of the fitted trajectories.

The `fitted()`

function returns the fitted values per trajectory per cluster. If the clusters are not provided, the function is expected to return a matrix with the fitted values for each cluster. This logic is handled in the `transformFitted()`

function, which is available in the package.

```
<- function(object, clusters = trajectoryAssignments(object)) {
fitted.lcModelSimpleGBTM <- paste0("pred_m", 1:nClusters(object))
predNames <- as.matrix(object@model$pred[predNames])
predMat colnames(predMat) <- clusterNames(object)
transformFitted(pred = predMat, model = object, clusters = clusters)
}
```

The easiest way to add predictive capability to a model is by implementing the function `predictForCluster()`

. This function returns the predicted values of a specific cluster, for the respective observations. This function is used by `predict.lcModel(model, newdata)`

function, which also reduces the output of `fitted(model)`

in case of `newdata = NULL`

.

```
setMethod('predictForCluster', signature('lcModelSimpleGBTM'), function(
what = 'mu', ...)
object, newdata, cluster,
{= lcmm::predictY(object@model, newdata = newdata)$pred %>%
predMat set_colnames(clusterNames(object))
= match(cluster, clusterNames(object))
clusIdx
predMat[, clusIdx] })
```

In order for the model implementation to return cluster assignments through the `trajectoryAssignments()`

function, we need to implement the `postprob()`

function to return the posterior probability matrix for the fitted data.

```
setMethod("postprob", signature("lcModelSimpleGBTM"), function(object) {
as.matrix(object@model$pprob)[, c(-1, -2), drop = FALSE]
})
```

Lastly, we override the `converged()`

function to return the convergence code of the internal model.

```
setMethod("converged", signature("lcModelSimpleGBTM"), function(object, ...) {
@model$conv
object })
```

```
<- lcMethodSimpleGBTM(formula = Y ~ Time)
m show(m)
#> lcMethodSimpleGBTM as "simple group-based trajectory model"
#> formula: Y ~ Time
#> time: getOption("latrend.time")
#> id: getOption("latrend.id")
#> nClusters: 2
#> nwg: FALSE
```

```
<- latrend(m, casedata)
sgbtm #> Be patient, lcmm is running ...
#> The program took 0.15 seconds
summary(sgbtm)
#> Longitudinal cluster model using simple group-based trajectory model
#> nwg: FALSE
#> nClusters: 2
#> id: "Traj"
#> time: "Time"
#> formula: Y ~ Time
#>
#> Cluster sizes (K=2):
#> A B
#> 40 (40%) 60 (60%)
#>
#> Number of obs: 1100, strata (Traj): 100
#>
#> Scaled residuals:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -2.1126 -0.9295 0.4772 0.0000 0.8352 1.3507
```

`plot(sgbtm)`