% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/simulate_gmm.R
\name{simulate_gmm}
\alias{simulate_gmm}
\title{Simulate data from a Gaussian mixture model with outliers.}
\usage{
simulate_gmm(
  n,
  mu,
  sigma,
  outlier_num,
  seed = NULL,
  crit_val = 0.9999,
  range_multiplier = 1.5,
  verbose = TRUE,
  max_rejection = 1e+06
)
}
\arguments{
\item{n}{Vector of component sizes.}

\item{mu}{List of component mean vectors.}

\item{sigma}{List of component covariance matrices.}

\item{outlier_num}{Desired number of outliers.}

\item{seed}{Seed.}

\item{crit_val}{Critical value for uniform sample rejection.}

\item{range_multiplier}{How much greater should the range of the Uniform
samples be than the range of the Normal samples?}

\item{verbose}{Whether a message should be printed if a high number of
outliers are being simulated. This suggests that many
simulated outliers are being rejected and the other arguments
may need to be adjusted.}

\item{max_rejection}{Maximum number of simulated outliers to be rejected.}
}
\value{
\code{simulate_gmm} return a \code{data.frame} with continuous variables
\code{X1}, \code{X2}, ..., followed by a mixture component label vector \code{G} with
outliers denoted by \code{0}.
}
\description{
Simulates data from a Gaussian mixture model, then simulates outliers from a
hyper-rectangle, with a rejection step to ensure that the outliers are
sufficiently unlikely under the model.
}
\details{
The simulated outliers are sampled from a Uniform distribution over a
hyper-rectangle. For each dimension, the hyper-rectangle is centred at the
midpoint between the maximum and minimum values for that variable from all of
the Gaussian observations. Its width in that dimension is the distance
between the minimum and maximum values for that variable multiplied by the
value of \code{range_multiplier}. If \code{range_multiplier = 1}, then this
hyper-rectangle is the axis-aligned minimum bounding box for all of the
Gaussian data points in this data set.

The \code{crit_val} ensures that it would have been sufficiently unlikely for a
simulated outlier to have been sampled from any of the Gaussian components.
The Mahalanobis distances of a proposed outlier from each component's mean
vector with respect to that component's covariance matrix are computed. If
any of these Mahalanobis distances are smaller than the critical value of the
appropriate Chi-squared distribution, then the proposed outlier is rejected.
In summary, for a Uniform sample to be accepted, it must be sufficiently far
from each component in terms of Mahalanobis distance.
}
\examples{
gmm_k3n1000o10 <- simulate_gmm(
  n = c(500, 250, 250),
  mu = list(c(-1, 0), c(+1, -1), c(+1, +1)),
  sigma = list(diag(c(0.2, 4 * 0.2)), diag(c(0.2, 0.2)), diag(c(0.2, 0.2))),
  outlier_num = 10,
  seed = 123,
  crit_val = 0.9999,
  range_multiplier = 1.5
)

plot(
  gmm_k3n1000o10[, c("X1", "X2")],
  col = gmm_k3n1000o10$G + 1, pch = gmm_k3n1000o10$G + 1
)
}
