\name{group.knowns}
\alias{group.knowns}
\alias{group.knowns.TRAMP}
\alias{group.knowns.TRAMPknowns}

\title{Knowns Clustering}

\description{Group a \code{\link{TRAMPknowns}} object so that knowns
  with similar TRFLP patterns and knowns that share the same species
  name \dQuote{group} together. In general, this function will be called
  automatically whenever appropriate (e.g. when loading a data set or
  adding new knowns).  Please see Details to understand why this
  function is necessary, and how it works.

  The main reason for manually calling \code{group.knowns} is to change
  the default values of the arguments; if you call \code{group.knowns}
  on a \code{TRAMPknowns} object, then any subsequent automatic call to
  \code{group.knowns} will use any arguments you passed in the
  manual \code{group.knowns} call (e.g. after doing
  \code{group.knowns(x, cut.height=20)}, all future groupings will use
  \code{cut.height=20}).
}

\usage{
group.knowns(x, ...)
\method{group.knowns}{TRAMPknowns}(x, dist.method, hclust.method, cut.height, ...)
\method{group.knowns}{TRAMP}(x, ...)
}

\arguments{
  \item{x}{A \code{\link{TRAMPknowns}} or \code{\link{TRAMP}} object,
    containing identified TRFLP patterns.}
  \item{dist.method}{Distance method used in calculating similarity
    between different knowns (see \code{\link{dist}}).  Valid options
    include \code{"maximum"}, \code{"euclidian"} and
    \code{"manhattan"}.}
  \item{hclust.method}{Clustering method used in generating clusters
    from the similarity matrix (see \code{\link{hclust}}).}
  \item{cut.height}{Passed to \code{\link{cutree}}; controls how similar
    members of each group should be (the larger \code{cut.height}, the
    more inclusive knowns groups will be).}
  \item{...}{Arguments passed to further methods.}
}

\details{
  \code{group.knowns} groups together knowns in a
  \code{\link{TRAMPknowns}} object based on two criteria: (1) TRFLP
  profiles that are very similar across shared enzyme/primer
  combinations (based on clustering) and (2) TRFLP profiles that belong
  to the same species (i.e. share a common \code{species} column in the
  \code{info} data.frame of \code{x}; see \code{\link{TRAMPknowns}} for
  more information).  This is to solve three issues in TRFLP analysis:
  \enumerate{
    \item The TRFLP profile of a single species can have variation in
    peak sizes due to DNA sequence variation.  By including multiple
    collections of each species, variation in TRFLP profiles can be
    accounted for.  If a \code{\link{TRAMPknowns}} object contains
    multiple collections of a species, these will be aggregated by
    \code{group.knowns}.  This aggregation is essential for community
    analysis, as leaving individual collections will artificially
    inflate the number of \dQuote{present species} when running
    \code{\link{TRAMP}}.

    Some authors have taken an alternative approach by using a larger
    tolerance in matching peaks between samples and knowns (effectively
    increasing \code{accept.error} in \code{\link{TRAMP}}) to account
    for within-species variation.  This is not recommended, as it
    dramatically increases the risk of incorrect matches.

    \item Distinctly different TRFLP profiles may occur within a species
    (or in some cases within an individual); see Avis et al. (2006).
    \code{group.knowns} looks at the \code{species} column of the
    \code{info} data.frame of \code{x} and joins any knowns with
    identical \code{species} values as a group.
    %% TODO: this originally said "in a species", but that's what we've
    %% already covered.  If my interpretation is correct, how do we get
    %% >1 profile/individual out of the data?
    This can also be used where multiple profiles are present in an
    individual.

    \item Different species may share a similar TRFLP profile and
    therefore be indistinguishable using TRFLP. If these patterns are
    not grouped, two species will be recorded as present wherever either
    is present. \code{group.knowns} prevents this by joining knowns with
    \dQuote{very similar} TRFLP patterns as a group.  Ideally, these
    problematic groups can be resolved by increasing the number of
    enzyme/primer pairs in the data.
  }

  Groups names are generated by concatenating all unique (sorted)
  species names together, separated by commas.

  To determine if knowns are \dQuote{similar enough} to form a group, we
  use \R's clustering tools: \code{\link{dist}}, \code{\link{hclust}}
  and \code{\link{cutree}}.  First, we generate a distance matrix of the
  knowns profiles using \code{\link{dist}}, and using method
  \code{dist.method} (see Example below; this is very similar to what
  \code{\link{TRAMP}} does, and \code{dist.method} should be specified
  accordingly).  We then generate clusters using \code{\link{hclust}},
  and using method \code{hclust.method}, and \dQuote{cut} the tree at
  \code{cut.height} using \code{\link{cutree}}.

  Knowns are grouped together iteratively; so that all groups sharing a
  common cluster are grouped together, and all knowns that share a
  common species name are grouped together.  In certain cases this may
  chain together seemingly unrelated groups.

  Because \code{group.knowns} is generic, it can be run on either a
  \code{\link{TRAMPknowns}} or a \code{\link{TRAMP}} object.  When run
  on a \code{TRAMP} object, it updates the \code{TRAMPknowns} object
  (stored as \code{x$knowns}), so that subsequent calls to
  \code{\link{plot.TRAMPknowns}} or \code{\link{summary.TRAMPknowns}}
  (for example) will use the new grouping parameters.

  Parameters set by \code{group.knowns} are retained as part of the
  object, so that when adding additional knowns (\code{\link{add.known}}
  and \code{combine}), or when subsetting a knowns database (see
  \code{\link{[.TRAMPknowns}}, %] Thanks emacs!
  aka \code{\link{TRAMPindexing}}), the same grouping parameters will be
  used.
}

\value{
  For \code{group.knowns.TRAMPknowns}, a new \code{TRAMPknowns} object.
  The \code{cluster.pars} element will have been updated with new
  parameters, if any were specified.

  For \code{group.knowns.TRAMP}, a new \code{TRAMP} object, with an
  updated \code{knowns} element.  Note that the \emph{original}
  \code{TRAMPknowns} object (i.e. the one from which the \code{TRAMP}
  object was constructed) will \code{not} be modified.
}

\section{Warning}{
  Warning about missing data: where there are \code{NA} values in
  certain combinations, \code{NA}s may be present in the final distance
  matrix, which means we cannot use \code{hclust} to generate the
  clusters!  In general, \code{NA} values are fine.  They just can't be
  everywhere.
}

\seealso{
  \code{\link{TRAMPknowns}}, which describes the \code{TRAMPknowns}
  object.
  
  \code{\link{build.knowns}}, which attempts to generate a knowns
  database from a \code{\link{TRAMPsamples}} data set.

  \code{\link{plot.TRAMPknowns}}, which graphically displays the
  relationships between knowns.
}

\references{
  Avis PG, Dickie IA, Mueller GM 2006. A \sQuote{dirty} business:
  testing the limitations of terminal restriction fragment length
  polymorphism (TRFLP) analysis of soil fungi. Molecular Ecology
  15: 873-882.
}

\examples{
data(demo.knowns)
data(demo.samples)

demo.knowns <- group.knowns(demo.knowns, cut.height=2.5)
plot(demo.knowns)

## Increasing cut.height makes groups more inclusive:
plot(group.knowns(demo.knowns, cut.height=100))

res <- TRAMP(demo.samples, demo.knowns)
m1.ungrouped <- summary(res)
m1.grouped <- summary(res, group=TRUE)
ncol(m1.grouped) # 94 groups

res2 <- group.knowns(res, cut.height=100)
m2.ungrouped <- summary(res2)
m2.grouped <- summary(res2, group=TRUE)
ncol(m2.grouped) # Now only 38 groups

## group.knowns results in the same distance matrix as produced by
## TRAMP, therefore using the same method (e.g. method="maximum") is
## important.  The example below shows how the matrix produced by
## dist(summary(x)) (as calculated by group.knowns) is the same as that
## produced by TRAMP:
f <- function(x, method="maximum") {
  ## Create a pseudo-samples object from our knowns
  y <- x
  y$data$height <- 1
  names(y$info)[names(y$info) == "knowns.pk"] <- "sample.pk"
  names(y$data)[names(y$data) == "knowns.fk"] <- "sample.fk"
  class(y) <- "TRAMPsamples"

  ## Run TRAMP, clean up and return
  ## (If method != "maximum", rescale the error to match that
  ## generated by dist()).
  z <- TRAMP(y, x, method=method)
  if ( method != "maximum" ) z$error <- z$error * z$n
  names(dimnames(z$error)) <- NULL
  z
}

g <- function(x, method="maximum")
  as.matrix(dist(summary(x), method=method))

all.equal(f(demo.knowns, "maximum")$error,   g(demo.knowns, "maximum"))
all.equal(f(demo.knowns, "euclidian")$error, g(demo.knowns, "euclidian"))
all.equal(f(demo.knowns, "manhattan")$error, g(demo.knowns, "manhattan"))

## However, TRAMP is over 100 times slower in this special case.
system.time(f(demo.knowns))
system.time(g(demo.knowns))
}

\keyword{multivariate}
\keyword{cluster}
