% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/vs_cluster_size.R
\name{vs_cluster_size}
\alias{vs_cluster_size}
\alias{vs_cluster}
\alias{cluster_size}
\alias{cluster}
\title{Cluster FASTA sequences}
\usage{
vs_cluster_size(
  fasta_input,
  centroids = NULL,
  otutabout = NULL,
  size_column = FALSE,
  id = 0.97,
  strand = "plus",
  sizein = TRUE,
  sizeout = TRUE,
  relabel = NULL,
  relabel_sha1 = FALSE,
  fasta_width = 0,
  sample = NULL,
  log_file = NULL,
  threads = 1,
  vsearch_options = NULL,
  tmpdir = NULL
)
}
\arguments{
\item{fasta_input}{(Required). A FASTA file path or a FASTA object containing
reads to cluster. See \emph{Details}.}

\item{centroids}{(Optional). A character string specifying the name of the
FASTA output file for the cluster centroid sequences. If \code{NULL}
(default), no output is written to a file and the centroid sequences are
returned as a FASTA object. See \emph{Details}.}

\item{otutabout}{(Optional). A character string specifying the name of the
output file in an OTU table format. If \code{NULL} (default), no output is
written to a file. If \code{TRUE}, the output is returned as a tibble. See
\emph{Details}.}

\item{size_column}{(Optional). If \code{TRUE}, a column with the size of each
centroid is added to the centroid output tibble.}

\item{id}{(Optional). Pairwise identity threshold for sequence to be added to
a cluster. Defaults to \code{0.97}. See \emph{Details}.}

\item{strand}{(Optional). Specifies which strand to consider when comparing
sequences. Can be either \code{"plus"} (default) or \code{"both"}.}

\item{sizein}{(Optional). If \code{TRUE} (default), abundance annotations
present in sequence headers are taken into account.}

\item{sizeout}{(Optional). If \code{TRUE} (default), abundance annotations
are added to FASTA headers.}

\item{relabel}{(Optional). Relabel sequences using the given prefix and a
ticker to construct new headers. Defaults to \code{NULL}.}

\item{relabel_sha1}{(Optional). If \code{TRUE} (default), relabel sequences
using the SHA1 message digest algorithm. Defaults to \code{FALSE}.}

\item{fasta_width}{(Optional). Number of characters per line in the output
FASTA file. Defaults to \code{0}, which eliminates wrapping.}

\item{sample}{(Optional). Add the given sample identifier string to sequence
headers. For instance, if the given string is "ABC", the text ";sample=ABC"
will be added to the header. This option is only applicable when the output
format is FASTA (\code{centroids}). If \code{NULL} (default), no identifier
is added.}

\item{log_file}{(Optional). Name of the log file to capture messages from
\code{VSEARCH}. If \code{NULL} (default), no log file is created.}

\item{threads}{(Optional). Number of computational threads to be used by
\code{VSEARCH}. Defaults to \code{1}.}

\item{vsearch_options}{(Optional). Additional arguments to pass to
\code{VSEARCH}. Defaults to \code{NULL}. See \emph{Details}.}

\item{tmpdir}{(Optional). Path to the directory where temporary files should
be written when tables are used as input or output. Defaults to
\code{NULL}, which resolves to the session-specific temporary directory
(\code{tempdir()}).}
}
\value{
A tibble or \code{NULL}.

If \code{centroids} is specified the centroid sequences are written to the
specified file, and no tibble is returned.

If \code{otutabout} is \code{TRUE}, an OTU table is returned as a tibble.
If \code{otutabout} is a character string, the output is written to the file,
and no tibble is returned.

If neither \code{centroids} nor \code{otutabout} is specified, a FASTA object
with the centroid sequences and additional column \code{otu_id} is returned.
The clustering statistics are included as an attribute named
\code{"statistics"}.

The \code{"statistics"} attribute of the returned tibble (when
\code{centroids} is \code{NULL}) is a tibble with the following columns:
\itemize{
  \item \code{num_nucleotides}: Total number of nucleotides used as input for
  clustering.
  \item \code{min_length_input_seq}: Length of the shortest sequence used as
  input for clustering.
  \item \code{max_length_input_seq}: Length of the longest sequence used as
  input for clustering.
  \item \code{avg_length_input_seq}: Average length of the sequences used as
  input for clustering.
  \item \code{num_clusters}: Number of clusters generated.
  \item \code{min_size_cluster}: Size of the smallest cluster.
  \item \code{max_size_cluster}: Size of the largest cluster.
  \item \code{avg_size_cluster}: Average size of the clusters.
  \item \code{num_singletons}: Number of singletons after clustering.
  \item \code{input}: Name of the input file/object for the clustering.
}
}
\description{
\code{vs_cluster_size} clusters FASTA sequences from a given
file or object using \code{VSEARCH}´s \code{cluster_size} method. The
function automatically sorts sequences by decreasing abundance before
clustering.
}
\details{
Sequences are clustered based on the pairwise identity threshold specified by
\code{id}. Sequences are sorted by decreasing abundance before clustering.
The centroid of each cluster is the first sequence added to the cluster.

\code{fasta_input} can either be a file path to a FASTA file or a FASTA
object. FASTA objects are tibbles that contain the columns \code{Header} and
\code{Sequence}, see \code{\link[microseq]{readFasta}}.

If neither \code{centroids} nor \code{otutabout} is specified (default), the
function returns the centroid sequences as a FASTA object with an additional
column \code{otu_id}. This column contains the identifier extracted from each
sequence header.

If \code{centroids} is specified, centroid sequences are written to the
specified file in FASTA format.

\code{otutabout} gives the option to output the results in an OTU
table format with tab-separated columns. When writing to a file, the first
line starts with the string "#OTU ID", followed by a tab-separated list of
all sample identifiers (formatted as "sample=X"). Each subsequent line,
corresponding to an OTU, begins with the OTU identifier and is followed by
tab-separated abundances for that OTU in each sample. If \code{otutabout} is
a character string, the output is written to the specified file. If
\code{otutabout} is \code{TRUE}, the function returns the OTU table as a
tibble, where the first column is named \code{otu_id} instead of "#OTU ID".

\code{id} is a value between 0 and 1 that defines the minimum pairwise
identity required for a sequence to be added to a cluster. A sequence is not
added to a cluster if its pairwise identity with the centroid is below the
\code{id} threshold.
Pairwise identity is calculated as the number of matching columns divided by
the alignment length minus terminal gaps.

If \code{log_file} is \code{NULL} and \code{centroids} is specified,
clustering statistics from \code{VSEARCH} will not be captured.

\code{vsearch_options} allows users to pass additional command-line arguments
to \code{VSEARCH} that are not directly supported by this function. Refer to
the \code{VSEARCH} manual for more details.
}
\examples{
\dontrun{
# Define arguments
fasta_input <- file.path(file.path(path.package("Rsearch"), "extdata"),
                                   "small.fasta")
centroids <- NULL

# Cluster sequences and return a FASTA tibble
cluster_seqs <- vs_cluster_size(fasta_input = fasta_input,
                                centroids = centroids)

# Extract clustering statistics
statistics <- attr(cluster_seqs, "statistics")

# Cluster sequences and write centroids to a file
vs_cluster_size(fasta_input = fasta_input,
                centroids = "centroids_sequences.fa")
}

}
\references{
\url{https://github.com/torognes/vsearch}
}
