% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/n_gram_merge.R
\name{n_gram_merge}
\alias{n_gram_merge}
\title{Value merging based on ngram fingerprints}
\usage{
n_gram_merge(vect, numgram = 2, ignore_strings = NULL, bus_suffix = TRUE,
  edit_threshold = 1, weight = c(d = 0.33, i = 0.33, s = 1, t = 0.5), ...)
}
\arguments{
\item{vect}{Character vector, items to be potentially clustered and merged.}

\item{numgram}{Numeric value, indicating the number of characters that
will occupy each ngram token. Default value is 2.}

\item{ignore_strings}{Character vector, these strings will be ignored during
the merging of values within \code{vect}. Default value is NULL.}

\item{bus_suffix}{Logical, indicating whether the merging of records should
be insensitive to common business suffixes or not. Default value is TRUE.}

\item{edit_threshold}{Numeric value, indicating the threshold at which a
merge is performed, based on the sum of the edit values derived from
param \code{weight}. Default value is 1. If this parameter is
set to 0 or NA, then no approximate string matching will be done, and all
merging will be based on strings that have identical ngram fingerprints.}

\item{weight}{Numeric vector, indicating the weights to assign to
the four edit operations (see details below), for the purpose of
approximate string matching. Default values are
c(d = 0.33, i = 0.33, s = 1, t = 0.5). This parameter gets passed along
to the \code{stringdist} function. Must be either
a numeric vector of length four, or NA.}

\item{...}{additional args to be passed along to the \code{stringdist}
function. The acceptable args are identical to those of
[stringdistmatrix()].}
}
\value{
Character vector with similar values merged.
}
\description{
This function takes a character vector and makes edits and merges values
that are approximately equivalent yet not identical. It uses a two step
process, the first is clustering values based on their ngram fingerprint (described here
\url{https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth}).
The second step is merging values based on approximate string matching of
the ngram fingerprints, using the [sd_lower_tri()] C function from the
package \code{stringdist}.
}
\details{
The values of arg \code{weight} are edit distance values that
 get passed to the \code{stringdist} edit distance function. The
 param takes four arguments, each one is a specific type of edit, with
 default penalty value.
 \itemize{
 \item d: deletion, default value is 0.33
 \item i: insertion, default value is 0.33
 \item s: substitution, default value is 1
 \item t: transposition, default value is 0.5
 }
}
\examples{
x <- c("Acme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC")

n_gram_merge(vect = x)

# The performance of the approximate string matching can be ajusted using
# parameters 'weight' or 'edit_threshold'
n_gram_merge(vect = x,
             weight = c(d = 0.4, i = 1, s = 1, t = 1))

# Use parameter 'ignore_strings' to ignore specific strings during merging
# of values.
x <- c("Bakersfield Highschool", "BAKERSFIELD high",
       "high school, bakersfield")
n_gram_merge(vect = x, ignore_strings = c("high", "school", "highschool"))

}
