Normalizers
译者:片刻小哥哥
项目地址:https://huggingface.apachecn.org/docs/tokenizers/api/normalizers
BertNormalizer
class
tokenizers.normalizers.
BertNormalizer
(
clean_text
= True
handle_chinese_chars
= True
strip_accents
= None
lowercase
= True
)
Parameters
- clean_text
(
bool
, optional , defaults toTrue
) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one. - handle_chinese_chars
(
bool
, optional , defaults toTrue
) — Whether to handle chinese chars by putting spaces around them. - strip_accents
(
bool
, optional ) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert). - lowercase
(
bool
, optional , defaults toTrue
) — Whether to lowercase.
BertNormalizer
Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing
Lowercase
class
tokenizers.normalizers.
Lowercase
(
)
Lowercase Normalizer
NFC
class
tokenizers.normalizers.
NFC
(
)
NFC Unicode Normalizer
NFD
class
tokenizers.normalizers.
NFD
(
)
NFD Unicode Normalizer
NFKC
class
tokenizers.normalizers.
NFKC
(
)
NFKC Unicode Normalizer
NFKD
class
tokenizers.normalizers.
NFKD
(
)
NFKD Unicode Normalizer
Nmt
class
tokenizers.normalizers.
Nmt
(
)
Nmt normalizer
Normalizer
class
tokenizers.normalizers.
Normalizer
(
)
Base class for all normalizers
This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.
normalize
(
normalized
)
Parameters
- normalized
(
NormalizedString
) — The normalized string on which to apply this Normalizer
Normalize a
NormalizedString
in-place
This method allows to modify a
NormalizedString
to
keep track of the alignment information. If you just want to see the result
of the normalization on a raw string, you can use
normalize_str()
normalize_str
(
sequence
)
→
str
Parameters
- sequence
(
str
) — A string to normalize
Returns
str
A string after normalization
Normalize the given string
This method provides a way to visualize the effect of a
Normalizer
but it does not keep track of the alignment
information. If you need to get/convert offsets, you can use
normalize()
Precompiled
class
tokenizers.normalizers.
Precompiled
(
precompiled_charsmap
)
Precompiled normalizer Don’t use manually it is used for compatiblity for SentencePiece.
Replace
class
tokenizers.normalizers.
Replace
(
pattern
content
)
Replace normalizer
Sequence
class
tokenizers.normalizers.
Sequence
(
)
Parameters
- normalizers
(
List[Normalizer]
) — A list of Normalizer to be run as a sequence
Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order
Strip
class
tokenizers.normalizers.
Strip
(
left
= True
right
= True
)
Strip normalizer
StripAccents
class
tokenizers.normalizers.
StripAccents
(
)
StripAccents normalizer