Decoders
译者:片刻小哥哥
项目地址:https://huggingface.apachecn.org/docs/tokenizers/api/decoders
BPEDecoder
class
tokenizers.decoders.
BPEDecoder
(
suffix
= ''
)
Parameters
- suffix
(
str
, optional , defaults to</w>
) — The suffix that was used to caracterize an end-of-word. This suffix will be replaced by whitespaces during the decoding
BPEDecoder Decoder
ByteLevel
class
tokenizers.decoders.
ByteLevel
(
)
ByteLevel Decoder
This decoder is to be used in tandem with the ByteLevel PreTokenizer .
CTC
class
tokenizers.decoders.
CTC
(
pad_token
= '
word_delimiter_token
= '|'
cleanup
= True
)
Parameters
- pad_token
(
str
, optional , defaults to<pad>
) — The pad token used by CTC to delimit a new token. - word_delimiter_token
(
str
, optional , defaults to|
) — The word delimiter token. It will be replaced by a - cleanup
(
bool
, optional , defaults toTrue
) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
CTC Decoder
Metaspace
class
tokenizers.decoders.
Metaspace
(
)
Parameters
- replacement
(
str
, optional , defaults to▁
) — The replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece). - add_prefix_space
(
bool
, optional , defaults toTrue
) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello .
Metaspace Decoder
WordPiece
class
tokenizers.decoders.
WordPiece
(
prefix
= '##'
cleanup
= True
)
Parameters
- prefix
(
str
, optional , defaults to##
) — The prefix to use for subwords that are not a beginning-of-word - cleanup
(
bool
, optional , defaults toTrue
) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
WordPiece Decoder