Tokenizer
译者:片刻小哥哥
项目地址:https://huggingface.apachecn.org/docs/tokenizers/api/tokenizer
Tokenizer
class
tokenizers.
Tokenizer
(
model
)
Parameters
- model
(
Model
) —
The core algorithm that this
Tokenizer
should be using.
A
Tokenizer
works as a pipeline. It processes some raw text as input
and outputs an
Encoding
.
property
decoder
The
optional
Decoder
in use by the Tokenizer
property
model
The Model in use by the Tokenizer
property
normalizer
The optional Normalizer in use by the Tokenizer
property
padding
Returns
(
dict
,
optional
)
A dict with the current padding parameters if padding is enabled
Get the current padding parameters
Cannot be set, use
enable_padding()
instead
property
post_processor
The
optional
PostProcessor
in use by the Tokenizer
property
pre_tokenizer
The optional PreTokenizer in use by the Tokenizer
property
truncation
Returns
(
dict
,
optional
)
A dict with the current truncation parameters if truncation is enabled
Get the currently set truncation parameters
Cannot set, use
enable_truncation()
instead
add_special_tokens
(
tokens
)
→
int
Parameters
- tokens
(A
List
of AddedToken orstr
) — The list of special tokens we want to add to the vocabulary. Each token can either be a string or an instance of AddedToken for more customization.
Returns
int
The number of tokens that were created in the vocabulary
Add the given special tokens to the Tokenizer.
If these tokens are already part of the vocabulary, it just let the Tokenizer know about them. If they don’t exist, the Tokenizer creates them, giving them a new id.
These special tokens will never be processed by the model (ie won’t be split into multiple tokens), and they can be removed from the output when decoding.
add_tokens
(
tokens
)
→
int
Parameters
- tokens
(A
List
of AddedToken orstr
) — The list of tokens we want to add to the vocabulary. Each token can be either a string or an instance of AddedToken for more customization.
Returns
int
The number of tokens that were created in the vocabulary
Add the given tokens to the vocabulary
The given tokens are added only if they don’t already exist in the vocabulary. Each token then gets a new attributed id.
decode
(
ids
skip_special_tokens
= True
)
→
str
Parameters
- ids
(A
List/Tuple
ofint
) — The list of ids that we want to decode - skip_special_tokens
(
bool
, defaults toTrue
) — Whether the special tokens should be removed from the decoded string
Returns
str
The decoded string
Decode the given list of ids back to a string
This is used to decode anything coming back from a Language Model
decode_batch
(
sequences
skip_special_tokens
= True
)
→
List[str]
Parameters
- sequences
(
List
ofList[int]
) — The batch of sequences we want to decode - skip_special_tokens
(
bool
, defaults toTrue
) — Whether the special tokens should be removed from the decoded strings
Returns
List[str]
A list of decoded strings
Decode a batch of ids back to their corresponding string
enable_padding
(
direction
= 'right'
pad_id
= 0
pad_type_id
= 0
pad_token
= '[PAD]'
length
= None
pad_to_multiple_of
= None
)
Parameters
- direction
(
str
, optional , defaults toright
) — The direction in which to pad. Can be eitherright
orleft
- pad_to_multiple_of
(
int
, optional ) — If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad witha length of 250 butpad_to_multiple_of=8
then we will pad to 256. - pad_id
(
int
, defaults to 0) — The id to be used when padding - pad_type_id
(
int
, defaults to 0) — The type id to be used when padding - pad_token
(
str
, defaults to[PAD]
) — The pad token to be used when padding - length
(
int
, optional ) — If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
Enable the padding
enable_truncation
(
max_length
stride
= 0
strategy
= 'longest_first'
direction
= 'right'
)
Parameters
- max_length
(
int
) — The max length at which to truncate - stride
(
int
, optional ) — The length of the previous first sequence to be included in the overflowing sequence - strategy
(
str
, optional , defaults tolongest_first
) — The strategy used to truncation. Can be one oflongest_first
,only_first
oronly_second
. - direction
(
str
, defaults toright
) — Truncate direction
Enable truncation
encode
(
sequence
pair
= None
is_pretokenized
= False
add_special_tokens
= True
)
→
Parameters
-
sequence (
~tokenizers.InputSequence
) — The main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to theis_pretokenized
argument:- If
is_pretokenized=False
:TextInputSequence
- If
is_pretokenized=True
:PreTokenizedInputSequence()
- pair
(
~tokenizers.InputSequence
, optional ) — An optional input sequence. The expected format is the same that forsequence
. - is_pretokenized
(
bool
, defaults toFalse
) — Whether the input is already pre-tokenized - add_special_tokens
(
bool
, defaults toTrue
) — Whether to add the special tokens
- If
Returns
The encoded result
Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.
Example:
Here are some examples of the inputs that are accepted:
encode("A single sequence")*
encode("A sequence", "And its pair")*
encode([ "A", "pre", "tokenized", "sequence" ], is_pretokenized=True)`
encode(
[ "A", "pre", "tokenized", "sequence" ], [ "And", "its", "pair" ],
is_pretokenized=True
)
encode_batch
(
input
is_pretokenized
= False
add_special_tokens
= True
)
→
A
List
of [`~tokenizers.Encoding“]
Parameters
-
input (A
List
/Tuple
of
~tokenizers.EncodeInput) — A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the
is_pretokenized` argument:- If
is_pretokenized=False
:TextEncodeInput()
- If
is_pretokenized=True
:PreTokenizedEncodeInput()
- is_pretokenized
(
bool
, defaults toFalse
) — Whether the input is already pre-tokenized - add_special_tokens
(
bool
, defaults toTrue
) — Whether to add the special tokens
- If
Returns
A
List
of [`~tokenizers.Encoding“]
The encoded batch
Encode the given batch of inputs. This method accept both raw text sequences as well as already pre-tokenized sequences.
Example:
Here are some examples of the inputs that are accepted:
encode_batch([
"A single sequence",
("A tuple with a sequence", "And its pair"),
[ "A", "pre", "tokenized", "sequence" ],
([ "A", "pre", "tokenized", "sequence" ], "And its pair")
])
from_buffer
(
buffer
)
→
Parameters
- buffer
(
bytes
) — A buffer containing a previously serialized Tokenizer
Returns
The new tokenizer
Instantiate a new Tokenizer from the given buffer.
from_file
(
path
)
→
Parameters
- path
(
str
) — A path to a local JSON file representing a previously serialized Tokenizer
Returns
The new tokenizer
Instantiate a new Tokenizer from the file at the given path.
from_pretrained
(
identifier
revision
= 'main'
auth_token
= None
)
→
Parameters
- identifier
(
str
) — The identifier of a Model on the Hugging Face Hub, that contains a tokenizer.json file - revision
(
str
, defaults to main ) — A branch or commit id - auth_token
(
str
, optional , defaults to None ) — An optional auth token used to access private repositories on the Hugging Face Hub
Returns
The new tokenizer
Instantiate a new Tokenizer from an existing file on the Hugging Face Hub.
from_str
(
json
)
→
Parameters
- json
(
str
) — A valid JSON string representing a previously serialized Tokenizer
Returns
The new tokenizer
Instantiate a new Tokenizer from the given JSON string.
get_vocab
(
with_added_tokens
= True
)
→
Dict[str, int]
Parameters
- with_added_tokens
(
bool
, defaults toTrue
) — Whether to include the added tokens
Returns
Dict[str, int]
The vocabulary
Get the underlying vocabulary
get_vocab_size
(
with_added_tokens
= True
)
→
int
Parameters
- with_added_tokens
(
bool
, defaults toTrue
) — Whether to include the added tokens
Returns
int
The size of the vocabulary
Get the size of the underlying vocabulary
id_to_token
(
id
)
→
Optional[str]
Parameters
- id
(
int
) — The id to convert
Returns
Optional[str]
An optional token,
None
if out of vocabulary
Convert the given id to its corresponding token if it exists
no_padding
(
)
Disable padding
no_truncation
(
)
Disable truncation
num_special_tokens_to_add
(
is_pair
)
Return the number of special tokens that would be added for single/pair sentences. :param is_pair: Boolean indicating if the input would be a single sentence or a pair :return:
post_process
(
encoding
pair
= None
add_special_tokens
= True
)
→
Parameters
- encoding ( Encoding ) — The Encoding corresponding to the main sequence.
- pair ( Encoding , optional ) — An optional Encoding corresponding to the pair sequence.
- add_special_tokens
(
bool
) — Whether to add the special tokens
Returns
The final post-processed encoding
Apply all the post-processing steps to the given encodings.
The various steps are:
- Truncate according to the set truncation params (provided with
enable_truncation()
) - Apply the
PostProcessor
- Pad according to the set padding params (provided with
enable_padding()
)
save
(
path
pretty
= True
)
Parameters
- path
(
str
) — A path to a file in which to save the serialized tokenizer. - pretty
(
bool
, defaults toTrue
) — Whether the JSON file should be pretty formatted.
Save the Tokenizer to the file at the given path.
to_str
(
pretty
= False
)
→
str
Parameters
- pretty
(
bool
, defaults toFalse
) — Whether the JSON string should be pretty formatted.
Returns
str
A string representing the serialized Tokenizer
Gets a serialized string representing this Tokenizer .
token_to_id
(
token
)
→
Optional[int]
Parameters
- token
(
str
) — The token to convert
Returns
Optional[int]
An optional id,
None
if out of vocabulary
Convert the given token to its corresponding id if it exists
train
(
files
trainer
= None
)
Parameters
- files
(
List[str]
) — A list of path to the files that we should use for training - trainer
(
~tokenizers.trainers.Trainer
, optional ) — An optional trainer that should be used to train our Model
Train the Tokenizer using the given files.
Reads the files line by line, while keeping all the whitespace, even new lines.
If you want to train from data store in-memory, you can check
train_from_iterator()
train_from_iterator
(
iterator
trainer
= None
length
= None
)
Parameters
- iterator
(
Iterator
) — Any iterator over strings or list of strings - trainer
(
~tokenizers.trainers.Trainer
, optional ) — An optional trainer that should be used to train our Model - length
(
int
, optional ) — The total number of sequences in the iterator. This is used to provide meaningful progress tracking
Train the Tokenizer using the provided iterator.
You can provide anything that is a Python Iterator
- A list of sequences
List[str]
- A generator that yields
str
orList[str]
- A Numpy array of strings
- …