跳转至

Trainers

译者:片刻小哥哥

项目地址:https://huggingface.apachecn.org/docs/tokenizers/api/trainers

原始地址:https://huggingface.co/docs/tokenizers/api/trainers

BpeTrainer

class

tokenizers.trainers.

BpeTrainer

(

)

Parameters

* vocab_size

( int , optional ) — The size of the final vocabulary, including all tokens and alphabet. * min_frequency ( int , optional ) — The minimum frequency a pair should have in order to be merged. * show_progress ( bool , optional ) — Whether to show progress bars while training. * special_tokens ( List[Union[str, AddedToken]] , optional ) — A list of special tokens the model should know of. * limit_alphabet ( int , optional ) — The maximum different characters to keep in the alphabet. * initial_alphabet ( List[str] , optional ) — A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept. * continuing_subword_prefix ( str , optional ) — A prefix to be used for every subword that is not a beginning-of-word. * end_of_word_suffix ( str , optional ) — A suffix to be used for every subword that is a end-of-word. * max_token_length ( int , optional ) — Prevents creating tokens longer than the specified size. This can help with reducing polluting your vocabulary with highly repetitive tokens like ====== for wikipedia

译者:片刻小哥哥

项目地址:https://huggingface.apachecn.org/docs/tokenizers/api/trainers

原始地址:https://huggingface.co/docs/tokenizers/api/trainers

Trainer capable of training a BPE model

UnigramTrainer

class

tokenizers.trainers.

UnigramTrainer

(

vocab_size

= 8000

show_progress

= True

special_tokens

= []

shrinking_factor

= 0.75

unk_token

= None

max_piece_length

= 16

n_sub_iterations

= 2

)

Parameters

  • vocab_size ( int ) — The size of the final vocabulary, including all tokens and alphabet.
  • show_progress ( bool ) — Whether to show progress bars while training.
  • special_tokens ( List[Union[str, AddedToken]] ) — A list of special tokens the model should know of.
  • initial_alphabet ( List[str] ) — A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.
  • shrinking_factor ( float ) — The shrinking factor used at each step of the training to prune the vocabulary.
  • unk_token ( str ) — The token used for out-of-vocabulary tokens.
  • max_piece_length ( int ) — The maximum length of a given token.
  • n_sub_iterations ( int ) — The number of iterations of the EM algorithm to perform before pruning the vocabulary.

Trainer capable of training a Unigram model

WordLevelTrainer

class

tokenizers.trainers.

WordLevelTrainer

(

)

Parameters

  • vocab_size ( int , optional ) — The size of the final vocabulary, including all tokens and alphabet.
  • min_frequency ( int , optional ) — The minimum frequency a pair should have in order to be merged.
  • show_progress ( bool , optional ) — Whether to show progress bars while training.
  • special_tokens ( List[Union[str, AddedToken]] ) — A list of special tokens the model should know of.

Trainer capable of training a WorldLevel model

WordPieceTrainer

class

tokenizers.trainers.

WordPieceTrainer

(

vocab_size

= 30000

min_frequency

= 0

show_progress

= True

special_tokens

= []

limit_alphabet

= None

initial_alphabet

= []

continuing_subword_prefix

= '##'

end_of_word_suffix

= None

)

Parameters

  • vocab_size ( int , optional ) — The size of the final vocabulary, including all tokens and alphabet.
  • min_frequency ( int , optional ) — The minimum frequency a pair should have in order to be merged.
  • show_progress ( bool , optional ) — Whether to show progress bars while training.
  • special_tokens ( List[Union[str, AddedToken]] , optional ) — A list of special tokens the model should know of.
  • limit_alphabet ( int , optional ) — The maximum different characters to keep in the alphabet.
  • initial_alphabet ( List[str] , optional ) — A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.
  • continuing_subword_prefix ( str , optional ) — A prefix to be used for every subword that is not a beginning-of-word.
  • end_of_word_suffix ( str , optional ) — A suffix to be used for every subword that is a end-of-word.

Trainer capable of training a WordPiece model



回到顶部