2022-12-09发表2022-12-09更新1 分钟读完 (大约196个字)

BertTokenizer

BertTokenizer使用方式

pad_to_max_length argument is deprecated and will be removed in a future version, use padding=True or padding='longest' to pad to the longest sequence in the batch, or use padding='max_length' to pad to a max length. In this case, you can give a specific length with max_length (e.g. max_length=45) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).

from transformers import BertTokenizer
token = BertTokenizer.from_pretrained('hfl/chinese-roberta-wwm-ext')
token([('你好', '氨基酸'), ('中南大学', '清华大学')],padding=True) 
# {'input_ids': [[101, 872, 1962, 102, 3710, 1825, 7000, 102, 0, 0, 0], [101, 704, 1298, 1920, 2110, 102, 3926, 1290, 1920, 2110, 102]], 'token_type_ids': [[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
token([('你好', '氨基酸'), ('中南大学', '清华大学')],padding='max_length', max_length=15)
# {'input_ids': [[101, 872, 1962, 102, 3710, 1825, 7000, 102, 0, 0, 0, 0, 0, 0, 0], [101, 704, 1298, 1920, 2110, 102, 3926, 1290, 1920, 2110, 102, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]}

# decode接收单个id序列，解码为字符串
token.decode([101, 872, 1962, 102, 3710, 1825, 7000, 102, 0, 0, 0])
# '[CLS] 你 好 [SEP] 氨 基 酸 [SEP] [PAD] [PAD] [PAD]'

# tokenizer和encode接收string或者字符序列
token.tokenize("你好")
# ['你', '好']
token.encode("你好")
# [101, 872, 1962, 102]

BertTokenizer

https://csu-swallow.github.io/2022/12/09/BertTokenizer/