Sentence Embedding by BERT and Sentence Similarity (中文)

CW Lin
19 min readJan 24, 2023

BERT 的原理就不贅述了,網路上很多教學,我也在 來玩點NLP — LSTM vs. BERT on IMDb dataset 這篇有大概介紹,有興趣可以去看看,只是我發現我在那篇裡使用了比較麻煩的方式來 fine-tune BERT 做文本分類,可能最近 BERT 發展得越來越方便使用,或是當時我沒發現比較方便的用法🤢

總之 這篇來記錄一下如何用 BERT 取得 sentence embedding 來 fine-tune downstream task 以及額外多寫一個 sentence similarity 的 task。

https://www.codemotion.com/magazine/ai-ml/bert-how-google-changed-nlp-and-how-to-benefit-from-this/

BERT sentence embedding for downstream task

基本上,概念就是把 sentence (i.e. sequence of text) 轉換成 vector 然後再接linear layer 做 downstream task。

BERT 提供下列四種 downstream task 的使用範例:

要使用BERT 必須先準備好 BERT 所需要的 input:

  1. token id: 每個單字在字典裡對應的一個 index
  2. attention mask: 每個句子會padding成一樣長度,attention mask 讓self-attention layer 知道哪些是padding 把他mask 起來讓它不會去關注padding的特殊字 ([PAD])
  3. segment id: 針對一次要 input 兩個句子的任務用來區分第一個句子和第二個句子的 index,若 input 只有一個句子則隨便給一個一致的index 即可。

我們來一步一步做一遍就很清楚了:

  1. import 相關套件 & download tokenizer and model:
import torch  
from transformers import BertTokenizer,BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

# models: https://huggingface.co/models?sort=downloads

2. tokenize the sequence

sentence = 'I really enjoyed this movie a lot.'

tokens = tokenizer.tokenize(sentence)
print(tokens)
# ['i', 'really', 'enjoyed', 'this', 'movie', 'a', 'lot', '.']

3. Add [CLS] and [SEP] tokens

tokens = ['[CLS]'] + tokens + ['[SEP]']
tokens
# ['[CLS]', 'i', 'really', 'enjoyed', 'this', 'movie', 'a', 'lot', '.', '[SEP]']

4. Padding the input

T=15
padded_tokens = tokens + ['[PAD]' for _ in range(T-len(tokens))]
print("Padded tokens are \n {} ".format(padded_tokens))
attn_mask = [ 1 if token != '[PAD]' else 0 for token in padded_tokens ]
print("Attention Mask are \n {} ".format(attn_mask))

5. Create a list of segment tokens

seg_ids = [0 for _ in range(len(padded_tokens))]

6. Create input tensor for all of this stuff

sent_ids = tokenizer.convert_tokens_to_ids(padded_tokens)
token_ids = torch.tensor(sent_ids).unsqueeze(0)
attn_mask = torch.tensor(attn_mask).unsqueeze(0)
seg_ids = torch.tensor(seg_ids).unsqueeze(0)

print(token_ids)
print(attn_mask)
print(seg_ids)

# tensor([[ 101, 1045, 2428, 5632, 2023, 3185, 1037, 2843, 1012, 102, 0, 0, 0, 0, 0]])
# tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])
# tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

7. model inference

到目前為止我們已經準備好BERT所需要的東西了 把這堆 tensor 餵進去model 就能得到embedding了~

output = model(token_ids, attention_mask=attn_mask,token_type_ids=seg_ids)
last_hidden_state, pooler_output = output[0], output[1]

print(last_hidden_state.shape) #hidden states of each token
print(pooler_output.shape) #hidden states of [cls] (forward one linear layer and Tanh activation)

(這篇文章有關於 last_hidden_state, pooler_output 的介紹)

基本上 pooler_output 就是我們想要的 sentence embedding,使用上 後面再串幾層 layer 組成我們要解決的 task 就能直接fine tune了。 如果懶得自己接layer組網路架構的話 也能直接用BERT提供的幾個網路架構 e.g. BertForSequenceClassification, BertForQuestionAnswering

基本款:
— bertModel
— bertTokenizer
預訓練階段:
— bertForMaskedLM
— bertForNextSentencePrediction
— bertForPreTraining
Fine-tuning 階段
— bertForSequenceClassification
— bertForTokenClassification
— bertForQuestionAnswering
— bertForMultipleChoice

看到這邊可以發現,要做 sentence embedding 主要的功都是花在組 BERT 的 input tensor。上面做過一遍應該也能比較清楚這些 tensor 是幹什麼的以及如何得到。
但其實 tokenizer 已經都把這些東西給封裝好了 (不論是AutoTokenizer or BertTokenizer)
上面的 2,3,4,5,6 步驟其實都可以被省略為以下這個做法

wrapped_input = tokenizer(sentence, max_length=15, add_special_tokens=True, truncation=True, 
padding='max_length', return_tensors="pt")

wrapped_input
#{'input_ids': tensor([[ 101, 1045, 2428, 5632, 2023, 3185, 1037, 2843, 1012, 102, 0, 0, 0, 0, 0]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])}

一下就得到 token_ids, attn_mask, seg_ids 了,背後也都已經做了padding~ (前面到底在辛苦什麼 XDD)

Note: padding=True will padding to the longest setence length,
on the other hand, if padding=‘max_length’ then it will padding to the length “max_length”

直接把這個 dict 丟給 model 就可以得到embedding 結果

output = model(**wrapped_input)
last_hidden_state, pooler_output = output[0], output[1]

小試身手

一樣使用 imdb 電影影評來實作文本分類 (BERTforSequenceClassification) 基本上就是 pytorch 的 training pipeline 我直接把 notebook 放到這個repo ,有興趣參考就去看看吧~

訓練時 val acc ~92.5% , 最後 test set acc = 23084 / 25000(92.336%)

[epoch 1]train on 24000 data......
100%|██████████| 1500/1500 [09:24<00:00, 2.66it/s]
training set: average loss: 0.0168, acc: 21350/24000(88.958%)
validation on 1000 data......
Val set:Average loss:0.0120, acc:928/1000(92.800%)
elapse: 575.06s

[epoch 2]train on 24000 data......
100%|██████████| 1500/1500 [09:15<00:00, 2.70it/s]
training set: average loss: 0.0094, acc: 22685/24000(94.521%)
validation on 1000 data......
Val set:Average loss:0.0126, acc:936/1000(93.600%)
elapse: 566.25s

[epoch 3]train on 24000 data......
100%|██████████| 1500/1500 [09:19<00:00, 2.68it/s]
training set: average loss: 0.0054, acc: 23321/24000(97.171%)
validation on 1000 data......
Val set:Average loss:0.0166, acc:925/1000(92.500%)
elapse: 569.87s

[epoch 4]train on 24000 data......
100%|██████████| 1500/1500 [09:18<00:00, 2.69it/s]
training set: average loss: 0.0032, acc: 23621/24000(98.421%)
validation on 1000 data......
Val set:Average loss:0.0196, acc:925/1000(92.500%)
elapse: 568.86s

[epoch 5]train on 24000 data......
100%|██████████| 1500/1500 [09:21<00:00, 2.67it/s]
training set: average loss: 0.0021, acc: 23743/24000(98.929%)
validation on 1000 data......
Val set:Average loss:0.0180, acc:925/1000(92.500%)
elapse: 572.23s

Sentence Similarity

當我想要做一些語句的 clustering 或是想要比對語句的相似度時(如在語音機器人裡,針對同一個需求但不同的問法, i.e. intention matching),可能會需要用到 sentence similarity。

你可能會想到直接套用上面的BERT embedding 來直接算一下歐氏距離或cosine similarity,但其實這樣做是不好的。想想 這個 BERT的 pre-train 是透過 MaskedLM, NextSentencePrediction 這兩個 task 來訓練得到的,所以 BERT 原始用法並不是用來產生 一個 meaningful 的 embedding 。

Jacob Devlin’s comment: I’m not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn’t mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally). (https://github.com/google-research/bert/issues/164#issuecomment-441324222)

如果想用 BERT 來做 sentence similarity 比較接近的 task 應該是 bertForSequenceClassification 的 成對句子分類任務

輸入兩個句子,然後 label 為是否為同一個意思。但這種做法如果你有100個句子要算兩兩的相似度的話,你就必須將這個網路 feedforward C(100, 2) = 4950 次。

比較直接的做法是訓練一個 meaningful 的 embedding,將來要是有任何要做sentence 比對的問題時就可以直接透過embedding 來計算相似度就好。

那你可能會聯想到 Siamese Network (Few Shot Learning — Siamese Network 我在這篇有介紹) 串兩個 BERT 得到 embedding 後使用 Contrastive loss or Triplet loss 來訓練 embedding。

沒錯 這個想法已經被發成 paper,並且也都開發成非常好用的工具了

大家可以看這篇 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks 其實就是 BERT 的 Siamese Network。

在 pooling strategies,paper 考慮了三種做法: Using the output of the CLS-token, computing the mean of all output vectors (MEANstrategy), and computing a max-over-time of the output vectors (MAX-strategy). 而最後選擇了使用 MEAN strategy.

最重要的是他們開發的 sentence-transformers 提供了 pretrain model,讓你省去最大的麻煩 — 搜集資料以及 labelling !

小試身手

我假設是一個要做語音助理的 intention matchin 來簡單做個實驗。

首先 import 套件並 load model.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2') # 多語言模型

sentences = [
'what is the weather tomorrow',
'will it rain tomorrow',
'Will the weather be hot in the future',
'what time is it',
'could you help me translate this setence',
'play some jazz music'
]

用這個 model 取得每個句子的 embedding

embedding = model.encode(sentences, convert_to_tensor=False)
embedding.shape
#(6, 384)

可以看到6個句子變成 6 個 384d 的 embedding vector

然後我們計算兩兩的 cosine similarity 來看看是否和直覺相符

cosine_scores = util.cos_sim(embedding, embedding)

d = {}
for i, v1 in enumerate(sentences):
for j, v2 in enumerate(sentences):
if i >= j:
continue
d[v1 + ' vs. ' + v2] = cosine_scores[i][j].item()

# sort by score
d_sorted = dict(sorted(d.items(), key=lambda x: x[1], reverse=True))
d_sorted

{'what is the weather tomorrow vs. will it rain tomorrow': 0.8252906203269958,
'what is the weather tomorrow vs. Will the weather be hot in the future': 0.6635355949401855,
'will it rain tomorrow vs. Will the weather be hot in the future': 0.5936063528060913,
'what is the weather tomorrow vs. what time is it': 0.47494661808013916,
'will it rain tomorrow vs. what time is it': 0.4440332055091858,
'Will the weather be hot in the future vs. what time is it': 0.33612486720085144,
'could you help me translate this setence vs. play some jazz music': 0.1588955670595169,
'what is the weather tomorrow vs. play some jazz music': 0.11192889511585236,
'will it rain tomorrow vs. play some jazz music': 0.09996305406093597,
'will it rain tomorrow vs. could you help me translate this setence': 0.09915214776992798,
'what time is it vs. could you help me translate this setence': 0.09021759033203125,
'what is the weather tomorrow vs. could you help me translate this setence': 0.08801298588514328,
'Will the weather be hot in the future vs. could you help me translate this setence': 0.07638849318027496,
'what time is it vs. play some jazz music': 0.054117172956466675,
'Will the weather be hot in the future vs. play some jazz music': 0.027871515601873398}

看起來相似度是還跟直覺相符合的~

我也嘗試過繁體中文的 case,看起來也是蠻正確的,另外想想一般的中文可能簡體中文在網路上的量會遠大於繁體,所以查了一下發現我們中研院也有發佈繁體中文的 pre-train model 也有包含在 huggingface 裡(https://huggingface.co/ckiplab),使用上非常方便!

--

--