When calculating the similarity loss between two sentences, it looks like we are using the averaged word embeddings per sentence. Within models.SDR.similarity_modeling.SimilarityModeling we have the following:
...
non_masked_outputs = self.roberta(
non_masked_input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
non_masked_seq_out = non_masked_outputs[0]
meaned_sentences = non_masked_seq_out.mean(1)
miner_output = list(self.miner_func(meaned_sentences, sample_labels))
sim_loss = self.similarity_loss_func(meaned_sentences, sample_labels, miner_output)
...
It appears using the embeddings for the padded tokens since we aren't taking into account any sentence lengths. Was this done by design perhaps?
When calculating the similarity loss between two sentences, it looks like we are using the averaged word embeddings per sentence. Within
models.SDR.similarity_modeling.SimilarityModelingwe have the following:It appears using the embeddings for the padded tokens since we aren't taking into account any sentence lengths. Was this done by design perhaps?