Using padded tokens when creating averaged sentence embeddings

When calculating the similarity loss between two sentences, it looks like we are using the averaged word embeddings per sentence. Within `models.SDR.similarity_modeling.SimilarityModeling` we have the following:

```
...
non_masked_outputs = self.roberta(
    non_masked_input_ids,
    attention_mask=attention_mask,
    token_type_ids=token_type_ids,
    position_ids=position_ids,
    head_mask=head_mask,
    inputs_embeds=inputs_embeds,
    output_hidden_states=output_hidden_states,
    return_dict=return_dict,
)
non_masked_seq_out = non_masked_outputs[0]

meaned_sentences = non_masked_seq_out.mean(1)
miner_output = list(self.miner_func(meaned_sentences, sample_labels))

sim_loss = self.similarity_loss_func(meaned_sentences, sample_labels, miner_output)
...
```

It appears using the embeddings for the padded tokens since we aren't taking into account any sentence lengths. Was this done by design perhaps?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using padded tokens when creating averaged sentence embeddings #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using padded tokens when creating averaged sentence embeddings #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions