Text to image similarity

I want to compute the similarity of my prompt and the model response. But a strange thing happened, whatever the prompt is related to  the image, the similarity score range 0.5-0.6.
The following is my own clip function by using your model:
def long_clip_similarity(image, text, model, preprocess, device):
        """计算图像和文本之间的Long-CLIP相似度"""
        try:
            # 处理图像
            image_processed = preprocess(image).unsqueeze(0).to(device)
            # 处理文本
            text_input = longclip.tokenize([text], truncate=True).to(device)
            
            # 计算特征
            with torch.no_grad():
                image_features = model.encode_image(image_processed)
                text_features = model.encode_text(text_input)
            
            # 归一化特征
            image_features = image_features / image_features.norm(dim=-1, keepdim=True)
            text_features = text_features / text_features.norm(dim=-1, keepdim=True)
            
            # 计算相似度
            similarity = torch.matmul(image_features, text_features.T).item()
            
            return similarity


Related text and image example:

![Image](https://github.com/user-attachments/assets/488f864e-327a-4b32-8ac6-3befdbe20e82)

Unrelated text and image example:

![Image](https://github.com/user-attachments/assets/3f945a82-7e34-40de-a55e-d24b82f2dc86)

Looking forward to your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text to image similarity #102

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Text to image similarity #102

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions