I want to compute the similarity of my prompt and the model response. But a strange thing happened, whatever the prompt is related to the image, the similarity score range 0.5-0.6.
The following is my own clip function by using your model:
def long_clip_similarity(image, text, model, preprocess, device):
"""计算图像和文本之间的Long-CLIP相似度"""
try:
# 处理图像
image_processed = preprocess(image).unsqueeze(0).to(device)
# 处理文本
text_input = longclip.tokenize([text], truncate=True).to(device)
# 计算特征
with torch.no_grad():
image_features = model.encode_image(image_processed)
text_features = model.encode_text(text_input)
# 归一化特征
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# 计算相似度
similarity = torch.matmul(image_features, text_features.T).item()
return similarity
Related text and image example:

Unrelated text and image example:

Looking forward to your reply.
I want to compute the similarity of my prompt and the model response. But a strange thing happened, whatever the prompt is related to the image, the similarity score range 0.5-0.6.
The following is my own clip function by using your model:
def long_clip_similarity(image, text, model, preprocess, device):
"""计算图像和文本之间的Long-CLIP相似度"""
try:
# 处理图像
image_processed = preprocess(image).unsqueeze(0).to(device)
# 处理文本
text_input = longclip.tokenize([text], truncate=True).to(device)
Related text and image example:
Unrelated text and image example:
Looking forward to your reply.