Skip to content

aditya047-stack/MultiModal_Eval_Project_PoC-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

MultiModal_Eval_Project_PoC-

This is PoC related to MultiModal Eval framework project idea doc of mine.

The project description is as follow:

1.The project shows how to evaluate the system. 2.Here my focus is on evaluating one of from many system (We can evaluate other system where MultiModal concept based systems is used) i.e RAG System. 3.Project is done in Google Colab Notebook Environment.

The Project is divided mainly into Two Parts:

-Firstly we Benchmark to know if the model is good for desired system. -As here our focused is on RAG System . -We Benchmark the model Using Lighteval Framework for f1 metric using custom tasks and RAG focused datasets(here,name-"rag-datasets/rag-mini-bioasq") from Hugging Face. -This is done all locally (Except calling model through api) -The Light Eval Pipeline is made to handle custom tasks with dual subsets(here-"question-answer-passages" and "text-corpus") with corresponding Splits. -From this we conclude if the model is best for desired system or not. -This part saves money and time at early steps

-This part is where whole RAG System PipeLine is situated. -First simply write logic of "Indexing" part of RAG System. -Second part is of "Retrieving" of RAG System. -first we simply run the RAG System to test whether it is working or broken.

-After this,we here used RAGAS Metric Library to make our evaluation pipeline. -we gave some sample queries and expected response which are related to our doc. -Hence we evaluate the RAG System. -For Traces and observability only we used LangSmith.

Lastly it is integrated for fastapi.

What other or different approaches should be taken? --->1.The evaluation pipeline could include llm-as-judge or humna-in-loop. 2.There are several reason for this addition . 3.Reasons are ,by this we would not just evaluate system we might improve the scores of it. 4.Pipeline would be [Rag output -->llm-as-judge --> human-in-loop (for correctness and future answer betterment) ---> Store in database(Vector database)(for future ).This might increase score indirectly but it ensure correctness and beliefness.

About

This is MultiModal Eval framework project idea doc related PoC

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors