LogicIF is a comprehensive framework for generating natural language instructions from code functions and evaluating LLMs performance. The framework consists of four main components:
- Instruction Generation Framework (
LogicIFGen) (core.py): the framework to generate natural language instructions from code functions. - Benchmarks: LogicIFEval, a collection of 426 instructions with complex logical structures and LogicIFEval-mini, a compute-friendly version (102 instructions) of LogicIFEval.
- Inference (inference.py) & Evaluation (evaluation.py): Easy inference and evaluation on generated benchmarks.
git clone https://github.com/your-org/logicif.git
cd logicifBasic usage (instruction generation, evaluation without vLLM-based inference):
pip install .If you want to run inference.py (vLLM-based inference):
pip install .[full]Set up OpenAI API key in config.json (or use environment variables):
{
"OPENAI_API_KEY": "your-openai-key"
}LogicIF: Complex Logical Instruction Following
Evaluate Models on LogicIFEval or LogicIFEval-mini
Your model should be supported by vllm in order to use our inference script inference.py. Or you need to directly provide a file with the same format as sample_result.jsonl and use the evaluation.py to get the metrics.
For each intruction, we apply a two steps to get the final results (a json object: {"output": ..., "stats": {...}}):
- Step 1: get the raw responses of the models (
llm_responsefield of sample_inference_output.jsonl) - Step 2: use a small OpenAI model to extract the json object. (
llm_outputfield of sample_result.jsonl)
Note: You may skip the Step 1 if you use other inference framework to get the responses and Step 2 if you use other methods to extract the result jsonl object.
python -m logicif.inference \
--input "benchmark/logic-if-eval-mini.jsonl" \
--output "/path/to/inference_output_file" \
--model "/path/to/your/model" \
--max-tokens 16384 \
--gpu-memory-utilization 0.8 \
--overwrite
Extract structured outputs from raw model responses:
python -m logicif.evaluation \
--todo extract_llm_results \
--input_file "/path/to/inference_output_file" \
--output_file "/path/to/result_file" \
--extract_model "gpt-4.1-mini" \ # used in our paper, works well
--overwriteCompare extracted results against code outputs, including the output accuray, state tracker accuracy, and overall accuracy:
python -m logicif.evaluation \
--todo compare_outputs \
--result_file "/path/to_result_file"LogicIFGen transforms code functions into detailed natural language instructions that can be used for benchmarking language models and model training.
-
Function File: This should be a JSONL file like sample_functions.jsonl, where each line should be a JSON object representing a Python function. The required fields are:
"task_id": The ID of the function."function": The function code as a string.
-
Test Case File: This should be a JSONL file like sample_test_cases.jsonl where each line contains a JSON object with:
"task_id": The function identifier (matching the task_id from function file)."test_cases": A list of test cases, where each test case is a list of arguments to pass to the function.
from logicif import LogicIFGen
from logicif.utils import load_api_keys
# Load API keys before using the framework
load_api_keys()
def main():
model = "o4-mini" # we recommnend using frontier close-sourced models to guarantee the quality of description
framework = LogicIFGen(output_dir="functions")
# Load functions from JSONL file
framework.load_functions_from_jsonl("sample_functions.jsonl")
# Add state trackers to functions
framework.add_stat(model=model, overwrite=True)
# Evolve functions for more complex logic
framework.evolve_functions(model=model, max_turns=1, overwrite=True) # you could uncomment this if you do not want change the logic of the functions
# Generate natural language descriptions
framework.gen_desc(model=model, overwrite=True)
# Verify description completeness
framework.verify_desc(model=model, max_turns=3, overwrite=True)
# Load test cases
framework.load_test_cases("sample_test_cases.jsonl")
# Execute functions to get expected outputs
framework.get_code_output(timeout=5)
# Analyze complexity
framework.analyze_complexity()
# Generate final benchmark file
framework.finalize_benchmark(output_file="sample_instructions.jsonl")
if __name__ == "__main__":
main()Generated Instruction
The genereted file is like sample_instructions.jsonl. Each line contains a complete instruction example with the following fields:
"task_id": The function identifier"test_case_id": Index of the specific test case (0, 1, 2, ...)"input": The input values for this test case"code_output": The expected output and the value of state trackers from running the function"function": The function code with state trackers"description": Natural language description of the code function"instruction": The complete instruction prompt"complexity_score": Complexity score for the function
The inference module uses vllm to run language model inference on benchmark files. If you want to evaluate the models on custom functions, please follow the previous section to generation instructions. You could also evaluate the models on LogicIFEval or LogicIFEval-mini located in benchmark/, which share the same format as sample_instructions.jsonl.
python -m logicif.inference \
--input "sample_instructions.jsonl" \
--output "sample_inference_output.jsonl" \
--model "/path/to/your/model" \
--max-tokens 16384 \
--gpu-memory-utilization 0.8 \
--overwrite--input: Input benchmark JSONL file with instructions--output: Output file to save model responses--model: HuggingFace model ID or local path--max-tokens: Maximum tokens to generate (default: 16384)--run-name: Optional run name for experiment tracking--overwrite: Overwrite existing output file--gpu-memory-utilization: GPU memory utilization ratio (default: 0.8)
The evaluation module provides tools to extract model outputs and compare them against expected results. In our paper, we use gpt-4.1-mini as the json extractor.
Extract structured outputs from raw model responses:
python -m logicif.evaluation \
--todo extract_llm_results \
--input_file "sample_inference_output.jsonl" \
--output_file "sample_result.jsonl" \
--extract_model "gpt-4.1-mini" \
--overwriteParameters
--input_file: JSONL file with raw model responses--output_file: Output file for extracted results--extract_model: Model to use for output extraction--overwrite: Overwrite existing files
Compare extracted results against expected outputs:
python -m logicif.evaluation \
--todo compare_outputs \
--result_file "sample_result.jsonl"Parameters
--result_file: Path to file with extracted results