CodeMorph

CodeMorph is a tool that uses semantic-preserving transformations to perturb code, effectively mitigating data leakage in LLM evaluation datasets. It optimizes combinations of perturbation methods through a genetic algorithm to enhance the effectiveness of the transformations.

Workflow

Download and Installation

JPlag

Ensure that JPlag is installed in the CODEMORPH directory, as it will also install other necessary dependencies, such as the g++ compiler.

Dependencies

pip install -r requirements.txt

API Setup

Configure the LLM_LIST in perturbation.py with the corresponding model, API key, and base URL required to create an OpenAI client:

LLM_LIST = {
    'perturbation_llm': ['model', 'api_key', 'base_url'],
    'voter_llm_1': ['model', 'api_key', 'base_url'],
    'voter_llm_2': ['model', 'api_key', 'base_url'],
    'voter_llm_3': ['model', 'api_key', 'base_url'],
}

Each entry should include the specific model name, API key, and base URL for each LLM instance you are configuring.

Preparing the Code File

Prepare a JSONL file containing the code to be perturbed, with each line formatted as follows:

{"task_id": "task_1", "code": "original code", "entry_point": "your entry point"}

Running the Tool

python -m CodeMorph.codemorph \
    --lang python \  # required parameter; choose from [python, c, cpp, rust, java, go]
    --file ./example.jsonl \  # required parameter; a JSONL file
    --mi 15 \  # maximum iterations (default: 15)
    --st 0.2 \  # similarity threshold (default: 0.2)
    --temper 2  # temperature for Boltzmann selection (default: 2)

This command runs CodeMorph with the specified language and code file, setting the maximum number of iterations (--mi), similarity threshold (--st), and temperature for selection (--temper).

To adjust the weights of $s_1$ and $s_2$, modify the compose_similarity_score function in similarity_computation.py.

Citation

If you reference our work or use our tools, the reference information is as follows:

@article{author:2025,  
  author = {Hongzhou, Rao. and Yanjie, Zhao. and Wenjie, Zhu. and Ling, Xiao. and Meizhen, Wang. and Haoyu, Wang.},  
  title = {CodeMorph: Mitigating Data Leakage in Large Language Model Assessment},  
  journal = {Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings},
  status = {Accepted},  
  year = {2024}  
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
benchmark/original_code		benchmark/original_code
compiler		compiler
figures		figures
perturbation		perturbation
prompt		prompt
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
codemorph.py		codemorph.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeMorph

Workflow

Download and Installation

JPlag

Dependencies

API Setup

Preparing the Code File

Running the Tool

Citation

About

Uh oh!

Releases

Packages

Languages

License

security-pride/CodeMorph

Folders and files

Latest commit

History

Repository files navigation

CodeMorph

Workflow

Download and Installation

JPlag

Dependencies

API Setup

Preparing the Code File

Running the Tool

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages