Correction of benchmark results

Hi everyone,

Found several bugs while checking the code of ipynb notebooks with benchmark results for 3 environments [TinyToy](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/notebooks/notebook_benchmark-tiny.ipynb), [ToyCTF](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/notebooks/notebook_benchmark-toyctf.ipynb), [Chain](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/notebooks/notebook_benchmark-chain.ipynb).

I think my findings might be useful for community, who uses this nice implementation of cyberattacks simulation.


>  **MOVED TO SEPARATE ISSUE** #115 
> 1. __Issue 1__: [learner.epsilon_greedy_search(...)](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/cyberbattle/agents/baseline/learner.py#L126) is often used for training agents with different algorithms, including DQL in the `dql_run`. However `dql_exploit_run` with input network `dql_run` as policy-agent and `eval_episode_count` parameter for the number of episodes, gives an impression that runs are used for evaluation of the trained DQN. The only distinguishable difference between 2 runs is epsilon queal to 0, which leads to exploitation mode of training, but does not exclude training, because during run with [ learner.epsilon_greedy_search](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/cyberbattle/agents/baseline/learner.py#L126) the `optimizer.step()` is executed on each step of training in the file `agent_dql.py`, function call [learner.on_step(...)](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/cyberbattle/agents/baseline/agent_dql.py#L348).
> - **Solution**: I will include in Pull request the code I used for better evaluation (based on  [ learner.epsilon_greedy_search(...)](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/cyberbattle/agents/baseline/learner.py#L126) and generate pictures below. 
> - **Screenshots**: Figure 1 & 2 and figure 3 & 4 , shows result of chain network evaluation using corresponding new cell in [notebook_benchmark-chain.ipynb](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/notebooks/notebook_benchmark-chain.ipynb). As you can see on [figure 1](https://user-images.githubusercontent.com/8929593/194348810-c5731ab6-80bd-4fd3-af8f-e2070e0aa943.png)  training on the initial 50 episodes is not enough for owning 100% of the network (AttackerGoal), whereas original run `dql_exploit_run` internally using `learner.on_step(...)` [figure 2](https://user-images.githubusercontent.com/8929593/194348854-9569a9cc-f553-48ec-b352-4ffc0890bf40.png) leads to much better results, due to optimization process, which still process ongoing experience of agent. We can overcome this inaccurate evaluation and still reach the goal in 100% of times [figure 3](https://user-images.githubusercontent.com/8929593/194349289-349268d9-3e2c-47d3-a6b8-6e0c602bfba0.png), while training on 200 episodes with commented `learner.on_step()`. It fixes trained network and stops optimizing during evaluation, but leads to the ownership of all the network with larger amount of learning episodes. This means with 200 episodes it is feasible to learn optimal path of agent attacks inside chain network configuration. 
> Lastly, [figure 4](https://user-images.githubusercontent.com/8929593/194349308-ad6e719c-a5e8-4174-8b7d-e3ca0b71b358.png) we can compare those runs with correct evaluation runs on 20 episodes reach 6000+ and 120+  cumulative reward for for 200 and 50 training episodes correspondently.
> [Figure 1: (after PR) no optimizer during evaluation, 20 trained episodes, 20 evaluation episodes](https://user-images.githubusercontent.com/8929593/194348810-c5731ab6-80bd-4fd3-af8f-e2070e0aa943.png)
> [Figure 2: (before  & after PR) dql_exploit_run with optimizer during evaluation, 20 trained episodes, 5 evaluation episodes](https://user-images.githubusercontent.com/8929593/194348854-9569a9cc-f553-48ec-b352-4ffc0890bf40.png)
> [Figure3: (after PR) no optimizer during evaluation, **200** trained episodes, 20 evaluation episodes](https://user-images.githubusercontent.com/8929593/194349289-349268d9-3e2c-47d3-a6b8-6e0c602bfba0.png)
> [Figure 4: (after PR) comparison of evaluation for network trained on 200 and 20 episodes, chain network configuration](https://user-images.githubusercontent.com/8929593/194349308-ad6e719c-a5e8-4174-8b7d-e3ca0b71b358.png)

2. **Issue 2:** During training each episode ends only within the maximum number of iterations, which is due to the mistype in [AttackerGoal](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/cyberbattle/_env/cyberbattle_env.py#L231) class. Default value for parameter `own_atleast_percent: float 1.0` is included as condition with AND, for raising flag `done = True`,  thus for TinyToy and ToyCTF (not Chain) leads to long duration of training, wrong RL signal for evaluating Q function and low sample-efficiency.
- **Solution:** In order to be coherent with originally defined environments, I included changes into gym [registry](https://github.com/microsoft/CyberBattleSim/blob/main/cyberbattle/__init__.py) with preserving previous environments version behavior and making new environments with standard behavior of using `done`.  This means inclusion of `own_atleast_percent: 1.0` in initialization of `"v0"` versions of `toyctf` and `tinytoy` environments and creation of new envs 'CyberBattleTiny-v1' and 'CyberBattleToyCTF-v1', by default `own_atleast_percent=0` and `own_atleast=6`. This is reasonable, due to the fact that CTF solution includes only 6 nodes to be owned and with correct reward engineering training stops at the attack, which owns 6 nodes with highest reward.
- **Screenshots:** [Figure 5: Length of training episodes, obvious increase during learning of optimal path](https://user-images.githubusercontent.com/8929593/194363830-1f85f982-bfec-4d00-bb0c-83fb0dba87c0.png)
[Figure 6: 1500 max iterations during training of 20 episodes, before PR](https://user-images.githubusercontent.com/8929593/194364012-920e01b4-5a7a-4e59-a555-6f0f24974d7d.png)
[Figure 7: training on both 20 and 200 episodes, either use more RL techniques or learn for more episodes](https://user-images.githubusercontent.com/8929593/194364407-8fe8aa87-cbb0-4844-b6b5-315e74163341.png)
- PR: included some leftover cells in [ToyCTF](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/notebooks/notebook_benchmark-toyctf.ipynb) for comparison, "Before PR", but it could be safely deleted.

>  **MOVED TO SEPARATE ISSUE** #115 
> 3. **Issue 3:**  [ToyCTF](https://github.com/microsoft/CyberBattleSim/blob/4fd228bccfc2b088d911e27072a923251203cac8/notebooks/notebook_benchmark-toyctf.ipynb) benchmark is inaccurate, because with correct evaluation procedure, like with chain network configuration, agent does not reqch goal of 6 owned nodes after 200 training episodes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correction of benchmark results #87

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Correction of benchmark results #87

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions