Implements Q-learning a form of Reinforcement Learning on a simulation on OpenAI Gymnasium from scratch!!
Q-learning is a model-free, off-policy reinforcement learning (RL) algorithm that learns the optimal action-value function
TL;DR: Q-learning learns a table (or function) of action values and acts greedily with respect to those values, while still exploring via an
$\epsilon$ -greedy policy.
Use Q-learning when:
- The state and action spaces are small/discrete (so a table fits in memory), or you can discretize reasonably.
- You don’t have (or don’t want) a dynamics model (transition probabilities).
- You want a simple, proven method to get a baseline or solve gridworld-like tasks, tabular control, or MDP toys.
If your problem has large/continuous spaces, prefer function approximation (e.g., DQN) or policy-gradient families.
-
MDP: Defined by states
$\mathcal{S}$ , actions$\mathcal{A}$ , transition dynamics$P(s'\mid s,a)$ , reward$R(s,a)$ , and discount$\gamma \in [0,1)$ . -
Policy (
$\pi$ ): A mapping from states to actions (deterministic or stochastic). - Optimal Q: The greedy policy that maximizes returns.
-
Exploration vs. Exploitation: Trade off trying new actions (explore) versus taking the best-known action (exploit). A common strategy is
$\epsilon$ -greedy.
After observing a transition
Where:
-
$\alpha$ is the learning rate (e.g., 0.1). -
$\gamma$ is the discount factor (e.g., 0.95). -
The target uses
$\max$ over next-state actions → off-policy (target is the greedy action regardless of the behavior policy).NOTE:- In the given code too the same formula has been used just the current-state's Q-value has been clubbed together :)
-
Initialize Q(s,a) arbitrarily (often zeros) for all states and actions.
-
For each episode:
-
Initialize state s.
-
Loop (until terminal):
- With probability
$\epsilon$ , pick a random action; otherwise$\arg\max_a Q(s,a)$ . - Execute action, observe reward r and next state
$s'$ . - Update Q using the formula above.
- Set
$s \leftarrow s'$ .
- With probability
-
-
Optionally anneal
$\epsilon$ from high to low for exploration early, exploitation later.
Notes:
- The reward shaping (small step cost, big goal reward) encourages shorter paths.
- Annealing
$\epsilon$ helps the agent explore early, then exploit later.
-
SARSA (on-policy): Update uses the action actually taken next:
$r + \gamma Q(s',a')$ . Safer near risky states (less overestimation), but may be more conservative. -
Expected SARSA: Uses expectation over next actions:
$r + \gamma \mathbb{E}_{a'\sim\pi}[Q(s',a')]$ . - Double Q-learning: Reduces overestimation bias via two Q-tables.
- DQN (Deep Q-Network): Replaces the table with a neural network; adds target networks and experience replay.
- Insufficient exploration: Agent gets stuck exploiting suboptimal early values.
- Non-stationary rewards/dynamics: Tabular Q-learning assumes stationary MDP.
- Huge/continuous spaces: The table explodes; use function approximation.
- Learning rate too high/low: Too high → oscillations; too low → slow learning.
-
Bootstrapping from terminal states: Don’t include
$\max_{a'} Q(s', a')$ if$s'$ is terminal.
The program trains an agent using Q-learning to solve the MountainCar-v0 environment from Gymnasium. In this problem,a car is stuck between two hills and must build enough momentum to reach the flag on top of the right hill. Since the environment has a continuous state space (position and velocity), the program discretizes it into a grid and applies tabular Q-learning.
env = gym.make('MountainCar-v0')The agent interacts with the MountainCar environment. The observation space has two values (position, velocity) and the action space has three values (push left, no push, push right).
The environment states are continuous, so they are mapped into discrete bins:
table_size = (20, 20)
window_size = (env.observation_space.high - env.observation_space.low) / table_size- The continuous space is split into a 20x20 grid.
get_discrete_state(state, window_size)converts a continuous state into an integer index for the Q-table.
q_table = np.random.uniform(low=0, high=1, size=(20, 20, env.action_space.n))- Q-table dimensions =
(position_bins, velocity_bins, actions). - Initially filled with random values in [0,1].
- Learning rate (α): 0.1 → how much new info overrides old.
- Discount factor (γ): 0.95 → importance of future rewards.
- Exploration rate (ε): starts at 1.0 → decays gradually to 0.01.
- Episodes: 60,000 → each one is an attempt to solve the task.
- Step limit per episode: 120.
For each episode:
-
Reset environment → get initial state.
-
Discretize the state.
-
Loop until done (goal reached or max steps):
-
Choose an action:
- With probability ε → random action (exploration).
- Otherwise → greedy action (exploitation) based on Q-table.
-
Perform action → observe
(new_state, reward, done). -
Discretize the new state.
-
Update Q-table using the Q-learning rule:
new_q = (1 - α) * current_q + α * (reward + γ * max_future_q)
-
If the goal is reached, set Q-value to 0 and mark episode done.
-
Accumulate reward and continue.
-
-
After each episode, decay ε to reduce exploration over time.
reward_list.append(total_reward)
sn.lineplot(reward_list)- Stores the total reward per episode.
- Plots a reward curve to visualize learning progress.
eval_env = gym.make('MountainCar-v0', render_mode='human')- A fresh environment is created with rendering enabled.
- The agent always takes the greedy action (no exploration).
- The car should reliably reach the goal after training.
-
Discretize continuous state space.
-
Initialize Q-table.
-
Train agent for many episodes:
- Explore early, exploit later.
- Update Q-values after each step.
-
Plot reward curve.
-
Run a final evaluation with the trained policy.
- Console messages: When the goal is reached during training (episode and step count).
- Reward curve plot: Shows improvement over episodes.
- Rendered environment: After training, the car reaches the goal in the visualization window.
- State space: 48 discrete states (a 4x12 grid).
- Action space: 4 discrete actions (up, down, left, right).
- Start: bottom-left corner.
- Goal: bottom-right corner (state 37 in this implementation).
- Cliff region: cells between start and goal on the bottom row. Stepping here yields large negative reward and episode termination.
q_table = np.random.uniform(low=0, high=1, size=(48, 4))- One row per state (48).
- One column per action (4).
- Initialized with random values.
For 10,000 episodes, the agent learns by trial and error:
-
Reset the environment to get the starting state.
-
While the episode is not finished (or step < 50):
-
Select action:
- With probability ε → random action (exploration).
- Otherwise → action with highest Q-value for that state (exploitation).
-
Perform the action → observe
(new_state, reward, done). -
Update Q-table using the Q-learning rule.
-
If the goal (state 37) is reached → set Q-value to 0 and mark episode as done.
-
-
Accumulate rewards and decay ε to encourage exploitation over time.
After each episode, the total reward is stored. At the end:
sn.lineplot(reward_list)
plt.title('Reward per episode')A curve of rewards per episode is plotted to visualize learning progress.
A fresh environment with rendering is created:
eval_env = gym.make('CliffWalking-v0', render_mode='human')- The agent acts greedily using the trained Q-table.
- The agent should learn to avoid the cliff and reach the goal.
- The process is displayed in the render window.
- Console logs: Messages when the agent successfully reaches the goal.
- Plot: Reward per episode curve.
- Render window: Visualization of the final policy navigating the gridworld.
- Unlike MountainCar, CliffWalking already has discrete states, so no discretization is needed.
- The main challenge is balancing exploration so the agent doesn’t keep falling into the cliff.
- These programs use tabular Q-learning with discretization — simple but effective for MountainCar and CliffWalking.
- More advanced approaches (e.g., Deep Q-Networks) can handle continuous spaces without manual discretization.