Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26921_lts_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
## Overview

Let's look at what we need to perform the attack.

### Socket to send the packet through

Different socket families have different handling of the routing and fragmentation issues.
We do not want to use upper layer protocols like TCP or UDP, because they perform their own fragmentation and we need to trigger fragmentation at the IP layer.

Second thing to consider is the kmalloc cache used to allocate struct sock. Most socket families have a dedicated cache, but some use a regular kmalloc(), giving us a simple way to reallocate the freed object without performing a cross-cache attack.

And finally, some sockets use a SOCK_RCU_FREE flag which causes sk_destruct() to wait for an RCU grace period before freeing the sock object and this would also make exploitation much harder.

The socket family that fulfills all those requirements is AF_PACKET (used for sending raw packets at layer 2).

This means we need to set our own layer 2 and layer 3 headers and choose an output device for the packet.
No routing will be done, the packet will go straight to the output queue of a selected device.

### Device driver to call ip_local_out()

Because we send our packets at layer 2, ip_send_skb() won't be called and we need to find another way to trigger ip_local_out().
Fortunately, it is used by IPvlan driver:
```
static int ipvlan_process_v4_outbound(struct sk_buff *skb)
{
...
skb_dst_set(skb, &rt->dst);

memset(IPCB(skb), 0, sizeof(*IPCB(skb)));

err = ip_local_out(net, skb->sk, skb);
...
```

So our packets will be sent out of the IPvlan interface.
IPvlan needs a master ethernet device and we used the veth interface for that.


### A way to close the socket fd before the ip_defrag() call

When our packet reaches ip_defrag(), the socket won't be freed if it is still referenced by the open file descriptor.
We can call close() only after sendmsg() returns. The syscall returns after the packets is enqueued to the output device, so we might be able to try a race condition to close the fd in time, but there is a simpler way.

sch_plug queuing discipline can be used to stop the packets from being dequeued from a network device until a command to "unplug" is received through the netlink API.

So the steps of our exploit are:
1. "plug" the ipvlan interface
2. Send a packet
3. Close the socket
4. "unplug" the ipvlan interface

These are actually all the steps needed to exploit the vulnerability, if we exclude the setup needed beforehand.

### Network tools

The exploit needs external iptables and ip (from iproute2 package) binaries to set up rules and network interfaces.
These tools are not available in the current kernelCTF root image, so the tar archive with binaries and supporting libraries is attached to the exploit binary as a custom ELF section and extracted using objcopy during execution.

## Triggering the IPv4 fragmentation

The obvious idea is to send the MTU on the outgoing interface (ipv1) to a low value, but then our send() will just return a "Message too long" error.
Instead, we must reroute our packet to another interface with a low MTU (ipv0). This is done using a DNAT rule.

## Triggering ip_defrag()

Because we already have DNAT rules, the conntrack defrag hooks are installed and ip_defrag() will be called for each of our fragments, triggering the release of the sock object at the last fragment.

## Reallocating the victim object

To replace the victim object all we have to do is allocate from the kmalloc-2k cache on the same CPU.
This must be done before all the hooks finish, so there is no way to make them from the user space.
However, we can use whatever netfilter modules we want. There's a lot of them and some are bound to make new allocations.
This line of thinking leads us to a TEE target:
> The TEE target will clone a packet and redirect this clone to another machine on the local network segment.

Cloning a packet sounds great, as it involves copying the data we passed to the send() function.
There is a problem, though. Our packet's data needs to be larger then than 1024 bytes to be allocated from kmalloc-2k and skb stores larger packets like that using a fragment list. When TEE clones the skb, pskb_copy() is called and only space for the head is allocated from the regular kmalloc, the rest is zero-copied by cloning the fraglist.

Fortunately, some netfilter modules need to look at the whole packet data in one piece (e.g. to search for patterns) instead of dealing with skb fragments.

One such example is a conntrack SIP helper. It calls skb_linearize() which transforms a fragmented skb to linear one, which involves allocating buffer for all the data using kmalloc and copying it there, which finally gives us a way to allocate from kmalloc-2k and overwrite the victim sock object with our data.

To summarize, by combining the TEE and SIP conntrack helper we are able to overwrite the victim sock object that will be used by the netfilter hooks.

## Getting RIP control

Controlling the struct sock object may seem like an instant win at first, but we soon discover that netfilter hooks rarely use the socket context and never call function pointers from that object.

The solution is the ip_route_me_harder() function which is called in the mangle table if some IPv4 parameters like src/dst address, TOS or mark change after mangle rules are executed:

```
static unsigned int
ipt_mangle_out(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
...
/* Save things which could affect route */
mark = skb->mark;
iph = ip_hdr(skb);
saddr = iph->saddr;
daddr = iph->daddr;
tos = iph->tos;

ret = ipt_do_table(priv, skb, state);
/* Reroute for ANY change. */
if (ret != NF_DROP && ret != NF_STOLEN) {
iph = ip_hdr(skb);

if (iph->saddr != saddr ||
iph->daddr != daddr ||
skb->mark != mark ||
iph->tos != tos) {
err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC);
...
```

state->sk here is the pointer to our sock object.

ip_route_me_harder() calls xfrm_lookup() which examines sk->sk_policy and if the policy matches the current connection it eventually calls dst_alloc().
dst_alloc() calls the gc function pointer of the netns_xfrm.dst_ops struct and the netns_xfrm comes from the xfrm policy which is under our control.

So if we are able to craft a valid struct xfrm_policy that matches our connection, we will be able to get RIP control.

This policy is prepared in the prepare_policy().
The fake object for the sock itself is simple - we just need to set the sk_policy pointer and sk_mark value.

The policy object takes a lot of space and has pointer to other objects like netns_xfrm, so we used the [direct mapping storage technique](../../CVE-2024-26923_lts_cos/docs/novel-techniques.md) to place it at a known address in the kernel address space.

## Pivot to ROP

When the gc pointer is called in the dst_alloc() the RDI register contains a pointer to dst_ops which is part of our fake netns_xfrm object.

Following gadgets were used to pivot to the ROP chain placed at dst_ops + 0x10 (our gc pointer is at dst_ops+0x08).

```
mov r8,QWORD PTR [rdi+0xc8]
mov eax,0x1
test r8,r8
je ffffffff82185d21
mov rsi,rdi
mov rcx,r14
mov rdi,rbp
mov rdx,r15
call ffffffff82427a60 <__x86_indirect_thunk_r8>
```

This copies RDI to RSI

```
push rsi
jmp qword ptr [rsi + 0x39]
```

and finally

```
pop rsp
pop rbp
pop rbx
ret
```

## Second pivot

To get more room for our ROP chain we move to a second location in the direct mapping using a simple pop rsp ; ret gadget.

## Privilege escalation

Our ROP is executed from the ksoftirqd context, so we can't do a traditional commit_creds() to modify the current process's privileges.

We could try locating our exploit process and changing its privileges, but we decided to go with a different approach - we patch the kernel creating a backdoor that will grant root privileges to any process that executes a given syscall.

We chose a rarely used kexec_file_load() syscall and overwrote its code with our get_root function that does all traditional privileges escalation/namespace escape stuff: commit_creds(init_cred), switch_task_namespaces(pid, init_nsproxy) etc.

This function also returns a special value (0x777) that our user space code can use to detect if the system was already compromised.

Patching the kernel function is done rop_patch_kernel_code() - it calls set_memory_rw() on destination memory and uses copy_user_generic() to write new code there.
65 changes: 65 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26921_lts_cos/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
## Requirements to trigger the vulnerability

- CAP_NET_ADMIN in a namespace is required
- Kernel configuration: CONFIG_INET
- User namespaces required: Yes

## Commit which introduced the vulnerability

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7026b1ddb6b8d4e6ee33dc2bd06c0ca8746fa7ab

## Commit which fixed the vulnerability

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18685451fc4e546fc0e718580d32df3c0e5c8272

## Affected kernel versions

Introduced in 4.1. Fixed in 6.6.25, 5.10.226 and other stable trees.

## Affected component, subsystem

net/ipv4

## Description

ip_local_out() is a function responsible for sending the locally generated IPV4 packets.
It will call the NF_INET_LOCAL_OUT netfilter hooks and eventually the dst_output().

The usual call to ip_local_out() looks like this:
```
int ip_send_skb(struct net *net, struct sk_buff *skb)
{
int err;

err = ip_local_out(net, skb->sk, skb);
if (err) {
if (err > 0)
err = net_xmit_errno(err);
if (err)
IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
}

return err;
}
```

Pointer to the socket associated with the skb is passed as an argument to ip_local_out() and then to all the netfilter hooks:

```
int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
{
...
return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
net, sk, skb, NULL, skb_dst(skb)->dev,
dst_output);

}
```

skb holds a reference to a socket. In normal conditions, skb is released only after its output path is finished or until the skb is received by the upper layers of the input stack (in scenarios when the outgoing packet is routed back to a local interface).
This ensures the associated socket is valid while the netfilter hooks are executing.

ip_defrag() is most often called in the input path and it calls skb_orphan()/kfree_skb() on the fragment skb, assuming it is no longer needed.
However, ip_defrag() can be also called in the output path by the netfilter conntrack hook ipv4_conntrack_defrag().

If that happens, the skb will be released and if it is a last reference to the socket, it will be released as well, causing a use-after-free when next hooks are called and in the ip_finish_output().
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
INCLUDES = -I/usr/include/libnl3
LIBS = -L. -pthread -lnl-cli-3 -lnl-route-3 -lnl-3 -ldl
CFLAGS = -fomit-frame-pointer -static -fcf-protection=none

exploit: exploit.c kernelver_16919.450.26.h
gcc -o $@ exploit.c $(INCLUDES) $(CFLAGS) $(LIBS)
objcopy --add-section tools=tools.tar.gz $@

prerequisites:
sudo apt-get install libnl-cli-3-dev libnl-route-3-dev
Binary file not shown.
Loading