Skip to content

ftp: add --ftp-recurse-symlink-dirs to descend into symlink directories#36

Open
coleleavitt wants to merge 36 commits intomirror:masterfrom
coleleavitt:feature/ftp-recurse-symlink-dirs
Open

ftp: add --ftp-recurse-symlink-dirs to descend into symlink directories#36
coleleavitt wants to merge 36 commits intomirror:masterfrom
coleleavitt:feature/ftp-recurse-symlink-dirs

Conversation

@coleleavitt
Copy link
Copy Markdown

Summary

Adds a new --ftp-recurse-symlink-dirs option that lets wget -r descend through FTP symlinks whose target is a directory on the server. Without it, such entries are either materialized as local symlinks (--retr-symlinks=no) or attempted as files (--retr-symlinks, the default) — but never explored. Whole subtrees reachable only through server-side symlinks are silently skipped.

Motivation

Real-world FTP servers routinely expose a curated tree via symlinks into the underlying storage volume. Example layout from a vendored firmware server:

/nms/PrimeTrex_S/CRBT/Firmware_3Tu8Z9 -> /tmsdata/nms/PrimeTrex_S/CRBT/Firmware_3Tu8Z9
/nms/PrimeTrex_S/CRBT/Log             -> /tmsdata/nms/PrimeTrex_S/CRBT/Log
/nms/PrimeTrex_S/CRBT/Present_conf    -> /tmsdata/nms/PrimeTrex_S/CRBT/Present_conf
...

With stock wget -r --retr-symlinks the user gets the index.html listing and nothing else — no way to actually pull the subtree without falling back to lftp mirror --dereference or scripting a per-symlink wget per subdir.

Design

  • Per-process struct hash_table keyed by host:port:abs-path where abs-path is the symlink target canonicalized against the current remote directory (relative targets resolved against u->dir, trailing slashes stripped).
  • In ftp_retrieve_list(), when opt.ftp_recurse_symlink_dirs && opt.recursive && f->linkto, an FT_SYMLINK entry is reclassified as FT_DIRECTORY after a cycle check so the existing ftp_retrieve_dirs()ftp_retrieve_glob() path issues a CWD using the symlink's own name. The server resolves the target; no client-side type probe is needed.
  • A target already descended through is skipped with LOG_VERBOSE instead of recursing again, so circular link graphs terminate.
  • Defaults to off — a hostile FTP server could otherwise use crafted symlinks to amplify crawl cost or redirect descent out of the intended subtree.

Files touched

File Lines Change
src/options.h +3 new ftp_recurse_symlink_dirs flag
src/init.c +6 register command + default init
src/main.c +3 long option + help text
src/ftp.c +83 cycle table + helpers + FT_SYMLINK hook
doc/wget.texi +23/-2 document new option; fix stale "may be added in the future" note
NEWS +4 release note

Validation

Built clean with ./bootstrap --skip-po && ./configure --without-ssl --disable-iri && make -j. Three source files (ftp.c, init.c, main.c) rebuild with no new warnings.

Option parser correctly handles =on/=off/--no- prefixing and rejects garbage values:

```
$ ./wget --ftp-recurse-symlink-dirs=bogus
wget: --ftp-recurse-symlink-dirs: Invalid boolean `bogus', use `on' or `off'.
```

End-to-end tested against a live vsftpd 3.0.3 server with the symlink layout shown above:

  • Without the option: 1 file fetched (index.html of the listing), 0 subdirectories descended — reproduces the current behavior.
  • With -r -l 2 --ftp-recurse-symlink-dirs: 18 firmware files fetched from the symlinked Firmware_3Tu8Z9/ subtree. Debug output shows each symlink being reclassified:

```
Reclassifying symlink Firmware_3Tu8Z9 -> /tmsdata/nms/PrimeTrex_S/CRBT/Firmware_3Tu8Z9 as directory for recursion.
Reclassifying symlink Log -> /tmsdata/nms/PrimeTrex_S/CRBT/Log as directory for recursion.
Reclassifying symlink Present_conf -> /tmsdata/nms/PrimeTrex_S/CRBT/Present_conf as directory for recursion.
```

  • With -r -l 5 to exercise the cycle guard: 92 files across 10 unique directories, process terminated normally. No infinite loop; the hash set prevents re-descent into already-visited targets.

Note on upstream

This PR targets the GitHub mirror for visibility. The canonical GNU Wget workflow is `git format-patch` + `git send-email bug-wget@gnu.org`; a patch series suitable for that workflow is attached to my fork and can be mailed separately.

rockdaboot and others added 30 commits October 27, 2024 18:58
* src/utils.h: wget.h includes config.h which should be included before
    any other header files. So swap the order of includes
>= GCC 12 reports an instance of -Warray-bounds in setval_internal_tilde
with the default -DNDEBUG:
```
In function ‘setval_internal_tilde’,
    inlined from ‘run_wgetrc’ at init.c:710:16:
init.c:940:17: error: array subscript [0, 167] is outside array bounds of ‘const struct <anonymous>[168]’ [-Werror=array-bounds=]
  940 |   if (((commands[comind].action == cmd_file) ||
      |         ~~~~~~~~^~~~~~~~
init.c: In function ‘run_wgetrc’:
init.c:135:3: note: while referencing ‘commands’
  135 | } commands[] = {
      |   ^~~~~~~~
```

setval_internal_tilde calls setval_internal and stores the result in ret;
setval_internal *does* check for if comind is out-of-bounds, but we only
check that *after* dereferencing commands[comind]. Swap the order in the
if() to fix that so we only dereference if we know it's safe.

ChangeLog:

	* src/init.c (setval_internal_tilde): Check 'ret' earlier.

Copyright-paperwork-exempt: Yes
Gnulib does not currently implement the `F_GETFL` and `F_SETFL` flags on
mingw. As a result building Wget failed on Windows. We don't currently
have a good solution to this problem and Windows users of Wget will not
get the new functionality of non-blocking file I/O reads. The current
solution is simply to provide a Windows specific no-op stub

 * src/mswindows.c (set_fd_nonblocking): Provide an empty stub function
   for Windows systems
 * src/mswindows.h (set_fd_nonblocking): Export the function so it is
   available in util.c where it is needed
 * src/util.c (set_fd_nonblocking): New function to set a fd as
   O_NONBLOCK on non-Windows, non-MSDOS systems
   (wget_read_from_file): Split the O_NONBLOCK code into
   set_fd_nonblocking and make all file I/O non-blocking, not just stdin
* src/retr.c (retrieve_from_file): Remove a dead store to status
  (retrieve_from_url_list): Initialize status to a default of RETROK
Clang's static analyzer believes that it is possible for a realloc
request to return NULL. However, the `xrealloc` implementation used
in Wget will trigger a program crash if enough memory couldn't be
found. I'm not sure why the analyzer isn't able to see that. But let's
pacify it by asserting here that xrealloc will always return a non-null

 * src/wget.h (DO_REALLOC): Assert that xrealloc will always return
   non-null
* doc/wget.texi: Add documentation for removed support for shorthand URLs.
* src/html-url.c (src/html-url.c): Call maybe_prepend_scheme.
* src/main.c (main): Likewise.
* src/retr.c (getproxy): Likewise.
* src/url.c: Rename definition of rewrite_shorthand_url to maybe_prepend_scheme,
  add new function is_valid_port.
* src/url.h: Rename declaration of rewrite_shorthand_url to maybe_prepend_scheme.

Reported-by: Goni Golan <gonig@jfrog.com>
* NEWS: Record release date.
* NEWS: Add header line for next release.
* .prev-version: Record previous version.
* cfg.mk (old_NEWS_hash): Auto-update.
doc/wget.texi: In the documentation for --tries, clarify that Wget tries
20 times, not retries. It actually retries only 19 times

Reported-by: Gert Robben <zuvlrx5@gert.gr>
Darshit Shah and others added 6 commits February 2, 2026 22:04
The new sc_codespell check introduces a lot of false positives. Until we
can clean it up, it is better to disable it and allow our CI pipelines
to work.
This reverts commit 0c0bfeb.

That was a local, untested change that should never have been pushed.
But I messed up, and pushed it when making another change to unblock the
pipelines. Teaches you to never work on master.
When recursing over FTP, wget treats entries of type FT_SYMLINK by
either creating a local symbolic link (--retr-symlinks=no) or trying to
fetch the target as a plain file (--retr-symlinks, the default).
Neither branch descends when the link's target is a directory on the
server, so whole subtrees reachable only through server-side symlinks
are silently skipped during --recursive retrieval.

Introduce --ftp-recurse-symlink-dirs.  When set together with
--recursive, a symlink entry with a non-empty link target is
reclassified as FT_DIRECTORY before the switch, so the existing
ftp_retrieve_dirs()/ftp_retrieve_glob() path issues a CWD using the
symlink's own name and the server resolves the target.  No client-side
type probe is needed.

Cycle detection is implemented as a per-process string hash keyed by
'host:port:abs-path' where abs-path is the symlink target canonicalised
against the current remote directory (relative targets resolved against
u->dir, trailing slashes stripped).  A target that has already been
descended through is skipped with a LOG_VERBOSE diagnostic instead of
recursing again, so circular link graphs terminate.

The option defaults to off: a hostile FTP server could otherwise use
crafted symlinks to amplify crawl cost or redirect descent out of the
intended subtree.

* src/options.h (struct options): Add ftp_recurse_symlink_dirs.
* src/init.c (commands): Register 'ftprecursesymlinkdirs' as
cmd_boolean.
(defaults): Initialise ftp_recurse_symlink_dirs to false.
* src/main.c (long_options): Add --ftp-recurse-symlink-dirs.
(print_help): Document it in the FTP options block.
* src/ftp.c: Include hash.h.
(ftp_visited_symlinks): New static string hash table.
(ftp_symlink_cycle_key, ftp_symlink_already_visited)
(ftp_symlink_mark_visited): New helpers.
(ftp_retrieve_list) <FT_SYMLINK>: When
opt.ftp_recurse_symlink_dirs && opt.recursive && f->linkto, consult
the cycle cache, mark the target visited, and reclassify the entry
as FT_DIRECTORY so ftp_retrieve_dirs descends it.
* doc/wget.texi (FTP Options): Document the option and update the
stale 'may be added in the future' note under --retr-symlinks.
* NEWS: Mention the new option.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants