ftp: add --ftp-recurse-symlink-dirs to descend into symlink directories#36
Open
coleleavitt wants to merge 36 commits intomirror:masterfrom
Open
ftp: add --ftp-recurse-symlink-dirs to descend into symlink directories#36coleleavitt wants to merge 36 commits intomirror:masterfrom
coleleavitt wants to merge 36 commits intomirror:masterfrom
Conversation
* src/utils.h: wget.h includes config.h which should be included before
any other header files. So swap the order of includes
>= GCC 12 reports an instance of -Warray-bounds in setval_internal_tilde
with the default -DNDEBUG:
```
In function ‘setval_internal_tilde’,
inlined from ‘run_wgetrc’ at init.c:710:16:
init.c:940:17: error: array subscript [0, 167] is outside array bounds of ‘const struct <anonymous>[168]’ [-Werror=array-bounds=]
940 | if (((commands[comind].action == cmd_file) ||
| ~~~~~~~~^~~~~~~~
init.c: In function ‘run_wgetrc’:
init.c:135:3: note: while referencing ‘commands’
135 | } commands[] = {
| ^~~~~~~~
```
setval_internal_tilde calls setval_internal and stores the result in ret;
setval_internal *does* check for if comind is out-of-bounds, but we only
check that *after* dereferencing commands[comind]. Swap the order in the
if() to fix that so we only dereference if we know it's safe.
ChangeLog:
* src/init.c (setval_internal_tilde): Check 'ret' earlier.
Copyright-paperwork-exempt: Yes
Gnulib does not currently implement the `F_GETFL` and `F_SETFL` flags on mingw. As a result building Wget failed on Windows. We don't currently have a good solution to this problem and Windows users of Wget will not get the new functionality of non-blocking file I/O reads. The current solution is simply to provide a Windows specific no-op stub * src/mswindows.c (set_fd_nonblocking): Provide an empty stub function for Windows systems * src/mswindows.h (set_fd_nonblocking): Export the function so it is available in util.c where it is needed * src/util.c (set_fd_nonblocking): New function to set a fd as O_NONBLOCK on non-Windows, non-MSDOS systems (wget_read_from_file): Split the O_NONBLOCK code into set_fd_nonblocking and make all file I/O non-blocking, not just stdin
* src/retr.c (retrieve_from_file): Remove a dead store to status (retrieve_from_url_list): Initialize status to a default of RETROK
Clang's static analyzer believes that it is possible for a realloc request to return NULL. However, the `xrealloc` implementation used in Wget will trigger a program crash if enough memory couldn't be found. I'm not sure why the analyzer isn't able to see that. But let's pacify it by asserting here that xrealloc will always return a non-null * src/wget.h (DO_REALLOC): Assert that xrealloc will always return non-null
…ild)" This reverts commit 04053a7.
* doc/wget.texi: Add documentation for removed support for shorthand URLs. * src/html-url.c (src/html-url.c): Call maybe_prepend_scheme. * src/main.c (main): Likewise. * src/retr.c (getproxy): Likewise. * src/url.c: Rename definition of rewrite_shorthand_url to maybe_prepend_scheme, add new function is_valid_port. * src/url.h: Rename declaration of rewrite_shorthand_url to maybe_prepend_scheme. Reported-by: Goni Golan <gonig@jfrog.com>
* NEWS: Record release date.
* NEWS: Add header line for next release. * .prev-version: Record previous version. * cfg.mk (old_NEWS_hash): Auto-update.
doc/wget.texi: In the documentation for --tries, clarify that Wget tries 20 times, not retries. It actually retries only 19 times Reported-by: Gert Robben <zuvlrx5@gert.gr>
Copyright-paperwork-exempt: Yes
The new sc_codespell check introduces a lot of false positives. Until we can clean it up, it is better to disable it and allow our CI pipelines to work.
This reverts commit 0c0bfeb. That was a local, untested change that should never have been pushed. But I messed up, and pushed it when making another change to unblock the pipelines. Teaches you to never work on master.
When recursing over FTP, wget treats entries of type FT_SYMLINK by either creating a local symbolic link (--retr-symlinks=no) or trying to fetch the target as a plain file (--retr-symlinks, the default). Neither branch descends when the link's target is a directory on the server, so whole subtrees reachable only through server-side symlinks are silently skipped during --recursive retrieval. Introduce --ftp-recurse-symlink-dirs. When set together with --recursive, a symlink entry with a non-empty link target is reclassified as FT_DIRECTORY before the switch, so the existing ftp_retrieve_dirs()/ftp_retrieve_glob() path issues a CWD using the symlink's own name and the server resolves the target. No client-side type probe is needed. Cycle detection is implemented as a per-process string hash keyed by 'host:port:abs-path' where abs-path is the symlink target canonicalised against the current remote directory (relative targets resolved against u->dir, trailing slashes stripped). A target that has already been descended through is skipped with a LOG_VERBOSE diagnostic instead of recursing again, so circular link graphs terminate. The option defaults to off: a hostile FTP server could otherwise use crafted symlinks to amplify crawl cost or redirect descent out of the intended subtree. * src/options.h (struct options): Add ftp_recurse_symlink_dirs. * src/init.c (commands): Register 'ftprecursesymlinkdirs' as cmd_boolean. (defaults): Initialise ftp_recurse_symlink_dirs to false. * src/main.c (long_options): Add --ftp-recurse-symlink-dirs. (print_help): Document it in the FTP options block. * src/ftp.c: Include hash.h. (ftp_visited_symlinks): New static string hash table. (ftp_symlink_cycle_key, ftp_symlink_already_visited) (ftp_symlink_mark_visited): New helpers. (ftp_retrieve_list) <FT_SYMLINK>: When opt.ftp_recurse_symlink_dirs && opt.recursive && f->linkto, consult the cycle cache, mark the target visited, and reclassify the entry as FT_DIRECTORY so ftp_retrieve_dirs descends it. * doc/wget.texi (FTP Options): Document the option and update the stale 'may be added in the future' note under --retr-symlinks. * NEWS: Mention the new option.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
--ftp-recurse-symlink-dirsoption that letswget -rdescend through FTP symlinks whose target is a directory on the server. Without it, such entries are either materialized as local symlinks (--retr-symlinks=no) or attempted as files (--retr-symlinks, the default) — but never explored. Whole subtrees reachable only through server-side symlinks are silently skipped.Motivation
Real-world FTP servers routinely expose a curated tree via symlinks into the underlying storage volume. Example layout from a vendored firmware server:
With stock
wget -r --retr-symlinksthe user gets theindex.htmllisting and nothing else — no way to actually pull the subtree without falling back tolftp mirror --dereferenceor scripting a per-symlinkwgetper subdir.Design
struct hash_tablekeyed byhost:port:abs-pathwhere abs-path is the symlink target canonicalized against the current remote directory (relative targets resolved againstu->dir, trailing slashes stripped).ftp_retrieve_list(), whenopt.ftp_recurse_symlink_dirs && opt.recursive && f->linkto, anFT_SYMLINKentry is reclassified asFT_DIRECTORYafter a cycle check so the existingftp_retrieve_dirs()→ftp_retrieve_glob()path issues a CWD using the symlink's own name. The server resolves the target; no client-side type probe is needed.LOG_VERBOSEinstead of recursing again, so circular link graphs terminate.Files touched
src/options.hftp_recurse_symlink_dirsflagsrc/init.csrc/main.csrc/ftp.cFT_SYMLINKhookdoc/wget.texiNEWSValidation
Built clean with
./bootstrap --skip-po && ./configure --without-ssl --disable-iri && make -j. Three source files (ftp.c, init.c, main.c) rebuild with no new warnings.Option parser correctly handles
=on/=off/--no-prefixing and rejects garbage values:```
$ ./wget --ftp-recurse-symlink-dirs=bogus
wget: --ftp-recurse-symlink-dirs: Invalid boolean `bogus', use `on' or `off'.
```
End-to-end tested against a live vsftpd 3.0.3 server with the symlink layout shown above:
index.htmlof the listing), 0 subdirectories descended — reproduces the current behavior.-r -l 2 --ftp-recurse-symlink-dirs: 18 firmware files fetched from the symlinkedFirmware_3Tu8Z9/subtree. Debug output shows each symlink being reclassified:```
Reclassifying symlink Firmware_3Tu8Z9 -> /tmsdata/nms/PrimeTrex_S/CRBT/Firmware_3Tu8Z9 as directory for recursion.
Reclassifying symlink Log -> /tmsdata/nms/PrimeTrex_S/CRBT/Log as directory for recursion.
Reclassifying symlink Present_conf -> /tmsdata/nms/PrimeTrex_S/CRBT/Present_conf as directory for recursion.
```
-r -l 5to exercise the cycle guard: 92 files across 10 unique directories, process terminated normally. No infinite loop; the hash set prevents re-descent into already-visited targets.Note on upstream
This PR targets the GitHub mirror for visibility. The canonical GNU Wget workflow is `git format-patch` + `git send-email bug-wget@gnu.org`; a patch series suitable for that workflow is attached to my fork and can be mailed separately.