Skip to content

workspace import-dir: default-exclude .git, .databricks, node_modules#5118

Open
jamesbroadhead wants to merge 3 commits intomainfrom
jb/import-dir-default-exclude-git
Open

workspace import-dir: default-exclude .git, .databricks, node_modules#5118
jamesbroadhead wants to merge 3 commits intomainfrom
jb/import-dir-default-exclude-git

Conversation

@jamesbroadhead
Copy link
Copy Markdown

@jamesbroadhead jamesbroadhead commented Apr 29, 2026

Summary

databricks workspace import-dir walks the source tree and copies every entry into the workspace verbatim — it has no awareness of .gitignore or default exclusions. This change adds a name-based skip for .git, .databricks, and node_modules directories during the walk. .gitignore and other dotfiles at the root remain copied. If a user explicitly passes .git (or any of the others) as the source root, that root is still copied — the skip rule applies to entries encountered during recursion.

Motivation: align import-dir with sync's existing defaults

databricks sync already hard-codes skips for the same two directories that cause the most trouble:

  • libs/git/repository.go// Always ignore root .git directory. adds .git to the default ignore rules unconditionally.
  • libs/git/view.go (SetupDefaults) — // Hard code .databricks ignore pattern so that we never sync it (irrespective of .gitignore patterns).

So sync and import-dir currently produce different workspace contents for the same source tree: sync skips .git/ and .databricks/, import-dir copies them. This PR closes that gap for import-dir so the two commands behave consistently.

node_modules is the one entry that goes beyond what sync does by default. For any project with a typical .gitignore, sync would already skip it via gitignore rules; import-dir ignores .gitignore entirely, so adding it to the name-based skip list keeps the behavior aligned with what users get from sync.

Why this matters in practice

For users who land on import-dir (typically via symmetry with the documented export-dir) and then run databricks apps deploy --source-code-path:

  1. The local repo's .git/config (often containing the template-repo origin URL) ends up at /app/python/source_code/.git/ in the running app container.
  2. Local bundle cache .databricks/ overwrites whatever the bundle pipeline put in the remote workspace.
  3. JS/TS apps drag node_modules/ along — large, slow to upload, and re-installed in the runtime anyway.

The canonical answer for the apps-deploy flow is databricks sync (which the official Apps docs recommend). This PR is not a substitute for that — it just brings import-dir's defaults into line with sync's for users who reach for it anyway.

Test plan

  • Unit tests covering: root .git/ skipped, nested .git/ skipped, .databricks/ skipped, node_modules/ skipped, .gitignore file kept, explicit .git root copied (escape hatch).
  • go test ./cmd/workspace/workspace/ — pass
  • golangci-lint run ./cmd/workspace/workspace/ — clean
  • Existing integration TestImportDir — unchanged, no .git in its testdata so behavior is identical.

This pull request and its description were written by Isaac.

The previous walker copied every entry under the source tree into the
workspace verbatim. That has two practical consequences for users
deploying Databricks Apps via `databricks workspace import-dir` followed
by `databricks apps deploy`:

1. The local repo's `.git/config` (often containing the template-repo
   origin URL, sometimes cached credentials) ends up at
   `/app/python/source_code/.git/` in the running app container.
2. Local bundle cache `.databricks/` overwrites whatever the bundle
   pipeline put in the remote workspace.

Empirically reproduced on a probe deployment (deploy04-probe-jb on
e2-dogfood.staging) — the running container had a full `.git/` tree
including HEAD, config, objects, refs, hooks. CoDA
(github.com/datasciencemonkey/coding-agents-databricks-apps) ships an
in-app `_reinit_app_git()` to scrub this on every startup, and its
CLAUDE.md warns "never move .git folder to the workspace if you're
running workspace import" — that workaround is the bug surface this
change closes.

Reported as DEPLOY-04 #2 in Tushar's "Apps Gaps That Matter to EMEA
Apps" doc.

Skip is name-based and applied during the walk; if a user explicitly
passes `.git` (or `.databricks`) as the source root, that root is still
copied — the rule only fires on entries encountered during recursion.
`.gitignore` and other dot-files at the root remain copied as before.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown

Waiting for approval

Based on git history, these people are best suited to review:

  • @pietern -- recent work in cmd/workspace/workspace/

Eligible reviewers: @andrewnester, @anton-107, @denik, @renaudhartert-db, @shreyas-goenka, @simonfaltum

Suggestions based on git history. See OWNERS for ownership rules.

Same rationale as .git/.databricks: gets uploaded by accident, large,
re-installed in the runtime anyway.

Co-authored-by: Isaac
@jamesbroadhead jamesbroadhead changed the title workspace import-dir: default-exclude .git and .databricks directories workspace import-dir: default-exclude .git, .databricks, node_modules May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant