Skip to content

Conversation

@cardmagic
Copy link
Owner

@cardmagic cardmagic commented Dec 30, 2025

Summary

Implements #119 - a full-featured command-line interface for the classifier gem with model registry support.

Core CLI Features

  • Default action is classification - just run classifier 'text to classify'
  • Supports all classifier types: Bayes, LSI, KNN, and Logistic Regression
  • Commands: train, info, fit (LR only), search (LSI), related (LSI)
  • Sensible defaults: ./classifier.json model file, bayes model type
  • Stdin support: pipe text for training or classification
  • JSON output: classifier info outputs JSON for jq filtering

Model Registry (like Homebrew taps)

Pre-trained models can be downloaded from public GitHub repositories:

  • Default registry: github.com/cardmagic/classifier-models
  • Custom registries: Any GitHub repo with the same structure (use @user/repo:model syntax)

New Commands:

  • classifier models - List models in default registry
  • classifier models @user/repo - List models in custom registry
  • classifier pull <model> - Download model from default registry
  • classifier pull @user/repo:model - Download model from custom registry
  • classifier pull @user/repo - Download ALL models from registry
  • classifier push - Shows instructions for contributing a model

New Options:

  • -r, --remote MODEL - Use remote model directly (downloads and caches automatically)
  • -o, --output FILE - Output path for pull command

Cache System:

  • Downloaded models are cached in ~/.classifier/models/
  • Models from custom registries are stored in ~/.classifier/models/@user/repo/

Developer Experience

  • Running classifier with no model shows a getting started guide with copy-paste examples
  • Running classifier with a model but no input shows model info and usage hints
  • All examples in the guide actually work when copy-pasted

Example Usage

# Use pre-trained model directly
classifier -r spam-filter "Is this spam?"
# => ham

# Use model from custom registry
classifier -r @nlp-models/classifiers:sentiment "I love this!"
# => positive

# Download models for offline use
classifier pull spam-filter
classifier pull @company/internal-models:custom-classifier

# List available models
classifier models
classifier models @user/repo

# Train your own
echo 'buy viagra now free pills' | classifier train spam
echo 'meeting scheduled for tomorrow' | classifier train ham

# Classify
classifier 'free money buy now'        # => spam
classifier 'meeting postponed'          # => ham

# Model info (JSON)
classifier info | jq '.categories'

# LSI semantic search
echo 'ruby programming language' | classifier train docs -m lsi
classifier search 'programming'

Environment Variables

  • CLASSIFIER_MODEL - Default model path
  • CLASSIFIER_TYPE - Default classifier type
  • CLASSIFIER_REGISTRY - Default registry (default: cardmagic/classifier-models)
  • CLASSIFIER_CACHE - Cache directory (default: ~/.classifier)

Test plan

  • 658 tests passing
  • All CLI examples in getting started guide verified working
  • JSON info output works with jq
  • Model registry commands tested with mocked HTTP responses
  • Remote classification with -r flag tested
  • Cache system tested

Closes #119

Implements issue #119 with a full-featured command-line interface:

- Default action is classification (no subcommand needed)
- Supports Bayes, LSI, KNN, and Logistic Regression models
- Commands: train, info, fit (LR only), search (LSI), related (LSI)
- Flags: -f (file), -m (model type), -p (probabilities), -k (KNN neighbors)
- Reads from stdin when no text argument provided
- Sensible defaults: ./classifier.json, bayes model type

Also:
- Add Classifier::VERSION as single source of truth
- LR now requires explicit fit() before classification
- Add add_category() method to LR for dynamic category creation

Closes #119
Instead of showing "Error: model not found" when running `classifier`
with no arguments and no model file, show a friendly getting started
guide with examples for training, classifying, and LSI semantic search.
When a model exists but classifier is run without any text to classify,
show the model info and usage hints instead of silently exiting.
The info command now outputs pretty-printed JSON with detailed stats:
- file: model path
- type: classifier type
- categories: list of categories
- category_stats: per-category word counts (Bayes)
- documents: document count (LSI, KNN)
- items: list of items (LSI)
- fitted: fit status (LR)
- LSI examples now use stdin training like Bayes examples
- Use separate lsi.json file to avoid conflicts
- All examples can be copy-pasted and run immediately
- Add defaults for --learning-rate (0.1), --regularization (0.01), --max-iterations (100)
- Remove -f lsi.json from LSI examples for simpler copy-paste
@cardmagic cardmagic self-assigned this Dec 30, 2025
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 30, 2025

Greptile Summary

Implements a full-featured CLI for training and classifying text across all classifier types (Bayes, LSI, KNN, Logistic Regression). The implementation successfully provides sensible defaults, stdin support, JSON output for info command, and proper error handling for unfitted models.

Key Changes:

  • Added exe/classifier executable and lib/classifier/cli.rb (562 lines) with commands: train, classify (default), info, fit, search, related
  • Modified lib/classifier/logistic_regression.rb to add add_category() method and require explicit fit() before classification
  • Added comprehensive test coverage (551 new test lines across 3 test files)
  • All 638 tests passing, JSON info output works correctly

Style Issues Found:

  • Multiple nested conditionals in command_classify (lines 336-344, 348-357) and command_train (lines 171-181) violate the style guide requirement to prefer guard clauses
  • 11 redundant comments that restate what the code already shows (e.g., "For LSI, train with file paths as item keys", "No model and no input - show getting started guide")
  • These comments should be removed per the de-AI code review guidelines

Implementation Quality:
The CLI design is well-thought-out with proper separation of concerns, but needs refactoring to eliminate nested conditionals and remove obvious comments.

Confidence Score: 4/5

  • Safe to merge with minor style improvements recommended
  • The implementation is functionally correct with comprehensive tests (638 passing) and proper error handling. The nested conditionals and redundant comments are style issues that don't affect correctness but violate the repository's coding standards. No logic errors, security issues, or breaking changes detected.
  • Pay attention to lib/classifier/cli.rb - refactor nested conditionals and remove redundant comments per style guide

Important Files Changed

Filename Overview
exe/classifier Clean executable entry point with proper error handling
lib/classifier/cli.rb Main CLI implementation with nested conditionals and redundant comments that need refactoring
lib/classifier/logistic_regression.rb Improved error handling for unfitted models, moved validation to fit(), added add_category method

Sequence Diagram

sequenceDiagram
    participant User
    participant Exe as exe/classifier
    participant CLI as CLI.new
    participant Parser as OptionParser
    participant Command as execute_command
    participant Classifier as Bayes/LSI/KNN/LR
    participant Storage as Storage::File

    User->>Exe: Run classifier [args]
    Exe->>CLI: new(ARGV)
    CLI->>CLI: initialize(args)
    Note over CLI: Set default options from ENV
    
    Exe->>CLI: run()
    CLI->>Parser: parse_options
    Parser->>CLI: Parse flags (-f, -m, -p, etc)
    
    CLI->>Command: execute_command
    
    alt train command
        Command->>Storage: load_or_create_classifier
        Storage-->>Command: Classifier instance
        Command->>Classifier: train(category, text)
        Command->>Storage: save_classifier
        Storage->>Storage: write JSON
    else classify command (default)
        alt No model exists
            Command->>User: show_getting_started
        else Model exists
            Command->>Storage: load_classifier
            Storage-->>Command: Classifier instance
            alt LR not fitted
                Command-->>User: Error: Model not fitted
            else Ready to classify
                Command->>Classifier: classify(text)
                Classifier-->>Command: category
                Command->>User: Output result
            end
        end
    else fit command (LR only)
        Command->>Storage: load_classifier
        Storage-->>Command: LR instance
        Command->>Classifier: fit()
        Note over Classifier: Run SGD optimization
        Command->>Storage: save_classifier
    else info command
        Command->>Storage: load_classifier
        Storage-->>Command: Classifier instance
        Command->>Classifier: Extract metadata
        Command->>User: Output JSON
    else search/related (LSI only)
        Command->>Storage: load_classifier
        Storage-->>Command: LSI instance
        Command->>Classifier: search(query) / find_related(item)
        Classifier-->>Command: results with scores
        Command->>User: Output results
    end
    
    CLI-->>Exe: {output, error, exit_code}
    Exe->>User: Display output/error
    Exe->>User: exit(code)
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 11 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 336 to 344
unless File.exist?(@options[:model])
if text.empty? && ($stdin.tty? || @stdin.nil?)
show_getting_started
return
end
@error << "Error: model not found at #{@options[:model]}"
@exit_code = 1
return
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Nested conditionals violate style guide. Use guard clauses with early returns.

Suggested change
unless File.exist?(@options[:model])
if text.empty? && ($stdin.tty? || @stdin.nil?)
show_getting_started
return
end
@error << "Error: model not found at #{@options[:model]}"
@exit_code = 1
return
end
# No model and no input - show getting started guide
if text.empty? && ($stdin.tty? || @stdin.nil?)
show_getting_started
return
end
unless File.exist?(@options[:model])
@error << "Error: model not found at #{@options[:model]}"
@exit_code = 1
return
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/cli.rb
Line: 336:344

Comment:
**style:** Nested conditionals violate style guide. Use guard clauses with early returns.

```suggestion
      # No model and no input - show getting started guide
      if text.empty? && ($stdin.tty? || @stdin.nil?)
        show_getting_started
        return
      end
      unless File.exist?(@options[:model])
        @error << "Error: model not found at #{@options[:model]}"
        @exit_code = 1
        return
      end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 348 to 357
if text.empty?
lines = read_stdin_lines
if lines.empty?
show_model_usage(classifier)
return
end
lines.each { |line| classify_and_output(classifier, line) }
else
classify_and_output(classifier, text)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Nested conditionals make logic harder to follow. Refactor using guard clauses.

Suggested change
if text.empty?
lines = read_stdin_lines
if lines.empty?
show_model_usage(classifier)
return
end
lines.each { |line| classify_and_output(classifier, line) }
else
classify_and_output(classifier, text)
end
if text.empty?
lines = read_stdin_lines
return show_model_usage(classifier) if lines.empty?
lines.each { |line| classify_and_output(classifier, line) }
else
classify_and_output(classifier, text)
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/cli.rb
Line: 348:357

Comment:
**style:** Nested conditionals make logic harder to follow. Refactor using guard clauses.

```suggestion
      if text.empty?
        lines = read_stdin_lines
        return show_model_usage(classifier) if lines.empty?
        
        lines.each { |line| classify_and_output(classifier, line) }
      else
        classify_and_output(classifier, text)
      end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 171 to 181
if classifier.is_a?(LSI) && @args.any?
train_lsi_from_files(classifier, category, @args)
else
text = read_training_input
if text.empty?
@error << 'Error: no training data provided'
@exit_code = 2
return
end
train_classifier(classifier, category, text)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Nested conditional inside else branch. Use guard clauses for clearer flow.

Suggested change
if classifier.is_a?(LSI) && @args.any?
train_lsi_from_files(classifier, category, @args)
else
text = read_training_input
if text.empty?
@error << 'Error: no training data provided'
@exit_code = 2
return
end
train_classifier(classifier, category, text)
end
# For LSI, train with file paths as item keys
if classifier.is_a?(LSI) && @args.any?
train_lsi_from_files(classifier, category, @args)
save_classifier(classifier)
return
end
text = read_training_input
if text.empty?
@error << 'Error: no training data provided'
@exit_code = 2
return
end
train_classifier(classifier, category, text)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/cli.rb
Line: 171:181

Comment:
**style:** Nested conditional inside else branch. Use guard clauses for clearer flow.

```suggestion
      # For LSI, train with file paths as item keys
      if classifier.is_a?(LSI) && @args.any?
        train_lsi_from_files(classifier, category, @args)
        save_classifier(classifier)
        return
      end
      
      text = read_training_input
      if text.empty?
        @error << 'Error: no training data provided'
        @exit_code = 2
        return
      end
      train_classifier(classifier, category, text)
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 204 to 229
# Categories - different classifiers store them differently
if classifier.respond_to?(:categories)
info[:categories] = classifier.categories.map(&:to_s)
end

info[:training_count] = classifier.training_count if classifier.respond_to?(:training_count)
info[:vocab_size] = classifier.vocab_size if classifier.respond_to?(:vocab_size)
info[:fitted] = classifier.fitted? if classifier.respond_to?(:fitted?)

# Category-specific stats for Bayes
if classifier.is_a?(Bayes)
categories_data = classifier.instance_variable_get(:@categories)
info[:category_stats] = classifier.categories.to_h do |cat|
cat_data = categories_data[cat.to_sym] || {}
[cat.to_s, { unique_words: cat_data.size, total_words: cat_data.values.sum }]
end
end

# LSI-specific info
if classifier.is_a?(LSI)
info[:documents] = classifier.items.size
info[:items] = classifier.items
# Get unique categories from items
categories = classifier.items.map { |item| classifier.categories_for(item) }.flatten.uniq
info[:categories] = categories.map(&:to_s) unless categories.empty?
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Remove inline comments that restate code. # Categories - different classifiers store them differently and # Get unique categories from items just describe what the code already shows. The conditional checks make the intent obvious.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/cli.rb
Line: 204:229

Comment:
**style:** Remove inline comments that restate code. `# Categories - different classifiers store them differently` and `# Get unique categories from items` just describe what the code already shows. The conditional checks make the intent obvious.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 213 to 220
# Category-specific stats for Bayes
if classifier.is_a?(Bayes)
categories_data = classifier.instance_variable_get(:@categories)
info[:category_stats] = classifier.categories.to_h do |cat|
cat_data = categories_data[cat.to_sym] || {}
[cat.to_s, { unique_words: cat_data.size, total_words: cat_data.values.sum }]
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Remove comment # Category-specific stats for Bayes — the code clearly shows it's extracting Bayes-specific category stats.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/cli.rb
Line: 213:220

Comment:
**style:** Remove comment `# Category-specific stats for Bayes` — the code clearly shows it's extracting Bayes-specific category stats.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 231 to 237
# KNN-specific info
if classifier.is_a?(KNN)
data = classifier.instance_variable_get(:@data) || []
info[:documents] = data.size
categories = data.map { |d| d[:category] }.uniq
info[:categories] = categories.map(&:to_s) unless categories.empty?
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Remove comment # KNN-specific info — redundant given the type check.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/cli.rb
Line: 231:237

Comment:
**style:** Remove comment `# KNN-specific info` — redundant given the type check.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 170 to 172
# For LSI, train with file paths as item keys
if classifier.is_a?(LSI) && @args.any?
train_lsi_from_files(classifier, category, @args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Remove comment # For LSI, train with file paths as item keys — restates what the code shows. The classifier.is_a?(LSI) check and method name train_lsi_from_files make this obvious.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/cli.rb
Line: 170:172

Comment:
**style:** Remove comment `# For LSI, train with file paths as item keys` — restates what the code shows. The `classifier.is_a?(LSI)` check and method name `train_lsi_from_files` make this obvious.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 335 to 339
# No model and no input - show getting started guide
unless File.exist?(@options[:model])
if text.empty? && ($stdin.tty? || @stdin.nil?)
show_getting_started
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Remove comment # No model and no input - show getting started guide — obvious from the condition and method call.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/cli.rb
Line: 335:339

Comment:
**style:** Remove comment `# No model and no input - show getting started guide` — obvious from the condition and method call.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 387 to 390
# LR requires explicit fit before classification
if classifier.is_a?(LogisticRegression) && !classifier.fitted?
raise StandardError, "Model not fitted. Run 'classifier fit' after training."
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Remove comment # LR requires explicit fit before classification — the error message already explains this clearly.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/cli.rb
Line: 387:390

Comment:
**style:** Remove comment `# LR requires explicit fit before classification` — the error message already explains this clearly.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 488 to 491
content = File.read(file)
# Use file path as item key so search/related returns file paths
classifier.add_item(file, category.to_sym) { content }
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Remove comment # Use file path as item key so search/related returns file paths — obvious from the code using file as the item identifier.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/cli.rb
Line: 488:491

Comment:
**style:** Remove comment `# Use file path as item key so search/related returns file paths` — obvious from the code using `file` as the item identifier.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

- Refactor build_model_info into smaller methods to reduce complexity
- Fix Minitest assertion style (refute/assert vs assert_equal)
- Add JSON.pretty_generate to vendor RBS
- Fix file permissions on generated RBS files (644 instead of 600)
- Refactor restore_state to reduce ABC complexity
- Regenerate RBS files for logistic_regression
- Refactor nested conditionals to use guard clauses in command_train
- Refactor nested conditionals to use guard clauses in command_classify
- Remove redundant comments that restate what code already shows
@cardmagic cardmagic changed the title Add CLI executable for training and classification Add CLI executable with model registry support Dec 31, 2025
Enable downloading and using pre-trained models from GitHub repositories
similar to Homebrew taps. This makes it easy to share and reuse classifier
models without training from scratch.

New commands:
- `models` - list available models in a registry
- `pull` - download models to local cache
- `push` - show instructions for contributing models

New options:
- `-r/--remote MODEL` - classify using a remote model directly
- `-o/--output FILE` - specify download location for pull

Features:
- Default registry: cardmagic/classifier-models
- Custom registries: @user/repo:model syntax
- Local caching in ~/.classifier/models/
- Automatic fallback from main to master branch
- Environment variables for configuration

Closes #119
Steep was failing because the CLI uses stdlib classes (FileUtils, URI,
Net::HTTP, JSON::ParserError) that weren't declared in the Steepfile.

Also fixes nil safety issue with spec[1..] which returns String? in
Ruby's type system - now uses fallback to empty string.
The 'models' command now supports --local to list models that have
been downloaded to the local cache (~/.classifier/models/) instead
of fetching the registry index from GitHub.

Shows model name, type, and human-readable file size. Custom registry
models display with their full @user/repo:name format.
Use idiomatic Ruby patterns to satisfy linter:
- Replace each+append with map for building arrays
- Use annotated format tokens (%<name>s) over positional ones
- Use %r{} for regex containing slashes
@cardmagic cardmagic merged commit 4972dc3 into master Dec 31, 2025
5 checks passed
@cardmagic cardmagic deleted the add-cli-executable branch December 31, 2025 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CLI executable for training and classification

2 participants