Skip to content

Data-Nova-Project/datanova-backend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataNova Backend

A modern, asynchronous FastAPI backend application for the DataNova project, built with Python and PostgreSQL.

🚀 Getting Started

Setup Guide: For complete installation and configuration instructions, please read GUIDE.md.

System Architecture & Modeling

Detailed Documentation: For a deep dive into the technology stack, design decisions, and component breakdown, please read docs/ARCHITECTURE_DETAILS.md.

Architecture Overview

DataNova follows a clean, layered architecture designed for scalability and maintainability.

Key Layers

  1. Presentation Layer (app/api): FastAPI Routers handling HTTP requests, validation, and serialization.
  2. Service Layer (app/services): Contains the core business logic (ML training, PDF generation, AI integration).
  3. Data Access Layer (app/models, app/db): SQLAlchemy models and async database sessions.
  4. Storage Layer:
    • PostgreSQL: Stores structured data (Users, Metadata, Relations).
    • File System: Stores heavy assets (CSV Datasets, Joblib Models, PDF Reports, Charts).

System Modeling (UML)

The project includes PlantUML diagrams in the modelisation/ directory to visualize the system:

  • Context Diagram (context_diagram.puml): High-level system interactions.
  • Use Case Diagram (usecase_diagram.puml): Functional requirements and actor interactions.
  • Class Diagram (class_diagram.puml): Key entity relationships (User, Dataset, Experiment).
  • Flow Diagram (flow_diagram.puml): Data processing sequence.
  • State Diagram (state_diagram.puml): Lifecycle of a Machine Learning Experiment.

We use PlantUML for system modeling. You can find the source files in the modelisation/ directory.

1. System Context

High-level view of how users interact with DataNova and its external dependencies (Groq AI, Database).

Context Diagram (See modelisation/context_diagram.puml)

2. Key Entities (Class Diagram)

The core data models representing the domain logic:

  • User: Owner of resources.
  • Dataset: Raw uploaded data.
  • Pipeline: Preprocessing steps configuration.
  • Experiment: Execution calculation of an Algorithm on a Dataset/Pipeline.
  • Report: PDF output of results.

(See modelisation/class_diagram.puml)

3. Data Flow

How data transforms from upload to report generation. (See modelisation/flow_diagram.puml)

Directory Structure

app/
├── api/            # API Route handlers (Controllers)
├── core/           # Config, Security, Auth
├── db/             # Database connection & Session logic
├── models/         # SQLAlchemy ORM Models
├── schemas/        # Pydantic Schemas (Validation)
├── services/       # Business Logic (ML, AI, Files)
│   ├── ai_service.py       # Groq LLM integration
│   ├── training.py         # ML model training
│   ├── pdf_service.py      # Report generation
│   └── ...
└── main.py         # App entry point

Features

Core Features

  • Asynchronous API: Built with FastAPI for high-performance async operations
  • User Authentication: JWT-based authentication with fastapi-users
  • Database Integration: PostgreSQL with async SQLAlchemy ORM
  • Database Migrations: Alembic support for schema management
  • API Documentation: Automatic OpenAPI/Swagger documentation

Machine Learning Features

  • 17 ML Algorithms: 6 classification, 7 regression, 4 clustering algorithms
  • Data Preprocessing Pipelines: Chain transformations (scaling, encoding, feature selection)
  • Dual-Mode Training: Train on preprocessed pipelines OR raw datasets directly
  • Background Training: Async training jobs with status tracking
  • Model Export: Download trained models as joblib files
  • Visualization: 12 chart types including confusion matrix, ROC curves, scatter plots
  • ML Model Recommendations: AI-powered algorithm recommendations based on dataset characteristics
  • Groq API Integration: Advanced AI capabilities with user-configurable API keys

Reports System

  • Comprehensive Report Generation: Create detailed analysis reports from experiments
  • Professional Visualizations: Automated chart generation with dataset insights
  • Dataset Analysis: In-depth statistical analysis and data profiling
  • Export Capabilities: Download reports in multiple formats
  • Interactive Dashboard: Visual report management interface

Data Analysis & Visualization System

  • 12+ Chart Types: Distribution plots, correlation heatmaps, scatter plots, bar charts
  • Statistical Analysis: Comprehensive dataset profiling and statistics
  • Interactive Visualizations: Real-time chart generation from API endpoints
  • PDF Report Generation: Professional PDF reports with embedded visualizations
  • Chart Export: Download charts as PNG images
  • Data Profiling: Automated data quality assessment and insights

AI Integration & Services

  • Groq AI Chat: Integrated AI chat completion with multiple models
  • User-Configurable AI: Personal Groq API keys for enhanced AI features
  • ML Algorithm Recommendations: AI-powered algorithm suggestions
  • Smart Data Analysis: AI-driven insights and recommendations
  • Multiple AI Models: Support for various Groq AI models (llama-3.3-70b-versatile, etc.)

Streamlit Integration (ML AutoFlow)

  • Alternative UI: Complete Streamlit-based ML workflow interface
  • French Language Support: Fully localized interface
  • Visual ML Pipeline: Drag-and-drop ML workflow builder
  • Real-time Preprocessing: Interactive data transformation tools
  • Model Comparison: Side-by-side algorithm performance analysis

Activity Tracking System

  • Comprehensive Logging: Track all user actions across resources
  • 13 Tracked Operations: Authentication, datasets, pipelines, experiments, models
  • Activity Analytics: Query by user, resource type, action type, date range
  • Request Context: IP address and user agent tracking
  • Non-Blocking: Activity failures don't affect API operations

Logging & Monitoring

  • Request Logging Middleware: Track all incoming requests with timing
  • Custom Logging: Persists to both database and console
  • Health Checks: Built-in health check endpoints
  • Testing: Comprehensive test suite with pytest and async support

Advanced Features

Dual Interface Support

DataNova provides two complementary interfaces:

  1. FastAPI REST API: Modern async API for web and mobile applications
  2. Streamlit ML AutoFlow: Interactive web interface for visual ML workflows

Multi-Language Support

  • English: Primary API documentation and interface
  • French: Complete Streamlit interface localization

Professional Reporting

  • PDF Generation: High-quality reports with embedded charts
  • Chart Gallery: Professional visualization library
  • Statistical Insights: Automated data analysis summaries
  • Export Options: Multiple format support (PDF, PNG, CSV)

Enterprise Features

  • Activity Tracking: Comprehensive audit logging
  • User Management: Role-based access control
  • Request Logging: Performance monitoring and debugging
  • Health Monitoring: System health checks and status reporting
  • CORS Configuration: Production-ready security settings

Prerequisites

  • Python 3.10 or higher
  • PostgreSQL database
  • pip (Python package manager)

Installation

  1. Clone the repository:

    git clone https://github.com/Data-Nova-Project/datanova-backend.git
    cd datanova-backend
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Set up environment variables: Create a .env file in the root directory with the following variables:

    DATABASE_URL=postgresql+asyncpg://username:password@localhost:5432/database_name
    SECRET_KEY=your-secret-key-here
    GROQ_API_KEY=your-groq-api-key-here
    
    # Optional Configuration
    APP_VERSION=1.0.0
    ENVIRONMENT=development
    DATABASE_ECHO=false
    PROJECT_NAME="DataNova"
    
    # CORS Settings (comma-separated list)
    CORS_ORIGINS=["http://localhost:3000","http://localhost:5173"]
    CORS_ALLOW_CREDENTIALS=true
    CORS_ALLOW_METHODS=["*"]
    CORS_ALLOW_HEADERS=["*"]

Configuration

The application uses Pydantic settings for configuration. Key settings include:

  • DATABASE_URL: PostgreSQL connection string (async format)
  • SECRET_KEY: Secret key for JWT token signing
  • DATABASE_ECHO: Enable SQL query logging (default: False)
  • PROJECT_NAME: Application name (default: "DataNova")
  • CORS_ORIGINS: List of allowed origins for CORS (default: localhost development servers)
  • CORS_ALLOW_CREDENTIALS: Allow credentials in CORS requests (default: True)
  • CORS_ALLOW_METHODS: Allowed HTTP methods for CORS (default: ["*"])
  • CORS_ALLOW_HEADERS: Allowed headers for CORS (default: ["*"])

Database Setup

  1. Ensure PostgreSQL is running and accessible
  2. Create a database for the application
  3. Run database migrations:
    alembic upgrade head

Running the Application

Development Mode

uvicorn app.main:app --reload

The application will be available at http://127.0.0.1:8000

Production Mode

uvicorn app.main:app --host 0.0.0.0 --port 8000

API Documentation

Once the application is running, visit:

  • Swagger UI: http://127.0.0.1:8000/docs
  • ReDoc: http://127.0.0.1:8000/redoc
  • OpenAPI Schema: http://127.0.0.1:8000/openapi.json

API Endpoints

Authentication

  • POST /api/v1/auth/register - Register new user
  • POST /api/v1/auth/jwt/login - Login and get JWT token

Datasets

  • POST /api/v1/datasets/ - Upload CSV dataset
  • GET /api/v1/datasets/ - List user's datasets
  • GET /api/v1/datasets/{id} - Get dataset details
  • GET /api/v1/datasets/{id}/download - Download dataset
  • DELETE /api/v1/datasets/{id} - Delete dataset

Pipelines

  • POST /api/v1/pipelines/ - Create preprocessing pipeline
  • GET /api/v1/pipelines/ - List user's pipelines
  • GET /api/v1/pipelines/{id} - Get pipeline details
  • POST /api/v1/pipelines/{id}/preview - Preview pipeline transformation
  • PUT /api/v1/pipelines/{id} - Update pipeline steps
  • DELETE /api/v1/pipelines/{id} - Delete pipeline

Experiments (ML Training)

  • POST /api/v1/experiments/ - Start training experiment
  • GET /api/v1/experiments/ - List experiments
  • GET /api/v1/experiments/{id} - Get experiment status/results
  • GET /api/v1/experiments/{id}/download-model - Download trained model
  • DELETE /api/v1/experiments/{id} - Delete experiment

ML Models

  • GET /api/v1/ml-models/ - List all available algorithms
  • GET /api/v1/ml-models/{category} - Get algorithms by category
  • GET /api/v1/ml-models/{algorithm}/schema - Get hyperparameter schema

Analysis & Visualizations

  • GET /api/v1/analysis/{dataset_id}/distribution - Generate distribution charts (PNG)
  • GET /api/v1/analysis/{dataset_id}/correlation - Generate correlation heatmaps
  • GET /api/v1/analysis/{dataset_id}/scatter - Create scatter plots
  • GET /api/v1/analysis/{dataset_id}/boxplot - Generate box plots
  • GET /api/v1/analysis/{dataset_id}/histogram - Create histograms
  • GET /api/v1/analysis/{dataset_id}/pairplot - Generate pair plots
  • GET /api/v1/analysis/{dataset_id}/summary - Get dataset statistical summary
  • GET /api/v1/analysis/{dataset_id}/profile - Comprehensive data profiling

AI Services

  • POST /api/v1/ai/chat - Chat completion with AI models
  • GET /api/v1/ai/models - List available AI models
  • POST /api/v1/ai/analyze-data - AI-powered data analysis

Visualizations

  • POST /api/v1/visualize/line - Create line chart
  • POST /api/v1/visualize/bar - Create bar chart
  • POST /api/v1/visualize/scatter - Create scatter plot
  • POST /api/v1/visualize/histogram - Create histogram
  • POST /api/v1/visualize/confusion-matrix - Create confusion matrix
  • And more... (12 visualization types total)

Reports

  • POST /api/v1/reports/generate/{experiment_id} - Generate comprehensive experiment report
  • GET /api/v1/reports/{report_id} - Get report details
  • GET /api/v1/reports/{report_id}/download - Download report file
  • GET /api/v1/reports/experiment/{experiment_id} - Get reports for specific experiment
  • DELETE /api/v1/reports/{report_id} - Delete report

ML Recommendations

  • POST /api/v1/recommendations/algorithms - Get algorithm recommendations for dataset
  • POST /api/v1/recommendations/analyze-dataset - Analyze dataset characteristics

User Settings

  • GET /api/v1/users/me/settings - Get user settings including Groq API configuration
  • PUT /api/v1/users/me/settings - Update user settings and API keys

Activity Tracking

  • GET /api/v1/activities/ - Get user activities
  • GET /api/v1/activities/summary - Activity summary statistics
  • GET /api/v1/activities/recent - Recent activities

Detailed API Documentation

See the /docs directory for comprehensive documentation:

Authentication

The API uses JWT (JSON Web Tokens) for authentication:

Register a new user:

POST /api/v1/auth/register
Content-Type: application/json

{
  "email": "user@example.com",
  "password": "password123"
}

Login:

POST /api/v1/auth/jwt/login
Content-Type: application/x-www-form-urlencoded

username=user@example.com&password=password123

Use authenticated endpoints:

Include the JWT token in the Authorization header:

Authorization: Bearer <your-jwt-token>

Testing

Run the test suite:

pytest

Run tests with coverage:

pytest --cov=app --cov-report=html

Run specific test file:

pytest tests/test_database.py

Project Structure

datanova-backend/
├── app/
│   ├── api/
│   │   └── routes/
│   │       ├── activities_route.py      # User activity tracking
│   │       ├── analysis_route.py        # Data analysis & visualization
│   │       ├── datasets_route.py        # Dataset management
│   │       ├── experiments_route.py     # ML experiment management
│   │       ├── logs_route.py           # System logging
│   │       ├── ml_models_route.py      # ML model registry
│   │       ├── pipelines_route.py      # Data preprocessing
│   │       ├── recommendations_route.py # AI recommendations
│   │       ├── reports_route.py        # Report generation
│   │       ├── users_route.py          # User management
│   │       └── __init__.py
│   ├── core/
│   │   └── auth.py                     # Authentication configuration
│   ├── db/
│   │   ├── session.py                  # Database session management
│   │   └── __init__.py
│   ├── ml/
│   │   ├── adapters/                   # ML algorithm adapters
│   │   ├── base.py                     # Base ML classes
│   │   ├── dataset_analyzer.py         # Dataset analysis tools
│   │   ├── problem_classifier.py       # Problem type classification
│   │   ├── recommendation_engine.py    # ML recommendation system
│   │   ├── registry.py                 # Algorithm registry
│   │   └── __init__.py
│   ├── models/
│   │   ├── activities.py               # Activity tracking models
│   │   ├── datasets.py                 # Dataset models
│   │   ├── experiments.py              # Experiment models
│   │   ├── logs.py                     # System log models
│   │   ├── pipelines.py                # Pipeline models
│   │   ├── recommendations.py          # Recommendation models
│   │   ├── users.py                    # User models
│   │   └── __init__.py
│   ├── schemas/
│   │   ├── experiments.py              # Experiment schemas
│   │   ├── logs.py                     # Log schemas
│   │   ├── ml_models.py                # ML model schemas
│   │   ├── pipelines.py                # Pipeline schemas
│   │   ├── recommendations.py          # Recommendation schemas
│   │   ├── reports.py                  # Report schemas
│   │   ├── users.py                    # User schemas
│   │   └── __init__.py
│   ├── services/
│   │   ├── activity.py                 # Activity tracking service
│   │   ├── ai_service.py               # AI/Groq integration
│   │   ├── analysis.py                 # Data analysis service
│   │   ├── chart_service.py            # Chart generation
│   │   ├── dataset_analysis.py         # Dataset profiling
│   │   ├── file_processing.py          # File handling
│   │   ├── logger.py                   # Logging service
│   │   ├── pdf_service.py              # PDF generation
│   │   ├── preprocessing.py            # Data preprocessing
│   │   ├── recommendation_service.py   # ML recommendations
│   │   ├── report_service.py           # Report generation
│   │   ├── storage.py                  # File storage
│   │   └── training.py                 # ML training
│   ├── streamlit_ia/                   # Alternative Streamlit UI
│   │   ├── Archecture/                 # Architecture diagrams
│   │   ├── models/                     # Streamlit ML models
│   │   ├── pages/                      # Streamlit pages
│   │   ├── preprocessing/              # Streamlit preprocessing
│   │   ├── tests/                      # Streamlit tests
│   │   ├── uploaded/                   # Upload directory
│   │   ├── utils/                      # Streamlit utilities
│   │   ├── visualization/              # Visualization components
│   │   ├── data_loader.py              # Data loading utilities
│   │   ├── Home.py                     # Streamlit main page
│   │   └── streamlit_app_OLD.py        # Legacy app
│   ├── middleware/
│   │   └── request_logging.py          # Request logging middleware
│   ├── config.py                       # Application configuration
│   ├── main.py                         # FastAPI application
│   └── __init__.py
├── tests/
│   ├── conftest.py                     # Test configuration
│   ├── test_auth.py                    # Authentication tests
│   ├── test_database.py                # Database tests
│   └── __init__.py
├── scripts/
│   └── seed_algorithms.py              # Database seeding
├── alembic/                            # Database migrations
│   ├── versions/                       # Migration files
│   ├── env.py                          # Alembic environment
│   └── script.py.mako                  # Migration template
├── models/                             # Trained model storage
├── reports/
│   └── charts/                         # Generated chart storage
├── temp/                               # Temporary files
├── uploads/                            # User uploads
├── docs/                               # Documentation
├── requirements.txt                    # Python dependencies
├── pytest.ini                         # Test configuration
├── alembic.ini                         # Alembic configuration
├── gg.py                               # Utility script
└── README.md                           # This file

Health Check

Check application health:

GET /health

Response:

{
  "status": "healthy",
  "timestamp": "2025-12-07T12:00:00.000000",
  "version": "1.0.0",
  "environment": "development",
  "database": {
    "status": "connected",
    "type": "postgresql"
  },
  "service": "DataNova API"
}

The health check endpoint returns comprehensive information about the application's status:

  • status: Overall health status ("healthy" or "unhealthy")
  • timestamp: Current UTC timestamp in ISO format
  • version: Application version (configurable via APP_VERSION environment variable)
  • environment: Deployment environment (configurable via ENVIRONMENT variable)
  • database: Database connection status and type
  • service: Service name identifier

CORS Configuration

The API includes Cross-Origin Resource Sharing (CORS) middleware to allow requests from web browsers. By default, it allows requests from common development servers:

  • http://localhost:3000 (React)
  • http://localhost:5173 (Vite)
  • http://127.0.0.1:3000
  • http://127.0.0.1:5173

Customizing CORS

You can customize CORS settings using environment variables or by modifying the CORS_ORIGINS list in your configuration:

# Allow specific production domain
CORS_ORIGINS=["https://your-frontend-domain.com", "https://www.your-frontend-domain.com"]

# Or add to existing development origins
CORS_ORIGINS=["http://localhost:3000", "http://localhost:5173", "https://your-frontend-domain.com"]

CORS Settings

  • CORS_ALLOW_CREDENTIALS: Set to true to allow cookies and authorization headers
  • CORS_ALLOW_METHODS: HTTP methods allowed (default: all methods)
  • CORS_ALLOW_HEADERS: HTTP headers allowed (default: all headers)

Logging

The application includes a custom logging system that:

  • Logs to console with timestamps and log levels
  • Persists log messages to the database
  • Supports different log levels: INFO, ERROR, WARNING
  • Handles database connection failures gracefully

CORS Configuration

The API includes Cross-Origin Resource Sharing (CORS) middleware to allow requests from web browsers. By default, it allows requests from common development servers:

  • http://localhost:3000 (React)
  • http://localhost:5173 (Vite)
  • http://127.0.0.1:3000
  • http://127.0.0.1:5173

Customizing CORS

You can customize CORS settings using environment variables or by modifying the CORS_ORIGINS list in your configuration:

# Allow specific production domain
CORS_ORIGINS=["https://your-frontend-domain.com", "https://www.your-frontend-domain.com"]

# Or add to existing development origins
CORS_ORIGINS=["http://localhost:3000", "http://localhost:5173", "https://your-frontend-domain.com"]

CORS Settings

  • CORS_ALLOW_CREDENTIALS: Set to true to allow cookies and authorization headers
  • CORS_ALLOW_METHODS: HTTP methods allowed (default: all methods)
  • CORS_ALLOW_HEADERS: HTTP headers allowed (default: all headers)

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

FastAPI backend service for machine learning model training, evaluation, and dataset management. Handles preprocessing pipelines, algorithm execution, and result generation.

Resources

Stars

Watchers

Forks

Contributors