DataNova Backend

A modern, asynchronous FastAPI backend application for the DataNova project, built with Python and PostgreSQL.

🚀 Getting Started

Setup Guide: For complete installation and configuration instructions, please read GUIDE.md.

System Architecture & Modeling

Detailed Documentation: For a deep dive into the technology stack, design decisions, and component breakdown, please read docs/ARCHITECTURE_DETAILS.md.

Architecture Overview

DataNova follows a clean, layered architecture designed for scalability and maintainability.

Key Layers

Presentation Layer (app/api): FastAPI Routers handling HTTP requests, validation, and serialization.
Service Layer (app/services): Contains the core business logic (ML training, PDF generation, AI integration).
Data Access Layer (app/models, app/db): SQLAlchemy models and async database sessions.
Storage Layer:
- PostgreSQL: Stores structured data (Users, Metadata, Relations).
- File System: Stores heavy assets (CSV Datasets, Joblib Models, PDF Reports, Charts).

System Modeling (UML)

The project includes PlantUML diagrams in the modelisation/ directory to visualize the system:

Context Diagram (context_diagram.puml): High-level system interactions.
Use Case Diagram (usecase_diagram.puml): Functional requirements and actor interactions.
Class Diagram (class_diagram.puml): Key entity relationships (User, Dataset, Experiment).
Flow Diagram (flow_diagram.puml): Data processing sequence.
State Diagram (state_diagram.puml): Lifecycle of a Machine Learning Experiment.

We use PlantUML for system modeling. You can find the source files in the modelisation/ directory.

1. System Context

High-level view of how users interact with DataNova and its external dependencies (Groq AI, Database).

(See modelisation/context_diagram.puml)

2. Key Entities (Class Diagram)

The core data models representing the domain logic:

User: Owner of resources.
Dataset: Raw uploaded data.
Pipeline: Preprocessing steps configuration.
Experiment: Execution calculation of an Algorithm on a Dataset/Pipeline.
Report: PDF output of results.

(See modelisation/class_diagram.puml)

3. Data Flow

How data transforms from upload to report generation. (See modelisation/flow_diagram.puml)

Directory Structure

app/
├── api/            # API Route handlers (Controllers)
├── core/           # Config, Security, Auth
├── db/             # Database connection & Session logic
├── models/         # SQLAlchemy ORM Models
├── schemas/        # Pydantic Schemas (Validation)
├── services/       # Business Logic (ML, AI, Files)
│   ├── ai_service.py       # Groq LLM integration
│   ├── training.py         # ML model training
│   ├── pdf_service.py      # Report generation
│   └── ...
└── main.py         # App entry point

Features

Core Features

Asynchronous API: Built with FastAPI for high-performance async operations
User Authentication: JWT-based authentication with fastapi-users
Database Integration: PostgreSQL with async SQLAlchemy ORM
Database Migrations: Alembic support for schema management
API Documentation: Automatic OpenAPI/Swagger documentation

Machine Learning Features

17 ML Algorithms: 6 classification, 7 regression, 4 clustering algorithms
Data Preprocessing Pipelines: Chain transformations (scaling, encoding, feature selection)
Dual-Mode Training: Train on preprocessed pipelines OR raw datasets directly
Background Training: Async training jobs with status tracking
Model Export: Download trained models as joblib files
Visualization: 12 chart types including confusion matrix, ROC curves, scatter plots
ML Model Recommendations: AI-powered algorithm recommendations based on dataset characteristics
Groq API Integration: Advanced AI capabilities with user-configurable API keys

Reports System

Comprehensive Report Generation: Create detailed analysis reports from experiments
Professional Visualizations: Automated chart generation with dataset insights
Dataset Analysis: In-depth statistical analysis and data profiling
Export Capabilities: Download reports in multiple formats
Interactive Dashboard: Visual report management interface

Data Analysis & Visualization System

12+ Chart Types: Distribution plots, correlation heatmaps, scatter plots, bar charts
Statistical Analysis: Comprehensive dataset profiling and statistics
Interactive Visualizations: Real-time chart generation from API endpoints
PDF Report Generation: Professional PDF reports with embedded visualizations
Chart Export: Download charts as PNG images
Data Profiling: Automated data quality assessment and insights

AI Integration & Services

Groq AI Chat: Integrated AI chat completion with multiple models
User-Configurable AI: Personal Groq API keys for enhanced AI features
ML Algorithm Recommendations: AI-powered algorithm suggestions
Smart Data Analysis: AI-driven insights and recommendations
Multiple AI Models: Support for various Groq AI models (llama-3.3-70b-versatile, etc.)

Streamlit Integration (ML AutoFlow)

Alternative UI: Complete Streamlit-based ML workflow interface
French Language Support: Fully localized interface
Visual ML Pipeline: Drag-and-drop ML workflow builder
Real-time Preprocessing: Interactive data transformation tools
Model Comparison: Side-by-side algorithm performance analysis

Activity Tracking System

Comprehensive Logging: Track all user actions across resources
13 Tracked Operations: Authentication, datasets, pipelines, experiments, models
Activity Analytics: Query by user, resource type, action type, date range
Request Context: IP address and user agent tracking
Non-Blocking: Activity failures don't affect API operations

Logging & Monitoring

Request Logging Middleware: Track all incoming requests with timing
Custom Logging: Persists to both database and console
Health Checks: Built-in health check endpoints
Testing: Comprehensive test suite with pytest and async support

Advanced Features

Dual Interface Support

DataNova provides two complementary interfaces:

FastAPI REST API: Modern async API for web and mobile applications
Streamlit ML AutoFlow: Interactive web interface for visual ML workflows

Multi-Language Support

English: Primary API documentation and interface
French: Complete Streamlit interface localization

Professional Reporting

PDF Generation: High-quality reports with embedded charts
Chart Gallery: Professional visualization library
Statistical Insights: Automated data analysis summaries
Export Options: Multiple format support (PDF, PNG, CSV)

Enterprise Features

Activity Tracking: Comprehensive audit logging
User Management: Role-based access control
Request Logging: Performance monitoring and debugging
Health Monitoring: System health checks and status reporting
CORS Configuration: Production-ready security settings

Prerequisites

Python 3.10 or higher
PostgreSQL database
pip (Python package manager)

Installation

Clone the repository:

git clone https://github.com/Data-Nova-Project/datanova-backend.git
cd datanova-backend

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables: Create a .env file in the root directory with the following variables:

DATABASE_URL=postgresql+asyncpg://username:password@localhost:5432/database_name
SECRET_KEY=your-secret-key-here
GROQ_API_KEY=your-groq-api-key-here

# Optional Configuration
APP_VERSION=1.0.0
ENVIRONMENT=development
DATABASE_ECHO=false
PROJECT_NAME="DataNova"

# CORS Settings (comma-separated list)
CORS_ORIGINS=["http://localhost:3000","http://localhost:5173"]
CORS_ALLOW_CREDENTIALS=true
CORS_ALLOW_METHODS=["*"]
CORS_ALLOW_HEADERS=["*"]

Configuration

The application uses Pydantic settings for configuration. Key settings include:

DATABASE_URL: PostgreSQL connection string (async format)
SECRET_KEY: Secret key for JWT token signing
DATABASE_ECHO: Enable SQL query logging (default: False)
PROJECT_NAME: Application name (default: "DataNova")
CORS_ORIGINS: List of allowed origins for CORS (default: localhost development servers)
CORS_ALLOW_CREDENTIALS: Allow credentials in CORS requests (default: True)
CORS_ALLOW_METHODS: Allowed HTTP methods for CORS (default: ["*"])
CORS_ALLOW_HEADERS: Allowed headers for CORS (default: ["*"])

Database Setup

Ensure PostgreSQL is running and accessible
Create a database for the application
Run database migrations:
```
alembic upgrade head
```

Running the Application

Development Mode

uvicorn app.main:app --reload

The application will be available at http://127.0.0.1:8000

Production Mode

uvicorn app.main:app --host 0.0.0.0 --port 8000

API Documentation

Once the application is running, visit:

Swagger UI: http://127.0.0.1:8000/docs
ReDoc: http://127.0.0.1:8000/redoc
OpenAPI Schema: http://127.0.0.1:8000/openapi.json

API Endpoints

Authentication

POST /api/v1/auth/register - Register new user
POST /api/v1/auth/jwt/login - Login and get JWT token

Datasets

POST /api/v1/datasets/ - Upload CSV dataset
GET /api/v1/datasets/ - List user's datasets
GET /api/v1/datasets/{id} - Get dataset details
GET /api/v1/datasets/{id}/download - Download dataset
DELETE /api/v1/datasets/{id} - Delete dataset

Pipelines

POST /api/v1/pipelines/ - Create preprocessing pipeline
GET /api/v1/pipelines/ - List user's pipelines
GET /api/v1/pipelines/{id} - Get pipeline details
POST /api/v1/pipelines/{id}/preview - Preview pipeline transformation
PUT /api/v1/pipelines/{id} - Update pipeline steps
DELETE /api/v1/pipelines/{id} - Delete pipeline

Experiments (ML Training)

POST /api/v1/experiments/ - Start training experiment
GET /api/v1/experiments/ - List experiments
GET /api/v1/experiments/{id} - Get experiment status/results
GET /api/v1/experiments/{id}/download-model - Download trained model
DELETE /api/v1/experiments/{id} - Delete experiment

ML Models

GET /api/v1/ml-models/ - List all available algorithms
GET /api/v1/ml-models/{category} - Get algorithms by category
GET /api/v1/ml-models/{algorithm}/schema - Get hyperparameter schema

Analysis & Visualizations

GET /api/v1/analysis/{dataset_id}/distribution - Generate distribution charts (PNG)
GET /api/v1/analysis/{dataset_id}/correlation - Generate correlation heatmaps
GET /api/v1/analysis/{dataset_id}/scatter - Create scatter plots
GET /api/v1/analysis/{dataset_id}/boxplot - Generate box plots
GET /api/v1/analysis/{dataset_id}/histogram - Create histograms
GET /api/v1/analysis/{dataset_id}/pairplot - Generate pair plots
GET /api/v1/analysis/{dataset_id}/summary - Get dataset statistical summary
GET /api/v1/analysis/{dataset_id}/profile - Comprehensive data profiling

AI Services

POST /api/v1/ai/chat - Chat completion with AI models
GET /api/v1/ai/models - List available AI models
POST /api/v1/ai/analyze-data - AI-powered data analysis

Visualizations

POST /api/v1/visualize/line - Create line chart
POST /api/v1/visualize/bar - Create bar chart
POST /api/v1/visualize/scatter - Create scatter plot
POST /api/v1/visualize/histogram - Create histogram
POST /api/v1/visualize/confusion-matrix - Create confusion matrix
And more... (12 visualization types total)

Reports

POST /api/v1/reports/generate/{experiment_id} - Generate comprehensive experiment report
GET /api/v1/reports/{report_id} - Get report details
GET /api/v1/reports/{report_id}/download - Download report file
GET /api/v1/reports/experiment/{experiment_id} - Get reports for specific experiment
DELETE /api/v1/reports/{report_id} - Delete report

ML Recommendations

POST /api/v1/recommendations/algorithms - Get algorithm recommendations for dataset
POST /api/v1/recommendations/analyze-dataset - Analyze dataset characteristics

User Settings

GET /api/v1/users/me/settings - Get user settings including Groq API configuration
PUT /api/v1/users/me/settings - Update user settings and API keys

Activity Tracking

GET /api/v1/activities/ - Get user activities
GET /api/v1/activities/summary - Activity summary statistics
GET /api/v1/activities/recent - Recent activities

Detailed API Documentation

See the /docs directory for comprehensive documentation:

API_EXPERIMENTS.md - Complete experiments API guide
MODEL_EXPORT.md - Model export/download documentation
DIRECT_DATASET_TRAINING.md - Direct dataset training guide
USER_ACTIVITY_TRACKING.md - Activity tracking system
FRONTEND_TRAINING_MODIFICATIONS.md - Frontend integration guide

Authentication

The API uses JWT (JSON Web Tokens) for authentication:

Register a new user:

POST /api/v1/auth/register
Content-Type: application/json

{
  "email": "user@example.com",
  "password": "password123"
}

Login:

POST /api/v1/auth/jwt/login
Content-Type: application/x-www-form-urlencoded

username=user@example.com&password=password123

Use authenticated endpoints:

Include the JWT token in the Authorization header:

Authorization: Bearer <your-jwt-token>

Testing

Run the test suite:

pytest

Run tests with coverage:

pytest --cov=app --cov-report=html

Run specific test file:

pytest tests/test_database.py

Project Structure

datanova-backend/
├── app/
│   ├── api/
│   │   └── routes/
│   │       ├── activities_route.py      # User activity tracking
│   │       ├── analysis_route.py        # Data analysis & visualization
│   │       ├── datasets_route.py        # Dataset management
│   │       ├── experiments_route.py     # ML experiment management
│   │       ├── logs_route.py           # System logging
│   │       ├── ml_models_route.py      # ML model registry
│   │       ├── pipelines_route.py      # Data preprocessing
│   │       ├── recommendations_route.py # AI recommendations
│   │       ├── reports_route.py        # Report generation
│   │       ├── users_route.py          # User management
│   │       └── __init__.py
│   ├── core/
│   │   └── auth.py                     # Authentication configuration
│   ├── db/
│   │   ├── session.py                  # Database session management
│   │   └── __init__.py
│   ├── ml/
│   │   ├── adapters/                   # ML algorithm adapters
│   │   ├── base.py                     # Base ML classes
│   │   ├── dataset_analyzer.py         # Dataset analysis tools
│   │   ├── problem_classifier.py       # Problem type classification
│   │   ├── recommendation_engine.py    # ML recommendation system
│   │   ├── registry.py                 # Algorithm registry
│   │   └── __init__.py
│   ├── models/
│   │   ├── activities.py               # Activity tracking models
│   │   ├── datasets.py                 # Dataset models
│   │   ├── experiments.py              # Experiment models
│   │   ├── logs.py                     # System log models
│   │   ├── pipelines.py                # Pipeline models
│   │   ├── recommendations.py          # Recommendation models
│   │   ├── users.py                    # User models
│   │   └── __init__.py
│   ├── schemas/
│   │   ├── experiments.py              # Experiment schemas
│   │   ├── logs.py                     # Log schemas
│   │   ├── ml_models.py                # ML model schemas
│   │   ├── pipelines.py                # Pipeline schemas
│   │   ├── recommendations.py          # Recommendation schemas
│   │   ├── reports.py                  # Report schemas
│   │   ├── users.py                    # User schemas
│   │   └── __init__.py
│   ├── services/
│   │   ├── activity.py                 # Activity tracking service
│   │   ├── ai_service.py               # AI/Groq integration
│   │   ├── analysis.py                 # Data analysis service
│   │   ├── chart_service.py            # Chart generation
│   │   ├── dataset_analysis.py         # Dataset profiling
│   │   ├── file_processing.py          # File handling
│   │   ├── logger.py                   # Logging service
│   │   ├── pdf_service.py              # PDF generation
│   │   ├── preprocessing.py            # Data preprocessing
│   │   ├── recommendation_service.py   # ML recommendations
│   │   ├── report_service.py           # Report generation
│   │   ├── storage.py                  # File storage
│   │   └── training.py                 # ML training
│   ├── streamlit_ia/                   # Alternative Streamlit UI
│   │   ├── Archecture/                 # Architecture diagrams
│   │   ├── models/                     # Streamlit ML models
│   │   ├── pages/                      # Streamlit pages
│   │   ├── preprocessing/              # Streamlit preprocessing
│   │   ├── tests/                      # Streamlit tests
│   │   ├── uploaded/                   # Upload directory
│   │   ├── utils/                      # Streamlit utilities
│   │   ├── visualization/              # Visualization components
│   │   ├── data_loader.py              # Data loading utilities
│   │   ├── Home.py                     # Streamlit main page
│   │   └── streamlit_app_OLD.py        # Legacy app
│   ├── middleware/
│   │   └── request_logging.py          # Request logging middleware
│   ├── config.py                       # Application configuration
│   ├── main.py                         # FastAPI application
│   └── __init__.py
├── tests/
│   ├── conftest.py                     # Test configuration
│   ├── test_auth.py                    # Authentication tests
│   ├── test_database.py                # Database tests
│   └── __init__.py
├── scripts/
│   └── seed_algorithms.py              # Database seeding
├── alembic/                            # Database migrations
│   ├── versions/                       # Migration files
│   ├── env.py                          # Alembic environment
│   └── script.py.mako                  # Migration template
├── models/                             # Trained model storage
├── reports/
│   └── charts/                         # Generated chart storage
├── temp/                               # Temporary files
├── uploads/                            # User uploads
├── docs/                               # Documentation
├── requirements.txt                    # Python dependencies
├── pytest.ini                         # Test configuration
├── alembic.ini                         # Alembic configuration
├── gg.py                               # Utility script
└── README.md                           # This file

Health Check

Check application health:

GET /health

Response:

{
  "status": "healthy",
  "timestamp": "2025-12-07T12:00:00.000000",
  "version": "1.0.0",
  "environment": "development",
  "database": {
    "status": "connected",
    "type": "postgresql"
  },
  "service": "DataNova API"
}

The health check endpoint returns comprehensive information about the application's status:

status: Overall health status ("healthy" or "unhealthy")
timestamp: Current UTC timestamp in ISO format
version: Application version (configurable via APP_VERSION environment variable)
environment: Deployment environment (configurable via ENVIRONMENT variable)
database: Database connection status and type
service: Service name identifier

CORS Configuration

The API includes Cross-Origin Resource Sharing (CORS) middleware to allow requests from web browsers. By default, it allows requests from common development servers:

http://localhost:3000 (React)
http://localhost:5173 (Vite)
http://127.0.0.1:3000
http://127.0.0.1:5173

Customizing CORS

You can customize CORS settings using environment variables or by modifying the CORS_ORIGINS list in your configuration:

# Allow specific production domain
CORS_ORIGINS=["https://your-frontend-domain.com", "https://www.your-frontend-domain.com"]

# Or add to existing development origins
CORS_ORIGINS=["http://localhost:3000", "http://localhost:5173", "https://your-frontend-domain.com"]

CORS Settings

CORS_ALLOW_CREDENTIALS: Set to true to allow cookies and authorization headers
CORS_ALLOW_METHODS: HTTP methods allowed (default: all methods)
CORS_ALLOW_HEADERS: HTTP headers allowed (default: all headers)

Logging

The application includes a custom logging system that:

Logs to console with timestamps and log levels
Persists log messages to the database
Supports different log levels: INFO, ERROR, WARNING
Handles database connection failures gracefully

CORS Configuration

The API includes Cross-Origin Resource Sharing (CORS) middleware to allow requests from web browsers. By default, it allows requests from common development servers:

http://localhost:3000 (React)
http://localhost:5173 (Vite)
http://127.0.0.1:3000
http://127.0.0.1:5173

Customizing CORS

You can customize CORS settings using environment variables or by modifying the CORS_ORIGINS list in your configuration:

# Allow specific production domain
CORS_ORIGINS=["https://your-frontend-domain.com", "https://www.your-frontend-domain.com"]

# Or add to existing development origins
CORS_ORIGINS=["http://localhost:3000", "http://localhost:5173", "https://your-frontend-domain.com"]

CORS Settings

CORS_ALLOW_CREDENTIALS: Set to true to allow cookies and authorization headers
CORS_ALLOW_METHODS: HTTP methods allowed (default: all methods)
CORS_ALLOW_HEADERS: HTTP headers allowed (default: all headers)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
alembic		alembic
app		app
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
GUIDE.md		GUIDE.md
README.md		README.md
alembic.ini.tmpl		alembic.ini.tmpl
gg.py		gg.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DataNova Backend

🚀 Getting Started

System Architecture & Modeling

Architecture Overview

Key Layers

System Modeling (UML)

1. System Context

2. Key Entities (Class Diagram)

3. Data Flow

Directory Structure

Features

Core Features

Machine Learning Features

Reports System

Data Analysis & Visualization System

AI Integration & Services

Streamlit Integration (ML AutoFlow)

Activity Tracking System

Logging & Monitoring

Advanced Features

Dual Interface Support

Multi-Language Support

Professional Reporting

Enterprise Features

Prerequisites

Installation

Configuration

Database Setup

Running the Application

Development Mode

Production Mode

API Documentation

API Endpoints

Authentication

Datasets

Pipelines

Experiments (ML Training)

ML Models

Analysis & Visualizations

AI Services

Visualizations

Reports

ML Recommendations

User Settings

Activity Tracking

Detailed API Documentation

Authentication

Register a new user:

Login:

Use authenticated endpoints:

Testing

Project Structure

Health Check

CORS Configuration

Customizing CORS

CORS Settings

Logging

CORS Configuration

Customizing CORS

CORS Settings

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages