A modern, asynchronous FastAPI backend application for the DataNova project, built with Python and PostgreSQL.
Setup Guide: For complete installation and configuration instructions, please read GUIDE.md.
Detailed Documentation: For a deep dive into the technology stack, design decisions, and component breakdown, please read docs/ARCHITECTURE_DETAILS.md.
DataNova follows a clean, layered architecture designed for scalability and maintainability.
- Presentation Layer (
app/api): FastAPI Routers handling HTTP requests, validation, and serialization. - Service Layer (
app/services): Contains the core business logic (ML training, PDF generation, AI integration). - Data Access Layer (
app/models,app/db): SQLAlchemy models and async database sessions. - Storage Layer:
- PostgreSQL: Stores structured data (Users, Metadata, Relations).
- File System: Stores heavy assets (CSV Datasets, Joblib Models, PDF Reports, Charts).
The project includes PlantUML diagrams in the modelisation/ directory to visualize the system:
- Context Diagram (
context_diagram.puml): High-level system interactions. - Use Case Diagram (
usecase_diagram.puml): Functional requirements and actor interactions. - Class Diagram (
class_diagram.puml): Key entity relationships (User, Dataset, Experiment). - Flow Diagram (
flow_diagram.puml): Data processing sequence. - State Diagram (
state_diagram.puml): Lifecycle of a Machine Learning Experiment.
We use PlantUML for system modeling. You can find the source files in the modelisation/ directory.
High-level view of how users interact with DataNova and its external dependencies (Groq AI, Database).
(See
modelisation/context_diagram.puml)
The core data models representing the domain logic:
- User: Owner of resources.
- Dataset: Raw uploaded data.
- Pipeline: Preprocessing steps configuration.
- Experiment: Execution calculation of an Algorithm on a Dataset/Pipeline.
- Report: PDF output of results.
(See modelisation/class_diagram.puml)
How data transforms from upload to report generation.
(See modelisation/flow_diagram.puml)
app/
├── api/ # API Route handlers (Controllers)
├── core/ # Config, Security, Auth
├── db/ # Database connection & Session logic
├── models/ # SQLAlchemy ORM Models
├── schemas/ # Pydantic Schemas (Validation)
├── services/ # Business Logic (ML, AI, Files)
│ ├── ai_service.py # Groq LLM integration
│ ├── training.py # ML model training
│ ├── pdf_service.py # Report generation
│ └── ...
└── main.py # App entry point
- Asynchronous API: Built with FastAPI for high-performance async operations
- User Authentication: JWT-based authentication with fastapi-users
- Database Integration: PostgreSQL with async SQLAlchemy ORM
- Database Migrations: Alembic support for schema management
- API Documentation: Automatic OpenAPI/Swagger documentation
- 17 ML Algorithms: 6 classification, 7 regression, 4 clustering algorithms
- Data Preprocessing Pipelines: Chain transformations (scaling, encoding, feature selection)
- Dual-Mode Training: Train on preprocessed pipelines OR raw datasets directly
- Background Training: Async training jobs with status tracking
- Model Export: Download trained models as joblib files
- Visualization: 12 chart types including confusion matrix, ROC curves, scatter plots
- ML Model Recommendations: AI-powered algorithm recommendations based on dataset characteristics
- Groq API Integration: Advanced AI capabilities with user-configurable API keys
- Comprehensive Report Generation: Create detailed analysis reports from experiments
- Professional Visualizations: Automated chart generation with dataset insights
- Dataset Analysis: In-depth statistical analysis and data profiling
- Export Capabilities: Download reports in multiple formats
- Interactive Dashboard: Visual report management interface
- 12+ Chart Types: Distribution plots, correlation heatmaps, scatter plots, bar charts
- Statistical Analysis: Comprehensive dataset profiling and statistics
- Interactive Visualizations: Real-time chart generation from API endpoints
- PDF Report Generation: Professional PDF reports with embedded visualizations
- Chart Export: Download charts as PNG images
- Data Profiling: Automated data quality assessment and insights
- Groq AI Chat: Integrated AI chat completion with multiple models
- User-Configurable AI: Personal Groq API keys for enhanced AI features
- ML Algorithm Recommendations: AI-powered algorithm suggestions
- Smart Data Analysis: AI-driven insights and recommendations
- Multiple AI Models: Support for various Groq AI models (llama-3.3-70b-versatile, etc.)
- Alternative UI: Complete Streamlit-based ML workflow interface
- French Language Support: Fully localized interface
- Visual ML Pipeline: Drag-and-drop ML workflow builder
- Real-time Preprocessing: Interactive data transformation tools
- Model Comparison: Side-by-side algorithm performance analysis
- Comprehensive Logging: Track all user actions across resources
- 13 Tracked Operations: Authentication, datasets, pipelines, experiments, models
- Activity Analytics: Query by user, resource type, action type, date range
- Request Context: IP address and user agent tracking
- Non-Blocking: Activity failures don't affect API operations
- Request Logging Middleware: Track all incoming requests with timing
- Custom Logging: Persists to both database and console
- Health Checks: Built-in health check endpoints
- Testing: Comprehensive test suite with pytest and async support
DataNova provides two complementary interfaces:
- FastAPI REST API: Modern async API for web and mobile applications
- Streamlit ML AutoFlow: Interactive web interface for visual ML workflows
- English: Primary API documentation and interface
- French: Complete Streamlit interface localization
- PDF Generation: High-quality reports with embedded charts
- Chart Gallery: Professional visualization library
- Statistical Insights: Automated data analysis summaries
- Export Options: Multiple format support (PDF, PNG, CSV)
- Activity Tracking: Comprehensive audit logging
- User Management: Role-based access control
- Request Logging: Performance monitoring and debugging
- Health Monitoring: System health checks and status reporting
- CORS Configuration: Production-ready security settings
- Python 3.10 or higher
- PostgreSQL database
- pip (Python package manager)
-
Clone the repository:
git clone https://github.com/Data-Nova-Project/datanova-backend.git cd datanova-backend -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the root directory with the following variables:DATABASE_URL=postgresql+asyncpg://username:password@localhost:5432/database_name SECRET_KEY=your-secret-key-here GROQ_API_KEY=your-groq-api-key-here # Optional Configuration APP_VERSION=1.0.0 ENVIRONMENT=development DATABASE_ECHO=false PROJECT_NAME="DataNova" # CORS Settings (comma-separated list) CORS_ORIGINS=["http://localhost:3000","http://localhost:5173"] CORS_ALLOW_CREDENTIALS=true CORS_ALLOW_METHODS=["*"] CORS_ALLOW_HEADERS=["*"]
The application uses Pydantic settings for configuration. Key settings include:
DATABASE_URL: PostgreSQL connection string (async format)SECRET_KEY: Secret key for JWT token signingDATABASE_ECHO: Enable SQL query logging (default: False)PROJECT_NAME: Application name (default: "DataNova")CORS_ORIGINS: List of allowed origins for CORS (default: localhost development servers)CORS_ALLOW_CREDENTIALS: Allow credentials in CORS requests (default: True)CORS_ALLOW_METHODS: Allowed HTTP methods for CORS (default: ["*"])CORS_ALLOW_HEADERS: Allowed headers for CORS (default: ["*"])
- Ensure PostgreSQL is running and accessible
- Create a database for the application
- Run database migrations:
alembic upgrade head
uvicorn app.main:app --reloadThe application will be available at http://127.0.0.1:8000
uvicorn app.main:app --host 0.0.0.0 --port 8000Once the application is running, visit:
- Swagger UI:
http://127.0.0.1:8000/docs - ReDoc:
http://127.0.0.1:8000/redoc - OpenAPI Schema:
http://127.0.0.1:8000/openapi.json
POST /api/v1/auth/register- Register new userPOST /api/v1/auth/jwt/login- Login and get JWT token
POST /api/v1/datasets/- Upload CSV datasetGET /api/v1/datasets/- List user's datasetsGET /api/v1/datasets/{id}- Get dataset detailsGET /api/v1/datasets/{id}/download- Download datasetDELETE /api/v1/datasets/{id}- Delete dataset
POST /api/v1/pipelines/- Create preprocessing pipelineGET /api/v1/pipelines/- List user's pipelinesGET /api/v1/pipelines/{id}- Get pipeline detailsPOST /api/v1/pipelines/{id}/preview- Preview pipeline transformationPUT /api/v1/pipelines/{id}- Update pipeline stepsDELETE /api/v1/pipelines/{id}- Delete pipeline
POST /api/v1/experiments/- Start training experimentGET /api/v1/experiments/- List experimentsGET /api/v1/experiments/{id}- Get experiment status/resultsGET /api/v1/experiments/{id}/download-model- Download trained modelDELETE /api/v1/experiments/{id}- Delete experiment
GET /api/v1/ml-models/- List all available algorithmsGET /api/v1/ml-models/{category}- Get algorithms by categoryGET /api/v1/ml-models/{algorithm}/schema- Get hyperparameter schema
GET /api/v1/analysis/{dataset_id}/distribution- Generate distribution charts (PNG)GET /api/v1/analysis/{dataset_id}/correlation- Generate correlation heatmapsGET /api/v1/analysis/{dataset_id}/scatter- Create scatter plotsGET /api/v1/analysis/{dataset_id}/boxplot- Generate box plotsGET /api/v1/analysis/{dataset_id}/histogram- Create histogramsGET /api/v1/analysis/{dataset_id}/pairplot- Generate pair plotsGET /api/v1/analysis/{dataset_id}/summary- Get dataset statistical summaryGET /api/v1/analysis/{dataset_id}/profile- Comprehensive data profiling
POST /api/v1/ai/chat- Chat completion with AI modelsGET /api/v1/ai/models- List available AI modelsPOST /api/v1/ai/analyze-data- AI-powered data analysis
POST /api/v1/visualize/line- Create line chartPOST /api/v1/visualize/bar- Create bar chartPOST /api/v1/visualize/scatter- Create scatter plotPOST /api/v1/visualize/histogram- Create histogramPOST /api/v1/visualize/confusion-matrix- Create confusion matrix- And more... (12 visualization types total)
POST /api/v1/reports/generate/{experiment_id}- Generate comprehensive experiment reportGET /api/v1/reports/{report_id}- Get report detailsGET /api/v1/reports/{report_id}/download- Download report fileGET /api/v1/reports/experiment/{experiment_id}- Get reports for specific experimentDELETE /api/v1/reports/{report_id}- Delete report
POST /api/v1/recommendations/algorithms- Get algorithm recommendations for datasetPOST /api/v1/recommendations/analyze-dataset- Analyze dataset characteristics
GET /api/v1/users/me/settings- Get user settings including Groq API configurationPUT /api/v1/users/me/settings- Update user settings and API keys
GET /api/v1/activities/- Get user activitiesGET /api/v1/activities/summary- Activity summary statisticsGET /api/v1/activities/recent- Recent activities
See the /docs directory for comprehensive documentation:
- API_EXPERIMENTS.md - Complete experiments API guide
- MODEL_EXPORT.md - Model export/download documentation
- DIRECT_DATASET_TRAINING.md - Direct dataset training guide
- USER_ACTIVITY_TRACKING.md - Activity tracking system
- FRONTEND_TRAINING_MODIFICATIONS.md - Frontend integration guide
The API uses JWT (JSON Web Tokens) for authentication:
POST /api/v1/auth/register
Content-Type: application/json
{
"email": "user@example.com",
"password": "password123"
}POST /api/v1/auth/jwt/login
Content-Type: application/x-www-form-urlencoded
username=user@example.com&password=password123Include the JWT token in the Authorization header:
Authorization: Bearer <your-jwt-token>
Run the test suite:
pytestRun tests with coverage:
pytest --cov=app --cov-report=htmlRun specific test file:
pytest tests/test_database.pydatanova-backend/
├── app/
│ ├── api/
│ │ └── routes/
│ │ ├── activities_route.py # User activity tracking
│ │ ├── analysis_route.py # Data analysis & visualization
│ │ ├── datasets_route.py # Dataset management
│ │ ├── experiments_route.py # ML experiment management
│ │ ├── logs_route.py # System logging
│ │ ├── ml_models_route.py # ML model registry
│ │ ├── pipelines_route.py # Data preprocessing
│ │ ├── recommendations_route.py # AI recommendations
│ │ ├── reports_route.py # Report generation
│ │ ├── users_route.py # User management
│ │ └── __init__.py
│ ├── core/
│ │ └── auth.py # Authentication configuration
│ ├── db/
│ │ ├── session.py # Database session management
│ │ └── __init__.py
│ ├── ml/
│ │ ├── adapters/ # ML algorithm adapters
│ │ ├── base.py # Base ML classes
│ │ ├── dataset_analyzer.py # Dataset analysis tools
│ │ ├── problem_classifier.py # Problem type classification
│ │ ├── recommendation_engine.py # ML recommendation system
│ │ ├── registry.py # Algorithm registry
│ │ └── __init__.py
│ ├── models/
│ │ ├── activities.py # Activity tracking models
│ │ ├── datasets.py # Dataset models
│ │ ├── experiments.py # Experiment models
│ │ ├── logs.py # System log models
│ │ ├── pipelines.py # Pipeline models
│ │ ├── recommendations.py # Recommendation models
│ │ ├── users.py # User models
│ │ └── __init__.py
│ ├── schemas/
│ │ ├── experiments.py # Experiment schemas
│ │ ├── logs.py # Log schemas
│ │ ├── ml_models.py # ML model schemas
│ │ ├── pipelines.py # Pipeline schemas
│ │ ├── recommendations.py # Recommendation schemas
│ │ ├── reports.py # Report schemas
│ │ ├── users.py # User schemas
│ │ └── __init__.py
│ ├── services/
│ │ ├── activity.py # Activity tracking service
│ │ ├── ai_service.py # AI/Groq integration
│ │ ├── analysis.py # Data analysis service
│ │ ├── chart_service.py # Chart generation
│ │ ├── dataset_analysis.py # Dataset profiling
│ │ ├── file_processing.py # File handling
│ │ ├── logger.py # Logging service
│ │ ├── pdf_service.py # PDF generation
│ │ ├── preprocessing.py # Data preprocessing
│ │ ├── recommendation_service.py # ML recommendations
│ │ ├── report_service.py # Report generation
│ │ ├── storage.py # File storage
│ │ └── training.py # ML training
│ ├── streamlit_ia/ # Alternative Streamlit UI
│ │ ├── Archecture/ # Architecture diagrams
│ │ ├── models/ # Streamlit ML models
│ │ ├── pages/ # Streamlit pages
│ │ ├── preprocessing/ # Streamlit preprocessing
│ │ ├── tests/ # Streamlit tests
│ │ ├── uploaded/ # Upload directory
│ │ ├── utils/ # Streamlit utilities
│ │ ├── visualization/ # Visualization components
│ │ ├── data_loader.py # Data loading utilities
│ │ ├── Home.py # Streamlit main page
│ │ └── streamlit_app_OLD.py # Legacy app
│ ├── middleware/
│ │ └── request_logging.py # Request logging middleware
│ ├── config.py # Application configuration
│ ├── main.py # FastAPI application
│ └── __init__.py
├── tests/
│ ├── conftest.py # Test configuration
│ ├── test_auth.py # Authentication tests
│ ├── test_database.py # Database tests
│ └── __init__.py
├── scripts/
│ └── seed_algorithms.py # Database seeding
├── alembic/ # Database migrations
│ ├── versions/ # Migration files
│ ├── env.py # Alembic environment
│ └── script.py.mako # Migration template
├── models/ # Trained model storage
├── reports/
│ └── charts/ # Generated chart storage
├── temp/ # Temporary files
├── uploads/ # User uploads
├── docs/ # Documentation
├── requirements.txt # Python dependencies
├── pytest.ini # Test configuration
├── alembic.ini # Alembic configuration
├── gg.py # Utility script
└── README.md # This file
Check application health:
GET /healthResponse:
{
"status": "healthy",
"timestamp": "2025-12-07T12:00:00.000000",
"version": "1.0.0",
"environment": "development",
"database": {
"status": "connected",
"type": "postgresql"
},
"service": "DataNova API"
}The health check endpoint returns comprehensive information about the application's status:
status: Overall health status ("healthy" or "unhealthy")timestamp: Current UTC timestamp in ISO formatversion: Application version (configurable via APP_VERSION environment variable)environment: Deployment environment (configurable via ENVIRONMENT variable)database: Database connection status and typeservice: Service name identifier
The API includes Cross-Origin Resource Sharing (CORS) middleware to allow requests from web browsers. By default, it allows requests from common development servers:
http://localhost:3000(React)http://localhost:5173(Vite)http://127.0.0.1:3000http://127.0.0.1:5173
You can customize CORS settings using environment variables or by modifying the CORS_ORIGINS list in your configuration:
# Allow specific production domain
CORS_ORIGINS=["https://your-frontend-domain.com", "https://www.your-frontend-domain.com"]
# Or add to existing development origins
CORS_ORIGINS=["http://localhost:3000", "http://localhost:5173", "https://your-frontend-domain.com"]CORS_ALLOW_CREDENTIALS: Set totrueto allow cookies and authorization headersCORS_ALLOW_METHODS: HTTP methods allowed (default: all methods)CORS_ALLOW_HEADERS: HTTP headers allowed (default: all headers)
The application includes a custom logging system that:
- Logs to console with timestamps and log levels
- Persists log messages to the database
- Supports different log levels: INFO, ERROR, WARNING
- Handles database connection failures gracefully
The API includes Cross-Origin Resource Sharing (CORS) middleware to allow requests from web browsers. By default, it allows requests from common development servers:
http://localhost:3000(React)http://localhost:5173(Vite)http://127.0.0.1:3000http://127.0.0.1:5173
You can customize CORS settings using environment variables or by modifying the CORS_ORIGINS list in your configuration:
# Allow specific production domain
CORS_ORIGINS=["https://your-frontend-domain.com", "https://www.your-frontend-domain.com"]
# Or add to existing development origins
CORS_ORIGINS=["http://localhost:3000", "http://localhost:5173", "https://your-frontend-domain.com"]CORS_ALLOW_CREDENTIALS: Set totrueto allow cookies and authorization headersCORS_ALLOW_METHODS: HTTP methods allowed (default: all methods)CORS_ALLOW_HEADERS: HTTP headers allowed (default: all headers)
This project is licensed under the MIT License - see the LICENSE file for details.