Job Data Cleanup System¶

The EEMT web interface includes an automated job data cleanup system to manage disk space by removing old job data while preserving job metadata for auditing and analysis.

Overview¶

The cleanup system automatically manages job data retention with configurable policies:

Successful Jobs: Output data deleted after 7 days, job configurations preserved indefinitely
Failed Jobs: All data deleted after 12 hours to free resources quickly
Job Metadata: Always preserved in the database for historical analysis
Configurable Retention: Adjust periods via environment variables
Manual Cleanup: API endpoint and CLI script for on-demand cleanup

Data Retention Philosophy

The cleanup system balances disk space management with data preservation needs:

Job configurations and metadata are always retained for analysis
Output data for successful jobs is kept long enough for users to download
Failed job data is cleaned quickly as it's typically not needed
All retention periods are configurable to match your requirements

Architecture¶

graph TB
    subgraph "Cleanup Components"
        CS[cleanup_jobs.py<br/>Standalone Script]
        API[/api/cleanup<br/>REST Endpoint]
        CRON[Cron Job<br/>Automated Schedule]
        SD[Systemd Timer<br/>Alternative Scheduler]
    end

    subgraph "Data Storage"
        DB[(SQLite<br/>Job Database)]
        UP[uploads/<br/>Input DEMs]
        RES[results/<br/>Output Data]
        TEMP[temp/<br/>Working Files]
    end

    subgraph "Cleanup Process"
        SCAN[Scan Jobs]
        CHECK[Check Age]
        DEL[Delete Data]
        LOG[Update Logs]
    end

    CS --> SCAN
    API --> SCAN
    CRON --> CS
    SD --> CS

    SCAN --> CHECK
    CHECK --> DEL
    DEL --> LOG

    DEL --> UP
    DEL --> RES
    DEL --> TEMP
    LOG --> DB

Retention Policies¶

Default Policies¶

Job Status	Data Type	Retention Period	Action
Successful	Output data	7 days	Delete files, keep DB record
Successful	Input DEM	7 days	Delete file if exists
Successful	Job metadata	Forever	Preserved in database
Failed	All data	12 hours	Delete all files
Failed	Job metadata	Forever	Preserved with error info
Running	All data	N/A	Never deleted automatically
Pending	All data	N/A	Never deleted automatically

Configurable Retention¶

Adjust retention periods using environment variables:

# Set retention periods (in hours)
export EEMT_SUCCESS_RETENTION_HOURS=168  # 7 days (default)
export EEMT_FAILED_RETENTION_HOURS=12    # 12 hours (default)

# Examples of common configurations
export EEMT_SUCCESS_RETENTION_HOURS=336  # 14 days for longer retention
export EEMT_SUCCESS_RETENTION_HOURS=72   # 3 days for rapid cleanup
export EEMT_FAILED_RETENTION_HOURS=24    # 24 hours for debugging

Cleanup Script Usage¶

Standalone Script (`cleanup_jobs.py`)¶

The cleanup script can be run manually or scheduled:

Basic Usage¶

# Run cleanup with default settings
python scripts/cleanup_jobs.py

# Specify custom database and directories
python scripts/cleanup_jobs.py \
    --db-path /path/to/jobs.db \
    --uploads-dir /path/to/uploads \
    --results-dir /path/to/results

Command-Line Options¶

Option	Description	Default
`--db-path`	Path to SQLite database	`./jobs.db`
`--uploads-dir`	Directory containing uploaded DEMs	`./uploads`
`--results-dir`	Directory containing job outputs	`./results`
`--success-retention`	Hours to keep successful job data	168 (7 days)
`--failed-retention`	Hours to keep failed job data	12
`--dry-run`	Preview cleanup without deleting	False
`--verbose`	Enable detailed logging	False

Dry Run Mode¶

Preview what would be deleted without making changes:

# See what would be cleaned up
python scripts/cleanup_jobs.py --dry-run --verbose

# Example output
[DRY RUN] Would clean job job-20240115-123456 (status: completed, age: 8.2 days)
[DRY RUN] Would delete: uploads/job-20240115-123456_dem.tif
[DRY RUN] Would delete: results/job-20240115-123456/ (15 GB)
[DRY RUN] Summary: Would clean 3 jobs, freeing ~45 GB

Verbose Output¶

Get detailed information during cleanup:

python scripts/cleanup_jobs.py --verbose

# Example output
[INFO] Starting job cleanup process...
[INFO] Checking job job-20240115-123456 (completed, 8 days old)
[INFO] Deleting uploads/job-20240115-123456_dem.tif (250 MB)
[INFO] Deleting results/job-20240115-123456/ (15 GB)
[INFO] Updated database: marked job as cleaned
[INFO] Cleaned 3 successful jobs (freed 45 GB)
[INFO] Cleaned 2 failed jobs (freed 500 MB)
[INFO] Cleanup completed successfully

Automated Scheduling¶

Cron Setup¶

Use the provided setup script for automatic cron configuration:

# Install cron job (runs daily at 2 AM)
./scripts/setup_cleanup_cron.sh install

# Verify installation
./scripts/setup_cleanup_cron.sh status

# Remove cron job
./scripts/setup_cleanup_cron.sh remove

Manual Cron Configuration¶

For custom scheduling, edit crontab directly:

# Edit crontab
crontab -e

# Add cleanup job (adjust schedule as needed)
# Daily at 2:00 AM
0 2 * * * cd /path/to/eemt/web-interface && /usr/bin/python3 scripts/cleanup_jobs.py >> logs/cleanup.log 2>&1

# Every 6 hours
0 */6 * * * cd /path/to/eemt/web-interface && /usr/bin/python3 scripts/cleanup_jobs.py

# Weekly on Sunday at 3 AM
0 3 * * 0 cd /path/to/eemt/web-interface && /usr/bin/python3 scripts/cleanup_jobs.py

Systemd Timer Setup¶

For systems using systemd, use timer units for more control:

# Install systemd timer and service
./scripts/setup_cleanup_cron.sh install-systemd

# Check timer status
systemctl --user status eemt-cleanup.timer

# View cleanup logs
journalctl --user -u eemt-cleanup.service

# Manual trigger
systemctl --user start eemt-cleanup.service

Systemd Configuration Files¶

The setup script creates these systemd files:

~/.config/systemd/user/eemt-cleanup.service:

[Unit]
Description=EEMT Job Cleanup Service
After=network.target

[Service]
Type=oneshot
WorkingDirectory=/path/to/eemt/web-interface
ExecStart=/usr/bin/python3 scripts/cleanup_jobs.py
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=default.target

~/.config/systemd/user/eemt-cleanup.timer:

[Unit]
Description=Daily EEMT Job Cleanup Timer
Requires=eemt-cleanup.service

[Timer]
OnCalendar=daily
OnCalendar=02:00
Persistent=true

[Install]
WantedBy=timers.target

Docker Integration¶

Docker Compose Configuration¶

The cleanup system integrates with Docker deployments:

# docker-compose.yml
services:
  eemt-web:
    image: eemt-web:latest
    environment:
      - EEMT_SUCCESS_RETENTION_HOURS=168
      - EEMT_FAILED_RETENTION_HOURS=12
      - EEMT_ENABLE_AUTO_CLEANUP=true
      - EEMT_CLEANUP_SCHEDULE="0 2 * * *"  # Cron expression
    volumes:
      - ./data/uploads:/app/uploads
      - ./data/results:/app/results
      - ./jobs.db:/app/jobs.db

Container Cleanup Script¶

Run cleanup inside the web interface container:

# Execute cleanup in running container
docker exec eemt-web python scripts/cleanup_jobs.py

# Run cleanup in new container
docker run --rm \
  -v $(pwd)/uploads:/app/uploads \
  -v $(pwd)/results:/app/results \
  -v $(pwd)/jobs.db:/app/jobs.db \
  eemt-web python scripts/cleanup_jobs.py

Kubernetes CronJob¶

For Kubernetes deployments, use a CronJob resource:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: eemt-cleanup
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cleanup
            image: eemt-web:latest
            command:
            - python
            - scripts/cleanup_jobs.py
            env:
            - name: EEMT_SUCCESS_RETENTION_HOURS
              value: "168"
            - name: EEMT_FAILED_RETENTION_HOURS
              value: "12"
            volumeMounts:
            - name: data
              mountPath: /app/data
          volumes:
          - name: data
            persistentVolumeClaim:
              claimName: eemt-data-pvc
          restartPolicy: OnFailure

API Endpoint¶

REST API for Manual Cleanup¶

Trigger cleanup via the web interface API:

Endpoint Details¶

POST /api/cleanup
Content-Type: application/json

{
  "dry_run": false,
  "success_retention_hours": 168,
  "failed_retention_hours": 12
}

Response Format¶

{
  "success": true,
  "cleaned_jobs": {
    "successful": 3,
    "failed": 2
  },
  "space_freed": {
    "bytes": 48318382080,
    "human_readable": "45.0 GB"
  },
  "details": [
    {
      "job_id": "job-20240115-123456",
      "status": "completed",
      "age_days": 8.2,
      "space_freed": "15.0 GB"
    }
  ]
}

Usage Examples¶

# Trigger cleanup with defaults
curl -X POST http://localhost:5000/api/cleanup

# Dry run to preview
curl -X POST http://localhost:5000/api/cleanup \
  -H "Content-Type: application/json" \
  -d '{"dry_run": true}'

# Custom retention periods
curl -X POST http://localhost:5000/api/cleanup \
  -H "Content-Type: application/json" \
  -d '{
    "success_retention_hours": 72,
    "failed_retention_hours": 6
  }'

Python Client Example¶

import requests
from datetime import datetime, timedelta

def cleanup_old_jobs(base_url="http://localhost:5000", dry_run=True):
    """Trigger job cleanup via API"""

    response = requests.post(
        f"{base_url}/api/cleanup",
        json={
            "dry_run": dry_run,
            "success_retention_hours": 168,  # 7 days
            "failed_retention_hours": 12     # 12 hours
        }
    )

    if response.status_code == 200:
        result = response.json()
        print(f"Cleanup {'preview' if dry_run else 'completed'}:")
        print(f"  Successful jobs cleaned: {result['cleaned_jobs']['successful']}")
        print(f"  Failed jobs cleaned: {result['cleaned_jobs']['failed']}")
        print(f"  Space freed: {result['space_freed']['human_readable']}")
    else:
        print(f"Cleanup failed: {response.text}")

# Preview cleanup
cleanup_old_jobs(dry_run=True)

# Execute cleanup
cleanup_old_jobs(dry_run=False)

Configuration Options¶

Environment Variables¶

Configure cleanup behavior through environment variables:

Variable	Description	Default
`EEMT_SUCCESS_RETENTION_HOURS`	Hours to keep successful job data	168 (7 days)
`EEMT_FAILED_RETENTION_HOURS`	Hours to keep failed job data	12
`EEMT_ENABLE_AUTO_CLEANUP`	Enable automatic cleanup on schedule	false
`EEMT_CLEANUP_SCHEDULE`	Cron expression for auto cleanup	"0 2 * * *"
`EEMT_CLEANUP_LOG_LEVEL`	Logging verbosity (DEBUG, INFO, WARNING, ERROR)	INFO
`EEMT_CLEANUP_BATCH_SIZE`	Number of jobs to process per batch	100
`EEMT_CLEANUP_DRY_RUN`	Always run in dry-run mode	false

Configuration File¶

Create a configuration file for complex setups:

cleanup_config.yaml:

# Retention policies
retention:
  successful_jobs:
    hours: 168  # 7 days
    keep_metadata: true
    keep_logs: false
  failed_jobs:
    hours: 12
    keep_metadata: true
    keep_error_logs: true

# Cleanup schedule
schedule:
  enabled: true
  cron: "0 2 * * *"  # Daily at 2 AM

# Directories
paths:
  database: ./jobs.db
  uploads: ./uploads
  results: ./results
  logs: ./logs

# Performance
performance:
  batch_size: 100
  parallel_delete: true
  max_workers: 4

# Notifications (optional)
notifications:
  email:
    enabled: false
    smtp_server: smtp.gmail.com
    recipients:
      - admin@example.com
  slack:
    enabled: false
    webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL

Load configuration in cleanup script:

import yaml

with open('cleanup_config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Use configuration
success_retention = config['retention']['successful_jobs']['hours']
failed_retention = config['retention']['failed_jobs']['hours']

Monitoring and Logging¶

Log Files¶

Cleanup operations are logged to multiple locations:

Application Logs: logs/cleanup.log
System Logs: /var/log/syslog (when using cron)
Journal: journalctl (when using systemd)
Container Logs: docker logs eemt-web

Log Format¶

2024-01-20 02:00:01 INFO: Starting job cleanup process
2024-01-20 02:00:02 INFO: Found 45 jobs to check
2024-01-20 02:00:03 INFO: Cleaning job job-20240113-123456 (completed, 7.5 days old)
2024-01-20 02:00:05 INFO: Deleted uploads/job-20240113-123456_dem.tif (250 MB)
2024-01-20 02:00:08 INFO: Deleted results/job-20240113-123456/ (15 GB)
2024-01-20 02:00:09 INFO: Cleanup summary: 3 successful, 2 failed, 45.5 GB freed
2024-01-20 02:00:09 INFO: Cleanup completed successfully

Monitoring Metrics¶

Track cleanup effectiveness with these metrics:

# Example monitoring script
import sqlite3
from datetime import datetime, timedelta

def get_cleanup_metrics(db_path="jobs.db"):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Jobs cleaned in last 24 hours
    yesterday = datetime.now() - timedelta(hours=24)
    cursor.execute("""
        SELECT COUNT(*), SUM(data_size_mb) 
        FROM jobs 
        WHERE data_cleaned_at > ?
    """, (yesterday.isoformat(),))

    count, size_mb = cursor.fetchone()

    # Disk space usage
    cursor.execute("""
        SELECT 
            status,
            COUNT(*) as count,
            SUM(data_size_mb) as total_mb
        FROM jobs
        WHERE data_cleaned_at IS NULL
        GROUP BY status
    """)

    usage_by_status = cursor.fetchall()

    print(f"Cleanup Metrics (Last 24h):")
    print(f"  Jobs cleaned: {count or 0}")
    print(f"  Space freed: {(size_mb or 0) / 1024:.1f} GB")
    print(f"\nCurrent Usage:")
    for status, count, mb in usage_by_status:
        print(f"  {status}: {count} jobs, {mb/1024:.1f} GB")

    conn.close()

# Run metrics check
get_cleanup_metrics()

Troubleshooting¶

Common Issues¶

1. Cleanup Not Running Automatically¶

Check cron job:

# List cron jobs
crontab -l

# Check cron service
systemctl status cron

# View cron logs
grep CRON /var/log/syslog

Check systemd timer:

# Timer status
systemctl --user status eemt-cleanup.timer

# List all timers
systemctl --user list-timers

# View service logs
journalctl --user -u eemt-cleanup.service -n 50

2. Permission Denied Errors¶

# Check file permissions
ls -la uploads/ results/

# Fix ownership
sudo chown -R $(whoami):$(whoami) uploads/ results/

# Fix permissions
chmod -R 755 uploads/ results/

3. Database Locked Error¶

# Check for running processes
fuser jobs.db

# Kill blocking process if needed
kill -9 <PID>

# Verify database integrity
sqlite3 jobs.db "PRAGMA integrity_check;"

4. Disk Space Not Being Freed¶

# Check if files are actually deleted
du -sh uploads/ results/

# Check for open file handles
lsof | grep results

# Force filesystem sync
sync

# Check available space
df -h .

5. Cleanup Taking Too Long¶

Optimize cleanup performance:

# Use parallel deletion
python scripts/cleanup_jobs.py --parallel --workers 4

# Process in smaller batches
python scripts/cleanup_jobs.py --batch-size 50

# Skip large directories first
python scripts/cleanup_jobs.py --skip-large --size-threshold 10G

Debug Mode¶

Enable detailed debugging output:

# Set debug environment variable
export EEMT_CLEANUP_LOG_LEVEL=DEBUG

# Run with maximum verbosity
python scripts/cleanup_jobs.py --verbose --debug

# Debug output includes:
# - SQL queries executed
# - File operations attempted
# - Time taken for each operation
# - Memory usage statistics

Recovery Procedures¶

Restore Accidentally Deleted Data¶

If important data was deleted by cleanup:

Check backups (if configured):

# Restore from backup
rsync -av /backup/eemt/results/job-id/ ./results/job-id/

Recover from database:

-- Job metadata is preserved
SELECT * FROM jobs WHERE job_id = 'job-20240120-123456';

Re-run job if necessary:

# Use preserved configuration
python scripts/rerun_job.py --job-id job-20240120-123456

Reset Cleanup State¶

Clear cleanup history and start fresh:

-- Reset cleanup timestamps
UPDATE jobs SET data_cleaned_at = NULL WHERE data_cleaned_at IS NOT NULL;

-- Clear cleanup log entries
DELETE FROM cleanup_log;

-- Vacuum database
VACUUM;

Best Practices¶

1. Regular Monitoring¶

Set up monitoring alerts:

# Daily check script
#!/bin/bash
SPACE_USED=$(du -sb results/ | cut -f1)
MAX_SPACE=107374182400  # 100 GB

if [ $SPACE_USED -gt $MAX_SPACE ]; then
    echo "Warning: Results directory using $(($SPACE_USED / 1073741824)) GB" | \
        mail -s "EEMT Cleanup Alert" admin@example.com
fi

2. Backup Important Results¶

Before cleanup, backup critical data:

# Backup successful jobs before cleanup
rsync -av --include="*completed*" results/ /backup/eemt/results/

# Archive old successful jobs
tar -czf archived_jobs_$(date +%Y%m).tar.gz \
    $(find results/ -name "*completed*" -mtime +7)

3. Gradual Cleanup¶

Implement staged cleanup for safety:

# Two-stage cleanup process
def staged_cleanup():
    # Stage 1: Mark for deletion
    mark_jobs_for_cleanup(age_days=7)

    # Stage 2: Delete marked jobs (24h later)
    delete_marked_jobs(marked_before=24)

4. Audit Trail¶

Maintain cleanup audit logs:

-- Create audit table
CREATE TABLE cleanup_audit (
    id INTEGER PRIMARY KEY,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    job_id TEXT,
    action TEXT,
    space_freed_mb INTEGER,
    user TEXT
);

-- Log cleanup actions
INSERT INTO cleanup_audit (job_id, action, space_freed_mb, user)
VALUES ('job-123', 'deleted_results', 15360, 'auto_cleanup');

5. Capacity Planning¶

Monitor growth trends:

import matplotlib.pyplot as plt
from datetime import datetime, timedelta

def plot_storage_trends():
    # Get historical data
    dates, usage = get_storage_history()

    # Plot trend
    plt.figure(figsize=(10, 6))
    plt.plot(dates, usage)
    plt.xlabel('Date')
    plt.ylabel('Storage (GB)')
    plt.title('EEMT Storage Usage Trend')
    plt.grid(True)
    plt.savefig('storage_trend.png')

Integration Examples¶

Slack Notifications¶

Send cleanup summaries to Slack:

import requests
import json

def send_slack_notification(webhook_url, message):
    payload = {
        "text": f"EEMT Cleanup Report",
        "attachments": [{
            "color": "good",
            "fields": [
                {"title": "Jobs Cleaned", "value": message['jobs_cleaned'], "short": True},
                {"title": "Space Freed", "value": message['space_freed'], "short": True},
                {"title": "Status", "value": "Success", "short": True}
            ]
        }]
    }
    requests.post(webhook_url, json=payload)

Email Reports¶

Send daily cleanup reports:

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

def send_cleanup_report(recipient, stats):
    msg = MIMEMultipart()
    msg['Subject'] = 'EEMT Daily Cleanup Report'
    msg['From'] = 'eemt@example.com'
    msg['To'] = recipient

    body = f"""
    Daily Cleanup Report
    ====================

    Jobs Cleaned: {stats['jobs_cleaned']}
    Space Freed: {stats['space_freed']}

    Current Storage Usage: {stats['current_usage']}
    Available Space: {stats['available_space']}

    Next cleanup scheduled for: {stats['next_run']}
    """

    msg.attach(MIMEText(body, 'plain'))

    with smtplib.SMTP('localhost') as server:
        server.send_message(msg)

Prometheus Metrics¶

Export cleanup metrics for monitoring:

from prometheus_client import Counter, Gauge, Histogram

# Define metrics
cleanup_runs = Counter('eemt_cleanup_runs_total', 'Total cleanup runs')
jobs_cleaned = Counter('eemt_jobs_cleaned_total', 'Total jobs cleaned', ['status'])
space_freed = Counter('eemt_space_freed_bytes_total', 'Total space freed')
cleanup_duration = Histogram('eemt_cleanup_duration_seconds', 'Cleanup duration')
current_usage = Gauge('eemt_storage_usage_bytes', 'Current storage usage')

# Update metrics during cleanup
@cleanup_duration.time()
def run_cleanup():
    cleanup_runs.inc()
    # ... cleanup logic ...
    jobs_cleaned.labels(status='completed').inc(3)
    space_freed.inc(45 * 1024 * 1024 * 1024)  # 45 GB
    current_usage.set(get_current_usage())

Security Considerations¶

Access Control¶

Restrict cleanup operations:

from functools import wraps
from flask import request, abort

def require_admin(f):
    @wraps(f)
    def decorated_function(*args, **kwargs):
        auth = request.headers.get('Authorization')
        if not verify_admin_token(auth):
            abort(403)
        return f(*args, **kwargs)
    return decorated_function

@app.route('/api/cleanup', methods=['POST'])
@require_admin
def api_cleanup():
    # Only admins can trigger cleanup
    return run_cleanup()

Audit Logging¶

Track all cleanup operations:

import logging
import json
from datetime import datetime

class AuditLogger:
    def __init__(self, log_file='cleanup_audit.log'):
        self.logger = logging.getLogger('cleanup_audit')
        handler = logging.FileHandler(log_file)
        formatter = logging.Formatter(
            '%(asctime)s - %(levelname)s - %(message)s'
        )
        handler.setFormatter(formatter)
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)

    def log_cleanup(self, user, action, details):
        audit_entry = {
            'timestamp': datetime.now().isoformat(),
            'user': user,
            'action': action,
            'details': details,
            'ip_address': request.remote_addr
        }
        self.logger.info(json.dumps(audit_entry))

Safe Deletion¶

Implement safeguards against accidental deletion:

def safe_delete(path, job_id):
    """Safely delete files with verification"""

    # Verify path is within allowed directories
    allowed_dirs = ['/app/uploads', '/app/results']
    if not any(str(path).startswith(d) for d in allowed_dirs):
        raise ValueError(f"Attempted to delete outside allowed directories: {path}")

    # Verify job ownership
    if not verify_job_ownership(job_id, path):
        raise ValueError(f"Path does not belong to job {job_id}")

    # Create backup before deletion (optional)
    if ENABLE_CLEANUP_BACKUP:
        backup_path = f"/backup/{job_id}/{path.name}"
        shutil.copy2(path, backup_path)

    # Perform deletion
    if path.is_dir():
        shutil.rmtree(path)
    else:
        path.unlink()

    return True

Summary¶

The EEMT job cleanup system provides flexible, automated management of job data to maintain optimal disk usage while preserving important metadata and configurations. Key features include:

Configurable retention policies for different job states
Multiple scheduling options (cron, systemd, manual)
Docker and Kubernetes integration for containerized deployments
Comprehensive logging and monitoring capabilities
Safety features including dry-run mode and audit trails
REST API for programmatic control

The system ensures that your EEMT deployment remains performant and manageable even with high job volumes, while maintaining data integrity and providing recovery options when needed.