Job Data Cleanup System¶
The EEMT infrastructure includes an automated job data cleanup system designed to manage disk space efficiently while preserving important job metadata for auditing and analysis.
Overview¶
The cleanup system provides intelligent data lifecycle management with the following key features:
- Automated retention policies - Different policies for successful vs failed jobs
- Selective data deletion - Preserves job configurations while removing output data
- Configurable scheduling - Cron, systemd, or manual execution options
- Container integration - Works seamlessly with Docker and Kubernetes deployments
- API endpoints - RESTful interface for programmatic cleanup control
Quick Start
For most deployments, the default configuration works out of the box:
System Architecture¶
graph TB
subgraph "Cleanup Components"
JCM[JobCleanupManager<br/>Core Python Class]
CS[cleanup_jobs.py<br/>CLI Script]
API[FastAPI Endpoint<br/>/api/cleanup]
CRON[Cron/Systemd<br/>Schedulers]
end
subgraph "Data Storage"
DB[(SQLite Database<br/>jobs.db)]
FS[File System<br/>uploads/ & results/]
end
subgraph "Cleanup Process"
SCAN[Scan Eligible Jobs]
EVAL[Evaluate Retention]
DELETE[Delete Data]
UPDATE[Update Records]
end
CS --> JCM
API --> JCM
CRON --> CS
JCM --> SCAN
SCAN --> DB
SCAN --> EVAL
EVAL --> DELETE
DELETE --> FS
DELETE --> UPDATE
UPDATE --> DB
Retention Policies¶
Default Retention Rules¶
| Job Status | Data Component | Default Retention | Action After Retention |
|---|---|---|---|
| Successful | Output data (results/) |
7 days | Delete files, preserve job record |
| Successful | Input DEM (uploads/) |
7 days | Delete file if exists |
| Successful | Job configuration | Forever | Preserved in database |
| Successful | Execution logs | 7 days | Delete with output data |
| Failed | All data | 12 hours | Complete deletion |
| Failed | Error information | Forever | Preserved in job record |
| Running | All data | N/A | Never auto-deleted |
| Pending | All data | N/A | Never auto-deleted |
Rationale¶
The retention policies are designed based on typical usage patterns:
- Successful jobs (7 days): Provides sufficient time for users to download results while managing disk space
- Failed jobs (12 hours): Quick cleanup since failed job data is rarely needed after debugging
- Configuration preservation: Maintains audit trail and enables job re-execution if needed
- Running/Pending protection: Ensures active jobs are never interrupted
Implementation Details¶
Core Components¶
JobCleanupManager Class¶
The JobCleanupManager class in cleanup_jobs.py provides the core cleanup functionality:
class JobCleanupManager:
"""Manages automated cleanup of EEMT job data"""
def __init__(self, base_dir: Path, dry_run: bool = False):
self.base_dir = base_dir
self.dry_run = dry_run
self.db_path = base_dir / "jobs.db"
self.results_dir = base_dir / "results"
self.uploads_dir = base_dir / "uploads"
# Configurable retention periods
self.success_retention_days = int(
os.getenv('EEMT_SUCCESS_RETENTION_DAYS', '7')
)
self.failed_retention_hours = int(
os.getenv('EEMT_FAILED_RETENTION_HOURS', '12')
)
Key methods:
get_jobs_for_cleanup(): Queries database for eligible jobs based on age and statuscleanup_job_data(): Performs actual deletion of files and directoriesrun_cleanup(): Orchestrates the complete cleanup process_get_directory_size_mb(): Calculates space to be freed
Database Schema¶
The cleanup system adds tracking fields to the jobs table:
-- Additional fields for cleanup tracking
ALTER TABLE jobs ADD COLUMN completed_at TIMESTAMP;
ALTER TABLE jobs ADD COLUMN data_cleaned_at TIMESTAMP;
ALTER TABLE jobs ADD COLUMN cleanup_summary TEXT;
File System Operations¶
The cleanup process handles various file types:
- Results directory (
results/<job_id>/): - Contains all workflow outputs (GeoTIFF files, logs, metadata)
- Deleted recursively for eligible jobs
-
Size calculated before deletion for reporting
-
Upload directory (
uploads/<filename>): - Contains original DEM files
- Deleted if associated with cleaned job
-
Filename stored in job record for reference
-
Temporary files (
temp/<job_id>/): - Working directories for active jobs
- Cleaned immediately after job completion
- Not subject to retention policies
SQL Queries¶
The system uses optimized queries to identify cleanup candidates:
-- Find successful jobs older than retention period
SELECT id, workflow_type, status, completed_at, dem_filename, parameters
FROM jobs
WHERE status = 'completed'
AND completed_at IS NOT NULL
AND datetime(completed_at) < datetime('now', '-7 days')
ORDER BY completed_at ASC;
-- Find failed jobs older than retention period
SELECT id, workflow_type, status, completed_at, dem_filename, parameters
FROM jobs
WHERE status = 'failed'
AND completed_at IS NOT NULL
AND datetime(completed_at) < datetime('now', '-12 hours')
ORDER BY completed_at ASC;
Configuration Options¶
Environment Variables¶
Configure cleanup behavior through environment variables:
# Retention periods
export EEMT_SUCCESS_RETENTION_DAYS=7 # Days to keep successful job data
export EEMT_FAILED_RETENTION_HOURS=12 # Hours to keep failed job data
# Operational settings
export EEMT_DRY_RUN=true # Preview mode without deletion
export EEMT_CLEANUP_LOG_LEVEL=INFO # Logging verbosity
export EEMT_CLEANUP_BATCH_SIZE=100 # Jobs per cleanup batch
# Directory paths (if non-standard)
export EEMT_BASE_DIR=/path/to/web-interface
export EEMT_RESULTS_DIR=/path/to/results
export EEMT_UPLOADS_DIR=/path/to/uploads
Command-Line Arguments¶
Override defaults via command-line:
python cleanup_jobs.py \
--success-retention-days 14 \
--failed-retention-hours 24 \
--base-dir /custom/path \
--dry-run \
--verbose
Configuration File (Advanced)¶
For complex deployments, use a YAML configuration:
# cleanup_config.yaml
retention:
successful_jobs:
days: 7
keep_metadata: true
keep_logs: false
failed_jobs:
hours: 12
keep_error_logs: true
performance:
batch_size: 100
parallel_delete: true
max_workers: 4
notifications:
enabled: true
email_on_error: admin@example.com
summary_webhook: https://hooks.slack.com/services/XXX
Scheduling Options¶
Cron Integration¶
The provided setup_cleanup_cron.sh script automates cron configuration:
# User-level cron (recommended)
./setup_cleanup_cron.sh --user --method cron
# System-wide cron (requires sudo)
sudo ./setup_cleanup_cron.sh --system --method cron
This creates a cron entry:
# Daily at 2:00 AM
0 2 * * * cd /path/to/web-interface && python3 cleanup_jobs.py >> cleanup_jobs.log 2>&1
Systemd Timer¶
For systemd-based systems:
# Install systemd timer
./setup_cleanup_cron.sh --user --method systemd
# Check status
systemctl --user status eemt-cleanup.timer
# View logs
journalctl --user -u eemt-cleanup.service
Manual Execution¶
Run cleanup on-demand:
# Production cleanup
python cleanup_jobs.py
# Preview mode (no deletion)
python cleanup_jobs.py --dry-run
# Verbose output
python cleanup_jobs.py --verbose
API Integration¶
RESTful Endpoint¶
The FastAPI application exposes a cleanup endpoint:
@app.post("/api/cleanup")
async def trigger_cleanup(
dry_run: bool = False,
success_retention_days: int = 7,
failed_retention_hours: int = 12
):
"""Trigger job data cleanup via API"""
cleanup_manager = JobCleanupManager(
base_dir=Path("./"),
dry_run=dry_run
)
# Override retention if specified
if success_retention_days:
cleanup_manager.success_retention_days = success_retention_days
if failed_retention_hours:
cleanup_manager.failed_retention_hours = failed_retention_hours
# Run cleanup
summary = cleanup_manager.run_cleanup()
return {
"success": True,
"summary": summary,
"timestamp": datetime.now().isoformat()
}
API Usage Examples¶
# Trigger cleanup with defaults
curl -X POST http://localhost:5000/api/cleanup
# Dry run to preview
curl -X POST http://localhost:5000/api/cleanup?dry_run=true
# Custom retention periods
curl -X POST http://localhost:5000/api/cleanup \
-H "Content-Type: application/json" \
-d '{
"success_retention_days": 3,
"failed_retention_hours": 6
}'
Logging and Monitoring¶
Log Output¶
The cleanup system provides detailed logging:
2024-01-20 02:00:01 INFO - Cleanup manager initialized:
2024-01-20 02:00:01 INFO - Base directory: /home/user/eemt/web-interface
2024-01-20 02:00:01 INFO - Success retention: 7 days
2024-01-20 02:00:01 INFO - Failed retention: 12 hours
2024-01-20 02:00:01 INFO - Starting EEMT job data cleanup process
2024-01-20 02:00:02 INFO - Found 3 successful jobs for data cleanup
2024-01-20 02:00:02 INFO - Found 2 failed jobs for complete deletion
2024-01-20 02:00:03 INFO - Processing job job-20240113-123456 (completed, 2024-01-13 14:23:45)
2024-01-20 02:00:04 INFO - Deleted results directory: results/job-20240113-123456 (15234.5 MB)
2024-01-20 02:00:04 INFO - Deleted uploaded DEM: uploads/job-20240113-123456_dem.tif (250.3 MB)
2024-01-20 02:00:05 INFO - === CLEANUP SUMMARY ===
2024-01-20 02:00:05 INFO - Successful jobs processed: 3
2024-01-20 02:00:05 INFO - Failed jobs processed: 2
2024-01-20 02:00:05 INFO - Total disk space freed: 45678.9 MB
2024-01-20 02:00:05 INFO - Job configs preserved: 3
2024-01-20 02:00:05 INFO - Job configs deleted: 2
2024-01-20 02:00:05 INFO - Cleanup completed successfully
Summary Reports¶
Each cleanup run generates a JSON summary:
{
"start_time": "2024-01-20T02:00:01.123456",
"end_time": "2024-01-20T02:00:05.789012",
"dry_run": false,
"successful_jobs_processed": 3,
"failed_jobs_processed": 2,
"total_size_freed_mb": 45678.9,
"configs_preserved": 3,
"configs_deleted": 2,
"errors": [],
"job_details": [
{
"job_id": "job-20240113-123456",
"status": "completed",
"completed_at": "2024-01-13T14:23:45",
"data_deleted": true,
"config_preserved": true,
"size_freed_mb": 15484.8,
"files_deleted": ["uploads/job-20240113-123456_dem.tif"],
"directories_deleted": ["results/job-20240113-123456"]
}
]
}
Monitoring Metrics¶
Track cleanup effectiveness:
# Example monitoring query
SELECT
DATE(completed_at) as date,
COUNT(*) as jobs_completed,
COUNT(CASE WHEN data_cleaned_at IS NOT NULL THEN 1 END) as jobs_cleaned,
SUM(CASE WHEN data_cleaned_at IS NOT NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as cleanup_rate
FROM jobs
WHERE completed_at >= date('now', '-30 days')
GROUP BY DATE(completed_at)
ORDER BY date DESC;
Docker Integration¶
Container Configuration¶
The cleanup system works seamlessly with Docker:
# docker-compose.yml
services:
eemt-web:
image: eemt-web:latest
environment:
- EEMT_SUCCESS_RETENTION_DAYS=7
- EEMT_FAILED_RETENTION_HOURS=12
- EEMT_ENABLE_AUTO_CLEANUP=true
volumes:
- ./uploads:/app/uploads
- ./results:/app/results
- ./jobs.db:/app/jobs.db
command: >
sh -c "
python app.py &
while true; do
sleep 86400
python cleanup_jobs.py
done
"
Container Execution¶
Run cleanup within containers:
# Execute in running container
docker exec eemt-web python cleanup_jobs.py
# Run in dedicated container
docker run --rm \
-v $(pwd)/uploads:/app/uploads \
-v $(pwd)/results:/app/results \
-v $(pwd)/jobs.db:/app/jobs.db \
eemt-web python cleanup_jobs.py --dry-run
Security Considerations¶
Access Control¶
The cleanup system implements several security measures:
- Path validation: Ensures deletions only occur within designated directories
- Job ownership verification: Confirms files belong to the job being cleaned
- Database integrity: Uses transactions to prevent partial updates
- Audit logging: Tracks all cleanup operations for accountability
Safe Deletion¶
The cleanup process includes safeguards:
def cleanup_job_data(self, job: Dict, keep_job_config: bool = True):
"""Clean up job data with safety checks"""
# Validate job ID format
if not re.match(r'^job-\d{8}-\d{6}$', job['id']):
raise ValueError(f"Invalid job ID format: {job['id']}")
# Verify paths are within allowed directories
results_path = self.results_dir / job['id']
if not str(results_path).startswith(str(self.results_dir)):
raise ValueError(f"Results path outside allowed directory")
# Perform deletion with error handling
try:
if results_path.exists():
shutil.rmtree(results_path)
except Exception as e:
logger.error(f"Failed to delete {results_path}: {e}")
raise
Troubleshooting¶
Common Issues¶
- Permission Denied
- Check file ownership:
ls -la uploads/ results/ -
Fix permissions:
chmod -R 755 uploads/ results/ -
Database Locked
- Check for concurrent access:
fuser jobs.db -
Wait for operations to complete or restart services
-
Disk Space Not Freed
- Verify deletion:
du -sh results/job-* - Check for open file handles:
lsof | grep results -
Force filesystem sync:
sync -
Cleanup Not Running
- Check cron:
crontab -landgrep CRON /var/log/syslog - Check systemd:
systemctl --user status eemt-cleanup.timer
Debug Mode¶
Enable detailed debugging:
# Maximum verbosity
export EEMT_CLEANUP_LOG_LEVEL=DEBUG
python cleanup_jobs.py --verbose --dry-run
# Debug output includes:
# - SQL queries executed
# - Files and directories examined
# - Size calculations
# - Decision logic for each job
Best Practices¶
- Test with dry-run first: Always preview cleanup actions before production use
- Monitor disk usage: Set up alerts for high disk usage
- Backup critical data: Archive important results before retention period expires
- Regular monitoring: Check cleanup logs and summaries weekly
- Adjust retention as needed: Modify periods based on usage patterns
- Document changes: Log any modifications to retention policies
Summary¶
The EEMT job data cleanup system provides automated, configurable management of job data lifecycle. It balances disk space efficiency with data preservation needs, ensuring system performance while maintaining important job metadata for analysis and auditing. The flexible architecture supports various deployment scenarios from single-node installations to distributed container orchestrations.