The Landscape¶
This work is licensed under a Creative Commons Attribution 4.0 International License.
A Glance at the Generative AI Landscape¶
Image Credit: Yang et al. (While this image depicts the state of LLMs in 2023, it effectively illustrates the foundational models and their evolution)
The field of Generative AI is rapidly evolving.
This section provides a snapshot of some of the most influential models and platforms as of 2025.
HuggingFace Arena LLM Leaderboard ¶
Table: Prices of Services (last checked 10/2025)¶
LLM Service | Plan | Price (per month) | Details |
---|---|---|---|
Anthropic Claude | Free | $0 | Basic Claude access with limited daily use |
Pro | $20 | More usage, Claude Code terminal access, unlimited projects, Research access | |
Max | $100 | Priority access, substantially higher usage, enhanced features | |
Max Pro | $200 | Highest tier with maximum usage limits and priority access to newest models | |
Team | $30/month or $25/month (annual) | Central billing, administration, collaboration features (minimum 5 members) | |
Enterprise | Contact Sales | Enhanced context window, SSO, role-based access, audit logs, compliance API | |
Claude API | Pay-As-You-Go | Varies | Claude Sonnet 4.5: $3/1M input, $15/1M output (200K context) Claude Opus 4.1: $15/1M input, $75/1M output Claude Haiku 3.5: $0.80/1M input, $4/1M output Batch processing: 50% discount, Prompt caching available (75-90% savings) |
Claude Code | Included in Pro+ | $20+ | Terminal-based AI coding assistant included with Pro, Max, Max Pro subscriptions Web Search: $10/1,000 searches, Code Execution: $0.05/hour per container |
Google AI | Free | $0 | Unlimited Gemini 2.5 Flash, limited Gemini 2.5 Pro, 32K context window |
Google AI Pro | $19.99 | Expanded access to Gemini 2.5 Pro (100 queries/day), 1M context window, 2TB storage, NotebookLM Free for university students for 1 year |
|
Google AI Ultra | $249.99 | Highest access to Gemini 2.5 Pro, exclusive access to Gemini 2.5 Deep Think, Veo 3 video generation, YouTube Premium, 30TB storage | |
Gemini API | Pay-As-You-Go | Varies | Gemini 2.5 Flash: $0.30/1M input, $2.50/1M output Gemini 2.5 Pro: $1.25/1M input (≤200K), $10/1M output (≤200K) Gemini 2.5 Flash-Lite: $0.10/1M input, $0.40/1M output Batch processing: 50% discount |
OpenAI ChatGPT | Free | $0 | Limited access to GPT-5 (10 messages every 5 hours), then GPT-5-mini |
Plus | $20 | Higher message limits to GPT-5, unlimited GPT-5-mini, access to o3-mini, o1 models | |
Pro | $200 | Unlimited GPT-5 access, GPT-5 Pro with advanced reasoning, extended context windows | |
Team | $25/user (annual) or $30/user (monthly) | All Plus features with higher message caps, team workspace, data excluded from training | |
Enterprise | Contact Sales | Unlimited high-speed models, extended context windows, enterprise security | |
OpenAI API | Pay-As-You-Go | Varies | GPT-5: $1.25/1M input, $10/1M output (272K-400K context) GPT-5-mini: $0.25/1M input, $2/1M output GPT-5-nano: $0.05/1M input, $0.40/1M output GPT-4o: $2.50/1M input, $10/1M output o3-mini: $1.10/1M input, $4.40/1M output |
Perplexity AI | Free | $0 | Unlimited quick searches, 5 Pro searches/day, 5 follow-up questions every 4 hours |
Pro | $20/month or $200/year | 300+ Pro searches/day, access to advanced AI models, file uploads | |
Education Pro | $4.99/month | All Pro features with student/faculty verification 1 month free trial |
|
Max | $200/month or $2,000/year | Unlimited Labs usage, access to top-tier models (OpenAI o3-pro, Claude Opus 4) | |
Enterprise Pro | $40/user/month or $400/user/year | Admin tools, collaboration features, domain verification, SCIM provisioning | |
Microsoft Copilot | Free | $0 | GPT-4o-powered chat, 15 image generation boosts/day |
Microsoft 365 Premium | Premium | $19.99 | Full M365 suite + Copilot in all apps, 1TB storage, extended AI usage limits, 40 image generations |
Microsoft 365 Copilot | Business/Enterprise | $30/user | AI in Word, Excel, PowerPoint, Outlook, Teams. Requires existing M365 license (\(12.50-\)57/user) |
Consumption-based | $0.01 per message | Pay-per-use alternative to monthly subscription (30 messages for proprietary files, 25 per agent action) | |
GitHub Copilot | Free | $0 | Up to 2,000 code completions/month, 50 premium requests/month Free for students, teachers, open source maintainers |
Pro | $10/month or $100/year | Unlimited code completions, 300 premium requests/month, access to Copilot coding agent | |
Pro+ | $39/month or $390/year | 1,500 premium requests/month, full access to all models, GitHub Spark, compute resources | |
Business | $19/user/month | 300 premium requests/user, user management, usage metrics, team collaboration | |
Enterprise | $39/user/month | 1,000 premium requests/user, all AI models, advanced customization, enterprise features | |
Mistral AI | Le Chat Free | $0 | Basic AI assistant with limited messages |
Le Chat Pro | $14.99 | Up to 6x more messages, 150 flash answers/day, 5x web searches, 1,000 memories, 15GB libraries | |
Le Chat Team | $24.99/user or $299.88/user/year | 200 flash answers/day, 30GB libraries/user, domain verification, SCIM provisioning | |
La Plateforme API | Varies | Mistral Medium 3: $0.40/1M input, $2.00/1M output Mistral Nemo: $0.30/1M tokens Mistral Large 2: $3/1M input, $9/1M output Codestral: $1/1M input, $3/1M output |
|
Cohere | Free Trial | $0 | Limited API calls for testing |
Production | Varies | Command R 03-2024: $0.50/1M input, $1.50/1M output Command R+ 08-2024: $2.50/1M input, $10/1M output Command-light: $0.30/1M input, $0.60/1M output Aya Expanse (8B & 32B): $0.50/1M input, $1.50/1M output |
|
Education Program | Contact | AI access for students and educators (pricing not publicly disclosed) | |
DeepSeek | DeepSeek Chat API | Pay-As-You-Go | DeepSeek Chat: $0.57/1M input, $1.68/1M output DeepSeek Reasoner (R1): $0.57/1M input, $1.68/1M output 128K context window ~200x cheaper than GPT-4 Turbo ⚠️ NOT ALLOWED for US-based researchers - See restrictions below |
Open Source | Free | Free to download and deploy locally. Training cost: $294,000 (peer-reviewed in Nature) ⚠️ Self-hosted use requires institutional IT/security approval |
|
Qwen (Alibaba) | Qwen Chat | Free | Free web interface powered by Qwen-Max ⚠️ NOT RECOMMENDED for US-based researchers - Chinese company, data sovereignty concerns |
Qwen API | Pay-As-You-Go | Qwen-Flash: $0.05/1M input, $0.40/1M output Qwen3-Coder: $0.22/1M input, $0.95/1M output Qwen-Max: $1.60/1M input, $6.40/1M output 1M context window, 90-day free trial (1M tokens) |
|
Open Source | Free | Apache 2.0 license, 40M+ downloads. Sizes: 0.6B-235B parameters ⚠️ Self-hosted use requires institutional IT/security approval |
|
Midjourney | Basic | $10 | ~200 image generations/month |
Standard | $30 | 15 hrs fast GPU time, unlimited relaxed | |
Pro | $60 | 30 hrs fast GPU time, stealth mode | |
Mega | $120 | 60 hrs fast GPU time, stealth mode | |
DALL-E 3 | Via ChatGPT Plus | Included | Image generation within ChatGPT |
API | Varies | Standard: $0.040/image, HD: $0.080/image | |
Stable Diffusion | DreamStudio | $10 | 1000 credits (~5000 images) |
API | Varies | $0.002 per image (512x512) | |
Grok by xAI | X Premium | $8 | Access via X (Twitter) Premium |
X Premium+ | $16 | Priority access, higher limits | |
Character AI | Free | $0 | Limited features and queue priority |
c.ai+ | $9.99 | Priority access, faster responses, exclusive features | |
Together AI | Serverless Inference | Pay-As-You-Go | Text & Vision Models: \(0.02-\)3.50/1M tokens Image Models: \(0.0027-\)0.08/megapixel Embedding Models: \(0.01-\)0.08/1M tokens |
GPU Clusters | Pay-As-You-Go | Instant Clusters: \(1.76-\)5.50/GPU hour Reserved Clusters: Starting at $1.30/GPU hour |
|
Fine-Tuning | Pay-As-You-Go | LoRA Fine-Tuning (≤16B params): Starting at $0.48 Full Fine-Tuning (70-100B params): Up to $3.20 |
|
Groq | Free Tier | $0 | Available for getting started |
Developer Tier | Pay-As-You-Go | Up to 10x more rate limits than free tier. Batch Processing: 50% cost discount (through April 2025) | |
Enterprise | Contact Sales | Custom solutions for large organizations | |
Replicate | Pay-As-You-Go | Varies | CPU: $0.36/hour Nvidia T4 GPU: $0.81/hour (public), $1.98/hour (private) 8x H100 GPU: $43.92/hour Run open-source models with per-second billing |
Hugging Face | Free | $0 | Community models and datasets |
Pro | $9 | Advanced features, private repos | |
Enterprise | Contact Sales | Dedicated support, SLAs, security features | |
Amazon Bedrock | On-Demand | Varies | Access to Claude, Llama 2, Stable Diffusion, and more |
Google Vertex AI | On-Demand | Varies | 130+ foundation models including Gemini, Claude, Llama |
Azure AI Studio | On-Demand | Varies | Access to GPT-4, Claude, Llama, Mistral, and more |
Meta Llama | Open Source | Free | Llama 2 and Llama 3 models for download |
Ollama | Local Install | Free | Run LLMs locally on your hardware |
LM Studio | Local Install | Free | Desktop app for running LLMs locally |
Jan.ai | Local Install | Free | Open-source ChatGPT alternative, runs locally |
Continue.dev | Open Source | Free | Open-source autopilot for VS Code and JetBrains |
Poe by Quora | Monthly | $19.99 | Access to various chatbots including GPT-4, Claude |
Yearly | $199.99 | Annual subscription with all chatbot access | |
You.com | YouPro | $20 | Latest AI models, personalized AI with memory |
Jasper AI | Creator | $49 | Writing assistant with templates |
Teams | $125 | Advanced features for small teams | |
Business | Contact Sales | Custom pricing for organizations | |
Replit AI | Core | $20 | AI coding assistant integrated in Replit IDE |
Agentic Browsers (AI-Powered Web Browsers)¶
Browser | Plan | Price (per month) | Details |
---|---|---|---|
Perplexity Comet | Free | $0 | AI-powered browser with sidecar assistant, Perplexity AI search, tab management, content summarization |
Perplexity Max | $200 | Background Assistant for multi-tasking, autonomous task execution (booking flights, sending emails), mission control dashboard | |
Dia Browser | Free Beta | $0 (Invite-only) | AI-first browser, URL bar = AI chat, tab conversations, Skills system, browsing history context (opt-in) macOS 14+ M1+ only |
Dia Pro | $20 | Unlimited AI chat and Skills, multi-step reasoning, task automation Acquired by Atlassian ($610M) |
|
Fellou | Free | $0 | 1,000 Sparks (~4 tasks), Deep Search, autonomous web actions, Shadow Workspace for background tasks |
Plus | $19 | 2,000 Sparks (~8 tasks), 3 scheduled tasks, priority support | |
Pro | $39.90 | 5,000 Sparks (~20 tasks), 5 scheduled tasks, Image/Code/Music agents | |
Ultra | $199.90 | Unlimited Sparks, unlimited scheduled/concurrent tasks, exclusive support | |
Opera Neon | Subscription | $19.99 (Waitlist) | Neon Do (autonomous browsing), Neon Make (AI creation), Cards system, Tasks workspaces, local processing |
Genspark AI Browser | Free | $0 | 100 credits daily, Super Agent Everywhere, Autopilot Mode, 700+ MCP tool integrations |
Plus | $24.99 | 10,000 credits monthly, priority AI agent access, top-tier models, AI Slides/Sheets/Docs | |
Pro | $249.99 | 125,000 credits monthly, full Super Agent access, phone calls, video generation | |
Microsoft Edge Copilot Mode | Free (Experimental) | $0 | Cross-tab awareness, task automation, in-page assistance, browser history/credentials access Windows/Mac, opt-in |
Opera One + Aria | Free | $0 | Free AI assistant, real-time web access, page context mode, image generation, tab commands, local AI models No account required |
Brave + Leo AI | Free | $0 | Privacy-first AI, Llama 3.1 8B, Mixtral, Claude Haiku, Qwen, content awareness, zero data retention |
Leo Premium | Varies | Claude Sonnet 4, DeepSeek R1 reasoning models, Bring Your Own Model (BYOM) |
Notes on Agentic Browsers:
- True Agentic Capabilities: Comet, Fellou, Opera Neon, Dia, and Genspark can autonomously perform multi-step tasks (booking, purchasing, form filling)
- AI-Enhanced: Microsoft Edge Copilot Mode, Opera One, and Brave Leo provide AI assistance but with less autonomous action
- Platform Availability: Most are Chromium-based; Dia is macOS only (M1+); Others support Windows/Mac/Linux
- Privacy Considerations: Check each browser's data policies - some use cloud AI, others offer local processing
- Coming Soon: OpenAI browser expected late 2025 with ChatGPT integration and Operator agent
Notes:
- Token pricing for API access can be complex. Refer to each provider's pricing page for the most accurate and up-to-date details.
- "Contact Sales" typically indicates that pricing is customized based on usage, features, and the specific needs of the customer.
- Many services offer free trials or limited free tiers, allowing you to test them out before committing to a paid plan.
⚠️ Important Restrictions for US-Based Researchers¶
DeepSeek AI - Federal and State Restrictions¶
PAID CLOUD SERVICE NOT ALLOWED:
DeepSeek's paid API and cloud services are prohibited for US-based researchers at many institutions due to:
Federal Restrictions:
-
H.R. 1121 - "No DeepSeek on Government Devices Act" (Introduced Feb 2025)
-
House Select Committee Report - "DeepSeek Unmasked: Exposing the CCP's Latest Tool For Spying, Stealing, and Subverting U.S. Export Control Restrictions"
-
Federal Agency Bans: NASA, U.S. Navy, Department of Defense (DOD), Department of Commerce have banned DeepSeek
-
Owned by High-Flyer (Chinese company with CCP control)
-
Data stored in China and accessible to Chinese government
-
Content manipulation to align with CCP propaganda
State-Level Bans:
-
Texas (Jan 31, 2025), Virginia (Feb 11, 2025), New York (Feb 10, 2025)
-
Additional states: Iowa, South Dakota, Kansas, Tennessee, North Carolina, Nebraska, Arkansas, North Dakota, Oklahoma, Alabama, Georgia
University Bans:
-
All Virginia public universities (George Mason, UVA, Virginia Tech, William & Mary, JMU)
-
North Dakota University System
SELF-HOSTED OPEN-SOURCE MAY BE PERMITTED:
Open-source DeepSeek models can be downloaded and run on-premises, but researchers MUST:
-
✅ Check with institutional IT and security teams first
-
✅ Ensure compliance with federal grant requirements (NSF, DOD, DOE)
-
✅ Never upload sensitive, proprietary, or controlled data
-
✅ Document usage for research security compliance
Qwen (Alibaba) - Data Sovereignty Concerns¶
NOT SPECIFICALLY BANNED, BUT NOT RECOMMENDED:
Qwen is not subject to specific federal bans like DeepSeek, but has serious concerns for US researchers:
Key Issues:
-
Owned by Alibaba (Chinese company subject to CCP control)
-
Data stored in China under Chinese data sovereignty laws
-
No GDPR compliance or EU data protection representative
-
Potential surveillance under Chinese national security laws
-
Congressional scrutiny (Senators urged sanctions in 2023, not yet implemented)
Regulatory Framework:
-
NSF Research Security - Requires disclosure of foreign support and affiliations
-
Treasury Outbound Investment Restrictions - Limits US investments in Chinese AI companies (affects funding, not use)
-
No Entity List designation (as of Oct 2025)
SELF-HOSTED OPEN-SOURCE MAY BE PERMITTED:
Qwen's Apache 2.0 licensed models (40M+ downloads on HuggingFace) can be run on-premises, but researchers MUST:
-
✅ Check with institutional IT and security teams first
-
✅ Verify compliance with federal grant terms
-
✅ Avoid uploading to Chinese cloud services
-
✅ Document AI tool usage in research security plans
Recommendations for Researchers¶
✅ SAFE FOR RESEARCH (US-based alternatives):
-
OpenAI (ChatGPT, GPT-5 API) - US company
-
Anthropic (Claude) - US company
-
Google (Gemini) - US company
-
Microsoft (Copilot) - US company
-
Mistral AI - French company (EU-based)
-
Cohere - Canadian company
⚠️ USE WITH EXTREME CAUTION (Chinese companies):
-
DeepSeek - BANNED at many institutions
-
Qwen - Not banned, but data sovereignty concerns
-
Check institutional policies BEFORE use
✅ SELF-HOSTED OPEN-SOURCE (May be acceptable):
-
Meta Llama (US company, Apache 2.0)
-
DeepSeek open-source (with institutional approval)
-
Qwen open-source (with institutional approval)
-
Mistral open-source (EU company, Apache 2.0)
ALWAYS:
-
Check your institution's AI usage policy
-
Review federal grant terms (NSF, NIH, DOD, DOE)
-
Consult with IT security and research compliance offices
-
Never share sensitive, proprietary, or controlled data with foreign AI services
-
Document all AI tool usage for research security requirements
Best Options for Students & Educators:
-
Free/Low-Cost Options:
- DeepSeek - Most affordable API at $0.57/$1.68 per 1M tokens (~200x cheaper than GPT-4 Turbo), open-source option available
- Meta Llama - Completely free open-source models (Llama 4 Scout & Maverick available for download)
- GitHub Copilot - Free for students, teachers, and open source maintainers
- Perplexity Education Pro - $4.99/month with student/faculty verification (1 month free trial)
- Google AI Pro - Free for university students for 1 year ($19.99/month value)
- HuggingFace - Free community access to models and datasets, $2/month free credits for Pro users
- Ollama, LM Studio, Jan.ai - Run LLMs locally on your hardware for free
-
Best Value Paid Options:
- Mistral Le Chat Pro - $14.99/month (cheaper than competitors, strong performance)
- OpenAI GPT-4o-mini API - $0.15/$0.60 per 1M tokens (60%+ cheaper than GPT-3.5 Turbo)
- Gemini 2.5 Flash-Lite - $0.10/$0.40 per 1M tokens (most economical for high-volume simple tasks)
- Claude Haiku 3.5 API - $0.80/$4 per 1M tokens (balanced cost and capability)
-
Educational Programs Available:
- Cohere Education Program - Contact for student/educator access
- Google AI Pro - 1 year free for university students
- Perplexity Education Pro - $4.99/month with verification
Additional Chatbot and LLM Services:
-
Amazon Bedrock, Azure AI Foundry, Google Vertex: Provide access to various foundation models but each run on a respective cloud service provider's hardware. Ideal for companies and institutions already running their infrastructure on commercial cloud services.
-
You.com: Offers a pro plan with access to latest AI models, personalized AI with memory and advanced AI writing tools.
-
Poe by Quora: A platform that gives you access to various chatbots (like GPT-4, Claude, etc.) through a single subscription.
Image and Video Generation Models
Image Generation Models (2025)¶
Stable Diffusion 3.5 (October 2024)
Stable Diffusion 3.5 from Stability AI features: - SD3.5 Large (8.1B): High-quality 1MP generation with advanced prompt adherence - SD3.5 Medium (2.5B): Balanced performance for consumer hardware (0.25-2MP) - SD3.5 Large Turbo: Optimized for speed with 4-step generation - Open Source: Free for non-commercial and commercial use under $1M revenue - Platforms: HuggingFace, GitHub, Replicate, Fireworks AI
FLUX Models (Black Forest Labs)
FLUX by Black Forest Labs offers cutting-edge diffusion models: - FLUX.1 Kontext (May 2025): Combines text+image prompts, state-of-the-art in-context generation and editing - FLUX 1.1 Pro Ultra: Latest professional variant with enhanced quality - FLUX.1 Krea Dev (July 2025): Better performance, varied aesthetics, improved realism - FLUX.1 Schnell: Apache-licensed open-source for fast local generation (12B parameters) - FLUX.1 Tools (November 2024): Fill, Depth, Canny, Redux for advanced control - Architecture: 12B parameter rectified flow transformer - Platforms: API access, BFL Playground, Azure AI Foundry
GPT-4o Image Generation (OpenAI)
GPT-4o Image (March 2025): - Model: gpt-image-1 (replaces DALL-E 3) - Resolution: Up to 4096×4096 pixels (4K) - Features: Native integration in GPT-4o, reliable text rendering, multi-turn refinement, image transformation - Access: ChatGPT (Free/Plus/Pro), OpenAI API - Safety: C2PA metadata watermarking on all images
Midjourney V7 (April 2025)
Midjourney latest features: - V7: Current default (since June 2025) with stunning text precision, richer textures, improved bodies/hands - Draft Mode: 10x speed at half the cost - Personalization: First model with personalization enabled by default - V8: In development with "significant differences" and innovative features - Video: Coming soon (in final sprint stage) - Platform: Discord-based, Web Interface
Google Imagen 4 (May 2025)
Imagen 4 and Imagen 4 Ultra: - Resolution: Up to 2K resolution - Speed: 10x faster mode available - Features: Enhanced photo-realism, improved text rendering, advanced typography, diverse art styles - Safety: SynthID watermarking, content filtering - Access: Gemini API, Google AI Studio, Google Labs
Adobe Firefly Image Model 4 (April 2025)
Firefly 4 and Firefly 4 Ultra: - Resolution: Up to 2K with lifelike quality - Features: Exceptional precision, camera control, structure/style references - Commercial-Safe: Training data with indemnification for enterprise - Integration: Photoshop, Illustrator, InDesign, API access - Firefly Video Model: New modality (April 2025)
Breakthrough New Models (2024-2025)
- Reve Image 1.0 (March 2025): #1 on Artificial Analysis Arena, best-in-class prompt adherence and typography
- Recraft V3 (October 2024): #1 on HuggingFace leaderboard at launch, first with vector art generation and extended text
- HiDream-I1 (April 2025): 17B parameters, open-source (MIT), sparse transformer architecture
- Ideogram 3.0 (2025): Enhanced realism, style reference (3 images), superior text rendering
- Leonardo Lucid Origin (2025): Most versatile model, accurate text rendering, full HD renders
Video Generation Models (2025)¶
OpenAI Sora 2 (September 2025)
Sora 2 latest features:
-
Native Audio: Synchronized dialogue, music, and sound effects
-
Resolution: Up to 1080p, duration up to 20 seconds
-
Physics: Superior simulation (basketball rebounds, water buoyancy, gymnastics)
-
Cameo Feature: Insert user likenesses with consent
-
Pricing: Plus plan (50 videos/month at 480p), Pro plan (10x more usage, higher resolutions)
-
Access: ChatGPT Plus/Pro, iOS app (US/Canada, invite-only)
Google Veo 3 (May 2025)
Veo 3 represents Google's latest advancement:
-
Resolution: Up to 4K, 8-second videos
-
Native Audio: Dialogue, sound effects, ambient noise
-
Features: Best-in-class physics, realism, prompt adherence, advanced character/camera controls
-
Access: Flow (Google Labs), Gemini app (AI Pro subscribers), Google AI Studio, Gemini API, Vertex AI
-
Limits: 3 videos/day for paying subscribers
-
Rollout: 159+ countries (July 2025)
Runway Gen-4 (March 2025)
Runway Gen-4 features:
-
World Consistency: Characters, locations, objects consistent across scenes
-
Visual References: Image + text prompt (no fine-tuning required)
-
Duration: 5 or 10 seconds
-
Gen-4 Turbo: Faster generation at lower cost
-
Access: app.runwayml.com
Meta Movie Gen (2025 Release Planned)
Movie Gen research features:
-
Models: 30B video, 13B audio
-
Resolution: 1080p HD, up to 16 seconds at 16 fps
-
Audio: Up to 45 seconds with synchronized sound
-
Features: Four capabilities (video generation, personalized video, precise editing, audio generation)
-
Status: Research phase, Instagram integration planned 2025
-
Partnership: Blumhouse Productions
Leading Commercial Video Platforms
- Pika 2.2 (February 2025): Pikaframes keyframe system, 10-second videos, 1080p, Pikatwists dramatic endings
- Kling AI 2.5 Turbo (September 2025): Enhanced prompt adherence, superior high-motion scenes, 1080p, 30% cost reduction
- Luma Ray3 (September 2025): Draft mode, HDR/EXR support, deep reasoning, fast generation
- HeyGen (2025): Avatar IV with hyper-realistic avatars, 140+ languages, Veo 3 integration, 60%+ Fortune 100 adoption
- Synthesia 3.0 (2025): Express-2 avatars, AI dubbing (32 languages), video agents, $2.1B valuation
- Hedra Character-3 (April 2025): Omnimodal model, 4K @ 60fps, 90-second videos, full-body animation with speech
Open-Source Video Models
- Hunyuan Video (Tencent): 13B+ parameters, largest open-source model, video-to-audio module, GitHub/HuggingFace
- Stable Video 4D 2.0 (May 2025): Enhanced 4D generation, 48 frames (12×4 views), 576×576, GitHub available
- Mochi 1 (Genmo): 10B parameters, Apache 2.0 license, 30fps, 5.4 seconds (HD version pending)
Advanced Capabilities¶
Image and Video Understanding
- Segment Anything Model 2 (SAM 2) (Meta): Real-time segmentation for images and videos
- CLIP (OpenAI): Vision-language understanding
- LLaVA: Open-source visual instruction tuning
3D Generation
- DreamGaussian: Text/image to 3D in minutes
- Meshy: Text to 3D mesh generation
- Luma Genie: Text to 3D model generation
Emerging Trends
- Consistency Models: Faster generation with fewer steps
- ControlNet Integration: Precise control over generation
- Real-time Generation: Sub-second image creation
- Multimodal Models: Unified image, video, and audio generation
- Neural Radiance Fields (NeRFs): 3D scene representation
- Diffusion Transformers (DiT): Next-generation architectures
Glossary
Google's Machine Learning Glossary
NVIDIA's Data Science Glossary
Agentic AI: Uses sophisticated reasoning and iterative planning to autonomously solve complex, multi-step problems. Agentic systems can break down tasks, use tools, and make decisions to achieve goals with minimal human intervention.
Anthropic: A research organization emphasizing AI safety and governance. Known for Claude, a large language model (LLM) with advanced reasoning and robust safety features.
API (Application Programming Interface): A set of protocols and tools that allow different software applications to communicate. In AI, APIs enable developers to integrate LLM capabilities into their applications programmatically.
Attention Mechanism: A neural network technique that allows models to focus on relevant parts of input data when processing information. The foundation of transformer architectures used in modern LLMs.
Chain-of-Thought (CoT): A prompting technique that encourages AI models to break down complex problems into intermediate reasoning steps, improving accuracy on tasks requiring logic and multi-step reasoning.
ChatGPT: OpenAI's general-purpose LLM, renowned for its conversational strengths, versatility, and ability to adapt to varied tasks through effective prompt engineering.
Claude: Anthropic's LLM, recognized for its interpretability, strong reasoning capabilities, and rigorous safety considerations.
Context Window: The maximum amount of text (measured in tokens) that an LLM can process at once, including both the input prompt and generated output. Modern models range from 8K to over 1M tokens.
Copilot (GitHub, Microsoft): An AI-driven developer assistant offering code suggestions, debugging support, and efficiency improvements, leveraging generative AI to boost productivity.
Diffusion Models: A class of generative models that create images by iteratively denoising random noise. Used in systems like Stable Diffusion, DALL-E, and Midjourney for text-to-image generation.
Embeddings: Numerical vector representations of data (e.g., text, images, audio) that capture semantic meaning and relationships. Useful for search, clustering, recommendation, and more.
Few-Shot Learning: The ability of an AI model to learn new tasks from just a few examples provided in the prompt, without requiring additional training or fine-tuning.
Fine-Tuning: The process of further training a pre-trained model on a specific dataset or task to specialize its capabilities for particular use cases or domains.
Foundation Models: Large-scale deep learning models (e.g., LLMs, vision models, multimodal models) trained on massive datasets. They serve as a base or "foundation" for a wide range of downstream tasks, enabling transfer learning and rapid adaptation.
Gemini: Google's family of multimodal foundation models, capable of understanding and generating text, images, and other data types, reflecting Google's advancements in AI research.
Generative AI (GenAI): AI systems capable of creating new content—text, images, code, audio, video—based on patterns learned from training data. Includes LLMs, image generators, and multimodal models.
GitHub: A leading platform for version control and software collaboration. Now integrated with AI tools like GitHub Copilot for enhanced code development workflows.
Hallucination: When an AI model generates false, nonsensical, or unfaithful information presented as fact. A key challenge in LLM reliability, especially for factual or specialized domains.
HuggingFace: A hub and community for open-source AI models, datasets, and applications. Widely used in the natural language processing (NLP) community for model sharing and development.
Inference: The process of using a trained AI model to make predictions or generate outputs. In LLMs, this refers to generating text responses from prompts.
Large Language Models (LLMs): A subset of foundation models trained on extensive text corpora, enabling them to generate human-like text, summarize information, reason about topics, and perform a variety of NLP tasks. Examples include GPT, Claude, and Gemini.
LoRA (Low-Rank Adaptation): An efficient fine-tuning technique that modifies only a small subset of model parameters, reducing computational costs while maintaining performance for specialized tasks.
MCP (Model Context Protocol): A standardized protocol for connecting AI assistants to external data sources and tools. Enables LLMs to access databases, APIs, and live information while maintaining security and privacy.
Mixture of Experts (MoE): A neural network architecture that uses multiple specialized sub-models (experts) and activates only relevant ones for each input, improving efficiency and scalability in large models.
Multimodal Models: AI systems that can process and generate multiple types of data (text, images, audio, video) in combination. Examples include GPT-4 with vision, Gemini, and Claude with image understanding.
Parameters: The trainable values within a neural network, updated during the training process to minimize loss and define the model's learned behavior. Model size is often described by parameter count (e.g., 7B, 70B parameters).
Prompt Engineering: The practice of crafting, refining, and optimizing instructions (prompts) given to AI models in order to guide their outputs toward desired results.
Quantization: A technique that reduces the precision of model weights (e.g., from 16-bit to 4-bit) to decrease memory usage and computational requirements, enabling deployment on resource-constrained devices.
RAG (Retrieval-Augmented Generation): A technique that enhances LLM responses by retrieving relevant information from external knowledge bases or documents before generating answers, reducing hallucinations and improving factual accuracy.
RLHF (Reinforcement Learning from Human Feedback): A training method that uses human preferences to fine-tune AI models, improving their alignment with human values and desired behaviors. Used extensively in ChatGPT and Claude development.
Stable Diffusion: A family of open-source latent-diffusion-based models used for generating high-quality images from text or other forms of input (e.g., sketches).
System Prompt: Initial instructions given to an AI model that define its role, behavior, constraints, and capabilities for a conversation or task. Often invisible to end users but shapes all responses.
Temperature: A parameter controlling randomness in AI-generated outputs. Lower temperatures (0.0-0.3) produce more deterministic responses; higher temperatures (0.7-1.0) increase creativity and variability.
Token: A fundamental unit of text—often a word, subword, or character—that LLMs process when understanding or generating language. Pricing and context limits are typically measured in tokens.
Transformer: The neural network architecture that powers modern LLMs, introduced in the paper "Attention is All You Need" (2017). Uses attention mechanisms to process sequences efficiently.
Vector Database: A specialized database optimized for storing and querying high-dimensional embedding vectors, enabling fast semantic search and similarity matching for RAG applications.
Weights: Numerical parameters within a neural network that determine the strength of connections between neurons or nodes.
Zero-shot Learning: The capability of an AI model to perform tasks it has never been explicitly trained on, often made possible by large-scale pretraining on diverse datasets.