- Inference & APIs
- Knowledge Bases (RAG)
- Agents
- Guardrails
- Model Customization
- Flows & Prompt Management
| Aspect | InvokeModel | Converse API |
|---|---|---|
| Purpose | Direct, low-level model invocation | Unified conversational interface |
| Format | Model-specific request/response | Model-agnostic format |
| Best For | Embeddings, images, model-specific features | Chatbots, multi-turn, tool use |
| Streaming | InvokeModelWithResponseStream | ConverseStream |
Key Insight: Converse API automatically applies conversation templates for Mistral AI and Meta models.
Limitation: Converse API does NOT support embedding or image generation models.
| Use Case | Recommended API |
|---|---|
| Multi-turn conversations | Converse |
| Tool use / function calling | Converse |
| Model-agnostic code | Converse |
| Embeddings generation | InvokeModel |
| Image generation | InvokeModel |
- Temperature: Lower = deterministic, Higher = creative
- Top K: Limits to K most probable tokens
- Top P (Nucleus): Limits to tokens comprising top P% probability mass
Note: Default values vary significantly by model - always check model docs.
Critical: The model ONLY requests tool calls - it NEVER executes them. Your application handles all execution.
User Query → Model → Tool Request → Your Code Executes → Tool Result → Model → Final Response
- TTL: 5 minutes (resets on each hit)
- Minimum tokens: 1,024 - 4,096 (model-dependent)
- Maximum checkpoints: 4 per request
| Model | Min Tokens | Cacheable Fields |
|---|---|---|
| Claude Opus 4.5 | 4,096 | system, messages, tools |
| Claude 3.7 Sonnet | 1,024 | system, messages, tools |
| Amazon Nova | 1,000 | system, messages (text only) |
Best Practice: Place cache checkpoints at END of static content.
- Use
CreateModelInvocationJobfor large-scale processing - Not supported with provisioned throughput
- More cost-efficient than individual API calls
| Type | Description |
|---|---|
| System-defined (Cross-Region) | Pre-configured multi-region routing |
| Application (Single-Region) | User-created for tracking |
| Application (Multi-Region) | User-created using system profile |
Pricing: Based on calling region, NOT destination region.
| Mode | Data Boundaries | Use Case |
|---|---|---|
| Geographic | US, EU, APAC only | Compliance/data residency |
| Global | Any region | Max throughput, ~10% savings |
Security: No public internet traversal - stays on AWS internal network.
Important: Can route to regions NOT manually enabled in your account.
| Tier | Performance | Cost | Use Case |
|---|---|---|---|
| Reserved | Guaranteed | Fixed premium | Mission-critical |
| Priority | Fastest on-demand | Higher | Customer-facing |
| Standard | Baseline | Standard | General tasks |
| Flex | May queue | Lowest | Batch processing |
Key: On-demand quota is SHARED across priority, standard, and flex tiers.
Supported Models (limited):
- Amazon Nova Pro (us-east-1, us-east-2)
- Claude 3.5 Haiku (us-east-2, us-west-2)
- Llama 3.1 405B/70B (us-east-2, us-west-2)
Tradeoff: Falls back to standard inference when quota reached.
Phase 1: Pre-processing (Ingestion)
Raw Data → Parsing → Chunking → Embedding → Vector Indexing
Phase 2: Runtime (Query)
User Query → Query Embedding → Semantic Search → Context Augmentation → LLM Response
| Operation | Purpose |
|---|---|
| Retrieve | Returns raw source chunks (custom pipelines) |
| RetrieveAndGenerate | Full RAG with citations |
| GenerateQuery | Natural language to SQL |
| Store | Key Characteristics |
|---|---|
| OpenSearch Serverless | Default, zero setup, serverless |
| Pinecone | High performance, purpose-built |
| Neptune | Graph-based, relationship-aware |
| Aurora RDS | SQL compatibility |
| Redis Enterprise | Low latency, real-time |
| MongoDB Atlas | Flexible schema |
| Strategy | Best For | Trade-off |
|---|---|---|
| Default (~300 tokens) | Quick setup | May not optimize for content |
| Fixed-Size | Uniform content | Configurable overlap |
| Hierarchical | Large documents | Fewer results, more context |
| Semantic | Complex relationships | Additional FM costs |
| No Chunking | Pre-processed files | No page-level citations |
- Parent chunks (large context) + Child chunks (precise embeddings)
- Child retrieved first, then replaced with parent
- Not recommended with S3 vector bucket
- Divides by meaningful boundaries
- Buffer size of 1 = 3 sentences (previous, current, next)
- Larger buffers capture more context but may introduce noise
| Connector | Multimodal Support |
|---|---|
| Amazon S3 | Yes |
| Custom Data Source | Yes |
| Confluence | No |
| SharePoint | No |
| Salesforce | No |
| Web Crawler | No |
Critical: Multimodal content ONLY works with S3 and Custom connectors.
| Model | Dimensions | Vector Types |
|---|---|---|
| Titan Embeddings G1 - Text | 1536 | float32 |
| Titan Text Embeddings V2 | 256/512/1024 | float32, binary |
| Cohere Embed | 1024 | float32, binary |
Binary vectors: Lower storage cost, reduced precision.
| Parser | Cost | Multimodal | Best For |
|---|---|---|---|
| Default | Free | No | Text extraction |
| Bedrock Data Automation | Per-page | Yes | PDF/images |
| Foundation Models | Per-token | Yes | Custom extraction |
Warning: Once you select advanced parsing, it applies to ALL PDFs in data source.
| Approach | Converts To | Best For |
|---|---|---|
| Nova Multimodal Embeddings | Native vectors | Visual/audio similarity |
| Bedrock Data Automation | Text | Text-based search |
Timestamp Handling: Your application must extract segments based on start/end metadata.
Purpose: Reorder retrieved results by relevance.
Benefits:
- Fewer, more relevant results
- Decreased cost and latency
- More accurate responses
Limitation: Text only - no images, audio, or video.
- Foundation Models: Decision engine
- Action Groups: Executable capabilities
- Knowledge Bases: Contextual augmentation
User Input → Pre-processing (optional) → Orchestration Loop → Post-processing (optional) → Response
Orchestration Loop:
- FM generates rationale
- FM predicts action group OR KB query
- Execute action (Lambda or return control)
- Generate observation
- Re-augment prompt
- Loop until complete
| Method | Best For |
|---|---|
| OpenAPI Schema | Explicit API mapping |
| Function Details | Simplified setup |
- Max 11 API operations per action group
- ONE Lambda function per group
- Payload limited to 6 MB
When you need control over execution:
- Configure
RETURN_CONTROL - Agent returns parameters with
invocationId - Your app executes action
- Send results back with same
invocationId
Note: If returnControlInvocationResults included, inputText is ignored.
Supervisor Agent (orchestrator)
├── Collaborator Agent 1
├── Collaborator Agent 2
└── Collaborator Agent N
Key: Hierarchical model with supervisor routing to domain specialists.
| Type | Persistence | Scope |
|---|---|---|
| Session Context | Session duration | Same sessionId |
| Memory Context | 1-365 days | Same memoryId |
Association Trigger: Memory associates when endSession=true OR idleSessionTimeout expires.
Capabilities: Generate, run, troubleshoot code in sandbox.
Limits:
- Max 5 files per request
- Total 10 MB file size
- 25 concurrent sessions per account
Regions: us-east-1, us-west-2, eu-central-1
| Strategy | Use Case |
|---|---|
| Default (ReAct) | Standard workflows |
| Advanced Prompts | Modified templates |
| Custom Orchestration | Lambda-based complex logic |
- Pre-processing (disabled by default)
- Orchestration
- KB response generation
- Post-processing (disabled by default)
Critical Warning: Agent instructions IGNORED if: single KB + default prompts + no action groups + user input disabled.
| Type | Persistence |
|---|---|
| sessionAttributes | Full session |
| promptSessionAttributes | Single turn |
| conversationHistory | Multi-agent flows |
User Input → Parallel Policy Evaluation → BLOCKED (no inference charges) OR PASSES → Model → Evaluation → Response
Key: If input blocked, model inference skipped = no charges.
| Type | Purpose | Best For |
|---|---|---|
| Content Filters | Block harmful categories | General safety |
| Denied Topics | Block specific topics | Domain restrictions |
| Contextual Grounding | Detect hallucinations | RAG applications |
| Automated Reasoning | Validate logical rules | Regulated industries |
| Sensitive Info Filters | Block/mask PII | Privacy compliance |
| Word Filters | Block exact words | Profanity, terms |
- Hate
- Insults
- Sexual
- Violence
- Misconduct
- Prompt Attack
- Start with Medium, adjust based on false positive rate
- Different thresholds for inputs vs outputs
- Actions:
BLOCKorNONE(detect-only)
Limits:
- Up to 30 topics per guardrail
- 200 chars (Classic) or 1,000 chars (Standard) per definition
- Up to 5 sample phrases per topic
Best Practices:
- Define topics crisply and precisely
- Don't include examples in definitions
- Don't define negative topics
- Use word filters for individual words
Two Checks:
- Grounding: Response accurate based on source?
- Relevance: Response answers the query?
Limits:
- Source: 100,000 chars max
- Query: 1,000 chars max
- Response: 5,000 chars max
- Threshold: 0 to 0.99 (1.0 invalid)
Not suitable for: Conversational chatbots.
Purpose: Validate against logical rules (HR policies, loan approvals, regulations).
Limitations:
- English (US) only
- No streaming API support
- Limited regional availability
- Cannot detect prompt injection
Variable Types: BOOL, INT, NUMBER, enum
| Feature | Standard | Classic |
|---|---|---|
| Multi-language | Extensive | English, French, Spanish |
| Cross-Region | Supported | Not supported |
| Prompt Leakage Detection | Yes | No |
| Denied Topic Definition | 1,000 chars | 200 chars |
Recommendation: Use Standard tier for new deployments.
<amazon-bedrock-guardrails-guardContent_xyz>
User input goes here
</amazon-bedrock-guardrails-guardContent_xyz>| Method | Data Type | Purpose |
|---|---|---|
| Distillation | Prompts (unlabeled) | Transfer from teacher to student |
| Supervised Fine-Tuning | Labeled pairs | Task-specific accuracy |
| Continued Pre-Training | Unlabeled text | Domain knowledge expansion |
| Reinforcement Fine-Tuning | Prompts + reward function | Optimize for outcomes |
How It Works:
- Provide input prompts (or use invocation logs)
- Bedrock generates responses from teacher model
- Student model fine-tuned on synthetic data
Two Paths:
- Path A: Upload prompts, Bedrock generates teacher responses
- Path B: Use existing CloudWatch Logs from production
Key: Can expand dataset up to 15,000 prompt-response pairs.
Data Formats:
Non-conversational:
{"prompt": "Question?", "completion": "Answer."}Conversational:
{
"schemaVersion": "bedrock-conversation-2024",
"system": [{"text": "You are helpful"}],
"messages": [
{"role": "user", "content": [{"text": "Q"}]},
{"role": "assistant", "content": [{"text": "A"}]}
]
}Data Format:
{"input": "Your domain text here..."}Use Cases: Private documents, domain vocabulary, proprietary knowledge.
Supported Models: Titan Text G1 - Express only.
Three Stages:
- Response Generation (4 responses per prompt)
- Reward Computation (Lambda scoring)
- Actor Training (GRPO optimization)
Two Approaches:
| Approach | Best For |
|---|---|
| RLVR (Verifiable Rewards) | Code, math, objective tasks |
| RLAIF (AI Feedback) | Subjective quality |
| Model | Min Records | Max Records | Max Tokens |
|---|---|---|---|
| Claude 3 Haiku | - | 10,000 | 4,096 input + 2,048 output |
| Llama 3.1/3.2 | 5 | 10,000 | 16,000 total |
| Titan Text | 1,000 | 500,000 | 4,096 |
Token Estimation: Use 6 characters per token.
File Size Limits:
- Training: 1 GB (fine-tuning), 10 GB (continued pre-training)
- Validation: 100 MB
| Model | Epoch Default | Batch Size | Special Features |
|---|---|---|---|
| Claude 3 Haiku | 2 | 32 (4-256) | Early stopping |
| Llama 3.1 | 5 | 1 (fixed) | Wide learning rate range |
| Nova | 2 | - | Warmup steps |
Key Details:
- Claude has automatic early stopping (overfitting prevention)
- Llama batch size is LOCKED at 1
- Each epoch multiplies token processing cost
MANDATORY for all custom models.
| Commitment | Discount |
|---|---|
| No commitment | None |
| 1 month | Moderate |
| 6 months | Maximum |
| Model Family | Distillation Region |
|---|---|
| Claude & Llama | US West (Oregon) |
| Nova | US East (N. Virginia) |
Note: Nova models cannot be copied to other regions.
Directed node-based workflows: Input → Processing Nodes → Output
Key Elements:
- Nodes: Processing units
- Connections: Data (solid) or Conditional (dotted)
- Expressions: JSONPath-based data extraction (
$.data.field)
| Category | Node | Purpose |
|---|---|---|
| Logic | Input | Entry point (exactly one) |
| Output | Return response (multiple allowed) | |
| Condition | Route by logic (first match wins) | |
| Iterator | Process arrays (sequential) | |
| DoWhile | Loop (max 10 iterations) | |
| AI | Prompt | Model inference |
| Agent | Agent orchestration (supports multi-turn) | |
| Knowledge Base | RAG queries | |
| Data | S3 Storage/Retrieval | Read/write S3 |
| Lambda | Custom code | |
| Inline Code | Python in-flow (max 5 nodes) |
Key Details:
- Iterator is SEQUENTIAL, not parallel
- Inline Code NOT supported in async execution
- S3 Retrieval requires UTF-8 encoding
Agent nodes can pause for user input:
- First call: No
executionId - Agent pauses:
INPUT_REQUIREDevent - Resume: Call with
executionId+ response - Complete:
SUCCESSevent
| Approach | Best For |
|---|---|
| Direct API | Simple calls, max control |
| Agents | Autonomous reasoning, tool use |
| Flows | Deterministic pipelines, visual dev |
Hybrid: Embed Agent nodes within Flows.
- Variables:
{{variable_name}}placeholders - Variants: Alternative configurations
- Versioning: Immutable snapshots
Create → Add Variables → Configure Model → Create Variants → Test → Save Version → Deploy
Bedrock can automatically rewrite prompts for better results.
Best Practice: Optimize in English only.
Supported Models: Claude 3.x, Nova, Llama 3.x, Mistral Large
Dynamically routes requests between models in same family:
- Analyzes prompt
- Predicts quality for each model
- Routes to best quality/cost combo
Limitations:
- English only
- Same model family only
- No application-specific learning
[Context/Background]
[Input Content]
[Task/Instruction]
[Output Format]
| Technique | When to Use |
|---|---|
| Zero-Shot | Simple tasks |
| Few-Shot | Complex output, calibration needed |
| Chain-of-Thought | Reasoning tasks |
| Model | Format |
|---|---|
| Claude | <example></example> tags |
| Titan | User: {}\nBot: |
| Llama | Meta's prompting guide |
Need inference?
├── Single call, simple → InvokeModel / Converse
├── Batch processing → CreateModelInvocationJob
├── Streaming → InvokeModelWithResponseStream / ConverseStream
└── Tool use → Converse API
- Global cross-region: ~10% savings
- Flex tier: Non-time-sensitive workloads
- Prompt caching: Repeated long contexts
- Batch inference: Large-scale processing
- Binary vectors: Lower storage costs
| Feature | InvokeModel | Converse | Notes |
|---|---|---|---|
| Text Generation | Yes | Yes | |
| Embeddings | Yes | No | |
| Image Generation | Yes | No | |
| Tool Use | No | Yes | |
| Multi-turn | Manual | Built-in | |
| Guardrails | Yes | Yes |
| Scenario | Solution |
|---|---|
| Need RAG with citations | Knowledge Bases + RetrieveAndGenerate |
| Block harmful content | Guardrails with Content Filters |
| Detect hallucinations | Contextual Grounding Check |
| Autonomous task execution | Agents with Action Groups |
| Reduce model costs | Distillation (smaller student) |
| Domain-specific knowledge | Continued Pre-Training |
| Visual workflow | Bedrock Flows |
| Prompt versioning | Prompt Management |
Generated from official AWS Bedrock documentation