Infrastructure Verification Skill
Tech Stack: AWS CLI, Terraform, VPC, CloudWatch, bash
Source: Extracted from PDF S3 upload timeout investigation (2026-01-05) and Infrastructure-Application Contract principle.
When to Use This Skill
Use the infrastructure-verification skill when:
- ✓ Before deploying Lambda-in-VPC code
- ✓ Investigating Lambda connection timeouts
- ✓ Debugging deterministic failure patterns (first N succeed, last M fail)
- ✓ Validating network path to AWS services (S3, DynamoDB, RDS)
- ✓ After adding VPC endpoints
- ✓ Before concurrent Lambda executions
DO NOT use this skill for:
- ✗ Application code debugging (use error-investigation)
- ✗ Performance optimization (different focus)
- ✗ IAM permission issues (use AWS CLI directly)
Core Verification Principles
Principle 1: Infrastructure Dependency Validation
From CLAUDE.md Principle #15:
"Before deploying code that depends on AWS infrastructure (S3, VPC endpoints, NAT Gateway), verify infrastructure exists and is correctly configured. Network path issues cause deterministic failure patterns."
When to validate:
- Before deploying Lambda functions that make AWS service calls
- After Terraform infrastructure changes
- When investigating Lambda timeout patterns
- Before increasing concurrency limits
Principle 2: Pattern Recognition
Failure Pattern Types:
| Pattern | Root Cause | Investigation Priority |
|---|
| First N succeed, last M fail | Infrastructure bottleneck (NAT, connection limits) | HIGH - VPC endpoint missing |
| Random scattered failures | Performance issue (slow API, memory) | MEDIUM - Optimize code |
| All operations fail | Configuration issue (permissions, endpoint) | HIGH - Fix config |
| Intermittent failures | Rate limiting, transient network | LOW - Add retries |
Deterministic pattern (first N succeed, last M fail) is strongest signal of infrastructure bottleneck.
Verification Workflows
Workflow 1: VPC Endpoint Verification
Use when: Lambda-in-VPC needs to access S3 or DynamoDB
Steps:
bash
1# 1. Check if VPC endpoint exists
2aws ec2 describe-vpc-endpoints \
3 --filters "Name=vpc-id,Values=vpc-xxx" \
4 "Name=service-name,Values=com.amazonaws.ap-southeast-1.s3" \
5 --query 'VpcEndpoints[*].{ID:VpcEndpointId,State:State,Service:ServiceName}' \
6 --output table
7
8# Expected output (if endpoint exists):
9# -----------------------------------------
10# | DescribeVpcEndpoints |
11# +-------+-------+------------------------+
12# | ID | State | Service |
13# +-------+-------+------------------------+
14# | vpce-xxx | available | com.amazonaws.ap-southeast-1.s3 |
15# +-------+-------+------------------------+
16
17# If empty → No S3 VPC Endpoint (traffic goes through NAT Gateway)
18
19# 2. Verify endpoint state
20aws ec2 describe-vpc-endpoints \
21 --vpc-endpoint-ids vpce-xxx \
22 --query 'VpcEndpoints[0].State' \
23 --output text
24
25# Expected: "available"
26# If "pending" → Wait for creation
27# If "failed" → Check Terraform logs
28
29# 3. Verify route table attachment
30aws ec2 describe-vpc-endpoints \
31 --vpc-endpoint-ids vpce-xxx \
32 --query 'VpcEndpoints[0].RouteTableIds' \
33 --output table
34
35# Expected: List of route table IDs (must include Lambda subnet route tables)
36
37# 4. Check Lambda subnet route tables
38aws lambda get-function-configuration \
39 --function-name my-function \
40 --query 'VpcConfig.SubnetIds' \
41 --output text | xargs -I {} aws ec2 describe-subnets --subnet-ids {}
42
43# Compare: Lambda subnets' route tables should be in VPC endpoint's RouteTableIds
44
45# 5. Verify S3 prefix list in route tables
46ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
47 --filters "Name=vpc-id,Values=vpc-xxx" \
48 --query 'RouteTables[0].RouteTableId' \
49 --output text)
50
51aws ec2 describe-route-tables \
52 --route-table-ids $ROUTE_TABLE_ID \
53 --query 'RouteTables[*].Routes[?GatewayId==`vpce-xxx`]'
54
55# Expected: Route with DestinationPrefixListId (S3 prefix list)
Verification checklist:
Common issues:
- Missing VPC endpoint → Create with Terraform
- State "pending" → Wait 2-3 minutes
- Route tables not attached → Update Terraform
route_table_ids
- Lambda subnets not covered → Verify subnet route table IDs
Workflow 2: NAT Gateway Diagnosis
Use when: Investigating Lambda connection timeouts with external services
Steps:
bash
1# 1. Check NAT Gateway exists
2aws ec2 describe-nat-gateways \
3 --filter "Name=vpc-id,Values=vpc-xxx" \
4 --query 'NatGateways[*].{ID:NatGatewayId,State:State,PublicIp:NatGatewayAddresses[0].PublicIp}' \
5 --output table
6
7# Expected: State "available"
8
9# 2. Check route tables using NAT Gateway
10aws ec2 describe-route-tables \
11 --filters "Name=vpc-id,Values=vpc-xxx" \
12 --query 'RouteTables[*].Routes[?NatGatewayId!=`null`].[RouteTableId,DestinationCidrBlock,NatGatewayId]' \
13 --output table
14
15# Expected: Route 0.0.0.0/0 → nat-xxx (default route through NAT)
16
17# 3. Analyze connection saturation pattern
18# Run this during concurrent Lambda executions
19aws logs filter-log-events \
20 --log-group-name /aws/lambda/my-function \
21 --start-time $(date -d '5 minutes ago' +%s)000 \
22 --filter-pattern "START RequestId" \
23 --query 'events[*].timestamp' \
24 --output text | xargs -n1 date -d @
25
26# Check execution pattern:
27# - All start within 1 second → Concurrent execution
28# - Some timeout after 600s → NAT Gateway saturation
29
30# 4. Check for connection timeout errors
31aws logs filter-log-events \
32 --log-group-name /aws/lambda/my-function \
33 --filter-pattern "ConnectTimeoutError" \
34 --query 'events[*].message' \
35 --output text
36
37# If errors found → NAT Gateway connection limit reached
38
39# 5. Calculate concurrent connection demand
40CONCURRENT_LAMBDAS=$(aws logs filter-log-events \
41 --log-group-name /aws/lambda/my-function \
42 --start-time $(date -d '1 minute ago' +%s)000 \
43 --filter-pattern "START RequestId" \
44 --query 'length(events)' \
45 --output text)
46
47echo "Concurrent Lambdas: $CONCURRENT_LAMBDAS"
48echo "NAT Gateway connection limit: ~55,000 (but establishment rate limited)"
NAT Gateway saturation indicators:
- ✅ Deterministic pattern (first N succeed, last M fail)
- ✅ ConnectTimeoutError in logs
- ✅ Long execution times (600s = boto3 default timeout)
- ✅ Timeline shows concurrent starts → split success/failure
Solution: Add VPC Gateway Endpoint for S3/DynamoDB to bypass NAT
Workflow 3: Network Path Validation
Use when: Verifying Lambda can reach AWS services
Steps:
bash
1# 1. Identify Lambda VPC configuration
2aws lambda get-function-configuration \
3 --function-name my-function \
4 --query 'VpcConfig.{VpcId:VpcId,SubnetIds:SubnetIds,SecurityGroupIds:SecurityGroupIds}' \
5 --output json
6
7# Save VPC ID, Subnet IDs, Security Group IDs
8
9# 2. Check security group egress rules
10aws ec2 describe-security-groups \
11 --group-ids sg-xxx \
12 --query 'SecurityGroups[*].IpPermissionsEgress[*].{Proto:IpProtocol,Port:FromPort,Dest:IpRanges[0].CidrIp}' \
13 --output table
14
15# Expected: 0.0.0.0/0 allowed (all egress)
16# If restricted → Add rule for destination service
17
18# 3. Check route table for Lambda subnet
19SUBNET_ID=$(aws lambda get-function-configuration \
20 --function-name my-function \
21 --query 'VpcConfig.SubnetIds[0]' \
22 --output text)
23
24ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
25 --filters "Name=association.subnet-id,Values=$SUBNET_ID" \
26 --query 'RouteTables[0].RouteTableId' \
27 --output text)
28
29aws ec2 describe-route-tables \
30 --route-table-ids $ROUTE_TABLE_ID \
31 --query 'RouteTables[*].Routes[*].[DestinationCidrBlock,GatewayId,NatGatewayId]' \
32 --output table
33
34# Expected routes:
35# - local → vpc-xxx (VPC internal)
36# - 0.0.0.0/0 → nat-xxx (internet via NAT) OR vpce-xxx (S3 via endpoint)
37
38# 4. Test actual network path (requires test Lambda invocation)
39# Deploy temporary test Lambda:
40# - Attempts connection to S3
41# - Logs connection details
42# - Reports success/failure
43
44# 5. Analyze test results
45aws logs tail /aws/lambda/network-test --since 1m
46
47# Look for:
48# - Connection established (success)
49# - Connection timeout (NAT saturated)
50# - Connection refused (security group blocked)
51# - DNS resolution failure (VPC DNS issue)
Network path checklist:
Workflow 4: Post-Deployment Infrastructure Validation
Use when: After deploying infrastructure changes (VPC endpoints, security groups)
Steps:
bash
1# 1. Verify Terraform outputs
2cd terraform
3terraform output s3_vpc_endpoint_id # Should return vpce-xxx
4terraform output s3_vpc_endpoint_state # Should return "available"
5
6# 2. Run smoke test Lambda invocation
7aws lambda invoke \
8 --function-name my-function \
9 --payload '{"test": true}' \
10 /tmp/response.json
11
12# Check response
13cat /tmp/response.json | jq .
14
15# 3. Verify CloudWatch logs show success
16aws logs tail /aws/lambda/my-function --since 1m --follow
17
18# Expected:
19# - No ConnectTimeoutError
20# - Operation completes in expected time (2-3s not 600s)
21# - Success message logged
22
23# 4. Test concurrent execution (simulate production load)
24for i in {1..10}; do
25 aws lambda invoke \
26 --function-name my-function \
27 --payload "{\"id\": $i}" \
28 --invocation-type Event \
29 /tmp/response_$i.json &
30done
31wait
32
33# 5. Analyze concurrent execution results
34aws logs filter-log-events \
35 --log-group-name /aws/lambda/my-function \
36 --start-time $(date -d '5 minutes ago' +%s)000 \
37 --filter-pattern "ConnectTimeoutError" \
38 --query 'length(events)' \
39 --output text
40
41# Expected: 0 (no timeout errors)
42# If > 0 → Infrastructure issue still exists
43
44# 6. Verify 100% success rate
45aws logs filter-log-events \
46 --log-group-name /aws/lambda/my-function \
47 --start-time $(date -d '5 minutes ago' +%s)000 \
48 --filter-pattern "✅" \
49 --query 'length(events)' \
50 --output text
51
52# Expected: 10 (all concurrent executions succeeded)
Post-deployment checklist:
Common Infrastructure Issues
Issue 1: Missing S3 VPC Endpoint
Symptom:
- Lambda timeout after 600s
- Error:
ConnectTimeoutError: Connect timeout on endpoint URL: "https://bucket.s3.region.amazonaws.com/..."
- Pattern: First N concurrent operations succeed, last M timeout
Diagnosis:
bash
1# Check for S3 VPC endpoint
2aws ec2 describe-vpc-endpoints \
3 --filters "Name=vpc-id,Values=vpc-xxx" \
4 "Name=service-name,Values=com.amazonaws.region.s3"
5
6# If empty → No endpoint (S3 traffic goes through NAT)
Fix:
hcl
1# terraform/s3_vpc_endpoint.tf
2data "aws_route_tables" "vpc_route_tables" {
3 vpc_id = data.aws_vpc.default.id
4}
5
6resource "aws_vpc_endpoint" "s3" {
7 vpc_id = data.aws_vpc.default.id
8 service_name = "com.amazonaws.${var.aws_region}.s3"
9 vpc_endpoint_type = "Gateway"
10
11 route_table_ids = data.aws_route_tables.vpc_route_tables.ids
12
13 policy = jsonencode({
14 Version = "2012-10-17"
15 Statement = [{
16 Effect = "Allow"
17 Principal = "*"
18 Action = "s3:*"
19 Resource = "*"
20 }]
21 })
22
23 tags = {
24 Name = "s3-endpoint"
25 }
26}
27
28output "s3_vpc_endpoint_id" {
29 value = aws_vpc_endpoint.s3.id
30}
31
32output "s3_vpc_endpoint_state" {
33 value = aws_vpc_endpoint.s3.state
34}
Verification:
bash
1cd terraform
2terraform apply
3terraform output s3_vpc_endpoint_state # Should be "available"
4
5# Test Lambda invocation
6aws lambda invoke --function-name my-function --payload '{}' /tmp/response.json
7aws logs tail /aws/lambda/my-function --since 1m
8# Expected: No timeout, completes in 2-3s
Issue 2: NAT Gateway Connection Saturation
Symptom:
- Deterministic failure pattern (first 5 succeed, last 5 timeout)
- All timeouts occur after ~10 minutes (boto3 default + retries)
- Timeline analysis shows concurrent Lambda starts
Diagnosis:
bash
1# Check timeline of Lambda executions
2aws logs filter-log-events \
3 --log-group-name /aws/lambda/my-function \
4 --start-time $(date -d '30 minutes ago' +%s)000 \
5 --filter-pattern "START RequestId" \
6 | jq -r '.events[] | .timestamp as $ts | ($ts/1000 | strftime("%H:%M:%S")) + " " + (.message | split(" ")[2])'
7
8# Look for:
9# - All start within 1 second (concurrent)
10# - Check which RequestIds have errors
11aws logs filter-log-events \
12 --log-group-name /aws/lambda/my-function \
13 --filter-pattern "ConnectTimeoutError" \
14 | jq -r '.events[].message' | grep -o "RequestId: [a-z0-9-]*"
15
16# Pattern: Last N RequestIds consistently fail
Root Cause:
- NAT Gateway has limited connection establishment rate
- Concurrent Lambdas try to establish S3 connections simultaneously
- First N connections succeed → Upload completes in 2-3s
- Last M connections queued → Eventually timeout after 600s
Fix: Add S3 VPC Gateway Endpoint (see Issue 1)
Why this works:
- VPC Gateway Endpoint bypasses NAT Gateway
- S3 traffic routed directly within AWS network
- No connection establishment limits
- Free (Gateway endpoints have no hourly charge)
Issue 3: Security Group Blocking Egress
Symptom:
- Lambda unable to connect to AWS service
- Error: Connection refused or timeout
- All invocations fail (not deterministic pattern)
Diagnosis:
bash
1# Check security group egress rules
2aws lambda get-function-configuration \
3 --function-name my-function \
4 --query 'VpcConfig.SecurityGroupIds[0]' \
5 --output text | xargs -I {} aws ec2 describe-security-groups --group-ids {}
6
7# Look for egress rules allowing HTTPS (port 443)
8# Expected: 0.0.0.0/0 or specific AWS service prefix list
Fix:
hcl
1# terraform/security_groups.tf
2resource "aws_security_group_rule" "lambda_egress_https" {
3 type = "egress"
4 from_port = 443
5 to_port = 443
6 protocol = "tcp"
7 cidr_blocks = ["0.0.0.0/0"]
8 security_group_id = aws_security_group.lambda.id
9}
Issue 4: Route Table Not Attached to VPC Endpoint
Symptom:
- VPC endpoint exists and is "available"
- Lambda still times out connecting to S3
- Deterministic or random failures
Diagnosis:
bash
1# Check VPC endpoint route table attachment
2aws ec2 describe-vpc-endpoints \
3 --vpc-endpoint-ids vpce-xxx \
4 --query 'VpcEndpoints[0].RouteTableIds' \
5 --output table
6
7# Get Lambda subnet route table
8aws lambda get-function-configuration \
9 --function-name my-function \
10 --query 'VpcConfig.SubnetIds[0]' \
11 --output text | xargs -I {} aws ec2 describe-route-tables \
12 --filters "Name=association.subnet-id,Values={}" \
13 --query 'RouteTables[0].RouteTableId' \
14 --output text
15
16# Compare: Lambda's route table should be in endpoint's RouteTableIds
Fix:
hcl
1# terraform/s3_vpc_endpoint.tf
2data "aws_route_tables" "vpc_route_tables" {
3 vpc_id = data.aws_vpc.default.id
4}
5
6resource "aws_vpc_endpoint" "s3" {
7 # ... other config ...
8
9 # Attach to ALL route tables (includes Lambda subnets)
10 route_table_ids = data.aws_route_tables.vpc_route_tables.ids
11}
Integration with Other Skills
With error-investigation
- Use infrastructure-verification BEFORE error-investigation when:
- Investigating Lambda timeout patterns
- Debugging connection failures
- Analyzing deterministic failure patterns
- Use error-investigation AFTER infrastructure-verification when:
- Infrastructure confirmed correct but errors persist
- Need to analyze application logs
- Debugging business logic failures
With deployment skill
- Use infrastructure-verification:
- BEFORE deploying Lambda-in-VPC code
- AFTER deploying infrastructure changes (Terraform apply)
- During post-deployment validation
- Complements deployment smoke tests with infrastructure-specific checks
With testing-workflow
- Infrastructure verification is a form of pre-deployment testing
- Validates infrastructure-application contract (CLAUDE.md Principle #15)
- Catches configuration issues before code deployment
Quick Reference
VPC Endpoint Types
| Type | Services | Cost | Use Case |
|---|
| Gateway | S3, DynamoDB | FREE | High-throughput data access |
| Interface | Most AWS services | ~$7.50/month | Other services (Secrets Manager, etc.) |
NAT Gateway Limits
| Limit | Value | Impact |
|---|
| Concurrent connections | 55,000 | Theoretical max |
| Connection establishment rate | Limited | Causes saturation with concurrent Lambdas |
| Data transfer cost | $0.045/GB | Expensive for large transfers |
Recommendation: Use VPC Gateway Endpoints for S3/DynamoDB (free, unlimited, faster)
Common AWS CLI Commands
bash
1# VPC endpoint
2aws ec2 describe-vpc-endpoints --vpc-endpoint-ids vpce-xxx
3
4# NAT Gateway
5aws ec2 describe-nat-gateways --nat-gateway-ids nat-xxx
6
7# Security groups
8aws ec2 describe-security-groups --group-ids sg-xxx
9
10# Route tables
11aws ec2 describe-route-tables --route-table-ids rtb-xxx
12
13# Lambda VPC config
14aws lambda get-function-configuration --function-name my-function --query 'VpcConfig'
File Organization
.claude/skills/infrastructure-verification/
└── SKILL.md # This file (complete skill)
References