OpenClaw 云原生架构成本优化深度实践:从设计理念到工程实现

引言:云成本优化的系统性思考

在云原生应用的生命周期管理中,成本控制往往是被低估的技术挑战。OpenClaw 作为一个集成了多种亚马逊云科技服务的 AI 助手平台,在部署初期出现了月成本 $200.34 的情况,这促使我们从架构设计、资源配置、服务选型等多个维度进行系统性的成本优化。

本文将深入分析 OpenClaw 成本优化的完整过程,从理论框架到具体实现,从单点优化到体系建设,力图为云原生应用的成本管控提供可复制的方法论和工程实践。

第一章:成本分析的理论框架与实施

1.1 云成本构成的多维度分析模型

云服务的成本结构可以抽象为一个多维度的模型:

Cost = Compute(Instance, Utilization, Time) + 
       Storage(Capacity, Type, Access_Pattern) + 
       Network(Transfer, Requests) + 
       Services(API_Calls, Processing_Units) + 
       Management(Monitoring, Backup, Security)

对于 OpenClaw 这样的 AI 应用,我们需要特别关注计算密集型负载和 API 调用成本的权衡。

1.2 成本可观测性的工程实现

构建成本可观测性系统是优化的第一步。我们设计了一个分层的成本监控架构:

import boto3
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Tuple
import logging
from dataclasses import dataclass
from enum import Enum

class CostDimension(Enum):
    SERVICE = 'SERVICE'
    USAGE_TYPE = 'USAGE_TYPE'
    OPERATION = 'OPERATION'
    AVAILABILITY_ZONE = 'AVAILABILITY_ZONE'
    INSTANCE_TYPE = 'INSTANCE_TYPE'
    REGION = 'REGION'

@dataclass
class CostMetric:
    value: float
    unit: str
    timestamp: datetime
    dimensions: Dict[str, str]
    metadata: Dict[str, any] = None

class CostAnalysisEngine:
    """成本分析引擎 - 提供多维度成本分解和趋势分析"""
    
    def __init__(self, region: str = 'us-east-1'):
        self.ce = boto3.client('ce', region_name=region)
        self.cloudwatch = boto3.client('cloudwatch', region_name=region)
        self.logger = logging.getLogger(__name__)
        
    def get_cost_and_usage_detailed(
        self, 
        start_date: datetime, 
        end_date: datetime,
        granularity: str = 'DAILY',
        dimensions: List[CostDimension] = None,
        filters: Dict = None
    ) -> pd.DataFrame:
        """
        获取详细的成本和使用量数据
        
        Args:
            start_date: 开始日期
            end_date: 结束日期 
            granularity: 时间粒度 (DAILY, MONTHLY, HOURLY)
            dimensions: 分组维度
            filters: 过滤条件
            
        Returns:
            包含成本数据的 DataFrame
        """
        
        if dimensions is None:
            dimensions = [CostDimension.SERVICE]
        
        group_by = [{'Type': 'DIMENSION', 'Key': dim.value} for dim in dimensions]
        
        try:
            response = self.ce.get_cost_and_usage(
                TimePeriod={
                    'Start': start_date.strftime('%Y-%m-%d'),
                    'End': end_date.strftime('%Y-%m-%d')
                },
                Granularity=granularity,
                Metrics=['BlendedCost', 'UsageQuantity', 'NormalizedUsageAmount'],
                GroupBy=group_by,
                Filter=filters
            )
            
            # 解析响应数据
            cost_data = []
            for time_result in response['ResultsByTime']:
                time_period = time_result['TimePeriod']
                
                for group in time_result['Groups']:
                    dimensions_dict = {
                        dim.value: group['Keys'][i] 
                        for i, dim in enumerate(dimensions)
                    }
                    
                    metrics = group['Metrics']
                    
                    cost_data.append({
                        'start_date': time_period['Start'],
                        'end_date': time_period['End'],
                        **dimensions_dict,
                        'blended_cost': float(metrics['BlendedCost']['Amount']),
                        'usage_quantity': float(metrics['UsageQuantity']['Amount']),
                        'normalized_usage': float(metrics['NormalizedUsageAmount']['Amount']),
                        'cost_unit': metrics['BlendedCost']['Unit'],
                        'usage_unit': metrics['UsageQuantity']['Unit']
                    })
            
            df = pd.DataFrame(cost_data)
            df['start_date'] = pd.to_datetime(df['start_date'])
            df['end_date'] = pd.to_datetime(df['end_date'])
            
            return df
            
        except Exception as e:
            self.logger.error(f"获取成本数据失败: {e}")
            raise
    
    def analyze_cost_drivers(self, df: pd.DataFrame) -> Dict:
        """
        分析成本驱动因素
        
        Args:
            df: 成本数据 DataFrame
            
        Returns:
            成本驱动分析结果
        """
        
        analysis = {
            'total_cost': df['blended_cost'].sum(),
            'cost_by_service': {},
            'cost_trends': {},
            'utilization_metrics': {},
            'optimization_opportunities': []
        }
        
        # 按服务分析成本
        if 'SERVICE' in df.columns:
            service_costs = df.groupby('SERVICE')['blended_cost'].sum().sort_values(ascending=False)
            analysis['cost_by_service'] = service_costs.to_dict()
        
        # 成本趋势分析
        if 'start_date' in df.columns:
            daily_costs = df.groupby('start_date')['blended_cost'].sum()
            analysis['cost_trends'] = {
                'daily_average': daily_costs.mean(),
                'growth_rate': self._calculate_growth_rate(daily_costs),
                'volatility': daily_costs.std(),
                'peak_day': daily_costs.idxmax(),
                'peak_cost': daily_costs.max()
            }
        
        # 利用率指标
        if 'usage_quantity' in df.columns and 'normalized_usage' in df.columns:
            analysis['utilization_metrics'] = self._analyze_utilization(df)
        
        # 识别优化机会
        analysis['optimization_opportunities'] = self._identify_optimization_opportunities(df)
        
        return analysis
    
    def _calculate_growth_rate(self, series: pd.Series) -> float:
        """计算成本增长率"""
        if len(series) < 2:
            return 0.0
        
        # 使用线性回归计算趋势
        from scipy.stats import linregress
        x = np.arange(len(series))
        slope, _, _, _, _ = linregress(x, series.values)
        
        return slope / series.mean() * 100  # 转换为百分比
    
    def _analyze_utilization(self, df: pd.DataFrame) -> Dict:
        """分析资源利用率"""
        utilization_metrics = {}
        
        # EC2 实例利用率分析
        ec2_data = df[df.get('SERVICE', '') == 'Amazon Elastic Compute Cloud - Compute']
        if not ec2_data.empty:
            # 这里需要结合 CloudWatch 数据
            utilization_metrics['ec2'] = self._get_ec2_utilization_metrics()
        
        # 存储利用率分析
        storage_data = df[df.get('SERVICE', '') == 'Amazon Elastic Block Store']
        if not storage_data.empty:
            utilization_metrics['storage'] = self._analyze_storage_utilization(storage_data)
        
        return utilization_metrics
    
    def _get_ec2_utilization_metrics(self) -> Dict:
        """获取 EC2 利用率指标"""
        try:
            # 获取实例列表
            ec2 = boto3.client('ec2')
            instances = ec2.describe_instances(
                Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
            )
            
            utilization_data = {}
            
            for reservation in instances['Reservations']:
                for instance in reservation['Instances']:
                    instance_id = instance['InstanceId']
                    instance_type = instance['InstanceType']
                    
                    # 获取 CPU 利用率
                    cpu_metrics = self.cloudwatch.get_metric_statistics(
                        Namespace='AWS/EC2',
                        MetricName='CPUUtilization',
                        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                        StartTime=datetime.utcnow() - timedelta(days=7),
                        EndTime=datetime.utcnow(),
                        Period=3600,
                        Statistics=['Average', 'Maximum']
                    )
                    
                    if cpu_metrics['Datapoints']:
                        avg_cpu = np.mean([dp['Average'] for dp in cpu_metrics['Datapoints']])
                        max_cpu = np.max([dp['Maximum'] for dp in cpu_metrics['Datapoints']])
                        
                        utilization_data[instance_id] = {
                            'instance_type': instance_type,
                            'avg_cpu_utilization': avg_cpu,
                            'max_cpu_utilization': max_cpu,
                            'efficiency_score': self._calculate_efficiency_score(avg_cpu, instance_type)
                        }
            
            return utilization_data
            
        except Exception as e:
            self.logger.warning(f"获取EC2利用率失败: {e}")
            return {}
    
    def _calculate_efficiency_score(self, avg_cpu: float, instance_type: str) -> float:
        """
        计算实例效率评分
        
        Args:
            avg_cpu: 平均CPU利用率
            instance_type: 实例类型
            
        Returns:
            效率评分 (0-100)
        """
        # 基于实例类型的期望利用率
        expected_utilization = {
            't3.nano': 60, 't3.micro': 50, 't3.small': 45, 't3.medium': 40,
            't3.large': 35, 't3.xlarge': 30, 't3.2xlarge': 25,
            't4g.nano': 65, 't4g.micro': 55, 't4g.small': 50, 't4g.medium': 45,
            't4g.large': 40, 't4g.xlarge': 35, 't4g.2xlarge': 30,
            'c5.large': 70, 'c5.xlarge': 65, 'c5.2xlarge': 60,
            'm5.large': 50, 'm5.xlarge': 45, 'm5.2xlarge': 40
        }
        
        expected = expected_utilization.get(instance_type, 40)
        
        if avg_cpu < expected * 0.3:  # 利用率过低
            return max(0, 100 - (expected * 0.3 - avg_cpu) * 2)
        elif avg_cpu > expected * 1.5:  # 利用率过高
            return max(0, 100 - (avg_cpu - expected * 1.5) * 1.5)
        else:  # 合理范围
            return 100 - abs(avg_cpu - expected) * 0.5
    
    def _analyze_storage_utilization(self, storage_data: pd.DataFrame) -> Dict:
        """分析存储利用率"""
        # 这里可以集成更详细的存储分析逻辑
        return {
            'total_storage_cost': storage_data['blended_cost'].sum(),
            'storage_types': storage_data.groupby('USAGE_TYPE')['blended_cost'].sum().to_dict()
        }
    
    def _identify_optimization_opportunities(self, df: pd.DataFrame) -> List[Dict]:
        """识别优化机会"""
        opportunities = []
        
        # 分析高成本低利用率的资源
        if 'SERVICE' in df.columns:
            ec2_cost = df[df['SERVICE'] == 'Amazon Elastic Compute Cloud - Compute']['blended_cost'].sum()
            if ec2_cost > 50:  # 如果EC2成本超过$50
                utilization = self._get_ec2_utilization_metrics()
                
                low_utilization_instances = [
                    instance_id for instance_id, metrics in utilization.items()
                    if metrics.get('efficiency_score', 100) < 50
                ]
                
                if low_utilization_instances:
                    opportunities.append({
                        'type': 'compute_rightsizing',
                        'description': 'EC2实例利用率不足,建议调整实例大小',
                        'potential_savings': ec2_cost * 0.3,  # 估算30%节省
                        'affected_resources': low_utilization_instances
                    })
        
        # 分析存储优化机会
        storage_cost = df[df.get('SERVICE', '') == 'Amazon Elastic Block Store']['blended_cost'].sum()
        if storage_cost > 10:  # 存储成本超过$10
            opportunities.append({
                'type': 'storage_optimization',
                'description': '考虑使用生命周期策略和智能分层',
                'potential_savings': storage_cost * 0.2,
                'recommendations': ['启用S3智能分层', 'EBS卷容量优化', '日志归档策略']
            })
        
        return opportunities
    
    def generate_optimization_report(self, start_date: datetime, end_date: datetime) -> str:
        """生成优化报告"""
        
        # 获取数据
        df = self.get_cost_and_usage_detailed(
            start_date, end_date,
            dimensions=[CostDimension.SERVICE, CostDimension.USAGE_TYPE]
        )
        
        # 分析数据
        analysis = self.analyze_cost_drivers(df)
        
        # 生成报告
        report = f"""
OpenClaw 成本优化分析报告
=========================

报告期间: {start_date.strftime('%Y-%m-%d')} 到 {end_date.strftime('%Y-%m-%d')}

 总体概况
----------
总成本: ${analysis['total_cost']:.2f}
日均成本: ${analysis['cost_trends'].get('daily_average', 0):.2f}
成本增长率: {analysis['cost_trends'].get('growth_rate', 0):.2f}%/日

 成本分布 (TOP 5)
------------------"""
        
        for service, cost in list(analysis['cost_by_service'].items())[:5]:
            percentage = cost / analysis['total_cost'] * 100
            report += f"\n{service}: ${cost:.2f} ({percentage:.1f}%)"
        
        report += f"""

 优化机会
----------"""
        
        total_potential_savings = 0
        for opportunity in analysis['optimization_opportunities']:
            savings = opportunity.get('potential_savings', 0)
            total_potential_savings += savings
            report += f"\n• {opportunity['description']}"
            report += f"\n  预估节省: ${savings:.2f}"
        
        report += f"""

 优化潜力总计: ${total_potential_savings:.2f} ({total_potential_savings/analysis['total_cost']*100:.1f}%)

"""
        
        return report

# 使用示例
def main():
    analyzer = CostAnalysisEngine()
    
    end_date = datetime.now()
    start_date = end_date - timedelta(days=30)
    
    # 生成成本分析报告
    report = analyzer.generate_optimization_report(start_date, end_date)
    print(report)
    
    # 详细的多维度分析
    df = analyzer.get_cost_and_usage_detailed(
        start_date, end_date,
        dimensions=[CostDimension.SERVICE, CostDimension.USAGE_TYPE, CostDimension.OPERATION]
    )
    
    # 保存数据到文件用于进一步分析
    df.to_csv('openclaw_cost_analysis.csv', index=False)
    
    # 执行成本驱动因素分析
    analysis = analyzer.analyze_cost_drivers(df)
    
    print("\n=== 详细分析结果 ===")
    print(f"识别到 {len(analysis['optimization_opportunities'])} 个优化机会")
    
    for opportunity in analysis['optimization_opportunities']:
        print(f"\n优化类型: {opportunity['type']}")
        print(f"描述: {opportunity['description']}")
        print(f"预估节省: ${opportunity.get('potential_savings', 0):.2f}")

if __name__ == "__main__":
    main()

1.3 成本分析的初步结果

通过上述成本分析框架,我们得到了 OpenClaw 的成本分解数据:

OpenClaw 成本优化分析报告
=========================

报告期间: 2024-02-01 到 2024-03-01

 总体概况
----------
总成本: $200.34
日均成本: $6.46
成本增长率: 2.3%/日

 成本分布 (TOP 5)
------------------
Amazon Elastic Compute Cloud - Compute: $119.81 (59.8%)
Amazon Bedrock: $68.54 (34.2%)
Amazon Elastic Block Store: $8.00 (4.0%)
Amazon Virtual Private Cloud: $3.99 (2.0%)

 优化机会
----------
• EC2实例利用率不足,建议调整实例大小
  预估节省: $89.86
• 考虑使用生命周期策略和智能分层
  预估节省: $1.60
• Bedrock模型选择和批量处理优化
  预估节省: $47.98

 优化潜力总计: $139.44 (69.6%)

这个分析结果明确指出了三个主要的优化方向,为后续的具体优化提供了数据支撑。

第二章:计算资源的深度优化策略

2.1 实例选型的理论基础

云计算资源的选型涉及到性能、成本、可用性三个维度的平衡。对于 AI 应用而言,我们需要考虑:

  1. 计算密集度:CPU/GPU 需求与内存需求的比例
  2. I/O 模式:网络 I/O 与磁盘 I/O 的特征
  3. 可突发性:负载的时间分布特征
  4. 容错性:对实例中断的容忍度

2.2 基于负载特征的实例选型算法

import boto3
import numpy as np
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass
import json

@dataclass
class WorkloadProfile:
    """工作负载特征描述"""
    cpu_utilization_avg: float
    cpu_utilization_max: float
    memory_utilization_avg: float
    memory_utilization_max: float
    network_io_avg: float  # MB/s
    disk_io_avg: float     # IOPS
    burst_frequency: float  # 突发频率 (per hour)
    duration_hours: int     # 运行时长
    fault_tolerance: str    # 'low', 'medium', 'high'

@dataclass
class InstanceOption:
    """实例选项"""
    instance_type: str
    vcpus: int
    memory_gb: float
    network_performance: str
    price_on_demand: float
    price_reserved_1yr: float
    price_spot_avg: float
    architecture: str  # 'x86_64' or 'arm64'

class IntelligentInstanceSelector:
    """智能实例选择器"""
    
    def __init__(self):
        self.pricing_client = boto3.client('pricing', region_name='us-east-1')
        self.ec2_client = boto3.client('ec2')
        self.instance_catalog = self._load_instance_catalog()
    
    def _load_instance_catalog(self) -> Dict[str, InstanceOption]:
        """加载实例目录"""
        # 这里可以从 AWS Pricing API 动态获取,为简化展示,使用静态数据
        catalog = {
            't3.medium': InstanceOption('t3.medium', 2, 4, 'Up to 5 Gbps', 0.0416, 0.0301, 0.0125, 'x86_64'),
            't3.large': InstanceOption('t3.large', 2, 8, 'Up to 5 Gbps', 0.0832, 0.0602, 0.0250, 'x86_64'),
            't3.xlarge': InstanceOption('t3.xlarge', 4, 16, 'Up to 5 Gbps', 0.1664, 0.1203, 0.0500, 'x86_64'),
            't4g.medium': InstanceOption('t4g.medium', 2, 4, 'Up to 5 Gbps', 0.0336, 0.0243, 0.0101, 'arm64'),
            't4g.large': InstanceOption('t4g.large', 2, 8, 'Up to 5 Gbps', 0.0672, 0.0486, 0.0202, 'arm64'),
            't4g.xlarge': InstanceOption('t4g.xlarge', 4, 16, 'Up to 5 Gbps', 0.1344, 0.0972, 0.0403, 'arm64'),
            'c5.large': InstanceOption('c5.large', 2, 4, 'Up to 10 Gbps', 0.085, 0.0615, 0.0256, 'x86_64'),
            'c5.xlarge': InstanceOption('c5.xlarge', 4, 8, 'Up to 10 Gbps', 0.17, 0.123, 0.0511, 'x86_64'),
            'c6g.large': InstanceOption('c6g.large', 2, 4, 'Up to 10 Gbps', 0.0688, 0.0497, 0.0207, 'arm64'),
            'c6g.xlarge': InstanceOption('c6g.xlarge', 4, 8, 'Up to 10 Gbps', 0.1376, 0.0994, 0.0413, 'arm64'),
            'm5.large': InstanceOption('m5.large', 2, 8, 'Up to 10 Gbps', 0.096, 0.0694, 0.0288, 'x86_64'),
            'm5.xlarge': InstanceOption('m5.xlarge', 4, 16, 'Up to 10 Gbps', 0.192, 0.1389, 0.0576, 'x86_64'),
            'm6g.large': InstanceOption('m6g.large', 2, 8, 'Up to 10 Gbps', 0.077, 0.0557, 0.0231, 'arm64'),
            'm6g.xlarge': InstanceOption('m6g.xlarge', 4, 16, 'Up to 10 Gbps', 0.154, 0.1113, 0.0462, 'arm64')
        }
        return catalog
    
    def calculate_fit_score(self, workload: WorkloadProfile, instance: InstanceOption) -> Tuple[float, Dict]:
        """
        计算实例与工作负载的匹配度评分
        
        Args:
            workload: 工作负载特征
            instance: 实例选项
            
        Returns:
            匹配度评分和详细分析
        """
        
        scores = {}
        
        # CPU 适配性评分 (0-100)
        cpu_requirement = max(workload.cpu_utilization_avg, workload.cpu_utilization_max * 0.7)
        cpu_capacity = instance.vcpus * 100  # 100% per vCPU
        
        if cpu_capacity >= cpu_requirement * 1.5:  # 过度配置
            scores['cpu'] = max(0, 100 - (cpu_capacity - cpu_requirement * 1.5) / cpu_requirement * 50)
        elif cpu_capacity < cpu_requirement * 0.8:  # 配置不足
            scores['cpu'] = max(0, cpu_capacity / cpu_requirement * 100)
        else:  # 合适范围
            scores['cpu'] = 100
        
        # 内存适配性评分
        memory_requirement = max(workload.memory_utilization_avg, workload.memory_utilization_max * 0.8)
        memory_capacity = instance.memory_gb * 1024  # 转换为 MB
        
        if memory_capacity >= memory_requirement * 1.3:  # 过度配置
            scores['memory'] = max(0, 100 - (memory_capacity - memory_requirement * 1.3) / memory_requirement * 40)
        elif memory_capacity < memory_requirement * 0.9:  # 配置不足
            scores['memory'] = 0  # 内存不足是致命问题
        else:  # 合适范围
            scores['memory'] = 100
        
        # 架构兼容性评分
        if instance.architecture == 'arm64':
            scores['architecture'] = 95  # ARM架构略有兼容性风险,但性价比高
        else:
            scores['architecture'] = 100
        
        # 成本效率评分
        monthly_cost_on_demand = instance.price_on_demand * 24 * 30
        monthly_cost_reserved = instance.price_reserved_1yr * 24 * 30
        monthly_cost_spot = instance.price_spot_avg * 24 * 30
        
        # 基于容错性选择定价模式
        if workload.fault_tolerance == 'high':
            cost_score_base = 150 / monthly_cost_spot  # Spot实例
        elif workload.fault_tolerance == 'medium':
            cost_score_base = 100 / monthly_cost_reserved  # Reserved实例
        else:
            cost_score_base = 80 / monthly_cost_on_demand  # On-Demand实例
        
        scores['cost_efficiency'] = min(100, cost_score_base * 10)
        
        # 网络性能评分(简化)
        network_map = {
            'Up to 5 Gbps': 70,
            'Up to 10 Gbps': 90,
            'Up to 25 Gbps': 100
        }
        scores['network'] = network_map.get(instance.network_performance, 50)
        
        # 计算综合评分(加权平均)
        weights = {
            'cpu': 0.25,
            'memory': 0.3,     # 内存权重最高,因为不足会导致系统无法运行
            'architecture': 0.1,
            'cost_efficiency': 0.25,
            'network': 0.1
        }
        
        overall_score = sum(scores[key] * weights[key] for key in weights)
        
        # 计算月度成本估算
        if workload.fault_tolerance == 'high':
            estimated_monthly_cost = monthly_cost_spot
            pricing_model = 'spot'
        elif workload.fault_tolerance == 'medium' and workload.duration_hours > 720:  # 运行超过1个月
            estimated_monthly_cost = monthly_cost_reserved
            pricing_model = 'reserved'
        else:
            estimated_monthly_cost = monthly_cost_on_demand
            pricing_model = 'on_demand'
        
        analysis = {
            'scores': scores,
            'overall_score': overall_score,
            'estimated_monthly_cost': estimated_monthly_cost,
            'pricing_model': pricing_model,
            'resource_utilization': {
                'cpu_utilization_projected': cpu_requirement / cpu_capacity * 100,
                'memory_utilization_projected': memory_requirement / memory_capacity * 100
            }
        }
        
        return overall_score, analysis
    
    def recommend_instances(self, workload: WorkloadProfile, top_n: int = 5) -> List[Tuple[str, float, Dict]]:
        """
        推荐最适合的实例类型
        
        Args:
            workload: 工作负载特征
            top_n: 返回前N个推荐
            
        Returns:
            推荐列表,包含实例类型、评分和分析详情
        """
        
        recommendations = []
        
        for instance_type, instance_spec in self.instance_catalog.items():
            score, analysis = self.calculate_fit_score(workload, instance_spec)
            
            # 过滤掉评分过低的选项
            if score >= 50:  # 只考虑评分50以上的实例
                recommendations.append((instance_type, score, analysis))
        
        # 按评分降序排序
        recommendations.sort(key=lambda x: x[1], reverse=True)
        
        return recommendations[:top_n]
    
    def generate_migration_plan(self, 
                              current_instance: str, 
                              workload: WorkloadProfile,
                              risk_tolerance: str = 'medium') -> Dict:
        """
        生成迁移计划
        
        Args:
            current_instance: 当前使用的实例类型
            workload: 工作负载特征
            risk_tolerance: 风险容忍度 ('low', 'medium', 'high')
            
        Returns:
            迁移计划详情
        """
        
        recommendations = self.recommend_instances(workload)
        
        if not recommendations:
            return {'error': '没有找到合适的实例推荐'}
        
        current_spec = self.instance_catalog.get(current_instance)
        if not current_spec:
            return {'error': f'未知的当前实例类型: {current_instance}'}
        
        best_option = recommendations[0]
        best_instance_type = best_option[0]
        best_spec = self.instance_catalog[best_instance_type]
        best_analysis = best_option[2]
        
        # 计算节省成本
        current_monthly_cost = current_spec.price_on_demand * 24 * 30  # 假设当前使用 On-Demand
        projected_monthly_cost = best_analysis['estimated_monthly_cost']
        monthly_savings = current_monthly_cost - projected_monthly_cost
        savings_percentage = monthly_savings / current_monthly_cost * 100
        
        # 评估迁移风险
        migration_risks = []
        
        if current_spec.architecture != best_spec.architecture:
            migration_risks.append({
                'type': 'architecture_change',
                'description': f'架构从 {current_spec.architecture} 变更为 {best_spec.architecture}',
                'mitigation': '需要验证应用兼容性,可能需要重新编译部分组件'
            })
        
        if best_spec.memory_gb < current_spec.memory_gb:
            migration_risks.append({
                'type': 'memory_reduction',
                'description': f'内存从 {current_spec.memory_gb}GB 降至 {best_spec.memory_gb}GB',
                'mitigation': '需要验证应用在较少内存下的运行稳定性'
            })
        
        if best_spec.vcpus < current_spec.vcpus:
            migration_risks.append({
                'type': 'cpu_reduction',
                'description': f'vCPU从 {current_spec.vcpus} 核降至 {best_spec.vcpus} 核',
                'mitigation': '需要进行性能测试验证'
            })
        
        # 制定迁移步骤
        migration_steps = [
            {
                'step': 1,
                'description': '性能基准测试',
                'details': '在当前环境下执行性能基准测试,建立性能基线'
            },
            {
                'step': 2,
                'description': '创建测试实例',
                'details': f'启动 {best_instance_type} 实例进行测试'
            },
            {
                'step': 3,
                'description': '应用部署和配置',
                'details': '在新实例上部署应用并进行配置调整'
            },
            {
                'step': 4,
                'description': '性能验证测试',
                'details': '执行相同的基准测试,对比性能指标'
            }
        ]
        
        # 根据风险容忍度添加额外步骤
        if risk_tolerance == 'low':
            migration_steps.extend([
                {
                    'step': 5,
                    'description': '金丝雀部署',
                    'details': '将少量流量切换到新实例,观察运行状况'
                },
                {
                    'step': 6,
                    'description': '逐步流量迁移',
                    'details': '分阶段增加新实例的流量比例'
                }
            ])
        
        migration_steps.append({
            'step': len(migration_steps) + 1,
            'description': '完成迁移',
            'details': '关闭旧实例,完成迁移过程'
        })
        
        plan = {
            'current_instance': current_instance,
            'recommended_instance': best_instance_type,
            'cost_analysis': {
                'current_monthly_cost': current_monthly_cost,
                'projected_monthly_cost': projected_monthly_cost,
                'monthly_savings': monthly_savings,
                'savings_percentage': savings_percentage,
                'annual_savings': monthly_savings * 12
            },
            'performance_impact': {
                'cpu_change': f"{best_spec.vcpus / current_spec.vcpus:.1%}",
                'memory_change': f"{best_spec.memory_gb / current_spec.memory_gb:.1%}",
                'architecture': f"{current_spec.architecture} → {best_spec.architecture}"
            },
            'migration_risks': migration_risks,
            'migration_steps': migration_steps,
            'all_recommendations': recommendations
        }
        
        return plan

# 使用示例:为 OpenClaw 生成实例推荐
def optimize_openclaw_compute():
    """为 OpenClaw 优化计算资源"""
    
    # 基于监控数据定义工作负载特征
    openclaw_workload = WorkloadProfile(
        cpu_utilization_avg=15.2,      # 平均 CPU 利用率
        cpu_utilization_max=28.3,      # 峰值 CPU 利用率
        memory_utilization_avg=2048,   # 平均内存使用 2GB
        memory_utilization_max=3072,   # 峰值内存使用 3GB  
        network_io_avg=10,             # 平均网络 I/O 10MB/s
        disk_io_avg=100,               # 平均磁盘 I/O 100 IOPS
        burst_frequency=0.5,           # 每小时0.5次突发
        duration_hours=24 * 30,        # 长期运行
        fault_tolerance='medium'        # 中等容错性
    )
    
    selector = IntelligentInstanceSelector()
    
    # 获取推荐
    recommendations = selector.recommend_instances(openclaw_workload)
    
    print("OpenClaw 实例推荐结果:")
    print("=" * 50)
    
    for i, (instance_type, score, analysis) in enumerate(recommendations, 1):
        instance_spec = selector.instance_catalog[instance_type]
        
        print(f"\n{i}. {instance_type} (评分: {score:.1f})")
        print(f"   规格: {instance_spec.vcpus} vCPU, {instance_spec.memory_gb}GB RAM, {instance_spec.architecture}")
        print(f"   月成本: ${analysis['estimated_monthly_cost']:.2f} ({analysis['pricing_model']})")
        print(f"   预计利用率: CPU {analysis['resource_utilization']['cpu_utilization_projected']:.1f}%, "
              f"内存 {analysis['resource_utilization']['memory_utilization_projected']:.1f}%")
    
    # 生成从 t3.xlarge 的迁移计划
    migration_plan = selector.generate_migration_plan('t3.xlarge', openclaw_workload)
    
    print(f"\n\n迁移计划:")
    print("=" * 50)
    print(f"当前实例: {migration_plan['current_instance']}")
    print(f"推荐实例: {migration_plan['recommended_instance']}")
    print(f"\n成本分析:")
    cost = migration_plan['cost_analysis']
    print(f"  当前月成本: ${cost['current_monthly_cost']:.2f}")
    print(f"  优化后月成本: ${cost['projected_monthly_cost']:.2f}")
    print(f"  月节省: ${cost['monthly_savings']:.2f} ({cost['savings_percentage']:.1f}%)")
    print(f"  年节省: ${cost['annual_savings']:.2f}")
    
    print(f"\n性能影响:")
    perf = migration_plan['performance_impact']
    print(f"  CPU变化: {perf['cpu_change']}")
    print(f"  内存变化: {perf['memory_change']}")
    print(f"  架构变化: {perf['architecture']}")
    
    if migration_plan['migration_risks']:
        print(f"\n迁移风险:")
        for risk in migration_plan['migration_risks']:
            print(f"  • {risk['description']}")
            print(f"    缓解措施: {risk['mitigation']}")
    
    print(f"\n迁移步骤:")
    for step in migration_plan['migration_steps']:
        print(f"  {step['step']}. {step['description']}")
        print(f"     {step['details']}")

if __name__ == "__main__":
    optimize_openclaw_compute()

2.3 实例迁移的工程实现

基于上述分析,我们为 OpenClaw 制定了从 t3.xlarge 到 t4g.medium 的迁移计划。以下是完整的迁移自动化脚本:

#!/bin/bash
# openclaw-instance-migration.sh
# OpenClaw 实例迁移自动化脚本

set -euo pipefail

# 配置变量
SOURCE_INSTANCE_ID="i-0123456789abcdef0"
TARGET_INSTANCE_TYPE="t4g.medium"
KEY_NAME="openclaw-keypair"
SECURITY_GROUP_ID="sg-12345678"
SUBNET_ID="subnet-12345678"
APPLICATION_NAME="OpenClaw"
BACKUP_BUCKET="openclaw-migration-backup"

# 日志函数
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}

error_exit() {
    echo "[ERROR] $1" >&2
    exit 1
}

# 检查依赖
check_dependencies() {
    log "检查依赖..."
    
    if ! command -v aws &> /dev/null; then
        error_exit "AWS CLI 未安装或未配置"
    fi
    
    if ! command -v jq &> /dev/null; then
        error_exit "jq 未安装"
    fi
    
    # 检查 AWS 凭证
    if ! aws sts get-caller-identity &> /dev/null; then
        error_exit "AWS 凭证未配置或无效"
    fi
    
    log "依赖检查通过"
}

# 获取实例信息
get_instance_info() {
    local instance_id=$1
    
    aws ec2 describe-instances \
        --instance-ids "$instance_id" \
        --query 'Reservations[0].Instances[0]' \
        --output json
}

# 创建 AMI 备份
create_ami_backup() {
    local instance_id=$1
    local ami_name="${APPLICATION_NAME}-migration-backup-$(date +%Y%m%d-%H%M%S)"
    
    log "创建 AMI 备份: $ami_name"
    
    local ami_id
    ami_id=$(aws ec2 create-image \
        --instance-id "$instance_id" \
        --name "$ami_name" \
        --description "Migration backup for $instance_id" \
        --no-reboot \
        --query 'ImageId' \
        --output text)
    
    log "AMI 创建中: $ami_id"
    
    # 等待 AMI 创建完成
    aws ec2 wait image-available --image-ids "$ami_id"
    
    log "AMI 备份完成: $ami_id"
    echo "$ami_id"
}

# 备份应用数据
backup_application_data() {
    local instance_id=$1
    local instance_ip
    
    instance_ip=$(aws ec2 describe-instances \
        --instance-ids "$instance_id" \
        --query 'Reservations[0].Instances[0].PublicIpAddress' \
        --output text)
    
    log "备份应用数据到 S3"
    
    # SSH 执行备份命令
    ssh -i ~/.ssh/${KEY_NAME}.pem -o StrictHostKeyChecking=no ubuntu@"$instance_ip" << 'EOF'
        # 创建备份目录
        sudo mkdir -p /tmp/migration-backup
        
        # 备份 OpenClaw 数据
        if [ -d /opt/openclaw/data ]; then
            sudo tar -czf /tmp/migration-backup/openclaw-data.tar.gz -C /opt/openclaw data
            echo "OpenClaw 数据已备份"
        fi
        
        # 备份配置文件
        if [ -d /etc/openclaw ]; then
            sudo tar -czf /tmp/migration-backup/openclaw-config.tar.gz -C /etc openclaw
            echo "OpenClaw 配置已备份"
        fi
        
        # 备份数据库(如果有)
        if systemctl is-active --quiet postgresql; then
            sudo -u postgres pg_dumpall > /tmp/migration-backup/database-backup.sql
            echo "数据库已备份"
        fi
        
        # 备份服务配置
        sudo cp -r /etc/systemd/system/openclaw* /tmp/migration-backup/ 2>/dev/null || true
        
        # 创建清单文件
        sudo find /tmp/migration-backup -type f -exec ls -la {} \; > /tmp/migration-backup/backup-manifest.txt
EOF
    
    # 上传备份到 S3
    local backup_key="migration-backups/$(date +%Y%m%d-%H%M%S)/"
    
    ssh -i ~/.ssh/${KEY_NAME}.pem ubuntu@"$instance_ip" \
        "aws s3 sync /tmp/migration-backup s3://${BACKUP_BUCKET}/${backup_key}"
    
    log "数据备份完成,位置: s3://${BACKUP_BUCKET}/${backup_key}"
    echo "$backup_key"
}

# 创建目标实例
create_target_instance() {
    local ami_id=$1
    
    log "创建目标实例: $TARGET_INSTANCE_TYPE"
    
    # 获取最新的 ARM64 Ubuntu AMI(如果是 ARM 实例)
    local target_ami_id
    if [[ "$TARGET_INSTANCE_TYPE" == *"g."* ]]; then
        # ARM 实例,使用 ARM64 AMI
        target_ami_id=$(aws ec2 describe-images \
            --owners 099720109477 \
            --filters \
                'Name=name,Values=ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-arm64-server-*' \
                'Name=state,Values=available' \
            --query 'Images|sort_by(@, &CreationDate)[-1]|ImageId' \
            --output text)
    else
        # x86_64 实例
        target_ami_id=$(aws ec2 describe-images \
            --owners 099720109477 \
            --filters \
                'Name=name,Values=ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*' \
                'Name=state,Values=available' \
            --query 'Images|sort_by(@, &CreationDate)[-1]|ImageId' \
            --output text)
    fi
    
    log "使用 AMI: $target_ami_id"
    
    # 创建用户数据脚本
    local user_data
    read -r -d '' user_data << 'EOF' || true
#!/bin/bash
apt-get update
apt-get install -y awscli python3-pip docker.io

# 下载 OpenClaw ARM64 版本(如果需要)
OPENCLAW_VERSION="latest"
if [ "$(uname -m)" = "aarch64" ]; then
    OPENCLAW_ARCH="arm64"
else
    OPENCLAW_ARCH="amd64"
fi

wget -O /tmp/openclaw.tar.gz "https://releases.openclaw.com/${OPENCLAW_VERSION}/openclaw-linux-${OPENCLAW_ARCH}.tar.gz"
tar -xzf /tmp/openclaw.tar.gz -C /opt/

# 安装系统依赖
systemctl enable docker
systemctl start docker
usermod -aG docker ubuntu

# 标记实例准备完成
touch /tmp/instance-ready
EOF
    
    # 启动实例
    local instance_id
    instance_id=$(aws ec2 run-instances \
        --image-id "$target_ami_id" \
        --instance-type "$TARGET_INSTANCE_TYPE" \
        --key-name "$KEY_NAME" \
        --security-group-ids "$SECURITY_GROUP_ID" \
        --subnet-id "$SUBNET_ID" \
        --user-data "$user_data" \
        --tag-specifications \
            "ResourceType=instance,Tags=[{Key=Name,Value=${APPLICATION_NAME}-migration-target},{Key=Environment,Value=migration}]" \
        --query 'Instances[0].InstanceId' \
        --output text)
    
    log "目标实例已创建: $instance_id"
    
    # 等待实例运行
    aws ec2 wait instance-running --instance-ids "$instance_id"
    
    # 等待系统检查通过
    aws ec2 wait instance-status-ok --instance-ids "$instance_id"
    
    log "目标实例已就绪: $instance_id"
    echo "$instance_id"
}

# 部署应用到目标实例
deploy_to_target() {
    local target_instance_id=$1
    local backup_key=$2
    
    local target_ip
    target_ip=$(aws ec2 describe-instances \
        --instance-ids "$target_instance_id" \
        --query 'Reservations[0].Instances[0].PublicIpAddress' \
        --output text)
    
    log "部署应用到目标实例: $target_ip"
    
    # 等待实例完全启动
    log "等待实例初始化完成..."
    while ! ssh -i ~/.ssh/${KEY_NAME}.pem -o StrictHostKeyChecking=no -o ConnectTimeout=5 ubuntu@"$target_ip" "test -f /tmp/instance-ready"; do
        sleep 10
    done
    
    # 恢复数据
    ssh -i ~/.ssh/${KEY_NAME}.pem -o StrictHostKeyChecking=no ubuntu@"$target_ip" << EOF
        # 创建恢复目录
        mkdir -p /tmp/migration-restore
        
        # 从 S3 下载备份
        aws s3 sync s3://${BACKUP_BUCKET}/${backup_key} /tmp/migration-restore/
        
        # 创建应用目录
        sudo mkdir -p /opt/openclaw/{data,logs}
        sudo mkdir -p /etc/openclaw
        
        # 恢复数据
        if [ -f /tmp/migration-restore/openclaw-data.tar.gz ]; then
            sudo tar -xzf /tmp/migration-restore/openclaw-data.tar.gz -C /opt/openclaw/
            echo "OpenClaw 数据已恢复"
        fi
        
        # 恢复配置
        if [ -f /tmp/migration-restore/openclaw-config.tar.gz ]; then
            sudo tar -xzf /tmp/migration-restore/openclaw-config.tar.gz -C /etc/
            echo "OpenClaw 配置已恢复"
        fi
        
        # 设置权限
        sudo chown -R openclaw:openclaw /opt/openclaw /etc/openclaw
        
        # 恢复系统服务配置
        sudo cp /tmp/migration-restore/openclaw*.service /etc/systemd/system/ 2>/dev/null || true
        sudo systemctl daemon-reload
        
        # 启动服务
        sudo systemctl enable openclaw
        sudo systemctl start openclaw
        
        echo "应用部署完成"
EOF
    
    log "应用部署完成"
}

# 性能验证测试
performance_validation() {
    local target_instance_id=$1
    
    local target_ip
    target_ip=$(aws ec2 describe-instances \
        --instance-ids "$target_instance_id" \
        --query 'Reservations[0].Instances[0].PublicIpAddress' \
        --output text)
    
    log "执行性能验证测试"
    
    # 等待服务启动
    sleep 30
    
    # 健康检查
    local health_check_url="http://${target_ip}:8080/health"
    local max_retries=10
    local retry_count=0
    
    while [ $retry_count -lt $max_retries ]; do
        if curl -f -s "$health_check_url" > /dev/null; then
            log "健康检查通过"
            break
        else
            log "健康检查失败,重试 $((retry_count + 1))/$max_retries"
            ((retry_count++))
            sleep 10
        fi
    done
    
    if [ $retry_count -eq $max_retries ]; then
        error_exit "健康检查持续失败"
    fi
    
    # 性能基准测试
    log "执行性能基准测试..."
    
    python3 << 'EOF'
import requests
import time
import statistics
import json

def run_performance_test():
    base_url = f"http://{target_ip}:8080"
    
    test_cases = [
        {"endpoint": "/health", "method": "GET", "expected_time": 0.1},
        {"endpoint": "/api/generate", "method": "POST", 
         "payload": {"prompt": "写一个Python排序算法", "max_tokens": 100}, 
         "expected_time": 2.0},
        {"endpoint": "/api/chat", "method": "POST",
         "payload": {"message": "解释什么是云计算?"}, 
         "expected_time": 1.5}
    ]
    
    results = {}
    
    for test in test_cases:
        endpoint = test["endpoint"]
        method = test["method"]
        expected_time = test["expected_time"]
        
        print(f"测试 {method} {endpoint}")
        
        response_times = []
        success_count = 0
        
        for i in range(5):  # 每个端点测试5次
            start_time = time.time()
            
            try:
                if method == "GET":
                    response = requests.get(f"{base_url}{endpoint}", timeout=10)
                else:
                    response = requests.post(f"{base_url}{endpoint}", 
                                           json=test.get("payload"), timeout=15)
                
                end_time = time.time()
                response_time = end_time - start_time
                
                if response.status_code == 200:
                    success_count += 1
                    response_times.append(response_time)
                    
                print(f"  请求 {i+1}: {response_time:.3f}s (HTTP {response.status_code})")
                
            except Exception as e:
                print(f"  请求 {i+1}: 失败 - {e}")
            
            time.sleep(1)
        
        if response_times:
            avg_time = statistics.mean(response_times)
            max_time = max(response_times)
            
            results[endpoint] = {
                "success_rate": success_count / 5 * 100,
                "avg_response_time": avg_time,
                "max_response_time": max_time,
                "performance_ratio": avg_time / expected_time,
                "status": "PASS" if avg_time <= expected_time * 1.5 else "WARN"
            }
            
            print(f"  平均响应时间: {avg_time:.3f}s")
            print(f"  成功率: {success_count}/5 ({success_count/5*100:.1f}%)")
            print(f"  性能比较: {avg_time/expected_time:.2f}x 预期时间")
            
        else:
            results[endpoint] = {
                "success_rate": 0,
                "status": "FAIL",
                "error": "所有请求都失败"
            }
    
    # 输出总结
    print("\n=== 性能测试总结 ===")
    all_passed = True
    
    for endpoint, result in results.items():
        status = result["status"]
        print(f"{endpoint}: {status}")
        
        if status == "FAIL":
            all_passed = False
            print(f"  错误: {result.get('error', '性能不达标')}")
        elif status == "WARN":
            print(f"  警告: 响应时间超过预期 {result['performance_ratio']:.1f}x")
        else:
            print(f"  响应时间: {result['avg_response_time']:.3f}s, 成功率: {result['success_rate']:.1f}%")
    
    return all_passed

# 执行测试
target_ip = "$target_ip"
test_passed = run_performance_test()

if not test_passed:
    exit(1)
    
print("\n✅ 性能验证测试通过")
EOF
    
    local test_result=$?
    
    if [ $test_result -eq 0 ]; then
        log "性能验证测试通过"
        return 0
    else
        log "性能验证测试失败"
        return 1
    fi
}

# 流量切换
switch_traffic() {
    local source_instance_id=$1
    local target_instance_id=$2
    
    log "准备切换流量"
    
    # 获取源实例的弹性IP
    local elastic_ip
    elastic_ip=$(aws ec2 describe-addresses \
        --filters "Name=instance-id,Values=$source_instance_id" \
        --query 'Addresses[0].PublicIp' \
        --output text)
    
    if [ "$elastic_ip" != "None" ] && [ -n "$elastic_ip" ]; then
        log "将弹性IP $elastic_ip 从 $source_instance_id 移动到 $target_instance_id"
        
        # 解绑弹性IP
        local allocation_id
        allocation_id=$(aws ec2 describe-addresses \
            --public-ips "$elastic_ip" \
            --query 'Addresses[0].AllocationId' \
            --output text)
        
        aws ec2 disassociate-address --association-id \
            $(aws ec2 describe-addresses --allocation-ids "$allocation_id" --query 'Addresses[0].AssociationId' --output text)
        
        # 关联到新实例
        aws ec2 associate-address \
            --instance-id "$target_instance_id" \
            --allocation-id "$allocation_id"
        
        log "弹性IP切换完成"
    else
        log "源实例未使用弹性IP,流量切换需要手动更新DNS记录"
    fi
    
    # 更新负载均衡器目标组(如果使用)
    # 这部分需要根据实际的负载均衡配置进行调整
    
    log "流量切换完成"
}

# 清理旧资源
cleanup_old_resources() {
    local source_instance_id=$1
    
    log "清理旧资源"
    
    # 确认清理
    echo "即将停止并终止原实例 $source_instance_id"
    read -p "确认继续? (y/N): " -n 1 -r
    echo
    
    if [[ $REPLY =~ ^[Yy]$ ]]; then
        # 停止实例
        aws ec2 stop-instances --instance-ids "$source_instance_id"
        aws ec2 wait instance-stopped --instance-ids "$source_instance_id"
        
        log "原实例已停止,等待24小时后可以终止实例"
        
        # 创建定时任务来延迟终止实例(可选)
        echo "aws ec2 terminate-instances --instance-ids $source_instance_id" | \
        at now + 24 hours 2>/dev/null || true
        
    else
        log "保留原实例,请手动清理"
    fi
}

# 主函数
main() {
    log "开始 OpenClaw 实例迁移过程"
    
    # 检查依赖
    check_dependencies
    
    # 获取源实例信息
    log "获取源实例信息"
    source_info=$(get_instance_info "$SOURCE_INSTANCE_ID")
    source_instance_type=$(echo "$source_info" | jq -r '.InstanceType')
    
    log "源实例类型: $source_instance_type"
    log "目标实例类型: $TARGET_INSTANCE_TYPE"
    
    # 创建备份
    ami_backup=$(create_ami_backup "$SOURCE_INSTANCE_ID")
    data_backup=$(backup_application_data "$SOURCE_INSTANCE_ID")
    
    # 创建目标实例
    target_instance_id=$(create_target_instance "$ami_backup")
    
    # 部署应用
    deploy_to_target "$target_instance_id" "$data_backup"
    
    # 性能验证
    if performance_validation "$target_instance_id"; then
        log "性能验证通过,准备切换流量"
        
        # 切换流量
        switch_traffic "$SOURCE_INSTANCE_ID" "$target_instance_id"
        
        log "迁移成功完成!"
        log "新实例ID: $target_instance_id"
        log "AMI备份: $ami_backup"
        log "数据备份: s3://${BACKUP_BUCKET}/${data_backup}"
        
        # 清理旧资源
        cleanup_old_resources "$SOURCE_INSTANCE_ID"
        
    else
        log "性能验证失败,回滚操作"
        
        # 终止目标实例
        aws ec2 terminate-instances --instance-ids "$target_instance_id"
        
        error_exit "迁移失败,已回滚"
    fi
}

# 执行主函数
if [ "${BASH_SOURCE[0]}" == "${0}" ]; then
    main "$@"
fi

通过这个系统化的迁移方案,我们成功将 OpenClaw 从 t3.xlarge 迁移到 t4g.medium,月成本从 $119.81 降低到 $30.37,节省了 75% 的计算成本。

第三章:AI 服务成本优化的深度策略

3.1 大语言模型成本结构分析

大语言模型的成本主要由以下几个因素构成:

  1. Token 消费成本:输入和输出 token 的定价差异
  2. 模型复杂度成本:不同模型的性能与成本权衡
  3. 调用频次成本:实时调用 vs 批量处理的成本差异
  4. 缓存命中成本:重复查询的优化潜力

3.2 智能模型选择框架

import boto3
import json
import time
import hashlib
import redis
import asyncio
from typing import Dict, List, Optional, Tuple, Union
from dataclasses import dataclass, asdict
from enum import Enum
import numpy as np
from datetime import datetime, timedelta
import logging

class ModelComplexity(Enum):
    SIMPLE = "simple"
    MEDIUM = "medium"  
    COMPLEX = "complex"

class TaskType(Enum):
    TEXT_GENERATION = "text_generation"
    CODE_GENERATION = "code_generation"
    CODE_REVIEW = "code_review"
    SUMMARIZATION = "summarization"
    CLASSIFICATION = "classification"
    QUESTION_ANSWERING = "question_answering"
    TRANSLATION = "translation"
    ANALYSIS = "analysis"

@dataclass
class ModelConfig:
    """模型配置信息"""
    model_id: str
    name: str
    input_cost_per_1k_tokens: float
    output_cost_per_1k_tokens: float
    max_tokens: int
    strengths: List[TaskType]
    latency_avg_ms: int
    quality_score: float  # 0-100
    context_window: int

@dataclass
class TaskRequest:
    """任务请求"""
    content: str
    task_type: TaskType
    complexity: ModelComplexity
    max_tokens: int = 200
    temperature: float = 0.1
    priority: str = "normal"  # low, normal, high
    deadline_seconds: Optional[int] = None

@dataclass
class ModelResponse:
    """模型响应"""
    content: str
    model_used: str
    input_tokens: int
    output_tokens: int
    cost: float
    latency_ms: int
    cached: bool = False
    batch_job_id: Optional[str] = None

class IntelligentModelSelector:
    """智能模型选择器"""
    
    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.bedrock = boto3.client('bedrock-runtime')
        self.bedrock_batch = boto3.client('bedrock')
        self.redis_client = redis.from_url(redis_url) if redis_url else None
        self.logger = logging.getLogger(__name__)
        
        # 初始化模型目录
        self.model_catalog = self._initialize_model_catalog()
        
        # 性能统计
        self.performance_stats = {}
    
    def _initialize_model_catalog(self) -> Dict[str, ModelConfig]:
        """初始化模型目录"""
        return {
            'claude-3-5-sonnet': ModelConfig(
                model_id='anthropic.claude-3-5-sonnet-20241022-v2:0',
                name='Claude 3.5 Sonnet',
                input_cost_per_1k_tokens=0.003,
                output_cost_per_1k_tokens=0.015,
                max_tokens=4096,
                strengths=[TaskType.CODE_GENERATION, TaskType.CODE_REVIEW, TaskType.ANALYSIS],
                latency_avg_ms=1200,
                quality_score=95,
                context_window=200000
            ),
            'claude-3-haiku': ModelConfig(
                model_id='anthropic.claude-3-haiku-20240307-v1:0',
                name='Claude 3 Haiku',
                input_cost_per_1k_tokens=0.00025,
                output_cost_per_1k_tokens=0.00125,
                max_tokens=4096,
                strengths=[TaskType.CLASSIFICATION, TaskType.SUMMARIZATION, TaskType.QUESTION_ANSWERING],
                latency_avg_ms=800,
                quality_score=85,
                context_window=200000
            ),
            'nova-lite': ModelConfig(
                model_id='amazon.nova-lite-v1:0',
                name='Amazon Nova Lite',
                input_cost_per_1k_tokens=0.0006,
                output_cost_per_1k_tokens=0.0024,
                max_tokens=4096,
                strengths=[TaskType.TEXT_GENERATION, TaskType.SUMMARIZATION, TaskType.QUESTION_ANSWERING],
                latency_avg_ms=900,
                quality_score=80,
                context_window=300000
            ),
            'nova-micro': ModelConfig(
                model_id='amazon.nova-micro-v1:0',
                name='Amazon Nova Micro',
                input_cost_per_1k_tokens=0.00035,
                output_cost_per_1k_tokens=0.0014,
                max_tokens=4096,
                strengths=[TaskType.CLASSIFICATION, TaskType.SUMMARIZATION],
                latency_avg_ms=600,
                quality_score=75,
                context_window=128000
            )
        }
    
    def estimate_tokens(self, text: str) -> int:
        """估算文本的 token 数量"""
        # 简化的 token 估算,实际应用中可以使用 tiktoken 等库
        return int(len(text.split()) * 1.3)
    
    def calculate_cost_estimate(self, model_name: str, input_text: str, output_tokens: int) -> float:
        """计算成本估算"""
        model = self.model_catalog[model_name]
        input_tokens = self.estimate_tokens(input_text)
        
        input_cost = (input_tokens / 1000) * model.input_cost_per_1k_tokens
        output_cost = (output_tokens / 1000) * model.output_cost_per_1k_tokens
        
        return input_cost + output_cost
    
    def select_optimal_model(self, request: TaskRequest) -> str:
        """选择最优模型"""
        
        scores = {}
        
        for model_name, model_config in self.model_catalog.items():
            score = self._calculate_model_score(request, model_config)
            scores[model_name] = score
        
        # 选择得分最高的模型
        optimal_model = max(scores.items(), key=lambda x: x[1])
        
        self.logger.info(f"任务类型: {request.task_type}, 复杂度: {request.complexity}")
        self.logger.info(f"选择模型: {optimal_model[0]} (得分: {optimal_model[1]:.2f})")
        
        return optimal_model[0]
    
    def _calculate_model_score(self, request: TaskRequest, model: ModelConfig) -> float:
        """计算模型适配得分"""
        
        # 基础得分从质量开始
        score = model.quality_score
        
        # 任务类型匹配加成
        if request.task_type in model.strengths:
            score += 20
        
        # 复杂度匹配
        complexity_bonus = {
            ModelComplexity.SIMPLE: {
                'nova-micro': 15,
                'claude-3-haiku': 10,
                'nova-lite': 5,
                'claude-3-5-sonnet': -10  # 简单任务用复杂模型减分
            },
            ModelComplexity.MEDIUM: {
                'nova-lite': 15,
                'claude-3-haiku': 10,
                'claude-3-5-sonnet': 0,
                'nova-micro': -5
            },
            ModelComplexity.COMPLEX: {
                'claude-3-5-sonnet': 20,
                'nova-lite': 5,
                'claude-3-haiku': -5,
                'nova-micro': -15
            }
        }
        
        model_key = model.model_id.split('.')[-1] if '.' in model.model_id else model.name.lower().replace(' ', '-')
        score += complexity_bonus.get(request.complexity, {}).get(model_key, 0)
        
        # 成本效率评分
        estimated_cost = self.calculate_cost_estimate(
            [k for k, v in self.model_catalog.items() if v == model][0],
            request.content,
            request.max_tokens
        )
        
        # 成本越低得分越高(非线性关系)
        if estimated_cost < 0.001:
            score += 15
        elif estimated_cost < 0.005:
            score += 10
        elif estimated_cost < 0.01:
            score += 5
        else:
            score -= (estimated_cost - 0.01) * 1000  # 高成本大幅减分
        
        # 延迟考虑(紧急任务优先考虑低延迟)
        if request.priority == "high" or request.deadline_seconds:
            if model.latency_avg_ms < 800:
                score += 10
            elif model.latency_avg_ms > 1200:
                score -= 5
        
        # 上下文窗口考虑
        content_tokens = self.estimate_tokens(request.content)
        if content_tokens > model.context_window * 0.8:  # 接近上下文限制
            score -= 20
        
        return score
    
    def _generate_cache_key(self, request: TaskRequest, model_id: str) -> str:
        """生成缓存键"""
        cache_data = {
            'content': request.content,
            'model_id': model_id,
            'task_type': request.task_type.value,
            'complexity': request.complexity.value,
            'max_tokens': request.max_tokens,
            'temperature': request.temperature
        }
        
        cache_string = json.dumps(cache_data, sort_keys=True)
        return f"llm_cache:{hashlib.sha256(cache_string.encode()).hexdigest()[:16]}"
    
    def _should_cache_response(self, request: TaskRequest, response: str) -> bool:
        """判断是否应该缓存响应"""
        
        # 缓存策略
        cache_conditions = [
            len(request.content) < 200,  # 短输入通常是重复查询
            any(keyword in request.content.lower() for keyword in 
                ['什么是', '解释', '定义', 'what is', 'explain', 'define']),  # 概念性问题
            request.task_type in [TaskType.CLASSIFICATION, TaskType.QUESTION_ANSWERING],  # 特定任务类型
            len(response) > 300  # 长输出缓存价值高
        ]
        
        return any(cache_conditions)
    
    def _get_cached_response(self, cache_key: str) -> Optional[ModelResponse]:
        """获取缓存的响应"""
        if not self.redis_client:
            return None
        
        try:
            cached_data = self.redis_client.get(cache_key)
            if cached_data:
                data = json.loads(cached_data)
                return ModelResponse(
                    content=data['content'],
                    model_used=data['model_used'],
                    input_tokens=data['input_tokens'],
                    output_tokens=data['output_tokens'],
                    cost=0,  # 缓存命中成本为0
                    latency_ms=data.get('latency_ms', 0),
                    cached=True
                )
        except Exception as e:
            self.logger.warning(f"缓存读取失败: {e}")
        
        return None
    
    def _cache_response(self, cache_key: str, response: ModelResponse, ttl: int = 3600):
        """缓存响应"""
        if not self.redis_client:
            return
        
        try:
            cache_data = {
                'content': response.content,
                'model_used': response.model_used,
                'input_tokens': response.input_tokens,
                'output_tokens': response.output_tokens,
                'latency_ms': response.latency_ms,
                'timestamp': datetime.now().isoformat()
            }
            
            self.redis_client.setex(cache_key, ttl, json.dumps(cache_data))
        except Exception as e:
            self.logger.warning(f"缓存写入失败: {e}")
    
    async def invoke_model_async(self, request: TaskRequest) -> ModelResponse:
        """异步调用模型"""
        
        # 选择最优模型
        optimal_model_name = self.select_optimal_model(request)
        model_config = self.model_catalog[optimal_model_name]
        
        # 检查缓存
        cache_key = self._generate_cache_key(request, model_config.model_id)
        cached_response = self._get_cached_response(cache_key)
        
        if cached_response:
            self.logger.info(f"缓存命中,节省成本: ${self.calculate_cost_estimate(optimal_model_name, request.content, request.max_tokens):.6f}")
            return cached_response
        
        # 调用模型
        start_time = time.time()
        
        try:
            request_body = {
                'anthropic_version': 'bedrock-2023-05-31',
                'max_tokens': request.max_tokens,
                'temperature': request.temperature,
                'messages': [
                    {'role': 'user', 'content': request.content}
                ]
            }
            
            response = self.bedrock.invoke_model(
                modelId=model_config.model_id,
                body=json.dumps(request_body)
            )
            
            end_time = time.time()
            latency_ms = int((end_time - start_time) * 1000)
            
            response_body = json.loads(response['body'].read())
            content = response_body['content'][0]['text']
            
            # 计算token和成本
            input_tokens = self.estimate_tokens(request.content)
            output_tokens = self.estimate_tokens(content)
            cost = self.calculate_cost_estimate(optimal_model_name, request.content, output_tokens)
            
            model_response = ModelResponse(
                content=content,
                model_used=optimal_model_name,
                input_tokens=input_tokens,
                output_tokens=output_tokens,
                cost=cost,
                latency_ms=latency_ms,
                cached=False
            )
            
            # 缓存响应
            if self._should_cache_response(request, content):
                self._cache_response(cache_key, model_response)
            
            # 更新性能统计
            self._update_performance_stats(optimal_model_name, model_response)
            
            return model_response
            
        except Exception as e:
            self.logger.error(f"模型调用失败: {e}")
            raise
    
    def invoke_model(self, request: TaskRequest) -> ModelResponse:
        """同步调用模型"""
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        try:
            return loop.run_until_complete(self.invoke_model_async(request))
        finally:
            loop.close()
    
    def _update_performance_stats(self, model_name: str, response: ModelResponse):
        """更新性能统计"""
        if model_name not in self.performance_stats:
            self.performance_stats[model_name] = {
                'total_calls': 0,
                'total_cost': 0.0,
                'total_latency_ms': 0,
                'cache_hits': 0,
                'last_updated': datetime.now()
            }
        
        stats = self.performance_stats[model_name]
        stats['total_calls'] += 1
        stats['total_cost'] += response.cost
        stats['total_latency_ms'] += response.latency_ms
        
        if response.cached:
            stats['cache_hits'] += 1
        
        stats['last_updated'] = datetime.now()
    
    def get_performance_report(self) -> Dict:
        """获取性能报告"""
        report = {
            'report_time': datetime.now().isoformat(),
            'models': {}
        }
        
        for model_name, stats in self.performance_stats.items():
            if stats['total_calls'] > 0:
                report['models'][model_name] = {
                    'total_calls': stats['total_calls'],
                    'total_cost': round(stats['total_cost'], 6),
                    'avg_cost_per_call': round(stats['total_cost'] / stats['total_calls'], 6),
                    'avg_latency_ms': round(stats['total_latency_ms'] / stats['total_calls'], 2),
                    'cache_hit_rate': round(stats['cache_hits'] / stats['total_calls'] * 100, 1),
                    'estimated_monthly_cost': round(stats['total_cost'] * 30, 2)
                }
        
        return report

class BatchProcessor:
    """批量处理器"""
    
    def __init__(self, model_selector: IntelligentModelSelector):
        self.model_selector = model_selector
        self.bedrock_batch = boto3.client('bedrock')
        self.s3_client = boto3.client('s3')
        self.logger = logging.getLogger(__name__)
        
        self.batch_bucket = "openclaw-batch-processing"
        self.pending_jobs = {}
    
    def create_batch_job(self, requests: List[TaskRequest], priority: str = "normal") -> str:
        """创建批量处理任务"""
        
        if not requests:
            raise ValueError("批量请求不能为空")
        
        # 根据任务类型分组
        grouped_requests = {}
        for request in requests:
            # 选择最优模型
            model_name = self.model_selector.select_optimal_model(request)
            model_config = self.model_selector.model_catalog[model_name]
            
            if model_config.model_id not in grouped_requests:
                grouped_requests[model_config.model_id] = []
            
            grouped_requests[model_config.model_id].append(request)
        
        job_ids = []
        
        # 为每种模型创建批量任务
        for model_id, model_requests in grouped_requests.items():
            job_id = self._create_single_batch_job(model_id, model_requests, priority)
            job_ids.append(job_id)
        
        return job_ids
    
    def _create_single_batch_job(self, model_id: str, requests: List[TaskRequest], priority: str) -> str:
        """创建单个批量任务"""
        
        job_timestamp = int(time.time())
        job_id = f"batch-{job_timestamp}-{len(requests)}"
        
        # 准备批量输入数据
        batch_input = []
        for i, request in enumerate(requests):
            batch_input.append({
                'recordId': f"record-{i}",
                'modelInput': {
                    'anthropic_version': 'bedrock-2023-05-31',
                    'max_tokens': request.max_tokens,
                    'temperature': request.temperature,
                    'messages': [
                        {'role': 'user', 'content': request.content}
                    ]
                }
            })
        
        # 上传到S3
        input_key = f"batch-input/{job_id}.jsonl"
        input_content = '\n'.join(json.dumps(item) for item in batch_input)
        
        self.s3_client.put_object(
            Bucket=self.batch_bucket,
            Key=input_key,
            Body=input_content,
            Metadata={
                'job_id': job_id,
                'model_id': model_id,
                'priority': priority,
                'request_count': str(len(requests))
            }
        )
        
        # 创建批量任务
        try:
            response = self.bedrock_batch.create_model_invocation_job(
                jobName=job_id,
                roleArn=f"arn:aws:iam::{boto3.client('sts').get_caller_identity()['Account']}:role/BedrockBatchExecutionRole",
                modelId=model_id,
                inputDataConfig={
                    's3InputDataConfig': {
                        's3Uri': f"s3://{self.batch_bucket}/{input_key}"
                    }
                },
                outputDataConfig={
                    's3OutputDataConfig': {
                        's3Uri': f"s3://{self.batch_bucket}/batch-output/{job_id}/"
                    }
                },
                timeoutDurationInHours=24
            )
            
            batch_job_arn = response['jobArn']
            
            # 记录任务信息
            self.pending_jobs[job_id] = {
                'job_arn': batch_job_arn,
                'model_id': model_id,
                'request_count': len(requests),
                'created_at': datetime.now(),
                'status': 'SUBMITTED'
            }
            
            self.logger.info(f"批量任务已创建: {job_id}, 包含 {len(requests)} 个请求")
            
            return job_id
            
        except Exception as e:
            self.logger.error(f"批量任务创建失败: {e}")
            raise
    
    def get_batch_job_status(self, job_id: str) -> Dict:
        """获取批量任务状态"""
        
        if job_id not in self.pending_jobs:
            return {'error': f'未知的任务ID: {job_id}'}
        
        job_info = self.pending_jobs[job_id]
        
        try:
            response = self.bedrock_batch.get_model_invocation_job(
                jobIdentifier=job_info['job_arn']
            )
            
            job_details = response.get('jobDetails', {})
            status = job_details.get('status', 'UNKNOWN')
            
            result = {
                'job_id': job_id,
                'status': status,
                'model_id': job_info['model_id'],
                'request_count': job_info['request_count'],
                'created_at': job_info['created_at'].isoformat(),
                'job_arn': job_info['job_arn']
            }
            
            if status == 'COMPLETED':
                # 获取输出结果
                output_s3_uri = job_details.get('outputDataConfig', {}).get('s3OutputDataConfig', {}).get('s3Uri')
                if output_s3_uri:
                    result['output_location'] = output_s3_uri
                    result['estimated_cost_savings'] = self._calculate_batch_savings(job_info['request_count'])
            
            elif status == 'FAILED':
                result['error_message'] = job_details.get('failureMessage', '未知错误')
            
            return result
            
        except Exception as e:
            return {'error': f'获取任务状态失败: {e}'}
    
    def _calculate_batch_savings(self, request_count: int) -> float:
        """计算批量处理的成本节省"""
        # 批量处理通常能节省 50% 的成本
        savings_rate = 0.5
        avg_cost_per_request = 0.001  # 假设平均每个请求成本
        
        total_savings = request_count * avg_cost_per_request * savings_rate
        return round(total_savings, 6)

# 使用示例
def demonstrate_intelligent_model_selection():
    """演示智能模型选择"""
    
    # 初始化模型选择器
    selector = IntelligentModelSelector(redis_url="redis://localhost:6379")
    
    # 创建不同类型的测试请求
    test_requests = [
        TaskRequest(
            content="什么是Docker容器技术?",
            task_type=TaskType.QUESTION_ANSWERING,
            complexity=ModelComplexity.SIMPLE
        ),
        TaskRequest(
            content="写一个Python实现的红黑树数据结构,包含插入、删除和查找操作",
            task_type=TaskType.CODE_GENERATION,
            complexity=ModelComplexity.COMPLEX,
            max_tokens=500
        ),
        TaskRequest(
            content="审查这段代码的安全性:def login(username, password): return username == 'admin' and password == '123456'",
            task_type=TaskType.CODE_REVIEW,
            complexity=ModelComplexity.MEDIUM
        ),
        TaskRequest(
            content="总结亚马逊云科技EC2服务的主要特点",
            task_type=TaskType.SUMMARIZATION,
            complexity=ModelComplexity.SIMPLE
        )
    ]
    
    print("=== 智能模型选择演示 ===\n")
    
    total_cost = 0
    responses = []
    
    # 处理每个请求
    for i, request in enumerate(test_requests, 1):
        print(f"请求 {i}: {request.content[:50]}...")
        print(f"任务类型: {request.task_type.value}")
        print(f"复杂度: {request.complexity.value}")
        
        try:
            response = selector.invoke_model(request)
            responses.append(response)
            
            print(f"选择模型: {response.model_used}")
            print(f"成本: ${response.cost:.6f}")
            print(f"延迟: {response.latency_ms}ms")
            print(f"缓存命中: {'是' if response.cached else '否'}")
            print(f"输入tokens: {response.input_tokens}")
            print(f"输出tokens: {response.output_tokens}")
            print(f"响应: {response.content[:100]}...")
            
            total_cost += response.cost
            
        except Exception as e:
            print(f"请求失败: {e}")
        
        print("-" * 50)
    
    print(f"\n总成本: ${total_cost:.6f}")
    
    # 显示性能报告
    print("\n=== 性能报告 ===")
    report = selector.get_performance_report()
    
    for model_name, stats in report['models'].items():
        print(f"\n模型: {model_name}")
        print(f"  调用次数: {stats['total_calls']}")
        print(f"  总成本: ${stats['total_cost']:.6f}")
        print(f"  平均成本: ${stats['avg_cost_per_call']:.6f}")
        print(f"  平均延迟: {stats['avg_latency_ms']:.1f}ms")
        print(f"  缓存命中率: {stats['cache_hit_rate']:.1f}%")
        print(f"  月度预估: ${stats['estimated_monthly_cost']:.2f}")
    
    # 演示批量处理
    print("\n=== 批量处理演示 ===")
    batch_processor = BatchProcessor(selector)
    
    # 创建批量请求
    batch_requests = [
        TaskRequest(
            content=f"解释第{i}个编程概念",
            task_type=TaskType.TEXT_GENERATION,
            complexity=ModelComplexity.SIMPLE
        ) for i in range(1, 21)  # 20个类似请求
    ]
    
    try:
        job_ids = batch_processor.create_batch_job(batch_requests)
        print(f"批量任务已创建,任务ID: {job_ids}")
        
        # 检查任务状态
        for job_id in job_ids:
            status = batch_processor.get_batch_job_status(job_id)
            print(f"任务 {job_id} 状态: {status}")
            
    except Exception as e:
        print(f"批量处理失败: {e}")

if __name__ == "__main__":
    demonstrate_intelligent_model_selection()

通过这个智能模型选择框架,OpenClaw 的 Bedrock 成本从 $68.54 降低到 $12.30,节省了 82% 的 AI 服务成本。关键的优化策略包括:

  1. 智能模型选择:根据任务复杂度自动选择最适合的模型
  2. 响应缓存:对重复查询进行缓存,避免重复计算
  3. 批量处理:对非实时任务使用批量推理,节省50%成本
  4. 性能监控:持续跟踪每个模型的成本效益,动态优化选择策略

第四章:存储与网络成本优化

4.1 存储生命周期管理

import boto3
import json
import gzip
import os
from datetime import datetime, timedelta
from typing import Dict, List, Optional
import logging

class IntelligentStorageManager:
    """智能存储管理器"""
    
    def __init__(self):
        self.s3 = boto3.client('s3')
        self.ec2 = boto3.client('ec2')
        self.cloudwatch = boto3.client('cloudwatch')
        self.logger = logging.getLogger(__name__)
    
    def optimize_ebs_volumes(self) -> Dict:
        """优化EBS卷配置"""
        
        optimization_results = {
            'analyzed_volumes': 0,
            'recommendations': [],
            'potential_savings': 0.0
        }
        
        # 获取所有EBS卷
        volumes = self.ec2.describe_volumes()['Volumes']
        
        for volume in volumes:
            volume_id = volume['VolumeId']
            volume_size = volume['Size']
            volume_type = volume['VolumeType']
            
            optimization_results['analyzed_volumes'] += 1
            
            # 获取卷使用情况
            usage_metrics = self._get_volume_usage_metrics(volume_id)
            
            if usage_metrics:
                utilization = usage_metrics['utilization_percentage']
                
                # 分析优化机会
                if utilization < 50 and volume_size > 20:
                    # 建议缩减卷大小
                    recommended_size = max(20, int(volume_size * (utilization / 100 + 0.2)))
                    monthly_savings = self._calculate_ebs_cost_difference(
                        volume_type, volume_size, recommended_size
                    )
                    
                    optimization_results['recommendations'].append({
                        'volume_id': volume_id,
                        'current_size': volume_size,
                        'recommended_size': recommended_size,
                        'utilization': utilization,
                        'monthly_savings': monthly_savings,
                        'action': 'resize'
                    })
                    
                    optimization_results['potential_savings'] += monthly_savings
                
                elif volume_type == 'gp2' and volume_size < 100:
                    # 建议升级到gp3
                    monthly_savings = self._calculate_ebs_type_savings(volume_size, 'gp2', 'gp3')
                    
                    optimization_results['recommendations'].append({
                        'volume_id': volume_id,
                        'current_type': 'gp2',
                        'recommended_type': 'gp3',
                        'monthly_savings': monthly_savings,
                        'action': 'upgrade_type'
                    })
                    
                    optimization_results['potential_savings'] += monthly_savings
        
        return optimization_results
    
    def _get_volume_usage_metrics(self, volume_id: str) -> Optional[Dict]:
        """获取卷使用指标"""
        try:
            # 获取过去7天的卷使用数据
            end_time = datetime.utcnow()
            start_time = end_time - timedelta(days=7)
            
            # 这里简化处理,实际需要通过CloudWatch Agent收集磁盘使用率
            # 或者通过其他方式获取实际使用情况
            return {
                'utilization_percentage': 45,  # 示例数据
                'avg_iops': 100,
                'avg_throughput': 10
            }
            
        except Exception as e:
            self.logger.warning(f"获取卷{volume_id}使用指标失败: {e}")
            return None
    
    def _calculate_ebs_cost_difference(self, volume_type: str, old_size: int, new_size: int) -> float:
        """计算EBS成本差异"""
        pricing = {
            'gp3': 0.08,  # per GB per month
            'gp2': 0.10,
            'io1': 0.125,
            'io2': 0.125
        }
        
        old_cost = old_size * pricing.get(volume_type, 0.10)
        new_cost = new_size * pricing.get(volume_type, 0.10)
        
        return old_cost - new_cost
    
    def _calculate_ebs_type_savings(self, size: int, old_type: str, new_type: str) -> float:
        """计算EBS类型变更的节省"""
        return self._calculate_ebs_cost_difference(old_type, size, size) - \
               self._calculate_ebs_cost_difference(new_type, size, size)
    
    def setup_intelligent_tiering(self, bucket_name: str) -> bool:
        """设置S3智能分层"""
        try:
            # 配置智能分层
            self.s3.put_bucket_intelligent_tiering_configuration(
                Bucket=bucket_name,
                Id='EntireBucketIT',
                IntelligentTieringConfiguration={
                    'Id': 'EntireBucketIT',
                    'Status': 'Enabled',
                    'Filter': {'Prefix': ''},
                    'Tierings': [
                        {
                            'Days': 1,
                            'AccessTier': 'ARCHIVE_ACCESS'
                        },
                        {
                            'Days': 90,
                            'AccessTier': 'DEEP_ARCHIVE_ACCESS'
                        }
                    ]
                }
            )
            
            # 配置生命周期策略
            lifecycle_config = {
                'Rules': [
                    {
                        'ID': 'LogArchiving',
                        'Status': 'Enabled',
                        'Filter': {'Prefix': 'logs/'},
                        'Transitions': [
                            {
                                'Days': 30,
                                'StorageClass': 'STANDARD_IA'
                            },
                            {
                                'Days': 90,
                                'StorageClass': 'GLACIER'
                            },
                            {
                                'Days': 365,
                                'StorageClass': 'DEEP_ARCHIVE'
                            }
                        ]
                    },
                    {
                        'ID': 'IncompleteMultipartUploads',
                        'Status': 'Enabled',
                        'Filter': {},
                        'AbortIncompleteMultipartUpload': {
                            'DaysAfterInitiation': 7
                        }
                    }
                ]
            }
            
            self.s3.put_bucket_lifecycle_configuration(
                Bucket=bucket_name,
                LifecycleConfiguration=lifecycle_config
            )
            
            self.logger.info(f"为存储桶 {bucket_name} 设置智能分层和生命周期策略")
            return True
            
        except Exception as e:
            self.logger.error(f"设置智能分层失败: {e}")
            return False
    
    def compress_and_archive_logs(self, log_directory: str, s3_bucket: str, archive_days: int = 7):
        """压缩并归档日志文件"""
        
        cutoff_date = datetime.now() - timedelta(days=archive_days)
        archived_count = 0
        total_savings = 0
        
        if not os.path.exists(log_directory):
            self.logger.warning(f"日志目录不存在: {log_directory}")
            return
        
        for root, dirs, files in os.walk(log_directory):
            for file in files:
                if not file.endswith('.log'):
                    continue
                
                file_path = os.path.join(root, file)
                file_stat = os.stat(file_path)
                file_date = datetime.fromtimestamp(file_stat.st_mtime)
                
                if file_date < cutoff_date:
                    # 压缩文件
                    compressed_file = f"{file}.gz"
                    compressed_path = os.path.join(root, compressed_file)
                    
                    original_size = file_stat.st_size
                    
                    with open(file_path, 'rb') as f_in:
                        with gzip.open(compressed_path, 'wb') as f_out:
                            f_out.writelines(f_in)
                    
                    compressed_size = os.path.getsize(compressed_path)
                    compression_ratio = compressed_size / original_size
                    
                    # 上传到S3
                    s3_key = f"archived-logs/{file_date.strftime('%Y/%m/%d')}/{compressed_file}"
                    
                    try:
                        self.s3.upload_file(
                            compressed_path,
                            s3_bucket,
                            s3_key,
                            ExtraArgs={
                                'StorageClass': 'INTELLIGENT_TIERING',
                                'Metadata': {
                                    'original-size': str(original_size),
                                    'compressed-size': str(compressed_size),
                                    'compression-ratio': f"{compression_ratio:.2f}"
                                }
                            }
                        )
                        
                        # 删除本地文件
                        os.remove(file_path)
                        os.remove(compressed_path)
                        
                        archived_count += 1
                        
                        # 计算节省的存储成本
                        local_storage_cost = original_size / (1024**3) * 0.023  # 假设本地存储$0.023/GB/month
                        s3_storage_cost = compressed_size / (1024**3) * 0.0125  # S3智能分层约$0.0125/GB/month
                        savings = local_storage_cost - s3_storage_cost
                        total_savings += savings
                        
                        self.logger.info(f"归档 {file}: {original_size/1024:.1f}KB -> {compressed_size/1024:.1f}KB (压缩率: {compression_ratio:.1%})")
                        
                    except Exception as e:
                        self.logger.error(f"归档文件 {file} 失败: {e}")
                        # 恢复本地文件
                        if os.path.exists(compressed_path):
                            os.remove(compressed_path)
        
        self.logger.info(f"日志归档完成: {archived_count} 个文件,预估月节省 ${total_savings:.2f}")
        
        return {
            'archived_files': archived_count,
            'estimated_monthly_savings': total_savings
        }

# 存储优化脚本使用示例
def optimize_openclaw_storage():
    """优化OpenClaw存储配置"""
    
    storage_manager = IntelligentStorageManager()
    
    print("=== OpenClaw 存储优化 ===\n")
    
    # 1. EBS卷优化分析
    print("1. 分析EBS卷优化机会...")
    ebs_optimization = storage_manager.optimize_ebs_volumes()
    
    print(f"分析了 {ebs_optimization['analyzed_volumes']} 个卷")
    print(f"发现 {len(ebs_optimization['recommendations'])} 个优化机会")
    print(f"潜在月节省: ${ebs_optimization['potential_savings']:.2f}")
    
    for rec in ebs_optimization['recommendations']:
        print(f"\n卷 {rec['volume_id']}:")
        if rec['action'] == 'resize':
            print(f"  建议: 从 {rec['current_size']}GB 缩减到 {rec['recommended_size']}GB")
            print(f"  当前利用率: {rec['utilization']}%")
        elif rec['action'] == 'upgrade_type':
            print(f"  建议: 从 {rec['current_type']} 升级到 {rec['recommended_type']}")
        
        print(f"  月节省: ${rec['monthly_savings']:.2f}")
    
    # 2. 设置S3智能分层
    print("\n2. 配置S3智能分层...")
    s3_buckets = ['openclaw-data', 'openclaw-logs', 'openclaw-backups']
    
    for bucket in s3_buckets:
        try:
            if storage_manager.setup_intelligent_tiering(bucket):
                print(f"  ✓ {bucket}: 智能分层已配置")
            else:
                print(f"  ✗ {bucket}: 配置失败")
        except Exception as e:
            print(f"  ✗ {bucket}: {e}")
    
    # 3. 日志归档
    print("\n3. 执行日志归档...")
    log_directories = ['/opt/openclaw/logs', '/var/log/openclaw']
    
    total_archived = 0
    total_savings = 0
    
    for log_dir in log_directories:
        result = storage_manager.compress_and_archive_logs(
            log_dir, 'openclaw-logs', archive_days=7
        )
        
        if result:
            total_archived += result['archived_files']
            total_savings += result['estimated_monthly_savings']
    
    print(f"总计归档: {total_archived} 个文件")
    print(f"预估月节省: ${total_savings:.2f}")
    
    # 4. 生成存储优化报告
    total_potential_savings = ebs_optimization['potential_savings'] + total_savings
    
    print(f"\n=== 存储优化总结 ===")
    print(f"EBS优化潜在节省: ${ebs_optimization['potential_savings']:.2f}/月")
    print(f"日志归档节省: ${total_savings:.2f}/月")
    print(f"总计潜在节省: ${total_potential_savings:.2f}/月")
    
    # 创建实施计划
    implementation_plan = {
        'ebs_optimizations': ebs_optimization['recommendations'],
        's3_intelligent_tiering': s3_buckets,
        'log_archiving': {
            'directories': log_directories,
            'archive_after_days': 7
        },
        'estimated_savings': {
            'monthly': total_potential_savings,
            'annual': total_potential_savings * 12
        }
    }
    
    # 保存实施计划
    with open('storage_optimization_plan.json', 'w') as f:
        json.dump(implementation_plan, f, indent=2, default=str)
    
    print(f"\n实施计划已保存到 storage_optimization_plan.json")

if __name__ == "__main__":
    optimize_openclaw_storage()

第五章:成本监控与自动化管理体系

通过以上四个维度的深度优化,OpenClaw 的月成本从 $200.34 成功降低到 $29.81,实现了 85% 的成本节省。这一成果证明了系统性成本优化方法论的有效性。

5.1 优化成果总结

优化维度 原始成本 优化后成本 节省金额 节省比例
计算资源 $119.81 $21.26 $98.55 82.2%
AI服务 $68.54 $12.30 $56.24 82.1%
存储服务 $8.00 $2.40 $5.60 70.0%
网络服务 $3.99 $3.85 $0.14 3.5%
总计 $200.34 $29.81 $170.53 85.1%

5.2 技术栈优化的关键洞察

  1. 性能与成本的动态平衡:通过精确的性能测试和监控,我们证明了在保持系统性能的前提下大幅降低成本是可行的。

  2. 智能化决策的价值:AI模型的智能选择和缓存策略带来了显著的成本节省,同时提升了响应效率。

  3. 架构演进的必要性:从x86到ARM架构的迁移不仅带来了成本优势,也为未来的技术演进打下了基础。

  4. 数据驱动的优化策略:基于真实使用数据的优化决策比主观判断更加准确和有效。

5.3 可持续优化的工程实践

成本优化是一个持续的过程,需要建立完善的监控和自动化机制。本文提供的框架和代码实现为其他云原生应用的成本优化提供了可复制的方法论。

对于面临类似挑战的技术团队,建议从以下几个方面入手:

  1. 建立成本可观测性:优先实现成本的多维度监控和分析
  2. 采用渐进式优化:分阶段实施优化措施,降低风险
  3. 注重自动化程度:通过自动化减少人工干预,提高优化效率
  4. 持续监控和调优:建立反馈机制,持续改进优化策略

通过系统性的成本优化实践,我们不仅实现了显著的成本节省,更建立了一套可持续的成本管控体系,为 OpenClaw 的长期发展奠定了坚实的基础。

posted @ 2026-03-19 11:32  亚马逊云开发者  阅读(1)  评论(0)    收藏  举报