Hbase 笔记(10) 集群监控

1、Context 监控实现:

GangliaContext  :                            推送至Ganglia

FileContext:                                      写入文件

TimeStampingFileContext:           写入文件,带时间戳

CompositeContext:                        多个实现

NullContext:                                     不监控

NullContextWithUpdateThread      不监控,启动聚合统计线程。


2、 HMaster 监控指标

 

cluster requests      集群请求数   

split time                   拆分预写日志的时间

 

split size                    拆分预写日志的大小


3、HRegionServer 监控指标

block cache          块缓存:     count, size, free, evicted      

compaction           合并:        size, tine, request size

memstore             内存缓存: size,  flush queue size, flush size, flush time

stores                     存储:         store files, stores, file index

I/O                             I/O:               fs read latency,      fs write latency,   fs sync latency

其他:                                            read request count,  write request count


4、RPC 监控

RPC Processing Time

RPC  Queue         Time


5. JVM 监控

Heap

GC

Thread

System event

 

6、Info监控

date   version  revision url  user hdfsDate  hdfsVersion  hdfsRevision  hdfsUrl  hdfsUser


7、Ganglia 结构

gmond   在所监控的每个节点上收集数据

gmetad  一个节点,从gmond 获取整个集群的数据

web页面 展示数据

安装完成后修改 hadoop-metrics.properties 或 hadoop-metrics2.properties


8. JMX 监控配置:

export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote.port=10101 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false  $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote.port=10102 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false  $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote.port=10103  -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false  $HADOOP_SECONDARYNAMENODE_OPTS"
export HBASE_MASTER_OPTS="-Dcom.sun.management.jmxremote.port=11101 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false  $HBASE_MASTER_OPTS"
export HBASE_REGIONSERVER_OPTS="-Dcom.sun.management.jmxremote.port=11102 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false  $HBASE_REGIONSERVER_OPTS"
export HBASE_ZOOKEEPER_OPTS="-Dcom.sun.management.jmxremote.port=11103 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false  $HBASE_ZOOKEEPER_OPTS"

export HBASE_THRIFT_OPTS="-Dcom.sun.management.jmxremote.port=11104 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false  $HBASE_THRIFT_OPTS"


9. JVM监控:

 

 

ClassLoading:  LoadedClassCount,  TotalLoadedClassCount,    UnloadedClassCount   

 

Compilation: Name,  CompilationTimeMonitoringSupported,       TotalCompilationTime

 

GarbageCollecto -->  PS MarkSweep : Name,  CollectionCount,     CollectionTime,   LastGcInfo,   MemoryPoolNames,  Valid

GarbageCollecto -->  PS  Scavenge :     Name,  CollectionCount,     CollectionTime,   LastGcInfo,   MemoryPoolNames,  Valid

Memory: HeapMemoryUsage (init,  max, commit, used),     NonHeapMemoryUsage (init,  max, commit, used),   ObjectPendingFinalizationCount

 

MemoryManager -> CodeCacheManager:     Name,  MemoryPoolName

MemoryPool -> Code Cache:    Name,  Type,  UsageThresholdSupported, CollectionUsageUsageThresholdSupported , MemoryManagerNames, Usage(init,  max, commit, used),PeakUsage(init,  max, commit, used) , UsageThreshold   UsageThresholdCount, UsageThresholdExceeded ,CollectionUsage(init,  max, commit, used) , CollectionUsageThreshold, CollectionUsageThresholdCount ,CollectionUsageThresholdExceeded

MemoryPool -> PS Eden Space:   Name,  Type,  UsageThresholdSupported, CollectionUsageUsageThresholdSupported , MemoryManagerNames, Usage(init,  max, commit, used),PeakUsage(init,  max, commit, used) , UsageThreshold   UsageThresholdCount, UsageThresholdExceeded ,CollectionUsage(init,  max, commit, used) , CollectionUsageThreshold, CollectionUsageThresholdCount ,CollectionUsageThresholdExceeded

MemoryPool -> PS Servivor  Space:   Name,  Type,  UsageThresholdSupported, CollectionUsageUsageThresholdSupported , MemoryManagerNames, Usage(init,  max, commit, used),PeakUsage(init,  max, commit, used) , UsageThreshold   UsageThresholdCount, UsageThresholdExceeded ,CollectionUsage(init,  max, commit, used) , CollectionUsageThreshold, CollectionUsageThresholdCount ,CollectionUsageThresholdExceeded

MemoryPool -> PS Old Gen:  Name,  Type,  UsageThresholdSupported, CollectionUsageUsageThresholdSupported , MemoryManagerNames, Usage(init,  max, commit, used),PeakUsage(init,  max, commit, used) , UsageThreshold   UsageThresholdCount, UsageThresholdExceeded ,CollectionUsage(init,  max, commit, used) , CollectionUsageThreshold, CollectionUsageThresholdCount ,CollectionUsageThresholdExceeded

MemoryPool -> PS Pern Gen:  Name,  Type,  UsageThresholdSupported, CollectionUsageUsageThresholdSupported , MemoryManagerNames, Usage(init,  max, commit, used),PeakUsage(init,  max, commit, used) , UsageThreshold   UsageThresholdCount, UsageThresholdExceeded ,CollectionUsage(init,  max, commit, used) , CollectionUsageThreshold, CollectionUsageThresholdCount ,CollectionUsageThresholdExceeded 

OperatingSystem:  Name, Arch, AvailableProcessors, CommittedVirtualMemorySize, FreePhysicalMemorySize, FreeSwapSpaceSize, MaxFileDescriptorCount,OpenFileDescriptorCount,ProcessCpuLoad,ProcessCpuTime, SystemCpuLoad, SystemLoadAverage, TotalPhysicalMemorySize, TotalSwapSpaceSize, Version

 

Runtime:  Name, BootClassPathSupported, BootClassPath, ClassPath, InputArguments, LibraryPath, ManagementSpecVersion, SpecName,SpecVendor,SpecVersion, StartTime,SystemProperties,Uptime,VmName,VmVendor,VmVersion

 

Threading:  CurrentThreadCpuTimeSupported, AllThreadIds, CurrentThreadCpuTime, CurrentThreadUserTime, CurrentThreadUserTime, ,ObjectMonitorUsageSupported, PeakThreadCount, SynchronizerUsageSupported, ThreadAllocatedMemoryEnabled, ThreadAllocatedMemorySupported, ThreadContentionMonitoringEnabled, ThreadContentionMonitoringSupported, ThreadCount,  ThreadCpuTimeEnabled, ThreadCpuTimeSupported, TotalStartedThreadCount

 

java.io.BufferPool -> direct:        Name, TotalCapacity, Count, MemoryUsed

java.io.BufferPool -> mapped:    Name, TotalCapacity, Count, MemoryUsed


10. Hadoop 各个进程共有属性

JvmMetrics: GcCount, GcCountPS MarkSweep, GcCountPS Scavenge, GcTimeMillis,GcTimeMillisPS MarkSweep,  GcTimeMillisPS Scavenge, LogError,LogFatal,  LogInfo, LogWarn, MemHeapCommittedM, MemHeapUsedM,MemMaxM, MemNonHeapCommittedM, MemNonHeapUsedM, ThreadsBlocked, ThreadsNew, ThreadsRunnable, ThreadsTerminated, ThreadsTimedWaiting, ThreadsWaiting,  tag.Context, tag.Hostname, tag.ProcessName , tag.SessionId

MetricsSystemStats :DroppedPubAll, NumActiveSinks, NumActiveSources, NumAllSinks, NumAllSources, PublishAvgTime, PublishNumOps, SnapshotAvgTime, SnapshotNumOps, tag.Context,  tag.Hostname

 

StartupProgress: ElapsedTime, LoadingEditsCount, LoadingEditsElapsedTime, LoadingEditsPercentComplete, LoadingEditsTotal,  LoadingFsImageCount, LoadingFsImageElapsedTime, LoadingFsImagePercentComplete, LoadingFsImageTotal,PercentComplete, SafeModeCount, SafeModeElapsedTime, SafeModePercentComplete, SafeModeTotal, SavingCheckpointCount, SavingCheckpointElapsedTime, SavingCheckpointPercentComplete, SavingCheckpointTotal, tag.Hostname

UgiMetrics (User and group):  LoginFailureAvgTime, LoginFailureNumOps, LoginSuccessAvgTime, LoginSuccessNumOps, tag.Context, tag.Hostname


11.  NameNode 监控:

FSNamesystem: BlockCapacity, BlocksTotal, CapacityRemaining, CapacityTotal,CapacityUsed,CapacityUsedNonDFS,CorruptBlocks, ExcessBlocks, ExpiredHeartbeats, FilesTotal,LastCheckpointTime, LastWrittenTransactionId, MillisSinceLastLoadedEdits, MissingBlocks, PendingDataNodeMessageCount, PendingDeletionBlocks, PendingReplicationBlocks, PostponedMisreplicatedBlocks, ScheduledReplicationBlocks, Snapshots, SnapshottableDirectories, StaleDataNodes, TotalFiles, TotalLoad, TransactionsSinceLastCheckpoint, TransactionsSinceLastLogRoll, UnderReplicatedBlocks, tag.Context, tag.HAState, tag.Hostname

FSNamesystemState: BlocksTotal, CapacityRemaining, CapacityTotal, CapacityUsed, FSState, FilesTotal, NumDeadDataNodes, NumStaleDataNodes, ScheduledReplicationBlocks, TotalLoad, UnderReplicatedBlocks

NameNodeActivity: AddBlockOps, AllowSnapshotOps, BlockReportAvgTime, BlockReportNumOps, CreateFileOps, CreateSnapshotOps, CreateSymlinkOps, DeleteFileOps,  DeleteSnapshotOps, DisallowSnapshotOps, FileInfoOps, FilesAppended, FilesCreated, FilesDeleted, FilesInGetListingOps, FilesRenamed, FsImageLoadTime, GetAdditionalDatanodeOps, GetBlockLocations, GetLinkTargetOps, GetListingOps, ListSnapshottableDirOps, RenameSnapshotOps,SafeModeTime , SnapshotDiffReportOps, SyncsAvgTime, TransactionsAvgTime, TransactionsBatchedInSync, TransactionsNumOps, tag.Context, tag.Hostname, tag.ProcessName

NameNodeInfo:BlockPoolId, BlockPoolUsedSpace, ClusterId, DeadNodes, DecomNodes, DistinctVersionCount, DistinctVersions,Free,  JournalTransactionInfo, LiveNodes, NameDirStatuses, NonDfsUsedSpace, NumberOfMissingBlocks, PercentBlockPoolUsed, PercentRemaining, PercentUsed,Safemode, Threads, Total, TotalBlocks,TotalFiles, UpgradeFinalized, Used, Version 

RpcActivityForPort9000: CallQueueLength,NumOpenConnections, ReceivedBytes,RpcAuthenticationFailures, RpcAuthenticationSuccesses, RpcAuthorizationFailures, RpcAuthorizationSuccesses, RpcProcessingTimeAvgTime,RpcProcessingTimeNumOps,   RpcQueueTimeAvgTime, RpcQueueTimeNumOps, SentBytes, tag.Context, tag.Hostname, tag.port

RpcDetailedActivityForPort9000:AddBlockAvgTime,AddBlockNumOps, BlockReceivedAndDeletedAvgTime, BlockReceivedAndDeletedNumOps, BlockReportAvgTime, BlockReportNumOps, CommitBlockSynchronizationAvgTime, CommitBlockSynchronizationNumOps, CompleteAvgTime, CompleteNumOps, CreateAvgTime, CreateNumOps, DeleteAvgTime, DeleteNumOps, FsyncAvgTime, FsyncNumOps, GetBlockLocationsAvgTime,  GetBlockLocationsNumOps, GetEditLogManifestAvgTime, GetEditLogManifestNumOps, GetFileInfoAvgTime, GetFileInfoNumOps, GetListingAvgTime, GetListingNumOps,GetServerDefaultsAvgTime, GetServerDefaultsNumOps, GetTransactionIdAvgTime, GetTransactionIdNumOps,MkdirsAvgTime, MkdirsNumOps , RecoverLeaseAvgTime, RecoverLeaseNumOps, ,RegisterDatanodeAvgTime, RegisterDatanodeNumOps,  RenameAvgTime, RenameNumOps, RenewLeaseAvgTime,  RenewLeaseNumOps,  RollEditLogAvgTime, RollEditLogNumOps, SendHeartbeatAvgTime,SendHeartbeatNumOps, SetSafeModeAvgTime, SetSafeModeNumOps, SetTimesAvgTime, SetTimesNumOps,  UpdateBlockForPipelineAvgTime, UpdateBlockForPipelineNumOps, UpdatePipelineAvgTime, UpdatePipelineNumOps, VersionRequestAvgTime, VersionRequestNumOps, tag.Context, tag.Hostname, tag.port

JvmMetrics:

MetricsSystemStats :

StartupProgress

UgiMetrics (User and group)


12.  DataNode 监控:

DataNodeActivity:BlockChecksumOpAvgTime, BlockChecksumOpNumOps,BlockReportsAvgTime,BlockReportsNumOps,BlockVerificationFailures,BlocksGetLocalPathInfo, BlocksRead, BlocksRemoved, BlocksReplicated, BlocksVerified, BlocksWritten, BytesRead,BytesWritten, CopyBlockOpAvgTime,CopyBlockOpNumOps,FlushNanosAvgTime,FlushNanosNumOps,FsyncCount,  FsyncNanosAvgTime,  FsyncNanosNumOps,  PacketAckRoundTripTimeNanosAvgTime,   PacketAckRoundTripTimeNanosNumOps, ReadBlockOpAvgTime, ReadBlockOpNumOps

DataNodeInfo:ClusterId,HttpPort,NamenodeAddresses,RpcPort,Version,VolumeInfo,XceiverCount

FSDatasetState:Capacity,DfsUsed,NumFailedVolumes,Remaining,StorageInfo

RpcActivityForPort50020:CallQueueLength,NumOpenConnections, ReceivedBytes,RpcAuthenticationFailures, RpcAuthenticationSuccesses, RpcAuthorizationFailures, RpcAuthorizationSuccesses, RpcProcessingTimeAvgTime,RpcProcessingTimeNumOps,   RpcQueueTimeAvgTime,  RpcQueueTimeNumOps, SentBytes, tag.Context, tag.Hostname,  tag.port

RpcDetailedActivityForPort50020:tag.Context, tag.Hostname, tag.port

 

JvmMetrics

MetricsSystemStats :

StartupProgress: 

UgiMetrics (User and group): 

 

13.  SecondaryNameNode 监控:

 

JvmMetrics:

MetricsSystemStats :

StartupProgress: 

UgiMetrics (User and group):  


14.  HMaster 监控:

IPC:ProcessCallTime ,QueueCallTime ,authenticationFailures,authenticationSuccesses,authorizationFailures,authorizationSuccesses,numActiveHandler,numCallsInGeneralQueue,numCallsInPriorityQueue,numCallsInReplicationQueue,numOpenConnections,queueSize,receivedBytes,sentBytes,tag.Context,tag.Hostname

AssignmentManger:Assign ,BulkAssign ,ritCount,ritCountOverThreshold,ritOldestAge,tag.Context,tag.Hostname

Balancer:BalancerCluster ,miscInvocationCount,tag.Context,tag.Hostname

FileSystem:HlogSplitSize ,HlogSplitTime ,MetaHlogSplitSize ,MetaHlogSplitTime ,tag.Context,tag.Hostname

Server:averageLoad,clusterRequests,masterActiveTime,masterStartTime,numDeadRegionServers,numRegionServers,tag.Context,tag.Hostname,tag.clusterId,tag.deadRegionServers,tag.isActiveMaster,tag.liveRegionServers,tag.serverName,tag.zookeeperQuorum

 

JvmMetrics:

MetricsSystemStats :

StartupProgress: 

UgiMetrics (User and group):  

 

15.  HRegionServer 监控:

IPC:ProcessCallTime ,QueueCallTime ,authenticationFailures,authenticationSuccesses,authorizationFailures,authorizationSuccesses,numActiveHandler,numCallsInGeneralQueue,numCallsInPriorityQueue,numCallsInReplicationQueue,numOpenConnections,queueSize,receivedBytes,sentBytes,tag.Context,tag.Hostname

Regions:tablename_get(75th_percentile,    95th_percentile, 99th_percentile, max, mean, median, min, num_ops),  tablename_scanNext(75th_percentile,    95th_percentile, 99th_percentile, max, mean, median, min, num_ops),  coprocessorExecutionStatistics, region_appendCount,   region_compactionsCompletedCount,  region_deleteCount,  region_incrementCount,  region_memStoreSize,  region_mutateCount,  region_numBytesCompactedCount,  region_numFilesCompactedCount,  region_storeCount,  region_storeFileCount,  region_storeFileSize

Replication:tag.Contextt,tag.Hostname

Server:Append  ,Delete ,Get ,Increment ,Mutate ,Replay ,blockCacheCount,blockCacheEvictionCount,blockCacheExpressHitPercent,blockCacheFreeSize, blockCacheHitCount,blockCacheMissCount,blockCacheSize,blockCountHitPercent,checkMutateFailedCount,checkMutatePassedCount,compactedCellsCount,compactedCellsSize,compactionQueueLength,flushQueueLength,flushedCellsCount,flushedCellsSize,hlogFileCount,hlogFileSize,majorCompactedCellsCount,majorCompactedCellsSize,memStoreSize,mutationsWithoutWALCount,mutationsWithoutWALSize,percentFilesLocal,readRequestCount,regionCount,regionServerStartTime,slowAppendCount,slowDeleteCount,slowGetCount,slowIncrementCount,slowPutCount,staticBloomSize,staticIndexSize,storeCount,storeFileCount,storeFileIndexSize,storeFileSize,totalRequestCount,updatesBlockedTime,writeRequestCount,tag.Context,tag.Hostname,tag.clusterId, tag.serverName,tag.zookeeperQuorum

WAL:AppendSize ,AppendTime ,SyncTime ,appendCount,slowAppendCount,tag.Contextt,tag.Hostname

 

JvmMetrics:

MetricsSystemStats :

StartupProgress: 

UgiMetrics (User and group):  

 

16.  ZooKeeper 监控:

ReplicatedServer_id1:Name,QuorumSize

replica.0:Name,QuorumAddress

replica.1:Name,QuorumAddress

replica.2:Name,QuorumAddress

Leader:AvgRequestLatency,ClientPort,CurrentZxid,MaxClientCnxnsPerHost,MaxRequestLatency,MaxSessionTimeout,MinRequestLatency,MinRequestLatency, MinSessionTimeout,NumAliveConnections,OutstandingRequests,PacketsReceived,PacketsSent,StartTime,TickTime,Version

InMemoryDataTree:LastZxid,NodeCount,WatchCount

Connection:AvgLatency,EphemeralNodes,LastCxid,LastLatency,LastOperation,LastResponseTime,LastZxid,MaxLatency,MinLatency,OutstandingRequests, PacketsReceived,PacketsSent,SessionId,SessionTimeout,SourceIP,StartedTime


17. Thrift Server 监控:

ThriftOne:  BatchGet  ,  BatchMutate  ,  SlowThriftCall  ,  ThriftCall  , TimeInQueue  ,   callQueueLen,  tag.Hostname,  tag.Context

ThriftTwo::  同 ThriftOne

JvmMetrics

 

MetricsSystemStats : 

UgiMetrics (User and group):  

 


 

posted @ 2014-11-14 15:59  lihui1625  阅读(531)  评论(0编辑  收藏  举报