【Hadoop源码解读】【RPC操作】删除块的流程
HDFS的删除块的机制
本文写作的背景是在工作中遇到升级期间数据无法保留在Trash中而开始的,写作之前本人只对DN的大致工作原理有所了解,一些细节部分理解不准确的希望大家斧正。
对比一下我删除文件和Hadoop删除文件。
// 我删除
File deleteFile = new File("C:/data/test");
deleteFile.delete();
// Hadoop删除(截取部分)
public void removeBlock(BlockInfo block) {
assert namesystem.hasWriteLock();
// No need to ACK blocks that are being removed entirely
// from the namespace, since the removal of the associated
// file already removes them from the block map below.
block.setNumBytes(BlockCommand.NO_ACK);
addToInvalidates(block);
removeBlockFromMap(block);
// Remove the block from pendingReconstruction and neededReconstruction
PendingBlockInfo remove = pendingReconstruction.remove(block);
if (remove != null) {
DatanodeStorageInfo.decrementBlocksScheduled(remove.getTargets()
.toArray(new DatanodeStorageInfo[remove.getTargets().size()]));
}
neededReconstruction.remove(block, LowRedundancyBlocks.LEVEL);
postponedMisreplicatedBlocks.remove(block);
}
// 为了提高删除效率自己实现了相关的数据结构
protected LinkedElement<T> removeElem(final T key) {
LinkedElement<T> found = null;
final int hashCode = key.hashCode();
final int index = getIndex(hashCode);
if (entries[index] == null) {
return null;
} else if (hashCode == entries[index].hashCode &&
entries[index].element.equals(key)) {
// remove the head of the bucket linked list
modification++;
size--;
found = entries[index];
entries[index] = found.next;
} else {
// head != null and key is not equal to head
// search the element
LinkedElement<T> prev = entries[index];
for (found = prev.next; found != null;) {
if (hashCode == found.hashCode &&
found.element.equals(key)) {
// found the element, remove it
modification++;
size--;
prev.next = found.next;
found.next = null;
break;
} else {
prev = found;
found = found.next;
}
}
}
return found;
}
可以看出Hadoop的删除块逻辑真的是优雅(复杂他妈给复杂开门复杂到家了)。
1. RPC流程
接下来我们言归正传,我们顺着整个Delete请求来看一下HDFS如何完成数据的删除。
HDFS的问题入口都在大名鼎鼎的NameNodeRpcServer类中,我们就来拜读一下NameNodeRpcServer#delete方法。
@Override // ClientProtocol
public boolean delete(String src, boolean recursive) throws IOException {
checkNNStartup();
if (stateChangeLog.isDebugEnabled()) {
stateChangeLog.debug("*DIR* Namenode.delete: src=" + src
+ ", recursive=" + recursive);
}
namesystem.checkOperation(OperationCategory.WRITE); // Standby节点不支持写操作
// RetryCache用来缓存上次处理过的重试请求。
CacheEntry cacheEntry = getCacheEntry();
if (cacheEntry != null && cacheEntry.isSuccess()) {
return true; // Return previous response
}
boolean ret = false;
try {
// 真实的处理方法
ret = namesystem.delete(src, recursive, cacheEntry != null);
} finally {
// 保存到ReCache中
RetryCache.setState(cacheEntry, ret);
}
if (ret)
metrics.incrDeleteFileOps();
return ret;
}
NameNode在处理delete请求的时候会先检查两个先决条件:
- NN的状态->主、已启动
- 请求未被处理,或重试请求中未成功
之后我们继续往下追代码,看一下FSNamesystem#delete方法。
/**
* Remove the indicated file from namespace.
*
* @see ClientProtocol#delete(String, boolean) for detailed description and
* description of exceptions
*/
boolean delete(String src, boolean recursive, boolean logRetryCache)
throws IOException {
final String operationName = "delete";
BlocksMapUpdateInfo toRemovedBlocks = null;
checkOperation(OperationCategory.WRITE);
final FSPermissionChecker pc = getPermissionChecker();
FSPermissionChecker.setOperationType(operationName);
boolean ret = false;
try {
// 这里采用独立的写锁
writeLock();
try {
checkOperation(OperationCategory.WRITE);
checkNameNodeSafeMode("Cannot delete " + src);
// 删除NN中保存的元数据信息,采用FSNamesystemLock控制并发
toRemovedBlocks = FSDirDeleteOp.delete(
this, pc, src, recursive, logRetryCache);
ret = toRemovedBlocks != null;
} finally {
writeUnlock(operationName);
}
} catch (AccessControlException e) {
logAuditEvent(false, operationName, src);
throw e;
}
getEditLog().logSync();
logAuditEvent(ret, operationName, src);
if (toRemovedBlocks != null) {
// 真正删除block的地方
// 这里有华为云提供的FGL锁方案,具体可以参考HDFS-14703
if (getFSDirectory().isFGLEnabled()) {
removeBlocksWithFGL(toRemovedBlocks, src, pc); // Incremental deletion of blocks
} else {
// 在这里执行块信息的删除操作
removeBlocks(toRemovedBlocks); // Incremental deletion of blocks
}
src = toRemovedBlocks.getFileRemoved() != 0
? src + " (filecount=" + toRemovedBlocks.getFileRemoved() + ")" : src;
}
return ret;
}
HDFS的删除操作是分步执行的,第一次拿锁删除元数据。完成之后再次拿锁去删除块信息。
这里夹带点私货,HDFS的锁采用全局锁的概念。及所有读写操作共用一把锁,100%的请求使用该锁。这里就是HDFS关键的性能瓶颈。其实早期的时候社区也有提供细粒度锁,但是当时的细粒度锁会针对每一个INode提供一把锁。这样在超大集群中会快速耗尽线程,因此社区回退了这种方案,转而采用全局锁。那么如何解决这两者之间的矛盾呢? Alluxio 2.0为我们提供了解决思路,即采用LockPool 的概念。也就是有一个锁资源池,每个Inode 不再关联(新增)一个 Lock 了,而是需要 Lock 加锁的时候,就去资源池里申请锁,同时引用计数会增加,用完了 unlock 掉的时候,引用计数会减少。而华为云的实现是先使用全局锁去找对应的细粒度FGL,然后释放全局锁,执行后续操作。这两者的优劣大家可以自行对比一下。
我们继续向下研究,这里我们跳过操作元数据的部分。本文的重点聚焦与块删除。我们继续阅读FSNamesystem#removeBlocks方法。
/**
* From the given list, incrementally remove the blocks from blockManager
* Writelock is dropped and reacquired every BLOCK_DELETION_INCREMENT to
* ensure that other waiters on the lock can get in. See HDFS-2938
*
* @param blocks
* An instance of {@link BlocksMapUpdateInfo} which contains a list
* of blocks that need to be removed from blocksMap
*/
void removeBlocks(BlocksMapUpdateInfo blocks) {
List<BlockInfo> toDeleteList = blocks.getToDeleteList();
Iterator<BlockInfo> iter = toDeleteList.iterator();
while (iter.hasNext()) {
writeLock();
try {
for (int i = 0; i < blockDeletionIncrement && iter.hasNext(); i++) {
blockManager.removeBlock(iter.next());
}
} finally {
writeUnlock("removeBlocks");
}
}
}
这里没必要介绍了,我们继续向下阅读BlockManager#removeBlock方法
public void removeBlock(BlockInfo block) {
assert namesystem.hasWriteLock();
// No need to ACK blocks that are being removed entirely
// from the namespace, since the removal of the associated
// file already removes them from the block map below.
block.setNumBytes(BlockCommand.NO_ACK);
// 加入Invalidate队列就意味着删除块,此处是重点
addToInvalidates(block);
removeBlockFromMap(block);
// Remove the block from pendingReconstruction and neededReconstruction
// pendingReconstruction是已经生成复制指令,待发送给DN的block队列
// neededReconstruction是准备生成复制指令的block队列
PendingBlockInfo remove = pendingReconstruction.remove(block);
if (remove != null) {
DatanodeStorageInfo.decrementBlocksScheduled(remove.getTargets()
.toArray(new DatanodeStorageInfo[remove.getTargets().size()]));
}
neededReconstruction.remove(block, LowRedundancyBlocks.LEVEL);
// 从postponedMisreplicatedBlocks队列里删除块信息
// 这里主要是清除还未来及对Blocks的操作
postponedMisreplicatedBlocks.remove(block);
}
这个方法主要是将block信息加入到块删除队列(invalidates)。然后从块关系map中移除此块的信息,并从pending和needed队列中移除关于此块的信息,避免无用复制。
下面看一下BlockManager#addToInvalidates方法:
/**
* Adds block to list of blocks which will be invalidated on all its
* datanodes.
*/
private void addToInvalidates(BlockInfo storedBlock) {
if (!isPopulatingReplQueues()) {
return;
}
StringBuilder datanodes = blockLog.isDebugEnabled()
? new StringBuilder() : null;
for (DatanodeStorageInfo storage : blocksMap.getStorages(storedBlock)) {
if (storage.getState() != State.NORMAL) {
continue;
}
final DatanodeDescriptor node = storage.getDatanodeDescriptor();
final Block b = getBlockOnStorage(storedBlock, storage);
if (b != null) {
// 把块信息加入队列的操作。
invalidateBlocks.add(b, node, false);
if (datanodes != null) {
datanodes.append(node).append(" ");
}
}
}
if (datanodes != null && datanodes.length() != 0) {
blockLog.debug("BLOCK* addToInvalidates: {} {}", storedBlock, datanodes);
}
}
添加到InvalidateBlocks队列后,等待DN的心跳读取然后下发删除指令给Datanode。
定时线程如何真正的删除块呢?这里我们就需要去阅读FsDatasetAsyncDiskService类了。该类提供了DN删除等机制的磁盘操作,该类使用线程池的方式进行删除,而非为每次删除创建新线程。
浙公网安备 33010602011771号