【Hadoop源码解读】【RPC操作】删除块的流程

HDFS的删除块的机制

本文写作的背景是在工作中遇到升级期间数据无法保留在Trash中而开始的,写作之前本人只对DN的大致工作原理有所了解,一些细节部分理解不准确的希望大家斧正。

对比一下我删除文件和Hadoop删除文件。

// 我删除
File deleteFile = new File("C:/data/test");
deleteFile.delete();

// Hadoop删除(截取部分)
public void removeBlock(BlockInfo block) {
    assert namesystem.hasWriteLock();
    // No need to ACK blocks that are being removed entirely
    // from the namespace, since the removal of the associated
    // file already removes them from the block map below.
    block.setNumBytes(BlockCommand.NO_ACK);
    addToInvalidates(block);
    removeBlockFromMap(block);
    // Remove the block from pendingReconstruction and neededReconstruction
    PendingBlockInfo remove = pendingReconstruction.remove(block);
    if (remove != null) {
      DatanodeStorageInfo.decrementBlocksScheduled(remove.getTargets()
          .toArray(new DatanodeStorageInfo[remove.getTargets().size()]));
    }
    neededReconstruction.remove(block, LowRedundancyBlocks.LEVEL);
    postponedMisreplicatedBlocks.remove(block);
}

// 为了提高删除效率自己实现了相关的数据结构
protected LinkedElement<T> removeElem(final T key) {
    LinkedElement<T> found = null;
    final int hashCode = key.hashCode();
    final int index = getIndex(hashCode);
    if (entries[index] == null) {
        return null;
    } else if (hashCode == entries[index].hashCode &&
                entries[index].element.equals(key)) {
        // remove the head of the bucket linked list
        modification++;
        size--;
        found = entries[index];
        entries[index] = found.next;
    } else {
        // head != null and key is not equal to head
        // search the element
        LinkedElement<T> prev = entries[index];
        for (found = prev.next; found != null;) {
            if (hashCode == found.hashCode &&
                found.element.equals(key)) {
                // found the element, remove it
                modification++;
                size--;
                prev.next = found.next;
                found.next = null;
                break;
            } else {
                prev = found;
                found = found.next;
            }
        }
    }
    return found;
}

可以看出Hadoop的删除块逻辑真的是优雅(复杂他妈给复杂开门复杂到家了)

1. RPC流程

接下来我们言归正传,我们顺着整个Delete请求来看一下HDFS如何完成数据的删除。
HDFS的问题入口都在大名鼎鼎的NameNodeRpcServer类中,我们就来拜读一下NameNodeRpcServer#delete方法。

@Override // ClientProtocol
public boolean delete(String src, boolean recursive) throws IOException {
    checkNNStartup();
    if (stateChangeLog.isDebugEnabled()) {
        stateChangeLog.debug("*DIR* Namenode.delete: src=" + src
                            + ", recursive=" + recursive);
    }
    namesystem.checkOperation(OperationCategory.WRITE); // Standby节点不支持写操作
    // RetryCache用来缓存上次处理过的重试请求。
    CacheEntry cacheEntry = getCacheEntry();
    if (cacheEntry != null && cacheEntry.isSuccess()) {
        return true; // Return previous response
    }

    boolean ret = false;
    try {
        // 真实的处理方法
        ret = namesystem.delete(src, recursive, cacheEntry != null);
    } finally {
        // 保存到ReCache中
        RetryCache.setState(cacheEntry, ret);
    }
    if (ret)
        metrics.incrDeleteFileOps();
    return ret;
}

NameNode在处理delete请求的时候会先检查两个先决条件:

  1. NN的状态->主、已启动
  2. 请求未被处理,或重试请求中未成功

之后我们继续往下追代码,看一下FSNamesystem#delete方法。

/**
* Remove the indicated file from namespace.
*
* @see ClientProtocol#delete(String, boolean) for detailed description and 
* description of exceptions
*/
boolean delete(String src, boolean recursive, boolean logRetryCache)
throws IOException {
    final String operationName = "delete";
    BlocksMapUpdateInfo toRemovedBlocks = null;
    checkOperation(OperationCategory.WRITE);
    final FSPermissionChecker pc = getPermissionChecker();
    FSPermissionChecker.setOperationType(operationName);
    boolean ret = false;
    try {
        // 这里采用独立的写锁
        writeLock();
        try {
            checkOperation(OperationCategory.WRITE);
            checkNameNodeSafeMode("Cannot delete " + src);
            // 删除NN中保存的元数据信息,采用FSNamesystemLock控制并发
            toRemovedBlocks = FSDirDeleteOp.delete(
                this, pc, src, recursive, logRetryCache);
            ret = toRemovedBlocks != null;
        } finally {
            writeUnlock(operationName);
        }
    } catch (AccessControlException e) {
        logAuditEvent(false, operationName, src);
        throw e;
    }
    getEditLog().logSync();
    logAuditEvent(ret, operationName, src);
    if (toRemovedBlocks != null) {
        // 真正删除block的地方
        // 这里有华为云提供的FGL锁方案,具体可以参考HDFS-14703
        if (getFSDirectory().isFGLEnabled()) {
            removeBlocksWithFGL(toRemovedBlocks, src, pc); // Incremental deletion of blocks
        } else {
        // 在这里执行块信息的删除操作
            removeBlocks(toRemovedBlocks); // Incremental deletion of blocks
        }
        src = toRemovedBlocks.getFileRemoved() != 0
            ? src + " (filecount=" + toRemovedBlocks.getFileRemoved() + ")" : src;
    }
    return ret;
}

HDFS的删除操作是分步执行的,第一次拿锁删除元数据。完成之后再次拿锁去删除块信息。

这里夹带点私货,HDFS的锁采用全局锁的概念。及所有读写操作共用一把锁,100%的请求使用该锁。这里就是HDFS关键的性能瓶颈。其实早期的时候社区也有提供细粒度锁,但是当时的细粒度锁会针对每一个INode提供一把锁。这样在超大集群中会快速耗尽线程,因此社区回退了这种方案,转而采用全局锁。那么如何解决这两者之间的矛盾呢? Alluxio 2.0为我们提供了解决思路,即采用LockPool 的概念。也就是有一个锁资源池,每个Inode 不再关联(新增)一个 Lock 了,而是需要 Lock 加锁的时候,就去资源池里申请锁,同时引用计数会增加,用完了 unlock 掉的时候,引用计数会减少。而华为云的实现是先使用全局锁去找对应的细粒度FGL,然后释放全局锁,执行后续操作。这两者的优劣大家可以自行对比一下。

我们继续向下研究,这里我们跳过操作元数据的部分。本文的重点聚焦与块删除。我们继续阅读FSNamesystem#removeBlocks方法。

/**
* From the given list, incrementally remove the blocks from blockManager
* Writelock is dropped and reacquired every BLOCK_DELETION_INCREMENT to
* ensure that other waiters on the lock can get in. See HDFS-2938
*
* @param blocks
*          An instance of {@link BlocksMapUpdateInfo} which contains a list
*          of blocks that need to be removed from blocksMap
*/
void removeBlocks(BlocksMapUpdateInfo blocks) {
    List<BlockInfo> toDeleteList = blocks.getToDeleteList();
    Iterator<BlockInfo> iter = toDeleteList.iterator();
    while (iter.hasNext()) {
        writeLock();
        try {
        for (int i = 0; i < blockDeletionIncrement && iter.hasNext(); i++) {
            blockManager.removeBlock(iter.next());
        }
        } finally {
        writeUnlock("removeBlocks");
        }
    }
}

这里没必要介绍了,我们继续向下阅读BlockManager#removeBlock方法

public void removeBlock(BlockInfo block) {
    assert namesystem.hasWriteLock();
    // No need to ACK blocks that are being removed entirely
    // from the namespace, since the removal of the associated
    // file already removes them from the block map below.
    block.setNumBytes(BlockCommand.NO_ACK);
    // 加入Invalidate队列就意味着删除块,此处是重点
    addToInvalidates(block);
    removeBlockFromMap(block);
    // Remove the block from pendingReconstruction and neededReconstruction
    // pendingReconstruction是已经生成复制指令,待发送给DN的block队列
    // neededReconstruction是准备生成复制指令的block队列
    PendingBlockInfo remove = pendingReconstruction.remove(block);
    if (remove != null) {
        DatanodeStorageInfo.decrementBlocksScheduled(remove.getTargets()
            .toArray(new DatanodeStorageInfo[remove.getTargets().size()]));
    }
    neededReconstruction.remove(block, LowRedundancyBlocks.LEVEL);
    // 从postponedMisreplicatedBlocks队列里删除块信息
    // 这里主要是清除还未来及对Blocks的操作
    postponedMisreplicatedBlocks.remove(block);
}

这个方法主要是将block信息加入到块删除队列(invalidates)。然后从块关系map中移除此块的信息,并从pending和needed队列中移除关于此块的信息,避免无用复制。
下面看一下BlockManager#addToInvalidates方法:

/**
* Adds block to list of blocks which will be invalidated on all its
* datanodes.
*/
private void addToInvalidates(BlockInfo storedBlock) {
    if (!isPopulatingReplQueues()) {
        return;
    }
    StringBuilder datanodes = blockLog.isDebugEnabled()
        ? new StringBuilder() : null;
    for (DatanodeStorageInfo storage : blocksMap.getStorages(storedBlock)) {
        if (storage.getState() != State.NORMAL) {
        continue;
        }
        final DatanodeDescriptor node = storage.getDatanodeDescriptor();
        final Block b = getBlockOnStorage(storedBlock, storage);
        if (b != null) {
        // 把块信息加入队列的操作。
        invalidateBlocks.add(b, node, false);
        if (datanodes != null) {
            datanodes.append(node).append(" ");
        }
        }
    }
    if (datanodes != null && datanodes.length() != 0) {
        blockLog.debug("BLOCK* addToInvalidates: {} {}", storedBlock, datanodes);
    }
}

添加到InvalidateBlocks队列后,等待DN的心跳读取然后下发删除指令给Datanode。
定时线程如何真正的删除块呢?这里我们就需要去阅读FsDatasetAsyncDiskService类了。该类提供了DN删除等机制的磁盘操作,该类使用线程池的方式进行删除,而非为每次删除创建新线程。

posted @ 2022-12-03 17:29  默默Coding  阅读(361)  评论(0)    收藏  举报