ray 用户指南: Task

ray 用户指南: Task

ray可以使得任何函数都可以在python workers上异步执行;这种函数是ray 远程函数,它们的异步调用称为ray tasks。以下为一个例子:

import ray
import time


# A regular Python function.
def normal_function():
    return 1


# By adding the `@ray.remote` decorator, a regular Python function
# becomes a Ray remote function.
@ray.remote
def my_function():
    return 1


# To invoke this remote function, use the `remote` method.
# This will immediately return an object ref (a future) and then create
# a task that will be executed on a worker process.
obj_ref = my_function.remote()

# The result can be retrieved with ``ray.get``.
assert ray.get(obj_ref) == 1


@ray.remote
def slow_function():
    time.sleep(10)
    return 1

for _ in range(4):
    # 注意get的时候才会真正执行函数
    slow_function.remote()

我们可以使用ray summary tasks来看运行以及完成的tasks和数量

配置资源需求

@ray.remote(num_cpus=4, num_gpus=2)
def my_function():
    return 1

# override默认的资源需求
my_function.options(num_cpus=3).remote()

给ray tasks传递对象引用

除了pass value,也可以pass obj ref.

@ray.remote
def function_with_an_argument(value):
    return value + 1

obj_ref1 = my_function.remote()
assert ray.get(obj_ref1) == 1
obj_ref2 = function_with_an_argument(obj_ref1)

注意如果这两个tasks如果在不同机器上运行,那么第一个task的结果将通过网络发送给第二个task运行的机器上

等待部分结果

object_refs = [slow_function.remote() for _ in range(2)]
# Return as soon as one of the tasks finished execution.
ready_refs, remaining_refs = ray.wait(object_refs, num_returns=1, timeout=None)

多个返回值

task默认会返回单个obj ref。然而我们可以配置tasks能够返回多个obj refs

@ray.remote
def return_single():
    return 0, 1, 2


object_ref = return_single.remote()
assert ray.get(object_ref) == (0, 1, 2)


# However, you can configure Ray tasks to return multiple Object Refs.
@ray.remote(num_returns=3)
def return_multiple():
    return 0, 1, 2


object_ref0, object_ref1, object_ref2 = return_multiple.remote()
assert ray.get(object_ref0) == 0
assert ray.get(object_ref1) == 1
assert ray.get(object_ref2) == 2

对于返回多个obj的task,ray也支持remote generators,这使得task可以一次只返回一个object;除此之外,ray也支持动态设置返回值的数量

@ray.remote(num_returns=3)
def return_multiple_as_generator():
    for i in range(3):
        yield i


# NOTE: 必须要完全执行完整个task,这些obj才能可用
a, b, c = return_multiple_as_generator.remote()

取消tasks

可以通过调用ray.cancel()来取消task的执行

@ray.remote
def blocking_operation():
    time.sleep(10e6)


obj_ref = blocking_operation.remote()
ray.cancel(obj_ref)

try:
    ray.get(obj_ref)
except ray.exceptions.TaskCancelledError:
    print("Object reference was cancelled.")

调度

对每个task,ray将选择一个节点来运行,调度决定是基于一些因素如task的资源需求,特定的调度策略,task参数的位置

容错

默认,ray将重试由于系统和特定的应用级别的错误而失败的任务。你可以通过设置max_retriesretry_exceptions选项来改变这个行为。

task events

默认,ray trace task的执行,报告task的状态以及profiling events

你可以改变这个行为,通过设置enable_task_events等关闭这个功能,这可以减少task执行的overhead。注意嵌套tasks不会从parent task继承task events的设置。

嵌套远程函数

import ray


@ray.remote
def f():
    return 1


@ray.remote
def g():
    # Call f 4 times and return the resulting object refs.
    return [f.remote() for _ in range(4)]


@ray.remote
def h():
    # Call f 4 times, block until those 4 tasks finish,
    # retrieve the results, and return the values.
    return ray.get([f.remote() for _ in range(4)])

注意f的定义必须出现在g和h定义之前。

当block时释放资源

@ray.remote(num_cpus=1, num_gpus=1)
def g():
    return ray.get(f.remote())

当g task在执行时,当在ray.get处block时会释放cpu资源;但是gpu资源并不会释放

动态生成器

使用ray的远程生成器的use case:

  1. 在从远程函数返回时减少最大的heap memory使用
  2. 当返回值的数量需要动态指定时

task caller设置num_returns

caller应该使用@ray.remote(num_returns=x)或者foo.options(num_returns=x).remote()设置远程函数的返回值数量。ray将给caller返回这么多的obj ref。remote task之后应该返回相同数量的值,通常作为一个元组或是list。相比动态设置返回值数量,这种方式有更低的复杂性和开销。

我们不需要改变caller的语法,就能使用remote generator来迭代式的生成值。这个generator应该yield与caller指定返回值的数量相同的值;并且在ray的对象存储中一次只存储一个。

@ray.remote
def large_values_generator(num_returns):
    for i in range(num_returns):
        yield np.random.randint(
            np.iinfo(np.int8).max, size=(100_000_000, 1), dtype=np.int8
        )
        print(f"yielded return value {i}")

task executor设置num_returns

在某些情况下,caller可能不知道返回值的数量;例如,假设我们写了一个task可以将它的参数划分为多个相同大小的chunk;我们不知道返回值的数量。

在这种情况下,可以使用remote generator来返回动态数量的值。一个例子如下:

import numpy as np


@ray.remote(num_returns="dynamic")
def split(array, chunk_size):
    while len(array) > 0:
        yield array[:chunk_size]
        array = array[chunk_size:]


array_ref = ray.put(np.zeros(np.random.randint(1000_000)))
block_size = 1000

# Returns an ObjectRef[DynamicObjectRefGenerator].
dynamic_ref = split.remote(array_ref, block_size)
print(dynamic_ref)
# ObjectRef(c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000)

i = -1
ref_generator = ray.get(dynamic_ref)
print(ref_generator)
# <ray._raylet.DynamicObjectRefGenerator object at 0x7f7e2116b290>
for i, ref in enumerate(ref_generator):
    # Each DynamicObjectRefGenerator iteration returns an ObjectRef.
    assert len(ray.get(ref)) <= block_size
num_blocks_generated = i + 1
array_size = len(ray.get(array_ref))
assert array_size <= num_blocks_generated * block_size
print(f"Split array of size {array_size} into {num_blocks_generated} blocks of "
      f"size {block_size} each.")
# Split array of size 63153 into 64 blocks of size 1000 each.

# NOTE: The dynamic_ref points to the generated ObjectRefs. Make sure that this
# ObjectRef goes out of scope so that Ray can garbage-collect the internal
# ObjectRefs.
del dynamic_ref

注意这个特殊的ref也可以传递给另一个远程task,如下:

@ray.remote
def get_size(ref_generator : DynamicObjectRefGenerator):
    print(ref_generator)
    num_elements = 0
    for ref in ref_generator:
        array = ray.get(ref)
        assert len(array) <= block_size
        num_elements += len(array)
    return num_elements


# Returns an ObjectRef[DynamicObjectRefGenerator].
dynamic_ref = split.remote(array_ref, block_size)
assert array_size == ray.get(get_size.remote(dynamic_ref))
# (get_size pid=1504184)
# <ray._raylet.DynamicObjectRefGenerator object at 0x7f81c4250ad0>

# This also works, but should be avoided because you have to call an additional
# `ray.get`, which blocks the driver.
ref_generator = ray.get(dynamic_ref)
assert array_size == ray.get(get_size.remote(ref_generator))
# (get_size pid=1504184)
# <ray._raylet.DynamicObjectRefGenerator object at 0x7f81c4251b50>

异常处理

如果一个生成器函数在yield所有值之前抛出异常,它已经存储的值将能够通过它们的refs进行访问;剩下的object ref将包含抛出的异常;对静态和动态都可以使用

@ray.remote
def generator():
    for i in range(2):
        yield i
    raise Exception("error")


ref1, ref2, ref3, ref4 = generator.options(num_returns=4).remote()
assert ray.get([ref1, ref2]) == [0, 1]
# All remaining ObjectRefs will contain the error.
try:
    ray.get([ref3, ref4])
except Exception as error:
    print(error)

dynamic_ref = generator.options(num_returns="dynamic").remote()
ref_generator = ray.get(dynamic_ref)
ref1, ref2, ref3 = ref_generator
assert ray.get([ref1, ref2]) == [0, 1]
# Generators with num_returns="dynamic" will store the exception in the final
# ObjectRef.
try:
    ray.get(ref3)
except Exception as error:
    print(error)
posted @ 2025-03-28 23:35  xwher  阅读(84)  评论(0)    收藏  举报