ray 用户指南: Task
ray 用户指南: Task
ray可以使得任何函数都可以在python workers上异步执行;这种函数是ray 远程函数,它们的异步调用称为ray tasks。以下为一个例子:
import ray
import time
# A regular Python function.
def normal_function():
return 1
# By adding the `@ray.remote` decorator, a regular Python function
# becomes a Ray remote function.
@ray.remote
def my_function():
return 1
# To invoke this remote function, use the `remote` method.
# This will immediately return an object ref (a future) and then create
# a task that will be executed on a worker process.
obj_ref = my_function.remote()
# The result can be retrieved with ``ray.get``.
assert ray.get(obj_ref) == 1
@ray.remote
def slow_function():
time.sleep(10)
return 1
for _ in range(4):
# 注意get的时候才会真正执行函数
slow_function.remote()
我们可以使用ray summary tasks
来看运行以及完成的tasks和数量
配置资源需求
@ray.remote(num_cpus=4, num_gpus=2)
def my_function():
return 1
# override默认的资源需求
my_function.options(num_cpus=3).remote()
给ray tasks传递对象引用
除了pass value,也可以pass obj ref.
@ray.remote
def function_with_an_argument(value):
return value + 1
obj_ref1 = my_function.remote()
assert ray.get(obj_ref1) == 1
obj_ref2 = function_with_an_argument(obj_ref1)
注意如果这两个tasks如果在不同机器上运行,那么第一个task的结果将通过网络发送给第二个task运行的机器上
等待部分结果
object_refs = [slow_function.remote() for _ in range(2)]
# Return as soon as one of the tasks finished execution.
ready_refs, remaining_refs = ray.wait(object_refs, num_returns=1, timeout=None)
多个返回值
task默认会返回单个obj ref。然而我们可以配置tasks能够返回多个obj refs
@ray.remote
def return_single():
return 0, 1, 2
object_ref = return_single.remote()
assert ray.get(object_ref) == (0, 1, 2)
# However, you can configure Ray tasks to return multiple Object Refs.
@ray.remote(num_returns=3)
def return_multiple():
return 0, 1, 2
object_ref0, object_ref1, object_ref2 = return_multiple.remote()
assert ray.get(object_ref0) == 0
assert ray.get(object_ref1) == 1
assert ray.get(object_ref2) == 2
对于返回多个obj的task,ray也支持remote generators,这使得task可以一次只返回一个object;除此之外,ray也支持动态设置返回值的数量
@ray.remote(num_returns=3)
def return_multiple_as_generator():
for i in range(3):
yield i
# NOTE: 必须要完全执行完整个task,这些obj才能可用
a, b, c = return_multiple_as_generator.remote()
取消tasks
可以通过调用ray.cancel()
来取消task的执行
@ray.remote
def blocking_operation():
time.sleep(10e6)
obj_ref = blocking_operation.remote()
ray.cancel(obj_ref)
try:
ray.get(obj_ref)
except ray.exceptions.TaskCancelledError:
print("Object reference was cancelled.")
调度
对每个task,ray将选择一个节点来运行,调度决定是基于一些因素如task的资源需求,特定的调度策略,task参数的位置
容错
默认,ray将重试由于系统和特定的应用级别的错误而失败的任务。你可以通过设置max_retries
和retry_exceptions
选项来改变这个行为。
task events
默认,ray trace task的执行,报告task的状态以及profiling events
你可以改变这个行为,通过设置enable_task_events
等关闭这个功能,这可以减少task执行的overhead。注意嵌套tasks不会从parent task继承task events的设置。
嵌套远程函数
import ray
@ray.remote
def f():
return 1
@ray.remote
def g():
# Call f 4 times and return the resulting object refs.
return [f.remote() for _ in range(4)]
@ray.remote
def h():
# Call f 4 times, block until those 4 tasks finish,
# retrieve the results, and return the values.
return ray.get([f.remote() for _ in range(4)])
注意f的定义必须出现在g和h定义之前。
当block时释放资源
@ray.remote(num_cpus=1, num_gpus=1)
def g():
return ray.get(f.remote())
当g task在执行时,当在ray.get处block时会释放cpu资源;但是gpu资源并不会释放。
动态生成器
使用ray的远程生成器的use case:
- 在从远程函数返回时减少最大的heap memory使用
- 当返回值的数量需要动态指定时
task caller设置num_returns
caller应该使用@ray.remote(num_returns=x)
或者foo.options(num_returns=x).remote()
设置远程函数的返回值数量。ray将给caller返回这么多的obj ref。remote task之后应该返回相同数量的值,通常作为一个元组或是list。相比动态设置返回值数量,这种方式有更低的复杂性和开销。
我们不需要改变caller的语法,就能使用remote generator来迭代式的生成值。这个generator应该yield与caller指定返回值的数量相同的值;并且在ray的对象存储中一次只存储一个。
@ray.remote
def large_values_generator(num_returns):
for i in range(num_returns):
yield np.random.randint(
np.iinfo(np.int8).max, size=(100_000_000, 1), dtype=np.int8
)
print(f"yielded return value {i}")
task executor设置num_returns
在某些情况下,caller可能不知道返回值的数量;例如,假设我们写了一个task可以将它的参数划分为多个相同大小的chunk;我们不知道返回值的数量。
在这种情况下,可以使用remote generator来返回动态数量的值。一个例子如下:
import numpy as np
@ray.remote(num_returns="dynamic")
def split(array, chunk_size):
while len(array) > 0:
yield array[:chunk_size]
array = array[chunk_size:]
array_ref = ray.put(np.zeros(np.random.randint(1000_000)))
block_size = 1000
# Returns an ObjectRef[DynamicObjectRefGenerator].
dynamic_ref = split.remote(array_ref, block_size)
print(dynamic_ref)
# ObjectRef(c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000)
i = -1
ref_generator = ray.get(dynamic_ref)
print(ref_generator)
# <ray._raylet.DynamicObjectRefGenerator object at 0x7f7e2116b290>
for i, ref in enumerate(ref_generator):
# Each DynamicObjectRefGenerator iteration returns an ObjectRef.
assert len(ray.get(ref)) <= block_size
num_blocks_generated = i + 1
array_size = len(ray.get(array_ref))
assert array_size <= num_blocks_generated * block_size
print(f"Split array of size {array_size} into {num_blocks_generated} blocks of "
f"size {block_size} each.")
# Split array of size 63153 into 64 blocks of size 1000 each.
# NOTE: The dynamic_ref points to the generated ObjectRefs. Make sure that this
# ObjectRef goes out of scope so that Ray can garbage-collect the internal
# ObjectRefs.
del dynamic_ref
注意这个特殊的ref也可以传递给另一个远程task,如下:
@ray.remote
def get_size(ref_generator : DynamicObjectRefGenerator):
print(ref_generator)
num_elements = 0
for ref in ref_generator:
array = ray.get(ref)
assert len(array) <= block_size
num_elements += len(array)
return num_elements
# Returns an ObjectRef[DynamicObjectRefGenerator].
dynamic_ref = split.remote(array_ref, block_size)
assert array_size == ray.get(get_size.remote(dynamic_ref))
# (get_size pid=1504184)
# <ray._raylet.DynamicObjectRefGenerator object at 0x7f81c4250ad0>
# This also works, but should be avoided because you have to call an additional
# `ray.get`, which blocks the driver.
ref_generator = ray.get(dynamic_ref)
assert array_size == ray.get(get_size.remote(ref_generator))
# (get_size pid=1504184)
# <ray._raylet.DynamicObjectRefGenerator object at 0x7f81c4251b50>
异常处理
如果一个生成器函数在yield所有值之前抛出异常,它已经存储的值将能够通过它们的refs进行访问;剩下的object ref将包含抛出的异常;对静态和动态都可以使用
@ray.remote
def generator():
for i in range(2):
yield i
raise Exception("error")
ref1, ref2, ref3, ref4 = generator.options(num_returns=4).remote()
assert ray.get([ref1, ref2]) == [0, 1]
# All remaining ObjectRefs will contain the error.
try:
ray.get([ref3, ref4])
except Exception as error:
print(error)
dynamic_ref = generator.options(num_returns="dynamic").remote()
ref_generator = ray.get(dynamic_ref)
ref1, ref2, ref3 = ref_generator
assert ray.get([ref1, ref2]) == [0, 1]
# Generators with num_returns="dynamic" will store the exception in the final
# ObjectRef.
try:
ray.get(ref3)
except Exception as error:
print(error)