cs294-ai-sys2022 lectures7 reading
1. Neural Adaptive Video Streaming with Pensieve (require 2017 MIT)
- 动机
-
Adaptive bitrate (ABR) algorithms are the primary tool that content providers use to optimize video quality. These algorithms run on client-side video players and dynamically choose a bitrate for each video chunk (e.g., 4-second block). ABR algorithms make bitrate decisions based on various observations such as the estimated network throughput and playback buffer occupancy. Their goal is to maximize the user’s QoE by adapting the video bitrate to the underlying network conditions. However, selecting the right bitrate can be very challenging due to (1) the variability of network throughput; (2) the conflicting video QoE requirements(high bitrate, minimal rebuffering, smoothness, etc.); (3) the cascading effects of bitrate decisions (e.g., selecting a high bitrate may drain the playback buffer to a dangerous level and cause rebuffering in the future); and (4) the coarse-grained(粗粒度) nature of ABR decisions.
-
- 工作
-
we propose Pensieve, a system that learns ABR algorithms automatically, without using any pre-programmed control rules or explicit assumptions about the operating environment. Pensieve uses modern reinforcement learning (RL) techniques to learn a control policy for bitrate adaptation purely through experience. During training, Pensieve starts knowing nothing about the task at hand. It then gradually learns to make better ABR decisions through reinforcement, in the form of reward signals that reflect video QoE for past decisions.
-
- LEARNING ABR ALGORITHMS
-
RL considers a general setting in which an agent interacts with an environment. At each time step t , the agent observes some state s_t , and chooses an action a_t . After applying the action, the state of the environment transitions to s_t+1 and the agent receives a reward r_t . The goal of learning is to maximize the expected cumulative discounted reward.
-
The ABR agent observes a set of metrics including the client playback buffer occupancy, past bitrate decisions, and several raw network signals (e.g., throughput measurements) and feeds these values to the neural network, which outputs the action, i.e., the bitrate to use for the next chunk. The resulting QoE is then observed and passed back to the ABR agent as a reward. The agent uses the reward information to train and improve its neural network model.
-
- DESIGN
- Training Methodology
-
Pensieve trains ABR algorithms in a simple simulation environment that faithfully models the dynamics of video streaming with real client applications. Using this chunk-level simulator, Pensieve can “experience”100 hours of video downloads in only 10 minutes
-
- Basic Training Algorithm
-
Inputs: After the download of each chunk t , Pensieve’s learning agent takes state inputs s_t = ( ~ x_t ; ~ _t ; ~ n_t ;b_t ; c_t ; l_t ) to its neural networks. ~ x_t is the network throughput measurements for the past k video chunks; ~ _t is the download time of the past k video chunks, which represents the time interval of the throughput measurements; ~ n_t is a vector of m available sizes for the next video chunk; b_t is the current buffer level; c_t is the number of chunks remaining in the video; and l_t is the bitrate at which the last chunk was downloaded.
-
Policy: Upon receiving st , Pensieve’s RL agent needs to take an action at that corresponds to the bitrate for the next video chunk. The agent selects actions based on a policy, defined as a probability distribution over actions pai: pai (st ; at ) ->[0; 1]. (st ; at ) is the probability that action at is taken in state st . In practice, there are intractably many {state, actiong} pairs, e.g., throughput estimates and buffer occupancies are continuous real numbers. To overcome this, Pensieve uses a neural network (NN) [15] to represent the policy with a manageable number of adjustable parameters, theta, which we refer to as policy parameters.
- Policy gradient training
![]()
-
- Training Methodology


浙公网安备 33010602011771号