Talking about WebRTC NetEQ

Talking about WebRTC NetEQ

There are many treasures worth learning in the WebRTC Native code, one of which is the NetEQ module of WebRTC. According   to the explanation of NetEQ in the WebRTC glossary :

A dynamic jitter buffer and error concealment algorithm used for concealing the negative effects of network jitter and packet loss. Keeps latency as low as possible while maintaining the highest voice quality.

A dynamic jitter buffer and error concealment (packet loss compensation) algorithm to remove the negative effects of network jitter and packet loss. While maintaining the highest voice quality, keep the latency as low as possible.

NetEQ is actually the Jitter Buffer module in audio and video processing, which is used in WebRTC's voice engine. This module is very important and will affect the playback experience, and it is also quite complicated.

The source code of this article refers to the  WebRTC Native  M78 version.

Jitter removal

At the same bit rate: jitter (J) = average arrival interval (close to the transmission interval)-single arrival interval

  • J> 0: positive jitter, data packets arrive early, packets are piled up, and the receiver overflows

  • J <0: negative jitter, packet delay or packet loss

Since the arrival of network packets is fast or slow, the interval is inconsistent, so the listening experience is not smooth. The jitter elimination is to change the non-uniform delay into a uniform delay, and the sum of the delays of all data packets in the network transmission is equal to the sum of the delays after the jitter buffer is processed.

 

View Image

 

 

Point in timeABCD
send306090120
Arrivals4090100130
After treatment6090120150

Through processing, the playback intervals of A, B, C, and D are the same, and the playback end will feel the delay in hearing, but there will be no stalls.

There are two common jitter buffer control algorithms:

  • Static jitter buffer control algorithm: The buffer size is fixed and easy to implement. When the network jitter is large, the packet loss rate is high, the jitter is small, and the delay is large.

  • Adaptive jitter buffer control algorithm: Calculate the current maximum jitter, adjust the buffer size, and achieve complex implementation. When the network jitter is large, the packet loss rate is low, the jitter is small, and the delay is small.

A good algorithm naturally pursues low packet loss rate and low latency.

Packet loss compensation

Packet Loss Concealment (PLC, Packet Loss Concealment), as the name implies, is a countermeasure when packet loss occurs. It is mainly divided into packet loss compensation of the sender and the receiver.

Sender

  • Active retransmission: Let the sender retransmit through signaling.

  • Passive channel coding: Do some special processing when grouping packets, which can be used as a basis for packet loss.

    • Forward error correction (FEC, Forward error correction): Process according to the packet information before the packet loss.
      • Media related: dual-transmission, the second packet in the data packet is generally encoded with a lower bit rate and sound quality.
      • Media independent: Every n data packets generate one (multiple) new parity packets, and the parity packets can restore the information of these n packets.
    • Interleaving: Splitting and rearranging data packets to reduce the data size of a single packet loss.

Receiving end

  • Insertion: compensate with a fixed package
    • Mute package
    • Noise package
    • Duplicate package
  • Interpolation: Pattern matching and interpolation techniques generate similar packets. The algorithm does not understand the specific content of the data packet, but only processes it from the data characteristics
  • Reconstruction: Generate packets based on encoding parameters and compression parameters. Unlike interpolation, the algorithm uses more information in the data packet, and the effect is better

Overall structure

 

View Image

 

 

In the WebRTC source code, NetEQ is located in the speech engine. Others include codecs, 3A algorithms and other classic and general modules.

 

View Image

 

 

From the sound processing flow, NetEQ is at the front of the receiving end to process the received network data packets and transmit them to the following specific audio processing algorithms.

 

View Image

 

 

In the NetEQ module, it is roughly divided into MCU (Micro Control Unit) module and DSP module. The MCU is mainly responsible for the calculation and statistics of delay and jitter, and generates corresponding control commands . The DSP module is responsible for receiving and processing corresponding data packets according to the control commands of the MCU , and transmitting them to the next link.

MCU module

The MCU module is like a command center. After receiving the data packet, it performs statistical calculation and analysis according to the information of the data packet, and makes command decisions. mainly include:

Network delay statistics algorithm

This algorithm is located 

neteq/delay_manager.cc
. When a packet is received, it will call

int DelayManager::Update(uint16_t sequence_number, uint32_t timestamp, int sample_rate_hz) Copy code

Pass the information of the data packet, and then update the statistics. The main process is as follows:

  1. Calculate the interval from the time the packet is pulled from the queue to the present
  2. Calculate the delay based on the packet sequence number and packet timestamp 
    iat_packets
     Number of intervals
    • Normal arrival: iat_packets=1
    • Arrived early out of order: iat_packets=0
    • N intervals of delayed arrival: iat_packets=n
  3. transfer 
    CalculateTargetLevel
     Update interval (calculate according to the delay interval probability and delay peak value in the latest period of time)
int DelayManager::CalculateTargetLevel(int iat_packets, bool reordered) Copy code

Jitter delay statistics algorithm

This algorithm is located 

neteq/buffer_level_filter.cc
. When fetching the package, it will call:

void BufferLevelFilter::Update(size_t buffer_size_samples, int time_stretched_samples) Copy code

Pass in the number of remaining packets in the current jitter buffer and the number of packets processed by acceleration and deceleration, and then update the statistics. The main process is as follows:

  1. Calculate smooth delay through dynamic forgetting factor (calculated according to network delay value)
  2. Calculate the influence of acceleration and deceleration (time_stretched_samples)

Control command decision decision

This algorithm is located 

neteq/decision_logic.cc
. When fetching the package, it will call:

Operations DecisionLogic::GetDecision(const SyncBuffer& sync_buffer, const Expand& expand, size_t decoder_frame_length, const Packet* next_packet, Modes prev_mode, bool play_dtmf, size_t generated_noise_samples, bool* reset_decoder) Copy code

The decision will be made based on the relationship between the package and the previous package. The main judgment conditions are as follows:

  • The current frame is normal + the previous frame is normal: judge according to the network delay statistics, and give a normal/acceleration/deceleration decision
  • The current frame is normal + the previous frame is lost: the previous frame is generated by compensation, so smooth processing is required to give a normal/fusion decision
  • The current frame is lost + the previous frame is normal: Enable packet loss compensation
  • Current frame loss + previous frame loss: continuous packet loss compensation

DSP processing

Variable speed unchanged

Code is in 

neteq/time_stretch.cc
 in:

TimeStretch::ReturnCodes TimeStretch::Process(const int16_t* input, size_t input_len, bool fast_mode, AudioMultiVector* output, size_t* length_change_samples) Copy code

Realizing variable speed without adjusting, and adjusting the voice duration is the basis for the acceleration and deceleration control. The WSOLA algorithm is used in WebRTC NetEQ, but because this algorithm is too complicated, I cannot fully understand it with the author's professional knowledge, so I am interested in the  article .

normal

Code is in 

neteq/normal.cc
 in:

int Normal::Process(const int16_t* input, size_t length, Modes last_mode, AudioMultiVector* output) Copy code

The data just meets the playback requirements and there is no additional processing, but it is necessary to consider whether the previous packet was a compensated packet, and if so, perform smoothing processing.

accelerate

Code is in 

neteq/accelerate.cc
 in:

Accelerate::ReturnCodes Accelerate::Process(const int16_t* input, size_t input_length, bool fast_accelerate, AudioMultiVector* output, size_t* length_change_samples) Copy code

When the jitter delay is too large, try to reduce the jitter delay without packet loss. Because there are too many data packets at this time, in order to consume data packets as soon as possible, the playback time of the data packets is shortened.

slow down

Code is in 

neteq/preemptive_expand.cc
 in:

PreemptiveExpand::ReturnCodes PreemptiveExpand::Process( const int16_t* input, size_t input_length, size_t old_data_length, AudioMultiVector* output, size_t* length_change_samples) Copy code

Deceleration is the opposite. When the network condition is bad, there are more packet loss. For continuity, the time waiting for network data is prolonged. Because there are little or no data packets accumulated at this time, in order to fight for the time to wait for a new network data packet, the playing time of the data packet is lengthened.

Fusion

Code is in 

neteq/expand.cc
 in:

Expand::Expand(BackgroundNoise* background_noise, SyncBuffer* sync_buffer, RandomVector* random_vector, StatisticsCalculator* statistics, int fs, size_t num_channels) Copy code

When the last played frame is not continuous with the currently decoded frame, it needs to be connected and smoothed. A part of the playback time of the two packets will overlap, making the transition more natural.

Packet loss compensation

Code is in 

neteq/expand.cc
 in:

Expand::Expand(BackgroundNoise* background_noise, SyncBuffer* sync_buffer, RandomVector* random_vector, StatisticsCalculator* statistics, int fs, size_t num_channels) Copy code

When the current frame is lost, the packet loss compensation will refer to some of the latest samples before, reconstruct the generated data through linear prediction, and update it as a reference for the next compensation. However, due to the complexity of this algorithm, I cannot fully understand it with the author's professional knowledge, so I am interested in the  article.

Buffer

During the entire NetEQ module processing, there are the following buffers:

Jitter buffer

lie in 

neteq/neteq_impl.h
 of 
NetEqImpl
 In the class 
Dependencies
 In the structure.

std::unique_ptr<PacketBuffer> packet_buffer Copy code

Used to store audio data packets on the network.

Decode buffer

lie in 

neteq/neteq_impl.h
 of 
NetEqImpl
 Class.

std::unique_ptr<int16_t[]> decoded_buffer_ RTC_GUARDED_BY(crit_sect_); Copy code

Used to store decoded PCM data.

Algorithm buffer

lie in 

neteq/neteq_impl.h
 of 
NetEqImpl
 Class.

std::unique_ptr<AudioMultiVector> algorithm_buffer_ RTC_GUARDED_BY(crit_sect_); Copy code

Used to store data processed by DSP.

Speech buffer

lie in 

neteq/neteq_impl.h
 of 
NetEqImpl
 Class.

std::unique_ptr<SyncBuffer> sync_buffer_ RTC_GUARDED_BY(crit_sect_); Copy code

In fact, it is the data copy of the algorithm buffer, and the segmentation mark of the played position is added.

Incoming package processing

The process code of the incoming package is located at 

neteq/neteq_impl.cc
 middle 
InsertPacket
 In the method, this method calls the internal method that is actually processed:

int NetEqImpl::InsertPacketInternal(const RTPHeader& rtp_header, rtc::ArrayView<const uint8_t> payload, uint32_t receive_timestamp) Copy code

The overall process is as follows:

  1. Put the data into the local variable PacketList
  2. Processing RTP packet logic
    • Convert internal and external timestamps
    • NACK (Negative Acknowledgement) Processing
    • Determine the redundant large package (RED) and resolve it into a small package
    • Check package type
    • Determine and process DTMF (Dual Tone Multi-Frequency) packets
    • Bandwidth estimation
  3. Analysis package
    • Noise removal package
    • Unpack header to get package information
    • Calculate the number of normal voice packets except error correction packets and redundant packets
  4. Insert the voice packet into the PacketBuffer (jitter buffer)

Out of package processing

The code of the package processing flow is located in 

neteq/neteq_impl.cc
 middle 
GetAudio
 In the method, this method calls the internal method that is actually processed:

int NetEqImpl::GetAudioInternal(AudioFrame* audio_frame, bool* muted, absl::optional<Operations> action_override) Copy code

The overall process is as follows:

  1. Check the mute status and return to the mute package
  2. Obtain control command decisions based on the reception of the current frame and the previous frame
  3. If it is not for packet loss compensation, decode it and put it into the decoding buffer
  4. Perform silent detection (Vad)
  5. According to the command decision, the decoding buffer is processed and placed in the algorithm buffer (AudioMultiVector)
  6. Copy the data in the algorithm buffer to the voice buffer (SyncBuffer)
  7. Process and take out 10ms voice data output

Extension-How to design Jitter Buffer

According to the NetEQ module of WebRTC, the author summarizes the following points that need to be paid attention to when designing Jitter Buffer:

  • Interface-putPacket/getPacket
  • Packet processing-type judgment/division into playback frames
  • Out of order receiving processing-Sequence/Timestamp
  • Jitter/delay statistics calculation
  • Frame position/capacity change-according to the status of packet reception
  • Audio parameter switch-Reset
  • Packet loss compensation-reconstruct with reference to historical packets

At last

The introduction of the WebRTC NetEQ module is probably finished. It is precisely because Google makes it open source to get a glimpse, and most of the self-developed speech engines have more or less referenced the strategies in NetEQ. I have to say that WebRTC is really a good source code worth learning in the audio and video field.

This article reprinted from the
author: Nemocdz
link: https://juejin.im/post/5e1074546fb9a048131aa111


Reference : https://blog.csdn.net/DittyChen/article/details/103961321