The Real-time Protocol (RTP) is one of the most widely used protocols in the AV over IP industry. You may find it in digital signage, video conferencing, IPTV, and surveillance video. Because it is used in both audio and video transport, it is useful to understand its purposes. One of the primary reasons that development engineers like the protocol is that its simplicity. Yet, it often provides the capability to manage and troubleshoot audio and video flows.



In Figure 1, you can see the position of the RTP header in the overall IP packet. Since the receiver processes the headers from left to right, you see that the destination device reads application port number in the UDP header before it processes the RTP header. So, what information does the receiver know and what does it need to know when it is ready to process the RTP header? First, it already knows by the IP address that the packet has been received in the correct device. And, it recognizes which application is to receive this information based on the UDP port number. However, since this is audio or video it should also identify which of these is in the payload. Additionally, it should be aware if any packets have been dropped from the flow. Finally, it must know what source (codec) created the payload—was it left audio, video, right audio or possibly another source. We can tell what data RTP carries by looking into the RTP header shown in Figure 2. While all of the fields might be used, I’ll only discuss the ones that are critical to give you an understanding for the purpose for RTP. The payload type (PT) contains a code to indicate the particular form of audio, video or data is in the payload field. The sequence number is a consecutive integer that counts off the packets. Using this number, the receiver immediately detects missing packets. Note that this capability is necessary because the previous protocols, UDP and IP, do not contain sequence numbers. If TCP were being used, it would have a sequence number. But, TCP is rarely used with RTP. The time stamp is a 32-bit binary number that provides very accurate time stamps used in presenting (playing) the audio or video to the user. SSRC and CSRC (Synchronization Source and Contributing Synchronization Source) Identifiers are codes that indicate the specific device that created the payload. For example, if a movie is being transported, there might be three SSRC codes – one for left audio, one for right audio, and one for video. We must be aware that the receiver is accepting RTP packets from all of these sources simultaneously and must have a way to identify each stream and deliver the signals to the appropriate speaker or screen.

In each case where RTP is used, there usually is a separate control protocol that allows for the flow to be established and terminated. In the case of voice or video conferencing, the control protocol is often SIP (Session Initiation Protocol). It is within this SIP establishment procedure that the port numbers associated with each RTP flow is exchanged between the source and the destination. In video surveillance, the control protocol is often RTSP (Real-time Streaming Protocol).

Phil Hippensteel, PhD, is a regular contributor to NewBay Media’s AV Technology and S&VC.