Types of VoIP Protocols used in 2020

VoIP protocols required for various components to work smoothly in the communications services. Virtually every device in the world uses a standard called Real-Time Protocol (RTP) for transmitting audio and video packets between communicating computers. The Internet Engineering Task Force (IETF) defines RTP in RFC 3550. The payload format for a number of CODECs is defined in RFC 3551, although the International Telecommunications Union ITU and other IETF RFCs define other payload format specifications.

RTP addresses issues like packet order, and it provides mechanisms to help address delay and jitter. These mechanisms include the Real-Time Control Protocol, or RTCP, which also is defined in RFC 3550. In addition, one of the main areas of concern about Internet communications is the eavesdropping potential. To address security concerns, Secure RTP was created (defined in RFC 3711), and this technology provides for encryption, authentication, and integrity of the audio and video packets that are transmitted between communicating devices.

But, before audio or video media can be transmitted between two computers, various protocols must be employed to find the remote device and to negotiate a communications transmission. The protocols that are essential to this process are known as “call-signaling protocols,” the most popular of which are H.323 and SIP; but there are many other protocols that help users to perform various tasks and these protocols need devices in order to function properly. The following protocols are most common to the majority of the devices utilized today.


In 1995, researchers wanted to solve the problem of how two computers could initiate communication in order to exchange audio and video media streams. H.323 and SIP (Session Initiation Protocol) were the two resultant solutions to that problem, but H.323 enjoyed the first commercial success. While both protocols allow users to do the same thing – to establish multimedia communication in single or in multimedia platforms – both protocols differ in design. H.323 is a binary protocol and SIP is an ASCII-based protocol.

When shopping for VoIP systems, you might notice that most new technology includes SIP rather than H.323. Ongoing debates over which system is better often leaves H.323 behind, but H.323 is superior in a number of ways. It often brings better interoperability with the PSTN and for video, and reliable out-of-band transport of DTMF (the tones heard when pressing a button on a telephone). One advantage SIP has over H.323 is its lack of complexity. SIP resembles the HTTP/SMTP protocols, which makes SIP easier for many individuals to use.

How it Works

An H.323 terminal is an endpoint in a LAN that participates in real-time two-way communications with another H.323 terminal, gateway, or multipoint control unit (MCU). H.323 endpoints are grouped together in zones, and each zone has one gatekeeper that manages all the endpoints in that zone. Each terminal must support audio communication, but they also can support audio with video, audio with data, or a combination of these capabilities.

H.323 can be referred to as “intelligent endpoint protocol,” which means that all the intelligence required to locate the remote endpoint and to establish media streams between the local and the remote device is an integral part of this protocol. “Device control protocols” are complementary to H.323, and those current protocols are H.248 and MGCP.

Basic Usage

To understand how H.323 is used, it helps to understand how the gateway works. In VoIP, the gateway usually is a device that offers an IP interface on one side and some sort of legacy telephone interface on the other side. The legacy telephone interface may be complex, such as an interface to a legacy Public Switched Telephone Network (PSTN) switch, or it may be a simple interface that allows the user to connect one or a few traditional telephones. Basically, the gateway converts media provided in one type of network to the format required for another type of network.

Originally, gateways were viewed as monolithic devices that had call control provided by H.323 (or SIP) and hardware required to control the PSTN interface. In 1998, the idea of splitting the gateway into two logical parts was proposed: one part, which contains the call control logic, is called the Media Gateway Controller (MGC) or call agent (CA), and the other part, which interfaces with the PSTN, is called the media gateway (MG). With this functional split, a new interface existed (going between the MGC and MG), driving the necessity to define MGCP and H.248.

The H.323 gateway can provide an interface between H.323 and a PSTN, but it also can provide an interface between H.320, V.70, H.324 and other speech terminals. H.323 uses CODECs to convert between circuit-switched and packet formats, and works with the gatekeeper through RAS protocols to route signals from voice and fax through the network.

Megaco H.248

H.323, used for Local Area Networks (LANs), isn’t capable of scaling to larger public networks. Enter Megaco, the result of a joint effort between the Internet Engineering Task Force (IETF) and the ITU-T Study Group 16. The IETF defines Megaco as RFC 3015 and as recommendation H.248.MGCP. Megaco/H.248, or the Media Gateway Control Protocol (MGCP), is also known as H.248 and Megaco.

This is a general-purpose standard protocol for handling signaling and session management required during a multimedia conference. It’s also used for control of elements in a physically decomposed multimedia gateway, which enables separation of call control from media conversion.

How it Works

MGCP and Megaco/H.248 are complementary to H.323 and SIP, and are referred to as “device control protocols” because they remove the signaling control from the gateway and send it to a media gateway controller (MGC – sometimes is called a “call agent” or softswitch), which dictates the service logic of communications traffic. Megaco/H.248 contains terminations and contexts, two basic components. Terminations represent streams entering or leaving the MG, such as analog telephone lines or RTP or MP3 streams. Terminations have properties such as the maximum size of a jitter buffer, which can be inspected and modified by the MGC.

Terminations can be placed into contexts, which are defined as when two or more termination streams are mixed and connected together. Contexts are created and released by the MG under command of the MGC when the first termination is added and released by removing the last termination. A termination may contain more than one stream, which is why a context may carry multistream context. Audio, video, and data streams may exist in context within several terminations.

All Megaco/H.248 messages are in the format of ASN.1 text messages that display demands. These demands are the messages sent from the MGC to the MG, although the command, “ServiceChange” can also be sent by the MG. The MG sends the “Notify” command to the MGC to inform the MGC that one of the events the MGC was interested in has occurred.

Basic Usage

Megaco/H.248 is similar to MGCP from an architectural and controller-to-gateway relationship, but Megaco/H.248 supports a broader range of networks. This protocol is central to VoP (Voice over Packet) solutions, and it can be integrated easily into products such as Central Office Switches, Gateways (Trunking, Residential and Access), Network Access Servers, Cable Modems, PBXs, IP Phones, Soft Phones, IADs, Middleboxes etc. to develop a convergent voice and data solution.


MGCP (Media Gateway Control Protocol) is an internal protocol used within a Voice over IP (VoIP) system, and specified in RFC 3435. This simple protocol was developed primarily to address carrier-based IP telephone network demands, and it has become the de facto standard for media gateway control worldwide. MGCP is a complementary protocol for both H.323 and SIP, which serve as IP signaling devices within an IP network. A Media Gateway Controller (MGC) uses MGCP to interface with the Media Gateway (MG), and handles all the processing by linking with an IP network.

How it Works

The MGCP system is comprised of a Call Agent (CA), one MG that performs the conversion of media signals between circuits and packets, and one signaling gateway (SG) when that SG is connected to the Public Switched Telephone Network (PSTN). MGCP, which utilizes SDP, is widely used between elements of a decomposed multimedia gateway. The gateway has a CA, which is comprised of the call control “intelligence” and a media gateway boasting the media functions, for example conversion from TDM voice to Voice over IP.

Media Gateways feature endpoints for the CA to create and manage media sessions with other multimedia endpoints. Endpoints are sources and/or sinks of data that can be physical or virtual. For creating physical endpoints, hardware installation is needed while virtual endpoint can be created using available software. Call Agents come with the capability of creating new connections or modify an existing connection.

Generally, a media gateway is a network element that provides conversion between the data packets carried over the Internet or other packet networks and the voice signals carried by telephone lines. The Call Agent provides instructions to the endpoints to check for any events and to create signals for existing events. The endpoints are designed in such a way as to automatically communicate changes in service state to the Call Agent. The Call Agent can audit endpoints and the connections on endpoints.

MGCP connections can be point-to-point or multipoint. Point-to-point connections can be created from a connection between two endpoints for transmitting data between these endpoints. Once the connection is setup between two endpoints, data transfer can take place between the endpoints. In a multipoint connection, the connection is set up between an endpoint and a multipoint session. In a multipoint connection, connections can be created over various types of bearer networks.

Basic Usage

MGCP is a popular VoIP application because the MGCP Call Agent works as seemingly complex software switch for a VoIP network; however, its simplicity is understated. It really does nothing more than direct the media gateways and signaling gateways that perform all the work.

Each and every command within MGCP architecture features a transaction ID, and it receives an acknowledgement and a response. These actions are often understood as a subscription architecture, as the CA informs the MG and signaling gateways as to the events that are attended and unattended.

MGCP packets usually are found wrapped in UDP port 2427. The MGCP datagrams are formatted with white space, and an MGCP packet can be either a command that begins with a four-letter verb or a three-number response code.


MIME, or Multipurpose Internet Mail Extensions, refers to an official Internet standard that defines how messages must be formatted so that they can be exchanged among various email systems. The headers that define MIME messages are defined by RFC 2045, and the extensions that permit non-US-ASCII text data in Internet mail header fields is defined by RFC 207. Finally the MIME message formats and acknowledgements are defined by RFC 2049.

MIME is a very flexible format that permits virtually any type of file or document message type that can include text, images, audio, video, or other data. MIME uses base64 as an encoding procedure to ensure protection for non-text message. Ironically, it achieves this encoding by coding non-text messages as text.

How it Works

MIME type comprises a combination of type and subtype, and the charset of a text type reveals the encoding. Internet protocols such as HTTP use the content-type header and MIME type registry. MIME enables messages to have a tree structure, and it offers many features that are considered essential for modern email usage:

  • Support for character sets other than ASCII, required for sending email in languages other than English.
  • A content type labeling system, which allows multimedia content to be handled intelligently by computer programs.
  • Support for content in email messages that is not text, which allows email to contain multimedia content including images, audio, office documents, and more.
  • Support for compound documents, which allows a single email message to contain multiple parts (multiple images, file attachments, and so on).

Basic Usage

The MIME format is very similar to the format of information that is exchanged between a Web browser and its Web server. This related format is specified as part of the Hypertext Transfer Protocol (HTTP). Virtually all human-written Internet e-mail and a fairly large proportion of automated e-mail are transmitted via SMTP in MIME format.

Internet email has come a long way since RFC 822 was published in 1982. Today, all the mainstream email programs are fully compatible with the MIME standard for email, which allows for some advanced features and interoperability. The user-visible features that depend on MIME include styled text, text in non-Roman alphabets, file attachments, and multimedia content.


Remove Voice Protocol (RVP or RVP/IP) is a proprietary specification developed by MCK Communications for transporting digital telephony sessions over packet- or circuit-based data networks. The protocol is used primarily in MCK’s Extender product family, which extends PBX services over Wide Area Networks (WANs).

How it Works

RVP provides facilities for connection establishment and configuration between a client (or remote station set) device and a server (or phone switch) device. When a remote caller attempts to make a connection with the PBX, the MCK Extender initiates a TCP session to the Extender PBXgateway. The initiation occurs from a high TCP to TCP 2698.

RVP/IP uses Transmission Control Protocol (TCP) to transport signaling and control data, and User Datagram Protocol (UDP) to transport voice data. Both TCP and UDP work in conjunction with IP to ensure that packets reach their intended destinations. The signaling occurs through the TCP session and the voice is transferred via the UDP session. RVP over IP depends on the network configuration and the level of Quality-of-Service (QoS).

The devices communicate as client and server with the MCK Extender products functioning as clients. A client initiates the RVP over IP session opens the first TCP port to begin with 1024 or higher. The client then sends a request to TCP 2698. Voice and network parameters make up the data packets. The voice parameter consists of a voice path, voice compression algorithm, DTMF encoding, comfort noise generator, echo cancellation, silence detection. The network parameters comprise packet size and jitter buffer.

The remote MCK extender starts the UDP stream upon the successful establishment of the TCP session. The UDP stream starts from port 12288 (0×3000) up to 12544 (0x30FF). The UDP listening port is 2698. RVP over IP reduces network traffic congestion and packet loss by employing a “packetizer” that uses a data packet for holding several voice samples. The CODEC and packet size determine the interval at which voice is transmitted.

Basic Usage

RVP Control Protocol was originally developed for point-to-point applications, so most of its functionality is unnecessary when using TCP/IP. During the RVP/IP session, one class of RVP/IP control message is exchanged. RVPCP ADD VOICE (operation code 12) packet takes a single parameter of type and the server responds with a single packet containing the code RVPCP ADD VOICE ACK (operation code 13). If RVP/IP is operating in “dynamic voice” mode, this exchange must be repeated whenever the voice channel needs to be reestablished, i.e., whenever the connection is broken.


Session Announcement Protocol (SAP) is an announcement protocol that is used by session directory clients to assist the advertisement of multicast multimedia conferences and other multicast sessions. It also is used to communicate the relevant session setup information to prospective participants.

How it Works

An SAP announcer periodically multicasts an announcement packet to a well-known multicast address and port. The announcement is multicast with the same scope as the session it is announcing, ensuring that the recipients of the announcement can also be potential recipients of the session the announcement describes (bandwidth and other such constraints permitting). This is also important for the scalability of the protocol, as it keeps local session announcements local.

A SAP listener learns of the multicast scopes it is within (for example, using the Multicast-Scope Zone Announcement Protocol) and listens on the well-known SAP address and port for those scopes. In this manner, it will eventually learn of all the sessions being announced, allowing those sessions to be joined.

Basic Usage

It is to be expected that sessions may be announced by a number of different mechanisms, not only SAP. For example, a session description may be placed on a web page, sent by email or conveyed in a session initiation protocol. To ease interoperability with these other mechanisms, application level security is employed, rather than using IPsec authentication headers.

The announcement is multicast with the same scope as the session it is announcing, ensuring that the recipients of the announcement can also be potential recipients of the session the announcement describes (bandwidth and other such constraints permitting). This is also important for the scalability of the protocol, as it keeps local session announcements local.


SDP is an IETF standard that allows a multimedia device to describe the kinds of media that has to offer or that it wishes to accept. As part of this description, the device will indicate the type of media (audio, video, text, etc.), the IP ports used, the protocols used (e.g., T.120), and other information necessary for a device to receive the specified media and understand how to handle that media.

SDP has been published by the IETF as RFC 4566. There are additional RFCs that document extensions or enhancements to SDP.

How it Works

The owner of a conference advertises it over a network by sending multicast messages which contain a description of the session e.g. the name of the owner, the name of the session, the coding, the timing, etc. SDP does not provide the content of the media form itself but simply provides a negotiation between two end points to allow them to agree on a media type and format.

The recipients of the SDP message then make a decision about participating in the session. SDP is generally contained in the body part of Session Initiation Protocol popularly called SIP.

SDP started off as a component of SAP, but found other uses in conjunction with RTP, SIP, and as a standalone format as described above.

There are five terms related to SDP:

  1. Conference: Two or more communicating users who utilize communication devices to meet rather than meeting in person.
  2. Session: The flowing stream of data between an open multimedia sender and receiver.
  3. Session Announcement: A session announcement is a session description conveyed to users who may or may not expect the announcement.
  4. Session Advertisement: same as session announcement.
  5. Session Description: The information included in the session announcement or advertisement.

Basic Usage

SDP is ideal to inform business partners, clients, and other large groups of interconnected individuals and groups about upcoming events. But, like with any software, there is a learning and usability curve. The SDP offer/answer model is where most SIP interoperability issues occur, and following the RFC may or may not resolve your issues. The problems in session advertising also may be concerned with endpoint types and various control protocols.

Within the context of VoIP architecture, it is seen that there are a number of different media endpoints which may be controlled by MGCP, H248 or SIP. For all combinations of endpoints on a given connection, there is a need to ensure that CODEC negotiation takes place and that the differing uses of SDP are reconciled.


Christian Huitema and Mauricio Arango published the Simple Gateway Control Protocol (SGCP) in 1998 by as part of the development of the “Call Agent Architecture” at Telcordia. In this architecture, a central server called the “Call Agent” or “Softswitch” controls media gateways and receives telephony signaling requests through a ‘signaling gateway.’ Basically, SGCP handles the communication between the call agent and the gateways.

How it Works

SGCP was designed to be a simple ‘remote control’ standard with the Session Initiation Protocol (SIP), enabling the Call Agent to relay calls between a VoIP network using H.323 or SIP and a traditional telephone network. The SGCP commands are encoded with syntax somewhat comparable to the SIP or HTTP headers. They carry a payload describing the voice over IP media stream. This payload is encoded using the same “session description protocol” (SDP) as SIP.

SGCP was merged with the IPDC proposal sponsored by Level3 Communications. This led to the definition of the Media Gateway Control Protocol, jointly submitted to the IETF by the authors of SGCP and IPDC in November 1998.

Basically, the SGCP assumes a connection model where the basic constructs are endpoints and connections. Connections may be either point-to-point or multipoint. A point-to-point connection is an association between two endpoints with the purpose of transmitting data between these endpoints. Once this association is established for both endpoints, data transfer between these endpoints can take place. A multipoint connection is established by connecting the end point to a multipoint session.

Basic Usage

The SGCP is designed as an internal protocol within a distributed system that appears to the outside as a single VoIP gateway. In reality, this system is composed of a call agent that may or may not be distributed over several computer platforms and of a set of gateways. SGCP is used to instruct remote control gateways to forward the voice signals received on a circuit towards another gateway.

SGCP commands (Create Connection and Modify Connection) carry an SDP payload, where the VoIP parameters such as supported encoding, RTP options, UDP port and IP address are defined. In some network configurations, gateways expect to carry the voice packets over an ATM or a frame relay network. SGCP can easily be extended to provide signaling for these gateways.

Through the interface, the call agent can ask the gateway to collect digits dialed by the user. This facility is intended to be used with access gateways to collect the numbers that a user dials; it may also be used with trunking gateways and access gateways alike, to collect the access codes, credit card numbers and other numbers requested by call control services.

An alternative procedure would ask the gateway to notify the call agent of the dialed digits, as soon as they are dialed. However, such a procedure generates a large number of interactions. It is preferable to accumulate the dialed numbers in a buffer, and to transmit them in a single message. The problem with this accumulation approach, however, is that it is hard for the gateway to predict how many numbers it needs to accumulate before transmission.

The solution to this problem is to load the gateway with a digit map that corresponds to the dial plan. The call agent provides digit maps to the gateway whenever the call agent instructs the gateway to listen for digits.


SIP is an application-layer control protocol that allows users to create, modify, and terminate sessions with one or more participants. It can be used to create two-party, multiparty, or multicast sessions that include Internet telephone calls, multimedia distribution, and multimedia conferences.

Session Initiation Protocol (SIP) was published by the IETF in 1996, but the first recognized standard published later in 1999. SIP was revised over the years and re-published in 2002 as RFC 3261, which is the currently recognized standard for SIP. These delays in the standards process resulted in delays in market adoption of the SIP protocol, which is why H.323 is considered the VoIP connectivity standard.

Today, H.323 still commands the bulk of the VoIP deployments in the service provider market for voice transit, especially for transporting voice calls internationally. H.323 is also widely used in room-based video conferencing systems and is the preferred protocol for IP-based video systems. SIP has, most recently, become more popular for use in instant messaging systems.

How it Works

Like HTTP or SMTP, SIP works in the Application layer of the Open Systems Interconnection (OSI) communications model, the level that ensures communications. SIP can establish multimedia sessions or Internet telephony calls, and modify or terminate them. The protocol can also invite participants to unicast or multicast sessions that do not necessarily involve the initiator. Because the SIP supports name mapping and redirection services, it makes it possible for users to initiate and receive communications and services from any location, and for networks to identify the users wherever they are.

SIP is a request-response protocol, dealing with requests from clients and responses from servers. Participants are identified by SIP URLs, and requests can be sent through any transport protocol. SIP will determine the end system that will be used for any given session, the communication media and its parameters, and the recipient’s response to the call. Once these actions have been executed, SIP establishes the call parameters at the caller and at the recipient ends and handles all transfer and termination.

Although SIP is as old as H.323 as an initiation protocol, SIP wasn’t designed to address many problems within legacy communication systems. Additionally, since H.323 has been the industry standard, many more people are familiar with this protocol. Although SIP has been marketed as easy to use and to debug, the reality is that there is the same amount of complexity involved in this standard as any other standard within VoIP.

SIP does appear to be easier to develop and troubleshoot, but these attributes don’t make the protocol easier to use. Instead, these abilities have resulted in a number of non-standard SIP variations and a number of non-standard extensions for these developments.

Basic Usage

SIP can run on TCP, UDP, or SCTP, and it supports five facets of establishing and terminating multimedia communications:

  • It determines the end system that will be used for communication;
  • It determines the willingness of the called party to engage in communications;
  • It determines the media and its parameters;
  • It ‘rings’ the establishment of session parameters on both ends;
  • It includes transfer and termination of sessions, modifies session parameters, and invokes services.

SIP provides a suite of security services, which include denial-of-service prevention, authentication (both user to user and proxy to user), integrity protection, and encryption and privacy services.

Skinny (SCCP)

The use of the word, “skinny,” often refers to a scaled down device that functions purposefully with fewer features or functions than its “fat” version of that same device. In VoIP, the Skinny Client Control Protocol (SCCP, also known as Skinny) is a ‘lite’ proprietary protocol that Cisco uses with its ‘fat’ telephone equipment systems. Skinny reduces the processing load on its hardware.

How it Works

In this system, Cisco allows SKINNY clients to communicate with H.323 VoIP systems, as the H.323 processing capabilities are used in an intervening Call Manager device.

The SKINNY client and the Call Manager use a simple messaging set called Skinny Client Control Protocol (SCCP) to communicate with each other over TCP/IP. SKINNY systems use a proxy for the H.225 and H.245 signaling, and use RTP/UDP/IP for audio. The skinny client (i.e. an Ethernet Phone) uses TCP/IP to transmit and receive calls and RTP/UDP/IP to/from a Skinny Client or H.323 terminal for audio. Skinny messages are carried above TCP and use port 2000. Skinny gateways are a series of digital gateways that include the DT-24+, the DT-30+, and the WS-X6608-x1 Catalyst voice module.

The end station of a LAN or IP- based PBX must be simple to use, familiar and relatively cheap. The H.323 recommendations are quite an expensive system. An H.323 proxy can be used to communicate with the Skinny Client using the SCCP. In such a case the telephone is a skinny client over IP, in the context of H.323. A proxy is used for the H.225 and H.245 signaling.

When calling a non-Skinny client, the clients establish a connection through the Call Manager using TCP and then the two endpoints communicate using UDP. When Skinny phones connect to each other, they use RTP over UDP. Some vendors in addition to Cisco also support SCCP, and Cisco Call Manager 4.0 supports a secure version of SCCP, which uses Transport Layer Security (TLS) to encrypt communications and provide for confidentiality of voice conversations.

Basic Usage

If you already maintain a Cisco system, the changeover might prove seamless. However, the use of this system limits the use of open source systems and it locks you into proprietary software that may be subject to budget-pinching upgrades and licenses. On the other hand, the Cisco Call Manager is an H.323 proxy that communicates with Skinny clients. This may result in much less overhead than with the H.323, especially for a business that is connected to a company Local Area Network (LAN) or Wide Area Network (WAN).