Configuring the network parameters for your workers can dramatically improve your users’ experience of your game.
The Worker SDK allows you to configure parameters of the network stack that workers use to communicate with the Runtime. This gives you control over the trade-off between bandwidth overhead and latency, the upper bound on worker memory usage, and disconnection timeouts. You can even choose whether or not your data is encrypted on the wire.
Each network stack comes with a default set of parameters which we believe should work well in the majority of use cases. This page outlines when and why it’s worth explicitly setting and/or experimenting with different values for these parameters.
Choosing a network stack
First, you need to choose a network connection type to use when creating a connection object. Despite the fact that each network connection type is named after the underlying transport protocol, they each correspond to an entirely different implementation of the network stack. There are three options to choose from, each with their own strengths and weaknesses:
You should use TCP for server-workers by default.
You should not use TCP for client-workers unless efficient use of bandwidth is much more important than latency for your use case.
The RakNet network stack uses the RakNet third-party game networking library for reliable transport. The RakNet reliable transport protocol is built on top of UDP and performs better than TCP on unreliable networks like those which client-workers typically connect over.
You can use RakNet for client-workers to improve latency on unreliable networks.
The KCP network stack uses the KCP third-party library for reliable transport. The KCP reliable transport protocol is also built on top of UDP and is designed specifically to reduce latency on particularly unreliable networks like Wi-Fi and 3G/4G. All data sent over a KCP connection is encrypted using DTLS by default.
The KCP network stack is much more configurable and flexible than TCP and RakNet. You can see this in our guide to configuring the network stack.
We recommend that you use KCP for client-workers to improve latency on unreliable networks. It can be particularly effective for client-workers that connect over wireless networks.
Configuring the network stack
In order to inform your choice of values for various network parameters, you should focus on five key outputs:
|Bandwidth overhead||the amount of data delivered between the worker and the Runtime in addition to your game data||lower is better|
|Throughput||the amount of data delivered between the worker and the Runtime per unit time||higher is better|
|Latency||the time it takes for the data to be delivered between the worker and the Runtime||lower is better|
|Memory usage||the amount of RAM consumed by your worker's connection for network activity||lower is better|
|CPU usage||the amount of CPU time consumed by your worker's connection for network activity||lower is better|
In order to make your networked game interactions feel responsive, you should aim to minimize latency whilst staying within the available bandwidth for your client-worker (which may vary significantly depending on the user’s network setup and ISP). The downstream bandwidth available (from Runtime to worker) to a client-worker is typically much greater than the upstream bandwidth available (from worker to Runtime).
The total amount of data throughput you can achieve with a given amount of available bandwidth is greater if your bandwidth overhead is lower. If the total bandwidth required by your game starts to exceed the client’s available bandwidth, your game data (such as commands and component updates) will be delayed, resulting in a poor user experience.
The effect of different network parameters on performance outputs
|Bandwidth overhead||Latency||Throughput||Memory usage||CPU usage|
|Increasing KCP or TCP flow control window sizes||may decrease||may increase||increases upper bound|
|Increasing KCP or TCP multiplex level||may decrease||may increase||increases upper bound||may increase|
|Increasing KCP minimum retransmission timeout||may decrease||may increase|
|Enabling KCP fast retransmission||may increase||may decrease||may decrease||may increase|
|Enabling KCP early retransmission||may increase||may decrease||may decrease||may increase|
|Enabling KCP non-concessional flow control||may decrease or increase||may decrease or increase|
|Increasing KCP update interval||increases||may increase||increases|
|Enabling KCP erasure coding||increases||may decrease||increases||increases|
|Increasing ratio of KCP erasure codec recovery packets to original packets||increases||may decrease||increases||increases|
|Enabling TCP TCP_NODELAY||may increase||decreases|
Flow control window sizes
Both KCP and TCP allow you to specify flow control parameters. Flow control is a technique employed by transport protocols to control the rate of data transfer in each direction. Flow control windows keep track of how much data has been sent but not yet processed by the receiver.
KCP allows you to specify send and receive window sizes in units of KCP packets. TCP allows you to specify send and receive buffer sizes in units of bytes. These buffers also act as flow control windows.
Generally, you should aim to specify window sizes large enough to avoid the possibility of flow control interrupting and delaying data transfer, since this could translate into noticeable delays in real-time gameplay. To elaborate, if your worker’s receive window is too small, the Runtime may fill up the entire window. The Runtime will then have to wait until the worker sends it a packet notifying it that the worker has freed up some space in its receive window. The same is true of the worker’s send window size, but in the other direction. When flow control kicks in like this, it can decrease throughput and increase latency.
However, specifying larger flow control windows increases the upper bound on memory usage. Also, in the case of KCP, specifying too large a receive window may result in the underlying UDP socket buffer running out of space when large amounts of data are sent at once, leading to additional packets being dropped.
To calculate a lower bound on what window sizes you need, you need to consider how much data your workers send and receive. You can use a concept known as the bandwidth-delay product to help you inform your decision. The bandwidth-delay product is the result of multiplying the following two quantities:
- bandwidth or data-link capacity of the route between the worker and the Runtime.
- round-trip time of a packet between the worker and the Runtime.
The bandwidth-delay product represents the maximum amount of data which can be in transit between the worker and the Runtime at a time. Window sizes smaller than the bandwidth-delay product will not be able to take advantage of all the available bandwidth.
Since KCP window sizes are specified as a number of KCP packets, the bandwidth-delay product (in bytes) cannot directly inform the choice of KCP window sizes. Luckily, we can calculate a similar quantity in units of KCP packets. The amount of data each packet can hold depends on the maximum transmission unit (MTU) of the underlying network. However, if most of the component updates your game sends and receives are relatively small, like position updates tend to be, there is likely to be a roughly 1-to-1 correspondence between the number of component updates and the number of KCP packets.
Example: calculating minimum KCP window sizes
A client-worker is interested in
50 entities around a player, each of which is receiving updates of
30 bytes (or
1 packet) at a rate of
60 updates per second. The average round-trip time between the client-worker and the Runtime is
50 milliseconds. From this information, we derive the following quantities:
- minimum number of packets received per second:
50 * 60 = 3000
- minimum number of packets received per round-trip duration (the bandwidth-delay product):
3000 * 50 / 1000 = 150
Therefore, the receive window needs to be at least
Note that this is a strict lower bound since you also need to consider all other sources of packets being sent between the client-worker and the Runtime, such as those for other component updates, events, commands, entity checkouts etc.
Both TCP and KCP allow you to specify a multiplex level. The multiplex level specifies the number of independent, reliable, ordered streams used by the transport layer to send data relating to different entities. Where possible, updates corresponding to different entities are sent on different streams to avoid delayed updates for one entity affecting other entities. Therefore, increasing the multiplex level may decrease latency.
Increasing the multiplex level increases the upper bound on memory used by the worker’s network connection because each stream has its own send and receive window. Therefore, the upper bound on memory usage is proportional to the multiplex level multiplied by the sum of the send and receive window sizes.
Having more multiplexed streams also increases the total amount of work that needs to be done, so it may increase CPU usage.
Minimum retransmission timeout
KCP allows you to specify a minimum retransmission timeout. When the network connection is first established, it has no idea what a typical round-trip time is for a packet. It must guess how long to wait before detecting a packet is lost and attempting to retransmit it. The time it waits initially is the minimum retransmission timeout. You can configure this value for KCP by explicitly setting
When the connection has calculated a smoothed round-trip time from some round-trip time samples, it calculates the retransmission timeout based on how long it takes for most packets to be acknowledged. However, the calculated retransmission timeout is still bounded by the configurable minimum retransmission timeout.
You can reduce the latency of retransmitted packets on unreliable networks by configuring the minimum retransmission timeout to be similar to the round-trip time between the worker and the Runtime. Round-trip times may be as little as 5-10ms for client-workers on networks which are physically or logically located close to the Runtime. However, if you choose this value for all client-workers then it may result in a (very) temporary increase in bandwidth overhead if packets are falsely detected as lost on connections with longer round-trip times.
Fast and early retransmission
KCP allows you to optionally enable these two boolean parameters to decrease latency. Both enable strategies which try to reduce the amount of time it takes to retransmit packets which are lost on unreliable networks.
- Enabling fast retransmission reduces the additional delay added to the retransmission schedule of packets when they are detected as lost multiple times in a row.
- Enabling early retransmission results in the following behaviour: when acknowledgements are received for packets
3but no acknowledgement has been received for packet
1will be retransmitted early with the assumption that it was lost. This works well if there is not much jitter (variance in packet round-trip times) in the network.
Because enabling both of these parameters results in more aggressive retransmission behaviour, they both increase bandwidth overhead and CPU usage.
KCP non-concessional flow control
KCP allows you to optionally enable a boolean parameter which disables an algorithm which reduces the size of flow control windows when packet loss is detected. The algorithm assumes that the packet loss is due to congestion in the network, which may or may not be the case.
If you enable non-concessional flow control and congestion is detected on the network via packet loss, data will continue to be sent at a high rate, which may result in further packet loss. This may lead to temporarily decreased throughput and increased latency.
However, if the packet loss was caused by other factors, such as interference on wireless networks, continuing to send data at a high rate may lead to temporarily decreased latency and increased throughput.
KCP update interval
KCP allows you to specify the frequency at which KCP performs network I/O (sending and receiving). Each packet experiences an artificial delay, with the average delay being about half the update interval.
Increasing the update interval increases latency but decreases CPU usage as there is a CPU overhead associated with each update.
KCP allows you to enable and configure a technique known as erasure coding.
Enabling erasure coding increases the average bandwidth overhead associated with delivering each packet. This is because enabling erasure coding results in additional, redundant packets being sent along with those packets containing game data. In general, Noriginal consecutive “original” packets and Nrecovery subsequent recovery packets are grouped into a batch. The size of each recovery packet is equal to the size of the largest original packet in that particular batch. Erasure coding is therefore most bandwidth-efficient when “original” packets are similar in size (for example, when most packets are fixed size position updates).
The benefit is that all the original data encoded within that batch can be recovered, provided that Noriginal packets arrive from the batch of Noriginal + Nrecovery packets.
This can help to significantly decrease worst-case latency on unreliable networks which experience packet loss. In fact, varying the ratio of “recovery” packets to “original” packets is one of the most direct ways you can trade off bandwidth overhead vs. latency.
You can use metrics to see what benefit erasure coding is providing for your worker and tweak your parameters accordingly. One of the following metrics will be incremented for each batch according to the given conditions:
erasure_coding_completed_batches: Ndelivered == (Noriginal + Nrecovery)
erasure_coding_recovered_batches: Noriginal <= Ndelivered < (Noriginal + Nrecovery)
erasure_coding_unrecoverable_batches: Ndelivered < Noriginal
The receiving end of the erasure codec records these metrics. Therefore, they correspond to the downstream traffic from Runtime to worker. In general, the higher the number of recovered batches compared to complete or unrecoverable batches, the more value erasure coding is providing.
A batch is only deemed unrecoverable if it is the oldest batch currently being held in memory for which the data has not yet been recovered. You can increase the window size parameter (specified in number of batches) to increase the average length of time an incomplete batch is kept in memory. This will increase memory usage but may improve the chance of a batch being recovered later if one of its packets was delayed.
Looking at raw packet loss statistics may help you to configure the erasure codec. KCP reports a histogram metric called
kcp_packet_send_count. If a packet has been sent more than once, it implies that the packet has been detected as lost. You can derive the overall percentage of packets which have been detected as lost from this metric. The metric is currently only reported via metrics ops.
TCP allows you to optionally enable TCP_NODELAY, a TCP-specific option to disable Nagle’s algorithm. Nagle’s algorithm artificially delays and merges outgoing small packets to reduce bandwidth overhead. Therefore, enabling TCP_NODELAY decreases latency but increases bandwidth overhead.
If a client-worker loses contact with the Runtime for whatever reason (such as the internet connection being lost), you may want to know so that you can inform the user.
You can configure how quickly a connection is deemed to have lost connectivity by setting
HeartbeatParameters on your client-worker for RakNet or KCP respectively. There is currently no equivalent parameter for TCP. These options determine the maximum time after which the network connection should receive an acknowledgement for a heartbeat message it sends to check whether the Runtime is still responding to it. If the timeout expires, the worker’s connection receives a disconnect op.
You should try to pick a timeout which minimizes the number of false positives (detecting a connection is broken when there is temporary congestion, for example) since the connection will be automatically closed when a heartbeat times out. The value of the timeout you choose will probably depend on factors such as how acceptable the user’s experience is when there is a temporary loss of connectivity.