Sites

Menu
These are the docs for 14.3, an old version of SpatialOS. 14.5 is the newest →

Network configuration

This page links to the API reference documentation for the Worker SDK in C# in various places, but there are equivalents for each class and parameter in all languages.

Configuring the network parameters for your workers can dramatically improve your users’ experience of your game.

The Worker SDK allows you to configure parameters of the network stack that workers use to communicate with the Runtime. This gives you control over the trade-off between bandwidth overhead and latency, the upper bound on worker memory usage, and disconnection timeouts. You can even choose whether or not your data is encrypted on the wire.

Each network stack comes with a default set of parameters which we believe should work well in the majority of use cases. This page outlines when and why it’s worth explicitly setting and/or experimenting with different values for these parameters.

Choosing a network stack

First, you need to choose a network connection type to use when creating a connection object. Despite the fact that each network connection type is named after the underlying transport protocol, they each correspond to an entirely different implementation of the network stack. There are three options to choose from, each with their own strengths and weaknesses:

TCP

The TCP network stack uses TCP for reliable transport. It works well on reliable networks like those used by a server-worker to communicate with the Runtime in a cloud deployment.

We have deprecated TCP in favour of the modular version (below). In the next major release, we will remove the TCP network stack and rename modular TCP to TCP.

RakNet

The RakNet network stack uses the RakNet third-party game networking library for reliable transport. The RakNet reliable transport protocol is built on top of UDP and performs better than TCP on unreliable networks like those which client-workers typically connect over.

We have deprecated RakNet. We recommend using modular KCP for client-workers to improve latency on unreliable networks.

KCP

The KCP network stack uses the KCP third-party library for reliable transport. The KCP reliable transport protocol is also built on top of UDP and is designed specifically to reduce latency on particularly unreliable networks like Wi-Fi and 3G/4G. All data sent over a KCP connection is encrypted using DTLS by default, but you can disable encryption by setting SecurityType to Insecure.

The KCP network stack is much more configurable and flexible than TCP and RakNet. You can see this in our guide to configuring the network stack.

We have deprecated KCP in favour of the modular version (below). In the next major release, we will remove the KCP network stack and rename modular KCP to KCP.

Modular KCP

Modular KCP is built on a new and improved network stack. The new stack means you can configure your worker’s network stack using multiple network stack modules, such as erasure coding or compression. Additionally, all connections made from workers on the new stack connect to the same port on a cloud deployment. This differs from the old network stack which required one port for each active connection. Data sent on this stack is encrypted by default with DTLS, but you can disable encryption by setting SecurityType to Insecure.

Similar to the KCP network stack above, the modular KCP network stack uses the KCP third-party library for reliable transport. The KcpTransportParameters correspond to the parameters for the KCP library. The main differences in our use of the KCP library between the modular and regular KCP network stacks are that in the modular KCP network stack:

  • multiple worker messages sent on different multiplexed streams can be packed into each UDP packet. This should noticeably reduce bandwidth on connections with high multiplex level.
  • there is a flush interval as opposed to an update interval. The flush interval only acts in the outbound direction, while the update interval acts in both inbound and outbound directions. This means there should be less artificial latency when receiving packets from the network. You can also flush the connection manually.
  • the flow control window, in bytes, is applied across multiplexed streams. This allows you to specify a smaller window size (and therefore lower bound on memory usage) for the same maximum throughput (data transfer rate).

We recommend that you use modular KCP for client-workers to improve latency on unreliable networks. It can be particularly effective for client-workers that connect over wireless networks.

Modular TCP

Modular TCP is the new and improved modular stack for TCP. It supports many of the same modular configuration options and improvements as modular KCP.

Similar to the TCP network stack above, the modular TCP network stack uses TCP for reliable transport. It works well on reliable networks like those used by a server-worker to communicate with the Runtime in a cloud deployment.

Improvements of the modular TCP stack compared to the non-modular variant include:

  • all connections made from workers on the new stack connect to the same port on a cloud deployment.
  • it is encrypted by default with TLS, but you can disable encryption by setting SecurityType to Insecure.
  • Instead of the ability to switch TCP_NODELAY on or off, there is configurable flush delay, along with manual flush support. Data sent on a stream within the flush delay can be packed, and you can also enable compression of the packed data.
  • the flow control window, in bytes, is applied across multiplexed streams. This allows you to specify a smaller window size (and therefore lower bound on memory usage) for the same maximum throughput (data transfer rate).

You should use TCP for server-workers by default.

You should not use TCP for client-workers unless efficient use of bandwidth is much more important than latency for your use case.

Configuring the network stack

All content within this section should be viewed as guidelines to help you optimize your game. We encourage you to perform your own experiments and come to your own conclusions.

Performance outputs

In order to inform your choice of values for various network parameters, you should focus on five key outputs:

Output Description Target
Bandwidth overhead the amount of data delivered between the worker and the Runtime in addition to your game data lower is better
Throughput the amount of data delivered between the worker and the Runtime per unit time higher is better
Latency the time it takes for the data to be delivered between the worker and the Runtime lower is better
Memory usage the amount of RAM consumed by your worker's connection for network activity lower is better
CPU usage the amount of CPU time consumed by your worker's connection for network activity lower is better

In order to make your networked game interactions feel responsive, you should aim to minimize latency whilst staying within the available bandwidth for your client-worker (which may vary significantly depending on the user’s network setup and ISP). The downstream bandwidth available (from Runtime to worker) to a client-worker is typically much greater than the upstream bandwidth available (from worker to Runtime).

The total amount of data throughput you can achieve with a given amount of available bandwidth is greater if your bandwidth overhead is lower. If the total bandwidth required by your game starts to exceed the client’s available bandwidth, your game data (such as commands and component updates) will be delayed, resulting in a poor user experience.

The effect of different network parameters on performance outputs

This table paints an over-simplified picture of how each network parameter affects each of the performance outputs. See below for a more in-depth explanation of the effect of each configurable parameter.

Bandwidth overhead Latency Throughput Memory usage CPU usage
Increasing KCP, modular KCP, TCP or modular TCP flow control window sizes might decrease might increase increases upper bound
Increasing KCP or TCP multiplex level might increase might decrease might increase increases upper bound might increase
Increasing modular KCP multiplex level might decrease might increase might increase
Increasing modular TCP multiplex level might increase might decrease might increase
Increasing KCP or modular KCP minimum retransmission timeout might decrease might increase
Enabling KCP or modular KCP fast retransmission might increase might decrease might decrease might increase
Enabling KCP or modular KCP early retransmission might increase might decrease might decrease might increase
Enabling KCP non-concessional flow control or disabling modular KCP congestion control might decrease or increase might decrease or increase
Increasing KCP update interval, modular KCP flush interval, or modular TCP flush delay might decrease increases might increase decreases
Enabling erasure coding increases might decrease increases increases
Increasing ratio of erasure codec recovery packets to original packets increases might decrease increases increases
Enabling compression decreases increases
Enabling TCP TCP_NODELAY might increase decreases

Flow control window sizes

KCP and TCP (both modular and regular variants) allow you to specify flow control parameters. Flow control is a technique employed by transport protocols to control the rate of data transfer in each direction. Flow control windows keep track of how much data has been sent but not yet processed by the receiver.

KCP lets you specify send and receive window sizes in units of KCP packets. TCP allows you to specify send and receive buffer sizes in bytes. Modular KCP and TCP let you specify cross-stream window sizes in bytes. These buffers also act as flow control windows.

Generally, you should aim to specify window sizes large enough to avoid the possibility of flow control interrupting and delaying data transfer, since this could translate into noticeable delays in real-time gameplay. To elaborate, if your worker’s receive window is too small, the Runtime may fill up the entire window. The Runtime will then have to wait until the worker sends it a packet notifying it that the worker has freed up some space in its receive window. The same is true of the worker’s send window size, but in the other direction. When flow control kicks in like this, it can decrease throughput and increase latency.

However, specifying larger flow control windows increases the upper bound on memory usage. Also, in the case of KCP, specifying too large a receive window may result in the underlying UDP socket buffer running out of space when large amounts of data are sent at once, leading to additional packets being dropped.

To calculate a lower bound on what window sizes you need, you need to consider how much data your workers send and receive. You can use a concept known as the bandwidth-delay product to help you inform your decision. The bandwidth-delay product is the result of multiplying the following two quantities:

  • bandwidth or data-link capacity of the route between the worker and the Runtime.
  • round-trip time of a packet between the worker and the Runtime.

A moving average of the round-trip time is sampled and reported as an internal metric called kcp_smoothed_round_trip_time_seconds by the Worker SDK.

The bandwidth-delay product represents the maximum amount of data which can be in transit between the worker and the Runtime at a time. Window sizes smaller than the bandwidth-delay product will not be able to take advantage of all the available bandwidth.

Since KCP window sizes are specified as a number of KCP packets, the bandwidth-delay product (in bytes) cannot directly inform the choice of KCP window sizes. Luckily, we can calculate a similar quantity in units of KCP packets. The amount of data each packet can hold depends on the maximum transmission unit (MTU) of the underlying network. However, if most of the component updates your game sends and receives are relatively small, like position updates tend to be, there is likely to be a roughly 1-to-1 correspondence between the number of component updates and the number of KCP packets.

Multiplex level

KCP and TCP (both modular and regular variants) allow you to specify a multiplex level. The multiplex level specifies the number of independent, reliable, ordered streams that the transport layer uses to send data relating to different entities. Where possible, it sends updates corresponding to different entities on different streams to avoid delayed updates for one entity affecting other entities. Therefore, increasing the multiplex level might decrease latency on connections with packet loss.

On KCP and TCP, but not the modular variants, increasing the multiplex level increases the upper bound on memory used by the worker’s network connection because each stream has its own send and receive window. Therefore, the upper bound on memory usage is proportional to the multiplex level multiplied by the sum of the send and receive window sizes.

For TCP, increasing the multiplex level might increase bandwidth usage, since TCP multiplexing uses separate physical TCP connections, and data sent on different streams cannot be packed or compressed together.

Having more multiplexed streams also increases the total amount of work that needs to be done, so it may increase CPU usage.

Minimum retransmission timeout

KCP and modular KCP let you specify a minimum retransmission timeout. When the network connection is first established, it doesn’t have a value for a typical round-trip time for a packet. It must guess how long to wait before detecting a packet is lost and attempting to retransmit it. The time it waits initially is the minimum retransmission timeout. You can configure this value for KCP by explicitly setting MinRtoMillis.

When the connection has calculated a smoothed round-trip time from some round-trip time samples, it calculates the retransmission timeout based on how long it takes for most packets to be acknowledged. However, the calculated retransmission timeout is still bounded by the configurable minimum retransmission timeout.

You can reduce the latency of retransmitted packets on unreliable networks by configuring the minimum retransmission timeout to be similar to the round-trip time between the worker and the Runtime. Round-trip times may be as little as 5-10ms for client-workers on networks which are physically or logically located close to the Runtime. However, if you choose this value for all client-workers then it may result in a (very) temporary increase in bandwidth overhead if packets are falsely detected as lost on connections with longer round-trip times.

Fast and early retransmission

KCP and modular KCP allow you to optionally enable these two boolean parameters to decrease latency. Both enable strategies that try to reduce the amount of time it takes to retransmit packets that are lost on unreliable networks.

  • Enabling fast retransmission reduces the additional delay added to the retransmission schedule of packets when they are detected as lost multiple times in a row.
  • Enabling early retransmission results in the following behaviour: when acknowledgements are received for packets 2 and 3 but no acknowledgement has been received for packet 1, packet 1 will be retransmitted early with the assumption that it was lost. This works well if there is not much jitter (variance in packet round-trip times) in the network.

Because enabling both of these parameters results in more aggressive retransmission behaviour, they both increase bandwidth overhead and CPU usage.

KCP non-concessional flow control / Modular KCP disable congestion control

KCP and modular KCP let you optionally enable a boolean parameter to disable an algorithm that reduces the size of flow control windows when packet loss is detected. The algorithm assumes that the packet loss is due to congestion in the network, which may or may not be the case.

If you enable non-concessional flow control and congestion is detected on the network via packet loss, data will continue to be sent at a high rate, which may result in further packet loss. This may lead to temporarily decreased throughput and increased latency.

However, if the packet loss was caused by other factors, such as interference on wireless networks, continuing to send data at a high rate may lead to temporarily decreased latency and increased throughput.

KCP update interval / Modular KCP flush interval

KCP lets you specify the frequency at which KCP performs network I/O (sending and receiving). For modular KCP, this interval only applies to sending on the network, not receiving. Each packet experiences an artificial delay, with the average delay being about half the update interval.

Increasing the update interval increases latency but decreases CPU usage as there is a CPU overhead associated with each update. It also may decrease bandwidth overhead by allowing time for more messages to be packed into a single packet.

Erasure coding

KCP and modular KCP let you enable and configure a technique known as erasure coding.

Enabling erasure coding increases the average bandwidth overhead associated with delivering each packet. This is because enabling erasure coding results in additional, redundant packets being sent along with those packets containing game data. In general, Noriginal consecutive “original” packets and Nrecovery subsequent recovery packets are grouped into a batch. The size of each recovery packet is equal to the size of the largest original packet in that particular batch. Erasure coding is therefore most bandwidth-efficient when “original” packets are similar in size (for example, when most packets are fixed size position updates).

The benefit is that all the original data encoded within that batch can be recovered, provided that Noriginal packets arrive from the batch of Noriginal + Nrecovery packets.

This can help to significantly decrease worst-case latency on unreliable networks which experience packet loss. In fact, varying the ratio of “recovery” packets to “original” packets is one of the most direct ways you can trade off bandwidth overhead vs. latency.

You can use metrics to see what benefit erasure coding is providing for your worker and tweak your parameters accordingly. One of the following metrics will be incremented for each batch according to the given conditions:

  • erasure_coding_completed_batches: Ndelivered == (Noriginal + Nrecovery)
  • erasure_coding_recovered_batches: Noriginal <= Ndelivered < (Noriginal + Nrecovery)
  • erasure_coding_unrecoverable_batches: Ndelivered < Noriginal

The receiving end of the erasure codec records these metrics. Therefore, they correspond to the downstream traffic from Runtime to worker. In general, the higher the number of recovered batches compared to complete or unrecoverable batches, the more value erasure coding is providing.

A batch is only deemed unrecoverable if it is the oldest batch currently being held in memory for which the data has not yet been recovered. You can increase the window size parameter (specified in number of batches) to increase the average length of time an incomplete batch is kept in memory. This will increase memory usage but may improve the chance of a batch being recovered later if one of its packets was delayed.

Looking at raw packet loss statistics might help you to configure the erasure codec. KCP reports a histogram metric called kcp_packet_send_count. If a packet has been sent more than once, this implies that the packet has been detected as lost. You can derive the overall percentage of packets that have been detected as lost from this metric. The metric is currently only reported via metrics ops.

Compression

Modular KCP and TCP let you enable compression of data sent on the network, using the zstd library. Compressed data generally requires fewer bytes to represent it than the equivalent uncompressed data, but this comes at the cost of CPU time encoding and decoding the data. Therefore, compression decreases bandwidth but increases CPU usage.

Compression is most effective when there is a sufficiently-large amount of data to compress. If the modular KCP flush interval or TCP flush delay is too short, or you are frequently flushing manually, enabling compression may consume CPU for little benefit.

We suggest disabling compression for server-workers, since typically care more about being CPU-efficient than bandwidth constraints.

TCP_NODELAY

TCP allows you to optionally enable TCP_NODELAY, a TCP-specific option to disable Nagle’s algorithm. Nagle’s algorithm artificially delays and merges outgoing small packets to reduce bandwidth overhead. Therefore, enabling TCP_NODELAY decreases latency but increases bandwidth overhead.

Heartbeating

If a client-worker loses contact with the Runtime for whatever reason (such as the internet connection being lost), you may want to know so that you can inform the user.

You can configure how quickly a connection is deemed to have lost connectivity by setting HeartbeatTimeoutMillis on RakNetNetworkParameters or TimeoutMillis on HeartbeatParameters on your client-worker for RakNet, KCP or modular KCP respectively. There is currently no equivalent parameter for TCP. These options determine the maximum time after which the network connection should receive an acknowledgement for a heartbeat message it sends to check whether the Runtime is still responding to it. If the timeout expires, the worker’s connection receives a disconnect op.

You should try to pick a timeout which minimizes the number of false positives (detecting a connection is broken when there is temporary congestion, for example) since the connection will be automatically closed when a heartbeat times out. The value of the timeout you choose will probably depend on factors such as how acceptable the user’s experience is when there is a temporary loss of connectivity.

Search results

Was this page helpful?

Thanks for letting us know!

Thanks for your feedback

Need more help? Ask on the forums