Each deployment is associated with a collection of metrics which are graphed using Grafana.
These metrics are split into categories over a number of dashboards which can help you profile and understand the health of deployment. Most of the metrics clearly describe what they measure, but understanding how you can use these measurements to diagnose your deployment and identifying what is causing performance issues is more tricky. The aim of this document is to highlight the most important metrics that can be used to profile your deployment and act as a starting point for deeper investigation.
Metrics can be accessed from the Metrics tab of the Inspector, or from the Deployment Overview page:
This dashboard gives high level information about the resource use of the deployment.
Note that the CPU measurements are number of cores utilized per second.
Any engine load > 1 indicates that the worker is getting more work that it can deal with. This means that messages queues to the worker from SpatialOS can fill up rapidly and performance can degrade over time.
This dashboard gives a more detailed breakdown of the use of SpatialOS features on entities, such as messaging, entity finds etc.
record_message_type_for_entity_messaging to get more information on the type of messages sent to and from
entities. Messaging is generally quite cheap and SpatialOS is able to handle many of these.
Of particular use is the Spatial Find Latency and the Entity Find Latency metrics which indicate how long these blocking operations take to return – see Client Debugging for why this is important. Latencies greater than several tenths of a second might be problematic. Entity finds are generally much quicker and therefore block for less time than spatial finds, and are generally preferred where possible.
If spatial and entity finds are failing, this may indicate that SpatialOS is under too much load to reliably perform these operations.
Spatial Find Fanout measures the number of nodes queried by SpatialOS when performing an entity query.
If many nodes are queried whilst performing an entity query then this may result in performance issues. Reducing the radius of entity queries might reduce this number.
Client debugging dashboard
This dashboard gives the most useful information about your deployment.
The Messaging section details the identity and quantity of messages being sent down/up from workers. For example, the Upstream Engine Message Size metric:
Use the flag
record_update_per_entity_prefab in your deployment configuration to break down these graphs by prefab.
The Entity Queries section is usually the most informative in identifying performance issues. You should generally aim to reduce the number of spatial finds and entity finds to reduce the load on SpatialOS.
Both these types of find are blocking on an entity: when these finds are performed on an entity, all of its behaviours are blocked until that spatial find returns. Global finds are performed across workers on different machines and are expensive because they take a much longer time to return a result. Local finds are returned more quickly, but still may be expensive because the all of the components on the found entities are still serialized and sent.
The Max Mailbox Size by Actor dashboard shows the number of akka messages currently being processed by different actors in the deployment. You will be able to see World Apps as actors in this list. If the number of messages received by a World App is increasing with time then it unable to process those messages as fast as it is receiving them, potentially indicating a bug in your game code.
The Dropped State Updates due to QoS metric can help identify if too many messages are being sent down to the client to be processed, if QoS is being used.