Base usage

There are two main reports:

Trends - the trends report show how key metrics, aggregated over all apps, change over time.
Apps - the metrics per application, on a fixed time frame.

The base metrics shown on the charts are described below. The description of potential problems can be found at a separate page

Base Metrics

Apps - The count of started spark applications
Failures
- Apps - how many applications have failed
- Stages - how many stages have failed
- Tasks - how many tasks have failed
Times
- App uptime - the time elapsed from the application start to end. This number can often fluctuate even if code and data remains the same. For example, if you use dynamic allocation, and a job has received less executors, it will take more wall time to complete. This metric is not a good indicator of efficiency, but can be useful to review if there are external SLOs on this application.
- Task time - the time actively spent executing tasks. Normally, this number is stable and therefore can be used to track job efficiency. In some cases, it can fluctuate due to environment - for example if you change the kind of hardware, or if spot availability/interruption is particularly bad one day.
Date Size
- Shuffle Write - total amount of data written during shuffle
- Shuffle Read - total amount of data read from remote executors during shuffle. Typically, it is similar to shuffle write. If it is considerably higher, it might suggest excessive retries, or multiple passes over the same data with poor locality.
- Input - total amount of data read from external sources.
- Spot interruption - the number of executors lost due to spot interruption by your cloud provider. This metric might not be available in your setup.
Executor removed: OOM - the number of executors lost due to out-of-memory errors.

Cost

The cost metric shows approximate compute cost of the applications. It is only available when Spark is run using Kubernetes in some specific environments.

The cost is computed using the following approach:

For each pod, we determine the node it uses, and compute the full node cost, adding up instance costs, network disk costs, and data transfer costs.
We determine what share of the node was used by the pod
- time share is pod runtime / node lifetime
- capacity share is pod cpu requests / node cpu capacity
Pod cost is computed as node cost * time share * capacity share
Job cost is computed as sum of costs of all its pods

There are possible indirect costs not included in this approach. For example, a pod might request a part of node’s capacity, but block other pods due to fragmentation.