Metrics

The Scalr Agent generates various metrics that provide insights into its performance.

Configuration

To enable this feature, set the SCALR_AGENT_OTLP_ENDPOINT environment variable to the host:port address of your OpenTelemetry collector – a gRPC server running an OTLP collector, and enable SCALR_AGENT_OTLP_METRICS_ENABLED option.

Naming conventions

Scalr Agent metrics adhere to a consistent naming convention to provide clear context.

Structure

Metrics are prefixed with a component path, like scalr_agent.core or scalr_agent.app.policy to identify the source of the metric.

If a unit is necessary for clarity, it's appended to the metric name in full, such as _bytes or _seconds.

Data Types

  • Time: All timing metrics are measured in seconds with millisecond precision, aligning with established conventions like those used by Prometheus and OpenTelemetry.
  • Size: All data size metrics are measured in bytes.
  • Boolean: Values are represented as the strings "true" or "false".

Labels

All metrics follow Unified Service Tagging, including env, service, and version labels. The service is set to scalr-agent, version contains the version of the Scalr Agent emitting the metrics, and env is always set to prod.

Additional labels:

  • app: The URL endpoint without scheme that the agent is connected to. E.g. myaccount.scalr.io.
  • hostname: The agent hostname.
  • kube_component: The Kubernetes component (controller or worker), related to agent-k8s and agent-job Helm charts.
  • kube_namespace: The Kubernetes namespace the agent is deployed to.

Metrics prefixed with scalr_agent.app include an additional context-specific label: account_name.

Core Metrics

Metrics produced by the Scalr Agent runtime.

scalr_agent.core.update_status_duration_seconds

Time taken to send a status update to the Scalr platform.
Type: Histogram
Unit: seconds

scalr_agent.core.update_result_duration_seconds

Time taken to send a result update to the Scalr platform.
Type: Histogram
Unit: seconds

scalr_agent.core.cpu_limit_nanocores

CPU limit in nanocores for the container during Plan and Apply Scalr run stages.
Type: Gauge
Unit: nanocores

scalr_agent.core.mem_limit_bytes

Memory limit in bytes for the container during Plan and Apply Scalr run stages.
Type: Gauge
Unit: bytes

scalr_agent.core.kubernetes_job_max_scheduling_delay_seconds

A real-time gauge showing the maximum scheduling delay across all active jobs handled by the agent controller (using kubernetes_job driver).
Type: ObservableGauge
Unit: seconds

scalr_agent.core.kubernetes_job_startup_latency_seconds

The time taken by an agent worker (using kubernetes_job driver) to pick up an agent task pod created by the agent controller.
Type: Histogram
Unit: seconds

scalr_agent.core.kubernetes_schedule_duration_seconds

The time taken by the agent controller (using kubernetes or kubernetes_job driver) to schedule an agent task pod.
Type: Histogram
Unit: seconds

scalr_agent.core.kubernetes_delegate_duration_seconds

The time taken by an agent worker (using kubernetes or kubernetes_job driver) to pick up an agent task pod created by an agent controller.
Type: Histogram
Unit: seconds

scalr_agent.core.acquire_tasks_duration_seconds

Time taken to acquire tasks on the Scalr Agent.
Type: Histogram
Unit: seconds

scalr_agent.core.tasks_total

Total number of tasks on the Huey worker.
Type: Gauge

scalr_agent.core.run_tasks_total

Total number of Scalr Run tasks on the Huey worker.
Type: Gauge

scalr_agent.core.import_duration_seconds

Time taken for the Python runtime to launch.
Type: Histogram
Unit: seconds

scalr_agent.core.startup_duration_seconds

Time taken be ready to accept incoming tasks after launched.
Type: Histogram
Unit: seconds

scalr_agent.core.connect_to_pool_duration_seconds

Time taken to register with the Scalr platform.
Type: Histogram
Unit: seconds

scalr_agent.binary_cache.total_used_size_bytes

Current size of the binary cache.
Type: ObservableGauge
Unit: bytes

scalr_agent.binary_cache.module_cache_usage_size_bytes

Disk space used by each module in the cache.
Type: ObservableGauge
Unit: bytes

scalr_agent.core.cancel_duration_seconds

Time required to cancel a task.
Type: Histogram
Unit: seconds

scalr_agent.core.cancel_errors_total

Counter for failed task cancellation attempts.
Type: Counter


Blob Storage Metrics

Emitted by the HTTP blob client during read/write/extract operations to Scalr blob storage. Includes I/O transfer metrics for processing configuration versions, Run Stage logs, software binaries, plan files, and state files, etc.

scalr_agent.core.blobclient.upload_blob_duration_seconds

Duration of full blob uploads (HTTP PUT).
Type: Histogram
Unit: seconds

scalr_agent.core.blobclient.upload_blob_bytes

Total bytes uploaded (HTTP PUT).
Type: Counter
Unit: bytes

scalr_agent.core.blobclient.read_blob_duration_seconds

Duration of blob downloads (HTTP GET).
Type: Histogram
Unit: seconds

scalr_agent.core.blobclient.read_blob_bytes

Total bytes downloaded (HTTP GET).
Type: Counter
Unit: bytes

scalr_agent.core.blobclient.write_blob_duration_seconds

Duration of blob write operations (HTTP PATCH).
Type: Histogram
Unit: seconds

scalr_agent.core.blobclient.write_blob_bytes

Total bytes written to blob (HTTP PATCH).
Type: Counter
Unit: bytes

scalr_agent.core.blobclient.extract_blob_duration_seconds

Duration of blob extraction (download and decompress).
Type: Histogram
Unit: seconds


IaC Component Metrics

Metrics for OpenTofu, Terraform, and Terragrunt, emitted by terraform.plan and terraform.apply tasks.

scalr_agent.app.terraform.apply_command_duration_seconds

Time taken to execute the Terraform/OpenTofu apply command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.upload_configuration_version_changes_duration_seconds

Time taken to compress and upload configuration version changes to Scalr Type: Histogram
Unit: seconds

scalr_agent.app.terraform.init_command_duration_seconds

Time taken to execute the Terraform/OpenTofu init command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.validate_command_duration_seconds

Time taken to execute the Terraform/OpenTofu validate command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.fmt_command_duration_seconds

Time taken to execute the Terraform/OpenTofu fmt command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.get_command_duration_seconds

Time taken to execute the Terraform/OpenTofu get command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.install_providers_duration_seconds

Time taken to install providers before the Terraform/OpenTofu init command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.cache_providers_duration_seconds

Time taken to cache providers after the Terraform/OpenTofu init command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.install_modules_duration_seconds

Time taken to install modules before the Terraform/OpenTofu get command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.cache_modules_duration_seconds

Time taken to cache modules after the Terraform/OpenTofu get command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.plan_command_duration_seconds

Time taken to execute Terraform/OpenTofu plan command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.show_command_duration_seconds

Time taken to execute show operation to generate JSON plan.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.upload_plan_duration_seconds

Time taken to compress and upload plan results back to Scalr in both binary and JSON formats.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.test_command_duration_seconds

Time taken to execute the OpenTofu test command.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.install_binaries_duration_seconds

Time taken to install binaries for IAC operation.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.download_configuration_version_duration_seconds

Time taken to download configuration version for plan, apply and test operations.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.download_configuration_version_changes_duration_seconds

Time taken to download configuration version changes for apply operations.
Type: Histogram
Unit: seconds

scalr_agent.app.terraform.provider_cache_freed_size_bytes

Total size in bytes of provider plugins cache deleted by garbage collection.
Type: Counter
Unit: bytes

scalr_agent.app.terraform.run_config_dir_used_size_bytes

Total size in bytes of the initialized Scalr Run configuration directory, measured after a Plan or Apply operation completes.
Type: Counter
Unit: bytes

scalr_agent.app.terraform.provider_cache_total_used_size_bytes

Current size of the provider plugins cache directory.
Type: ObservableGauge
Unit: bytes

scalr_agent.app.terraform.provider_cache_usage_count

Number of times each provider plugin in the cache has been used.
Type: ObservableGauge
Unit: count

scalr_agent.app.terraform.provider_cache_usage_size_bytes

Disk space used by each provider plugin in the cache.
Type: ObservableGauge
Unit: bytes


Policy Component Metrics

Emitted by policy.check tasks for pre-init and post-plan stages.


scalr_agent.app.policy.opa_eval_command_duration_seconds

Time taken to call the opa eval command.
Type: Histogram
Unit: seconds

scalr_agent.app.policy.download_tfinput_duration_seconds

Time taken to download tfinput files.
Type: Histogram
Unit: seconds

scalr_agent.app.policy.evaluate_policy_group_duration_seconds

Time taken to evaluate a policy group.
Type: Histogram
Unit: seconds

scalr_agent.app.policy.evaluate_policy_duration_seconds

Time taken to evaluate all files within a policy.
Type: Histogram
Unit: seconds

scalr_agent.app.policy.install_binaries_duration_seconds

Time taken to install binaries for policy operation.
Type: Histogram
Unit: seconds


Cost Estimate Component Metrics

Emitted by cost.estimate tasks during the cost estimation stage.

scalr_agent.app.cost.infracost_breakdown_command_duration_seconds

Time taken to execute the Infracost breakdown command.
Type: Histogram
Unit: seconds

scalr_agent.app.cost.infracost_output_command_duration_seconds

Time taken to execute the Infracost output command.
Type: Histogram
Unit: seconds

scalr_agent.app.cost.download_plan_json_duration_seconds

Time taken to download the plan JSON file as input for the cost estimation workflow.
Type: Histogram
Unit: seconds

scalr_agent.app.cost.estimate_duration_seconds

Time taken for the cost estimation workflow.
Type: Histogram
Unit: seconds

scalr_agent.app.cost.install_binaries_duration_seconds

Time taken to install binaries for cost operations.
Type: Histogram
Unit: seconds


Checkov Component Metrics

Emitted by checkov.analyze tasks during the pre-plan stage.

scalr_agent.app.checkov.download_configuration_version_duration_seconds

Time taken to download the configuration version for Checkov operations.
Type: Histogram
Unit: seconds

scalr_agent.app.checkov.download_external_checks_duration_seconds

Time taken to download external Checkov check files.
Type: Histogram
Unit: seconds

scalr_agent.app.checkov.checkov_analyze_command_duration_seconds

Time taken to execute the Checkov analyze command.
Type: Histogram
Unit: seconds

scalr_agent.app.checkov.install_binaries_duration_seconds

Time taken to install binaries for Checkov operations.
Type: Histogram
Unit: seconds