Introduction to Service Telemetry Framework 1.4
Service Telemetry Framework (STF) collects monitoring data from OpenStack (OSP) or third-party nodes. You can use STF to perform the following tasks:
-
Store or archive the monitoring data for historical information.
-
View the monitoring data graphically on the dashboard.
-
Use the monitoring data to trigger alerts or warnings.
The monitoring data can be either metric or event:
- Metric
-
A numeric measurement of an application or system.
- Event
-
Irregular and discrete occurrences that happen in a system.
The components of STF use a message bus for data transport. Other modular components that receive and store data are deployed as containers on OpenShift.
-
For more information about how to deploy OpenShift, see the OpenShift product documentation.
-
You can install OpenShift on cloud platforms or on bare metal. For more information about STF performance and scaling, see https://access.redhat.com/articles/4907241.
-
You can install OpenShift on bare metal or other supported cloud platforms. For more information about installing OpenShift, see OpenShift Container Platform 4.8 Documentation.
Service Telemetry Framework architecture
Service Telemetry Framework (STF) uses a client-server architecture, in which OpenStack (OSP) is the client and OpenShift is the server.
STF consists of the following components:
-
Data collection
-
collectd: Collects infrastructure metrics and events.
-
Ceilometer: Collects OSP metrics and events.
-
-
Transport
-
Apache Qpid Dispatch Router: An AMQP 1.x compatible messaging bus that provides fast and reliable data transport to transfer the metrics to STF for storage.
-
Smart Gateway: A Golang application that takes metrics and events from the AMQP 1.x bus to deliver to ElasticSearch or Prometheus.
-
-
Data storage
-
Prometheus: Time-series data storage that stores STF metrics received from the Smart Gateway.
-
ElasticSearch: Events data storage that stores STF events received from the Smart Gateway.
-
-
Observation
-
Alertmanager: An alerting tool that uses Prometheus alert rules to manage alerts.
-
Grafana: A visualization and analytics application that you can use to query, visualize, and explore data.
-
The following table describes the application of the client and server components:
Component | Client | Server |
---|---|---|
An AMQP 1.x compatible messaging bus |
yes |
yes |
Smart Gateway |
no |
yes |
Prometheus |
no |
yes |
ElasticSearch |
no |
yes |
collectd |
yes |
no |
Ceilometer |
yes |
no |
To ensure that the monitoring platform can report operational problems with your cloud, do not install STF on the same infrastructure that you are monitoring. |

For client side metrics, collectd provides infrastructure metrics without project data, and Ceilometer provides OSP platform data based on projects or user workload. Both Ceilometer and collectd deliver data to Prometheus by using the Apache Qpid Dispatch Router transport, delivering the data through the message bus. On the server side, a Golang application called the Smart Gateway takes the data stream from the bus and exposes it as a local scrape endpoint for Prometheus.
If you plan to collect and store events, collectd and Ceilometer deliver event data to the server side by using the Apache Qpid Dispatch Router transport. Another Smart Gateway writes the data to the ElasticSearch datastore.
Server-side STF monitoring infrastructure consists of the following layers:
-
Service Telemetry Framework 1.4
-
OpenShift 4.7 through 4.8
-
Infrastructure platform

Installation size of OpenShift
The size of your OpenShift installation depends on the following factors:
-
The infrastructure that you select.
-
The number of nodes that you want to monitor.
-
The number of metrics that you want to collect.
-
The resolution of metrics.
-
The length of time that you want to store the data.
Installation of Service Telemetry Framework (STF) depends on an existing OpenShift environment.
For more information about minimum resources requirements when you install OpenShift on baremetal, see Minimum resource requirements in the Installing a cluster on bare metal guide. For installation requirements of the various public and private cloud platforms that you can install, see the corresponding installation documentation for your cloud platform of choice.
Development environment resource requirements
You can create an all-in-one development environment for STF locally by using CodeReady Containers. The installation process of CodeReady Containers (CRC) is available at https://code-ready.github.io/crc/#installation_gsg.
The minimum resource requirements for CRC is not enough by default to run STF. Ensure that your host system has the following resources available:
-
4 physical cores (8 hyperthreaded cores)
-
64 GB of memory
-
80 GB of storage space
After you complete the installation of CRC, use the crc start
command to start your environment. The recommended minimum system resources for running STF in CodeReady Containers is 48 GB of memory and 8 virtual CPU cores:
crc start --memory=49152 --cpus=8
If you have an existing environment, delete it, and recreate it to ensure that the resource requests have an effect.
-
Enter the
crc delete
command.crc delete
-
Run the
crc start
command to create your environment:crc start --memory=49152 --cpus=8
Preparing your OpenShift environment for Service Telemetry Framework
To prepare your OpenShift environment for Service Telemetry Framework (STF), you must plan for persistent storage, adequate resources, and event storage:
-
Ensure that persistent storage is available in your OpenShift cluster for a production grade deployment. For more information, see Persistent volumes.
-
Ensure that enough resources are available to run the Operators and the application containers. For more information, see Resource allocation.
-
STF uses ElasticSearch to store events, which requires a larger than normal
vm.max_map_count
. Thevm.max_map_count
value is set by default in OpenShift. For more information about how to edit the value ofvm.max_map_count
, see Node tuning operator.
Observability Strategy in Service Telemetry Framework
Service Telemetry Framework (STF) does not include storage backends and alerting tools. STF uses community operators to deploy Prometheus, Alertmanager, Grafana, and Elasticsearch. STF makes requests to these community operators to create instances of each application configured to work with STF.
Instead of having Service Telemetry Operator create custom resource requests, you can use your own deployments of these applications or other compatible applications, and scrape the metrics Smart Gateways for delivery to your own Prometheus-compatible system for telemetry storage. If you set the observability strategy to use alternative backends instead, persistent or ephemeral storage is not required for STF.
Persistent volumes
Service Telemetry Framework (STF) uses persistent storage in OpenShift to request persistent volumes so that Prometheus and ElasticSearch can store metrics and events.
When you enable persistent storage through the Service Telemetry Operator, the Persistent Volume Claims (PVC) requested in an STF deployment results in an access mode of RWO (ReadWriteOnce). If your environment contains pre-provisioned persistent volumes, ensure that volumes of RWO are available in the OpenShift default configured storageClass
.
-
For more information about configuring persistent storage for OpenShift, see Understanding persistent storage.
-
For more information about recommended configurable storage technology in OpenShift, see Recommended configurable storage technology.
-
For more information about configuring persistent storage for Prometheus in STF, see Configuring persistent storage for Prometheus.
-
For more information about configuring persistent storage for ElasticSearch in STF, see Configuring persistent storage for ElasticSearch.
Ephemeral storage
You can use ephemeral storage to run Service Telemetry Framework (STF) without persistently storing data in your OpenShift cluster.
If you use ephemeral storage, you might experience data loss if a pod is restarted, updated, or rescheduled onto another node. Use ephemeral storage only for development or testing, and not production environments. |
Resource allocation
To enable the scheduling of pods within the OpenShift infrastructure, you need resources for the components that are running. If you do not allocate enough resources, pods remain in a Pending
state because they cannot be scheduled.
The amount of resources that you require to run Service Telemetry Framework (STF) depends on your environment and the number of nodes and clouds that you want to monitor.
-
For recommendations about sizing for metrics collection, see Service Telemetry Framework Performance and Scaling.
-
For information about sizing requirements for ElasticSearch, see https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-managing-compute-resources.html.
Node tuning operator
STF uses ElasticSearch to store events, which requires a larger than normal vm.max_map_count
. The vm.max_map_count
value is set by default in OpenShift.
If your host platform is a typical OpenShift 4 environment, do not make any adjustments. The default node tuning operator is configured to account for ElasticSearch workloads. |
If you want to edit the value of vm.max_map_count
, you cannot apply node tuning manually using the sysctl
command because OpenShift manages nodes directly. To configure values and apply them to the infrastructure, you must use the node tuning operator. For more information, see Using the Node Tuning Operator.
In an OKD deployment, the default node tuning operator specification provides the required profiles for ElasticSearch workloads or pods scheduled on nodes. To view the default cluster node tuning specification, run the following command:
$ oc get Tuned/default -o yaml -n openshift-cluster-node-tuning-operator
The output of the default specification is documented at Default profiles set on a cluster. You can manage the assignment of profiles in the recommend
section where profiles are applied to a node when certain conditions are met. When scheduling ElasticSearch to a node in STF, one of the following profiles is applied:
-
openshift-control-plane-es
-
openshift-node-es
When scheduling an ElasticSearch pod, there must be a label present that matches tuned.openshift.io/elasticsearch
. If the label is present, one of the two profiles is assigned to the pod. No action is required by the administrator if you use the recommended Operator for ElasticSearch. If you use a custom-deployed ElasticSearch with STF, ensure that you add the tuned.openshift.io/elasticsearch
label to all scheduled pods.
-
For more information about virtual memory usage by ElasticSearch, see https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
-
For more information about how the profiles are applied to nodes, see Custom tuning specification.
Installing the core components of Service Telemetry Framework
You can use Operators to load the Service Telemetry Framework (STF) components and objects. Operators manage each of the following STF core and community components:
-
Apache Qpid Dispatch Router
-
Smart Gateway
-
Prometheus and AlertManager
-
ElasticSearch
-
Grafana
-
An OpenShift version inclusive of 4.7 through 4.8 is running.
-
You have prepared your OpenShift environment and ensured that there is persistent storage and enough resources to run the STF components on top of the OpenShift environment. For more information, see Service Telemetry Framework Performance and Scaling.
-
For more information about Operators, see the Understanding Operators guide.
Deploying Service Telemetry Framework to the OpenShift environment
Deploy Service Telemetry Framework (STF) to collect, store, and monitor events:
-
Create a namespace to contain the STF components, for example,
service-telemetry
:$ oc new-project service-telemetry
-
Create an OperatorGroup in the namespace so that you can schedule the Operator pods:
$ oc create -f - <<EOF apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: service-telemetry-operator-group namespace: service-telemetry spec: targetNamespaces: - service-telemetry EOF
For more information, see OperatorGroups.
-
Before you deploy STF on OpenShift, you must enable the catalog source. Install a CatalogSource that contains the Service Telemetry Operator and the Smart Gateway Operator:
$ oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: infrawatch-operators namespace: openshift-marketplace spec: displayName: InfraWatch Operators image: quay.io/infrawatch-operators/infrawatch-catalog:nightly publisher: InfraWatch sourceType: grpc updateStrategy: registryPoll: interval: 30m EOF
-
Validate the creation of your CatalogSource:
$ oc get -nopenshift-marketplace catalogsource infrawatch-operators NAME DISPLAY TYPE PUBLISHER AGE infrawatch-operators InfraWatch Operators grpc InfraWatch 2m16s
-
Validate that the Operators are available from the catalog:
$ oc get packagemanifests | grep InfraWatch service-telemetry-operator InfraWatch Operators 7m20s smart-gateway-operator InfraWatch Operators 7m20s
-
Enable the OperatorHub.io Community Catalog Source to install data storage and visualization Operators:
Red Hat supports the core Operators and workloads, including Apache Qpid Dispatch Router, AMQ Certificate Manager, Service Telemetry Operator, and Smart Gateway Operator. Red Hat does not support the community Operators or workload components, inclusive of ElasticSearch, Prometheus, Alertmanager, Grafana, and their Operators. $ oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: operatorhubio-operators namespace: openshift-marketplace spec: sourceType: grpc image: quay.io/operatorhubio/catalog:latest displayName: OperatorHub.io Operators publisher: OperatorHub.io EOF
-
Subscribe to the AMQ Certificate Manager Operator by using the redhat-operators CatalogSource:
The AMQ Certificate Manager deploys to the openshift-operators
namespace and is then available to all namespaces across the cluster. As a result, on clusters with a large number of namespaces, it can take several minutes for the Operator to be available in theservice-telemetry
namespace. The AMQ Certificate Manager Operator is not compatible with the dependency management of Operator Lifecycle Manager when you use it with other namespace-scoped operators.$ oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: amq7-cert-manager-operator namespace: openshift-operators spec: channel: 1.x installPlanApproval: Automatic name: amq7-cert-manager-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF
-
Validate your ClusterServiceVersion. Ensure that amq7-cert-manager.v1.0.3 displays a phase of
Succeeded
:$ oc get csv --namespace openshift-operators --selector operators.coreos.com/amq7-cert-manager-operator.openshift-operators NAME DISPLAY VERSION REPLACES PHASE amq7-cert-manager.v1.0.3 Red Hat Integration - AMQ Certificate Manager 1.0.3 amq7-cert-manager.v1.0.2 Succeeded
-
Subscribe to the AMQ Interconnect Operator by using the redhat-operators CatalogSource:
$ oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: amq7-interconnect-operator namespace: service-telemetry spec: channel: 1.10.x installPlanApproval: Automatic name: amq7-interconnect-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF
-
Validate your ClusterServiceVersion. Ensure that amq7-interconnect-operator.v1.10.4 displays a phase of
Succeeded
:$ oc get csv --selector=operators.coreos.com/amq7-interconnect-operator.service-telemetry NAME DISPLAY VERSION REPLACES PHASE amq7-interconnect-operator.v1.10.4 Red Hat Integration - AMQ Interconnect 1.10.4 amq7-interconnect-operator.v1.10.3 Succeeded
-
If you plan to store metrics in Prometheus, you must enable the Prometheus Operator. To enable the Prometheus Operator, create the following manifest in your OpenShift environment:
$ oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: prometheus namespace: service-telemetry spec: channel: beta installPlanApproval: Automatic name: prometheus source: operatorhubio-operators sourceNamespace: openshift-marketplace EOF
-
Verify that the ClusterServiceVersion for Prometheus
Succeeded
:$ oc get csv --selector=operators.coreos.com/prometheus.service-telemetry NAME DISPLAY VERSION REPLACES PHASE prometheusoperator.0.47.0 Prometheus Operator 0.47.0 prometheusoperator.0.37.0 Succeeded
-
If you plan to store events in ElasticSearch, you must enable the Elastic Cloud on Kubernetes (ECK) Operator. To enable the ECK Operator, create the following manifest in your OpenShift environment:
$ oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: elasticsearch-eck-operator-certified namespace: service-telemetry spec: channel: stable installPlanApproval: Automatic name: elasticsearch-eck-operator-certified source: certified-operators sourceNamespace: openshift-marketplace EOF
-
Verify that the ClusterServiceVersion for Elastic Cloud on Kubernetes
Succeeded
:$ oc get csv --selector=operators.coreos.com/elasticsearch-eck-operator-certified.service-telemetry NAME DISPLAY VERSION REPLACES PHASE elasticsearch-eck-operator-certified.1.9.1 Elasticsearch (ECK) Operator 1.9.1 Succeeded
-
Create the Service Telemetry Operator subscription to manage the STF instances:
$ oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: service-telemetry-operator namespace: service-telemetry spec: channel: stable-1.4 installPlanApproval: Automatic name: service-telemetry-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF
-
Validate the Service Telemetry Operator and the dependent operators:
$ oc get csv --namespace service-telemetry NAME DISPLAY VERSION REPLACES PHASE amq7-cert-manager.v1.0.3 Red Hat Integration - AMQ Certificate Manager 1.0.3 amq7-cert-manager.v1.0.2 Succeeded amq7-interconnect-operator.v1.10.4 Red Hat Integration - AMQ Interconnect 1.10.4 amq7-interconnect-operator.v1.10.3 Succeeded elasticsearch-eck-operator-certified.1.9.1 Elasticsearch (ECK) Operator 1.9.1 Succeeded prometheusoperator.0.47.0 Prometheus Operator 0.47.0 prometheusoperator.0.37.0 Succeeded service-telemetry-operator.v1.4.1641489191 Service Telemetry Operator 1.4.1641489191 Succeeded smart-gateway-operator.v4.0.1641489202 Smart Gateway Operator 4.0.1641489202 Succeeded
Creating a ServiceTelemetry object in OpenShift
Create a ServiceTelemetry
object in OpenShift to result in the Service Telemetry Operator creating the supporting components for a Service Telemetry Framework (STF) deployment. For more information, see Primary parameters of the ServiceTelemetry object.
-
To create a
ServiceTelemetry
object that results in an STF deployment that uses the default values, create aServiceTelemetry
object with an emptyspec
parameter:$ oc apply -f - <<EOF apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: {} EOF
To override a default value, define the parameter that you want to override. In this example, enable ElasticSearch by setting
enabled
totrue
:$ oc apply -f - <<EOF apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: backends: events: elasticsearch: enabled: true EOF
Creating a
ServiceTelemetry
object with an emptyspec
parameter results in an STF deployment with the following default settings:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: alerting: alertmanager: receivers: snmpTraps: enabled: false target: 192.168.24.254 storage: persistent: pvcStorageRequest: 20G strategy: persistent enabled: true backends: events: elasticsearch: enabled: false storage: persistent: pvcStorageRequest: 20Gi strategy: persistent version: 7.16.1 logs: loki: enabled: false flavor: 1x.extra-small replicationFactor: 1 storage: objectStorageSecret: test storageClass: standard metrics: prometheus: enabled: true scrapeInterval: 10s storage: persistent: pvcStorageRequest: 20G retention: 24h strategy: persistent clouds: - events: collectors: - collectorType: collectd debugEnabled: false subscriptionAddress: collectd/cloud1-notify - collectorType: ceilometer debugEnabled: false subscriptionAddress: anycast/ceilometer/cloud1-event.sample metrics: collectors: - collectorType: collectd debugEnabled: false subscriptionAddress: collectd/cloud1-telemetry - collectorType: ceilometer debugEnabled: false subscriptionAddress: anycast/ceilometer/cloud1-metering.sample - collectorType: sensubility debugEnabled: false subscriptionAddress: sensubility/cloud1-telemetry name: cloud1 graphing: enabled: false grafana: adminPassword: secret adminUser: root baseImage: docker.io/grafana/grafana:latest disableSignoutMenu: false ingressEnabled: false highAvailability: enabled: false observabilityStrategy: use_community transports: qdr: enabled: true web: enabled: false
To override these defaults, add the configuration to the
spec
parameter. -
View the STF deployment logs in the Service Telemetry Operator:
$ oc logs --selector name=service-telemetry-operator ... --------------------------- Ansible Task Status Event StdOut ----------------- PLAY RECAP ********************************************************************* localhost : ok=57 changed=0 unreachable=0 failed=0 skipped=20 rescued=0 ignored=0
-
To determine that all workloads are operating correctly, view the pods and the status of each pod.
If you set the backends.events.elasticsearch.enabled
parameter totrue
, the notification Smart Gateways reportError
andCrashLoopBackOff
error messages for a period of time before ElasticSearch starts.$ oc get pods NAME READY STATUS RESTARTS AGE alertmanager-default-0 2/2 Running 0 17m default-cloud1-ceil-meter-smartgateway-6484b98b68-vd48z 2/2 Running 0 17m default-cloud1-coll-meter-smartgateway-799f687658-4gxpn 2/2 Running 0 17m default-cloud1-sens-meter-smartgateway-c7f4f7fc8-c57b4 2/2 Running 0 17m default-interconnect-54658f5d4-pzrpt 1/1 Running 0 17m elastic-operator-66b7bc49c4-sxkc2 1/1 Running 0 52m interconnect-operator-69df6b9cb6-7hhp9 1/1 Running 0 50m prometheus-default-0 2/2 Running 1 17m prometheus-operator-6458b74d86-wbdqp 1/1 Running 0 51m service-telemetry-operator-864646787c-hd9pm 1/1 Running 0 51m smart-gateway-operator-79778cf548-mz5z7 1/1 Running 0 51m
Primary parameters of the ServiceTelemetry object
The ServiceTelemetry
object comprises the following primary configuration parameters:
-
alerting
-
backends
-
clouds
-
graphing
-
highAvailability
-
transports
You can configure each of these configuration parameters to provide different features in an STF deployment.
Support for |
The backends parameter
Use the backends
parameter to control which storage back ends are available for storage of metrics and events, and to control the enablement of Smart Gateways that the clouds
parameter defines. For more information, see The clouds parameter.
Currently, you can use Prometheus as the metrics storage back end and ElasticSearch as the events storage back end.
Enabling Prometheus as a storage back end for metrics
To enable Prometheus as a storage back end for metrics, you must configure the ServiceTelemetry
object.
-
Configure the
ServiceTelemetry
object:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: backends: metrics: prometheus: enabled: true
Configuring persistent storage for Prometheus
Use the additional parameters that are defined in backends.metrics.prometheus.storage.persistent
to configure persistent storage options for Prometheus, such as storage class and volume size.
Use storageClass
to define the back end storage class. If you do not set this parameter, the Service Telemetry Operator uses the default storage class for the OpenShift cluster.
Use the pvcStorageRequest
parameter to define the minimum required volume size to satisfy the storage request. If volumes are statically defined, it is possible that a volume size larger than requested is used. By default, Service Telemetry Operator requests a volume size of 20G
(20 Gigabytes).
-
List the available storage classes:
$ oc get storageclasses NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE csi-manila-ceph manila.csi.openstack.org Delete Immediate false 20h standard (default) kubernetes.io/cinder Delete WaitForFirstConsumer true 20h standard-csi cinder.csi.openstack.org Delete WaitForFirstConsumer true 20h
-
Configure the
ServiceTelemetry
object:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: backends: metrics: prometheus: enabled: true storage: strategy: persistent persistent: storageClass: standard-csi pvcStorageRequest: 50G
Enabling ElasticSearch as a storage back end for events
To enable ElasticSearch as a storage back end for events, you must configure the ServiceTelemetry
object.
-
Configure the
ServiceTelemetry
object:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: backends: events: elasticsearch: enabled: true
Configuring persistent storage for ElasticSearch
Use the additional parameters defined in backends.events.elasticsearch.storage.persistent
to configure persistent storage options for ElasticSearch, such as storage class and volume size.
Use storageClass
to define the back end storage class. If you do not set this parameter, the Service Telemetry Operator uses the default storage class for the OpenShift cluster.
Use the pvcStorageRequest
parameter to define the minimum required volume size to satisfy the storage request. If volumes are statically defined, it is possible that a volume size larger than requested is used. By default, Service Telemetry Operator requests a volume size of 20Gi
(20 Gibibytes).
-
List the available storage classes:
$ oc get storageclasses NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE csi-manila-ceph manila.csi.openstack.org Delete Immediate false 20h standard (default) kubernetes.io/cinder Delete WaitForFirstConsumer true 20h standard-csi cinder.csi.openstack.org Delete WaitForFirstConsumer true 20h
-
Configure the
ServiceTelemetry
object:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: backends: events: elasticsearch: enabled: true version: 7.16.1 storage: strategy: persistent persistent: storageClass: standard-csi pvcStorageRequest: 50G
The clouds parameter
Use the clouds
parameter to define which Smart Gateway objects deploy, thereby providing the interface for multiple monitored cloud environments to connect to an instance of STF. If a supporting back end is available, then metrics and events Smart Gateways for the default cloud configuration are created. By default, the Service Telemetry Operator creates Smart Gateways for cloud1
.
You can create a list of cloud objects to control which Smart Gateways are created for the defined clouds. Each cloud consists of data types and collectors. Data types are metrics
or events
. Each data type consists of a list of collectors, the message bus subscription address, and a parameter to enable debugging. Available collectors for metrics are collectd
, ceilometer
, and sensubility
. Available collectors for events are collectd
and ceilometer
. Ensure that the subscription address for each of these collectors is unique for every cloud, data type, and collector combination.
The default cloud1
configuration is represented by the following ServiceTelemetry
object, which provides subscriptions and data storage of metrics and events for collectd, Ceilometer, and Sensubility data collectors for a particular cloud instance:
apiVersion: infra.watch/v1beta1
kind: ServiceTelemetry
metadata:
name: stf-default
namespace: service-telemetry
spec:
clouds:
- name: cloud1
metrics:
collectors:
- collectorType: collectd
subscriptionAddress: collectd/telemetry
- collectorType: ceilometer
subscriptionAddress: anycast/ceilometer/metering.sample
- collectorType: sensubility
subscriptionAddress: sensubility/telemetry
debugEnabled: false
events:
collectors:
- collectorType: collectd
subscriptionAddress: collectd/notify
- collectorType: ceilometer
subscriptionAddress: anycast/ceilometer/event.sample
Each item of the clouds
parameter represents a cloud instance. A cloud instance consists of three top-level parameters: name
, metrics
, and events
. The metrics
and events
parameters represent the corresponding back end for storage of that data type. The collectors
parameter specifies a list of objects made up of two required parameters, collectorType
and subscriptionAddress
, and these represent an instance of the Smart Gateway. The collectorType
parameter specifies data collected by either collectd, Ceilometer, or Sensubility. The subscriptionAddress
parameter provides the Apache Qpid Dispatch Router address to which a Smart Gateway subscribes.
You can use the optional Boolean parameter debugEnabled
within the collectors
parameter to enable additional console debugging in the running Smart Gateway pod.
-
For more information about deleting default Smart Gateways, see Deleting the default Smart Gateways.
-
For more information about how to configure multiple clouds, see Configuring multiple clouds.
The alerting parameter
Use the alerting
parameter to control creation of an Alertmanager instance and the configuration of the storage back end. By default, alerting
is enabled. For more information, see Alerts in Service Telemetry Framework.
The graphing parameter
Use the graphing
parameter to control the creation of a Grafana instance. By default, graphing
is disabled. For more information, see Dashboards in Service Telemetry Framework.
The highAvailability parameter
Use the highAvailability
parameter to control the instantiation of multiple copies of STF components to reduce recovery time of components that fail or are rescheduled. By default, highAvailability
is disabled. For more information, see High availability.
The transports parameter
Use the transports
parameter to control the enablement of the message bus for a STF deployment. The only transport currently supported is Apache Qpid Dispatch Router. By default, the qdr
transport is enabled.
Accessing user interfaces for STF components
In OpenShift, applications are exposed to the external network through a route. For more information about routes, see Configuring ingress cluster traffic.
In Service Telemetry Framework (STF), HTTPS routes are exposed for each service that has a web-based interface. These routes are protected by OpenShift RBAC and any user that has a ClusterRoleBinding
that enables them to view OpenShift Namespaces can log in. For more information about RBAC, see Using RBAC to define and apply permissions.
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
List the available web UI routes in the
service-telemetry
project:$ oc get routes | grep web default-alertmanager-proxy default-alertmanager-proxy-service-telemetry.apps.infra.watch default-alertmanager-proxy web reencrypt/Redirect None default-prometheus-proxy default-prometheus-proxy-service-telemetry.apps.infra.watch default-prometheus-proxy web reencrypt/Redirect None
-
In a web browser, navigate to https://<route_address> to access the web interface for the corresponding service.
Configuring an alternate observability strategy
To configure STF to skip the deployment of storage, visualization, and alerting backends, add observabilityStrategy: none
to the ServiceTelemetry spec. In this mode, only Apache Qpid Dispatch Router routers and metrics Smart Gateways are deployed, and you must configure an external Prometheus-compatible system to collect metrics from the STF Smart Gateways.
Currently, only metrics are supported when you set observabilityStrategy to none . Events Smart Gateways are not deployed.
|
-
Create a
ServiceTelemetry
object with the propertyobservabilityStrategy: none
in thespec
parameter. The manifest shows results in a default deployment of STF that is suitable for receiving telemetry from a single cloud with all metrics collector types.$ oc apply -f - <<EOF apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: observabilityStrategy: none EOF
-
To verify that all workloads are operating correctly, view the pods and the status of each pod:
$ oc get pods NAME READY STATUS RESTARTS AGE default-cloud1-ceil-meter-smartgateway-59c845d65b-gzhcs 3/3 Running 0 132m default-cloud1-coll-meter-smartgateway-75bbd948b9-d5phm 3/3 Running 0 132m default-cloud1-sens-meter-smartgateway-7fdbb57b6d-dh2g9 3/3 Running 0 132m default-interconnect-668d5bbcd6-57b2l 1/1 Running 0 132m interconnect-operator-b8f5bb647-tlp5t 1/1 Running 0 47h service-telemetry-operator-566b9dd695-wkvjq 1/1 Running 0 156m smart-gateway-operator-58d77dcf7-6xsq7 1/1 Running 0 47h
For more information about configuring additional clouds or to change the set of supported collectors, see Deploying Smart Gateways
Removing Service Telemetry Framework from the OpenShift environment
Remove Service Telemetry Framework (STF) from an OpenShift environment if you no longer require the STF functionality.
Deleting the namespace
To remove the operational resources for STF from OpenShift, delete the namespace.
-
Run the
oc delete
command:$ oc delete project service-telemetry
-
Verify that the resources have been deleted from the namespace:
$ oc get all No resources found.
Removing the CatalogSource
If you do not expect to install Service Telemetry Framework (STF) again, delete the CatalogSource. When you remove the CatalogSource, PackageManifests related to STF are automatically removed from the Operator Lifecycle Manager catalog.
-
Delete the CatalogSource:
$ oc delete --namespace=openshift-marketplace catalogsource infrawatch-operators catalogsource.operators.coreos.com "infrawatch-operators" deleted
-
Verify that the STF PackageManifests are removed from the platform. If successful, the following command returns no result:
$ oc get packagemanifests | grep InfraWatch
-
If you enabled the OperatorHub.io Community Catalog Source during the installation process and you no longer need this catalog source, delete it:
$ oc delete --namespace=openshift-marketplace catalogsource operatorhubio-operators catalogsource.operators.coreos.com "operatorhubio-operators" deleted
For more information about the OperatorHub.io Community Catalog Source, see Deploying Service Telemetry Framework to the OpenShift environment.
Configuring OpenStack for Service Telemetry Framework
To collect metrics, events, or both, and to send them to the Service Telemetry Framework (STF) storage domain, you must configure the OpenStack (OSP) overcloud to enable data collection and transport.
STF can support both single and multiple clouds. The default configuration in OSP and STF set up for a single cloud installation.
-
For a single OSP overcloud deployment with default configuration, see Deploying OpenStack overcloud for Service Telemetry Framework.
-
To plan your OSP installation and configuration STF for multiple clouds, see Configuring multiple clouds.
-
As part of an OSP overcloud deployment, you might need to configure additional features in your environment:
-
To deploy data collection and transport to STF on OSP cloud nodes that employ routed L3 domains, such as distributed compute node (DCN) or spine-leaf, see Deploying to non-standard network topologies.
-
To send metrics to both Gnocchi and STF, see Sending metrics to Gnocchi and Service Telemetry Framework.
-
Deploying OpenStack overcloud for Service Telemetry Framework
As part of the OpenStack (OSP) overcloud deployment, you must configure the data collectors and the data transport to Service Telemetry Framework (STF).
-
To collect data through Apache Qpid Dispatch Router, see the amqp1 plug-in.
Retrieving the Apache Qpid Dispatch Router route address
When you configure the OpenStack (OSP) overcloud for Service Telemetry Framework (STF), you must provide the Apache Qpid Dispatch Router route address in the STF connection file.
-
Log in to your OpenShift environment.
-
In the
service-telemetry
project, retrieve the Apache Qpid Dispatch Router route address:$ oc get routes -ogo-template='{{ range .items }}{{printf "%s\n" .spec.host }}{{ end }}' | grep "\-5671" default-interconnect-5671-service-telemetry.apps.infra.watch
Creating the base configuration for STF
To configure the base parameters to provide a compatible data collection and transport for Service Telemetry Framework (STF), you must create a file that defines the default data collection values.
-
Log in to the OpenStack (OSP) undercloud as the
stack
user. -
Create a configuration file called
enable-stf.yaml
in the/home/stack
directory.Setting
EventPipelinePublishers
andPipelinePublishers
to empty lists results in no event or metric data passing to OSP telemetry components, such as Gnocchi or Panko. If you need to send data to additional pipelines, the Ceilometer polling interval of 30 seconds, as specified inExtraConfig
, might overwhelm the OSP telemetry components, and you must increase the interval to a larger value, such as300
. Increasing the value to a longer polling interval results in less telemetry resolution in STF.To enable collection of telemetry with STF and Gnocchi, see Sending metrics to Gnocchi and Service Telemetry Framework
parameter_defaults:
# only send to STF, not other publishers
EventPipelinePublishers: []
PipelinePublishers: []
# manage the polling and pipeline configuration files for Ceilometer agents
ManagePolling: true
ManagePipeline: true
# enable Ceilometer metrics and events
CeilometerQdrPublishMetrics: true
CeilometerQdrPublishEvents: true
# enable collection of API status
CollectdEnableSensubility: true
CollectdSensubilityTransport: amqp1
# enable collection of containerized service metrics
CollectdEnableLibpodstats: true
# set collectd overrides for higher telemetry resolution and extra plugins
# to load
CollectdConnectionType: amqp1
CollectdAmqpInterval: 5
CollectdDefaultPollingInterval: 5
CollectdExtraPlugins:
- vmem
# set standard prefixes for where metrics and events are published to QDR
MetricsQdrAddresses:
- prefix: 'collectd'
distribution: multicast
- prefix: 'anycast/ceilometer'
distribution: multicast
ExtraConfig:
ceilometer::agent::polling::polling_interval: 30
ceilometer::agent::polling::polling_meters:
- cpu
- disk.*
- ip.*
- image.*
- memory
- memory.*
- network.*
- perf.*
- port
- port.*
- switch
- switch.*
- storage.*
- volume.*
# to avoid filling the memory buffers if disconnected from the message bus
# note: this may need an adjustment if there are many metrics to be sent.
collectd::plugin::amqp1::send_queue_limit: 5000
# receive extra information about virtual memory
collectd::plugin::vmem::verbose: true
# provide name and uuid in addition to hostname for better correlation
# to ceilometer data
collectd::plugin::virt::hostname_format: "name uuid hostname"
# provide the human-friendly name of the virtual instance
collectd::plugin::virt::plugin_instance_format: metadata
# set memcached collectd plugin to report its metrics by hostname
# rather than host IP, ensuring metrics in the dashboard remain uniform
collectd::plugin::memcached::instances:
local:
host: "%{hiera('fqdn_canonical')}"
port: 11211
Configuring the STF connection for the overcloud
To configure the Service Telemetry Framework (STF) connection, you must create a file that contains the connection configuration of the Apache Qpid Dispatch Router for the overcloud to the STF deployment. Enable the collection of events and storage of the events in STF and deploy the overcloud. The default configuration is for a single cloud instance with the default message bus topics. For configuration of multiple cloud deployments, see Configuring multiple clouds.
-
Retrieve the Apache Qpid Dispatch Router route address. For more information, see Retrieving the Apache Qpid Dispatch Router route address.
-
Log in to the OSP undercloud as the
stack
user. -
Create a configuration file called
stf-connectors.yaml
in the/home/stack
directory. -
In the
stf-connectors.yaml
file, configure theMetricsQdrConnectors
address to connect the Apache Qpid Dispatch Router on the overcloud to the STF deployment. You configure the topic addresses for Sensubility, Ceilometer, and collectd in this file to match the defaults in STF. For more information about customizing topics and cloud configuration, see Configuring multiple clouds.-
Replace the
host
parameter with the value ofHOST/PORT
that you retrieved in Retrieving the Apache Qpid Dispatch Router route address.
stf-connectors.yamlresource_registry: OS::TripleO::Services::Collectd: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml (1) parameter_defaults: MetricsQdrConnectors: - host: stf-default-interconnect-5671-service-telemetry.apps.infra.watch (2) port: 443 role: edge verifyHostname: false sslProfile: sslProfile MetricsQdrSSLProfiles: - name: sslProfile CeilometerQdrEventsConfig: driver: amqp topic: cloud1-event (3) CeilometerQdrMetricsConfig: driver: amqp topic: cloud1-metering (4) CollectdAmqpInstances: cloud1-notify: (5) notify: true format: JSON presettle: false cloud1-telemetry: (6) format: JSON presettle: false CollectdSensubilityResultsChannel: sensubility/cloud1-telemetry (7)
1 Directly load the collectd service because you are not including the collectd-write-qdr.yaml
environment file for multiple cloud deployments.2 Replace the host
parameter with the value ofHOST/PORT
that you retrieved in Retrieving the Apache Qpid Dispatch Router route address.3 Define the topic for Ceilometer events. The format of this value is anycast/ceilometer/cloud1-event.sample
.4 Define the topic for Ceilometer metrics. The format of this value is`anycast/ceilometer/cloud1-metering.sample`. 5 Define the topic for collectd events. The format of this value is collectd/cloud1-notify
.6 Define the topic for collectd metrics. The format of this value is collectd/cloud1-telemetry
.7 Define the topic for collectd-sensubility events. The value is the exact string sensubility/cloud1-telemetry
. -
Deploying the overcloud
Deploy or update the overcloud with the required environment files so that data is collected and transmitted to Service Telemetry Framework (STF).
-
Log in to the OpenStack (OSP) undercloud as the
stack
user. -
Source the authentication file:
[stack@undercloud-0 ~]$ source stackrc (undercloud) [stack@undercloud-0 ~]$
-
Add the following files to your OSP TripleO deployment to configure data collection and Apache Qpid Dispatch Router:
-
The
ceilometer-write-qdr.yaml
file to ensure that Ceilometer telemetry and events are sent to STF -
The
qdr-edge-only.yaml
file to ensure that the message bus is enabled and connected to STF message bus routers -
The
enable-stf.yaml
environment file to ensure defaults are configured correctly -
The
stf-connectors.yaml
environment file to define the connection to STF
-
-
Deploy the OSP overcloud:
(undercloud) [stack@undercloud-0 ~]$ openstack overcloud deploy <other_arguments> --templates /usr/share/openstack-tripleo-heat-templates \ --environment-file <...other_environment_files...> \ --environment-file /usr/share/openstack-tripleo-heat-templates/environments/metrics/ceilometer-write-qdr.yaml \ --environment-file /usr/share/openstack-tripleo-heat-templates/environments/metrics/qdr-edge-only.yaml \ --environment-file /home/stack/enable-stf.yaml \ --environment-file /home/stack/stf-connectors.yaml
Validating client-side installation
To validate data collection from the Service Telemetry Framework (STF) storage domain, query the data sources for delivered data. To validate individual nodes in the OpenStack (OSP) deployment, use SSH to connect to the console.
Some telemetry data is available only when OSP has active workloads. |
-
Log in to an overcloud node, for example, controller-0.
-
Ensure that the
metrics_qdr
container is running on the node:$ sudo podman container inspect --format '{{.State.Status}}' metrics_qdr running
-
Return the internal network address on which Apache Qpid Dispatch Router is running, for example,
172.17.1.44
listening on port5666
:$ sudo podman exec -it metrics_qdr cat /etc/qpid-dispatch/qdrouterd.conf listener { host: 172.17.1.44 port: 5666 authenticatePeer: no saslMechanisms: ANONYMOUS }
-
Return a list of connections to the local Apache Qpid Dispatch Router:
$ sudo podman exec -it metrics_qdr qdstat --bus=172.17.1.44:5666 --connections Connections id host container role dir security authentication tenant ============================================================================================================================================================================================================================================================================================ 1 default-interconnect-5671-service-telemetry.apps.infra.watch:443 default-interconnect-7458fd4d69-bgzfb edge out TLSv1.2(DHE-RSA-AES256-GCM-SHA384) anonymous-user 12 172.17.1.44:60290 openstack.org/om/container/controller-0/ceilometer-agent-notification/25/5c02cee550f143ec9ea030db5cccba14 normal in no-security no-auth 16 172.17.1.44:36408 metrics normal in no-security anonymous-user 899 172.17.1.44:39500 10a2e99d-1b8a-4329-b48c-4335e5f75c84 normal in no-security no-auth
There are four connections:
-
Outbound connection to STF
-
Inbound connection from ceilometer
-
Inbound connection from collectd
-
Inbound connection from our
qdstat
clientThe outbound STF connection is provided to the
MetricsQdrConnectors
host parameter and is the route for the STF storage domain. The other hosts are internal network addresses of the client connections to this Apache Qpid Dispatch Router.
-
-
To ensure that messages are delivered, list the links, and view the
_edge
address in thedeliv
column for delivery of messages:$ sudo podman exec -it metrics_qdr qdstat --bus=172.17.1.44:5666 --links Router Links type dir conn id id peer class addr phs cap pri undel unsett deliv presett psdrop acc rej rel mod delay rate =========================================================================================================================================================== endpoint out 1 5 local _edge 250 0 0 0 2979926 0 0 0 0 2979926 0 0 0 endpoint in 1 6 250 0 0 0 0 0 0 0 0 0 0 0 0 endpoint in 1 7 250 0 0 0 0 0 0 0 0 0 0 0 0 endpoint out 1 8 250 0 0 0 0 0 0 0 0 0 0 0 0 endpoint in 1 9 250 0 0 0 0 0 0 0 0 0 0 0 0 endpoint out 1 10 250 0 0 0 911 911 0 0 0 0 0 911 0 endpoint in 1 11 250 0 0 0 0 911 0 0 0 0 0 0 0 endpoint out 12 32 local temp.lSY6Mcicol4J2Kp 250 0 0 0 0 0 0 0 0 0 0 0 0 endpoint in 16 41 250 0 0 0 2979924 0 0 0 0 2979924 0 0 0 endpoint in 912 1834 mobile $management 0 250 0 0 0 1 0 0 1 0 0 0 0 0 endpoint out 912 1835 local temp.9Ok2resI9tmt+CT 250 0 0 0 0 0 0 0 0 0 0 0 0
-
To list the addresses from OSP nodes to STF, connect to OpenShift to retrieve the Apache Qpid Dispatch Router pod name and list the connections. List the available Apache Qpid Dispatch Router pods:
$ oc get pods -l application=default-interconnect NAME READY STATUS RESTARTS AGE default-interconnect-7458fd4d69-bgzfb 1/1 Running 0 6d21h
-
Connect to the pod and list the known connections. In this example, there are three
edge
connections from the OSP nodes with connectionid
22, 23, and 24:$ oc exec -it default-interconnect-7458fd4d69-bgzfb -- qdstat --connections 2020-04-21 18:25:47.243852 UTC default-interconnect-7458fd4d69-bgzfb Connections id host container role dir security authentication tenant last dlv uptime =============================================================================================================================================================================================== 5 10.129.0.110:48498 bridge-3f5 edge in no-security anonymous-user 000:00:00:02 000:17:36:29 6 10.129.0.111:43254 rcv[default-cloud1-ceil-meter-smartgateway-58f885c76d-xmxwn] edge in no-security anonymous-user 000:00:00:02 000:17:36:20 7 10.130.0.109:50518 rcv[default-cloud1-coll-event-smartgateway-58fbbd4485-rl9bd] normal in no-security anonymous-user - 000:17:36:11 8 10.130.0.110:33802 rcv[default-cloud1-ceil-event-smartgateway-6cfb65478c-g5q82] normal in no-security anonymous-user 000:01:26:18 000:17:36:05 22 10.128.0.1:51948 Router.ceph-0.redhat.local edge in TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384) anonymous-user 000:00:00:03 000:22:08:43 23 10.128.0.1:51950 Router.compute-0.redhat.local edge in TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384) anonymous-user 000:00:00:03 000:22:08:43 24 10.128.0.1:52082 Router.controller-0.redhat.local edge in TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384) anonymous-user 000:00:00:00 000:22:08:34 27 127.0.0.1:42202 c2f541c1-4c97-4b37-a189-a396c08fb079 normal in no-security no-auth 000:00:00:00 000:00:00:00
-
To view the number of messages delivered by the network, use each address with the
oc exec
command:$ oc exec -it default-interconnect-7458fd4d69-bgzfb -- qdstat --address 2020-04-21 18:20:10.293258 UTC default-interconnect-7458fd4d69-bgzfb Router Addresses class addr phs distrib pri local remote in out thru fallback ========================================================================================================================== mobile anycast/ceilometer/event.sample 0 balanced - 1 0 970 970 0 0 mobile anycast/ceilometer/metering.sample 0 balanced - 1 0 2,344,833 2,344,833 0 0 mobile collectd/notify 0 multicast - 1 0 70 70 0 0 mobile collectd/telemetry 0 multicast - 1 0 216,128,890 216,128,890 0 0
Sending metrics to Gnocchi and Service Telemetry Framework
To send metrics to Service Telemetry Framework (STF) and Gnocchi simultaneously, you must include an environment file in your deployment to enable an additional publisher.
If you need to send data to additional pipelines, the Ceilometer polling interval of 30 seconds, as specified in ExtraConfig , might overwhelm the OSP telemetry components, and you must increase the interval to a larger value, such as 300 . Increasing the value to a longer polling interval results in less telemetry resolution in STF.
|
-
You have created a file that contains the connection configuration of the Apache Qpid Dispatch Router for the overcloud to STF. For more information, see Configuring the STF connection for the overcloud.
-
Create an environment file named
gnocchi-connectors.yaml
in the/home/stack
directory.resource_registry: OS::TripleO::Services::GnocchiApi: /usr/share/openstack-tripleo-heat-templates/deployment/gnocchi/gnocchi-api-container-puppet.yaml OS::TripleO::Services::GnocchiMetricd: /usr/share/openstack-tripleo-heat-templates/deployment/gnocchi/gnocchi-metricd-container-puppet.yaml OS::TripleO::Services::GnocchiStatsd: /usr/share/openstack-tripleo-heat-templates/deployment/gnocchi/gnocchi-statsd-container-puppet.yaml OS::TripleO::Services::AodhApi: /usr/share/openstack-tripleo-heat-templates/deployment/aodh/aodh-api-container-puppet.yaml OS::TripleO::Services::AodhEvaluator: /usr/share/openstack-tripleo-heat-templates/deployment/aodh/aodh-evaluator-container-puppet.yaml OS::TripleO::Services::AodhNotifier: /usr/share/openstack-tripleo-heat-templates/deployment/aodh/aodh-notifier-container-puppet.yaml OS::TripleO::Services::AodhListener: /usr/share/openstack-tripleo-heat-templates/deployment/aodh/aodh-listener-container-puppet.yaml parameter_defaults: CeilometerEnableGnocchi: true CeilometerEnablePanko: false GnocchiArchivePolicy: 'high' GnocchiBackend: 'rbd' GnocchiRbdPoolName: 'metrics' EventPipelinePublishers: ['gnocchi://?filter_project=service'] PipelinePublishers: ['gnocchi://?filter_project=service']
-
Add the environment file
gnocchi-connectors.yaml
to the deployment command. Replace <other_arguments> with files that are applicable to your environment.$ openstack overcloud deploy <other_arguments> --templates /usr/share/openstack-tripleo-heat-templates \ --environment-file <...other_environment_files...> \ --environment-file /usr/share/openstack-tripleo-heat-templates/environments/metrics/ceilometer-write-qdr.yaml \ --environment-file /usr/share/openstack-tripleo-heat-templates/environments/metrics/collectd-write-qdr.yaml \ --environment-file /usr/share/openstack-tripleo-heat-templates/environments/metrics/qdr-edge-only.yaml \ --environment-file /home/stack/enable-stf.yaml \ --environment-file /home/stack/stf-connectors.yaml \ --environment-file /home/stack/gnocchi-connectors.yaml
-
To ensure that the configuration was successful, verify the content of the file
/var/lib/config-data/puppet-generated/ceilometer/etc/ceilometer/pipeline.yaml
on a Controller node. Ensure that thepublishers
section of the file contains information for bothnotifier
andGnocchi
.sources: - name: meter_source meters: - "*" sinks: - meter_sink sinks: - name: meter_sink publishers: - gnocchi://?filter_project=service - notifier://172.17.1.35:5666/?driver=amqp&topic=metering
Deploying to non-standard network topologies
If your nodes are on a separate network from the default InternalApi
network, you must make configuration adjustments so that Apache Qpid Dispatch Router can transport data to the Service Telemetry Framework (STF) server instance. This scenario is typical in a spine-leaf or a DCN topology. For more information about DCN configuration, see the Spine Leaf Networking guide.
If you use STF with OpenStack (OSP) Train and plan to monitor your Ceph, Block, or Object Storage nodes, you must make configuration changes that are similar to the configuration changes that you make to the spine-leaf and DCN network configuration. To monitor Ceph nodes, use the CephStorageExtraConfig
parameter to define which network interface to load into the Apache Qpid Dispatch Router and collectd configuration files.
CephStorageExtraConfig:
tripleo::profile::base::metrics::collectd::amqp_host: "%{hiera('storage')}"
tripleo::profile::base::metrics::qdr::listener_addr: "%{hiera('storage')}"
tripleo::profile::base::ceilometer::agent::notification::notifier_host_addr: "%{hiera('storage')}"
Similarly, you must specify BlockStorageExtraConfig
and ObjectStorageExtraConfig
parameters if your environment uses Block and Object Storage roles.
To deploy a spine-leaf topology, you must create roles and networks, then assign those networks to the available roles. When you configure data collection and transport for STF for an OSP deployment, the default network for roles is InternalApi
. For Ceph, Block and Object storage roles, the default network is Storage
.
Because a spine-leaf configuration can result in different networks being assigned to different Leaf groupings and those names are typically unique, additional configuration is required in the parameter_defaults
section of the OSP environment files.
-
Document which networks are available for each of the Leaf roles. For examples of network name definitions, see Creating a network data file in the Spine Leaf Networking guide. For more information about the creation of the Leaf groupings (roles) and assignment of the networks to those groupings, see Creating a roles data file in the Spine Leaf Networking guide.
-
Add the following configuration example to the
ExtraConfig
section for each of the leaf roles. In this example,internal_api_subnet
is the value defined in thename_lower
parameter of your network definition (with_subnet
appended to the name for Leaf 0) , and is the network to which theComputeLeaf0
leaf role is connected. In this case, the network identification of 0 corresponds to the Compute role for leaf 0, and represents a value that is different from the default internal API network name.For the
ComputeLeaf0
leaf role, specify extra configuration to perform a hiera lookup to determine which network interface for a particular network to assign to the collectd AMQP host parameter. Perform the same configuration for the Apache Qpid Dispatch Router listener address parameter.ComputeLeaf0ExtraConfig: tripleo::profile::base::metrics::collectd::amqp_host: "%{hiera('internal_api_subnet')}" tripleo::profile::base::metrics::qdr::listener_addr: "%{hiera('internal_api_subnet')}"
Additional leaf roles typically replace
_subnet
with_leafN
.N
represents a unique identifier for the leaf.ComputeLeaf1ExtraConfig: tripleo::profile::base::metrics::collectd::amqp_host: "%{hiera('internal_api_leaf1')}" tripleo::profile::base::metrics::qdr::listener_addr: "%{hiera('internal_api_leaf1')}"
This example configuration is on a CephStorage leaf role:
CephStorageLeaf0ExtraConfig: tripleo::profile::base::metrics::collectd::amqp_host: "%{hiera('storage_subnet')}" tripleo::profile::base::metrics::qdr::listener_addr: "%{hiera('storage_subnet')}"
Configuring multiple clouds
You can configure multiple OpenStack (OSP) clouds to target a single instance of Service Telemetry Framework (STF). When you configure multiple clouds, every cloud must send metrics and events on their own unique message bus topic. In the STF deployment, Smart Gateway instances listen on these topics to save information to the common data store. Data that is stored by the Smart Gateway in the data storage domain is filtered by using the metadata that each of Smart Gateways creates.

To configure the OSP overcloud for a multiple cloud scenario, complete the following tasks:
-
Plan the AMQP address prefixes that you want to use for each cloud. For more information, see Planning AMQP address prefixes.
-
Deploy metrics and events consumer Smart Gateways for each cloud to listen on the corresponding address prefixes. For more information, see Deploying Smart Gateways.
-
Configure each cloud with a unique domain name. For more information, see Setting a unique cloud domain.
-
Create the base configuration for STF. For more information, see Creating the base configuration for STF.
-
Configure each cloud to send its metrics and events to STF on the correct address. For more information, see Creating the OpenStack environment file for multiple clouds.
Planning AMQP address prefixes
By default, OpenStack (OSP) nodes receive data through two data collectors; collectd and Ceilometer. The collectd-sensubility plugin requires a unique address. These components send telemetry data or notifications to the respective AMQP addresses, for example, collectd/telemetry
. STF Smart Gateways listen on those AMQP addresses for data. To support multiple clouds and to identify which cloud generated the monitoring data, configure each cloud to send data to a unique address. Add a cloud identifier prefix to the second part of the address. The following list shows some example addresses and identifiers:
-
collectd/cloud1-telemetry
-
collectd/cloud1-notify
-
sensubility/cloud1-telemetry
-
anycast/ceilometer/cloud1-metering.sample
-
anycast/ceilometer/cloud1-event.sample
-
collectd/cloud2-telemetry
-
collectd/cloud2-notify
-
sensubility/cloud2-telemetry
-
anycast/ceilometer/cloud2-metering.sample
-
anycast/ceilometer/cloud2-event.sample
-
collectd/us-east-1-telemetry
-
collectd/us-west-3-telemetry
Deploying Smart Gateways
You must deploy a Smart Gateway for each of the data collection types for each cloud; one for collectd metrics, one for collectd events, one for Ceilometer metrics, one for Ceilometer events, and one for collectd-sensubility metrics. Configure each of the Smart Gateways to listen on the AMQP address that you define for the corresponding cloud. To define Smart Gateways, configure the clouds
parameter in the ServiceTelemetry
manifest.
When you deploy STF for the first time, Smart Gateway manifests are created that define the initial Smart Gateways for a single cloud. When you deploy Smart Gateways for multiple cloud support, you deploy multiple Smart Gateways for each of the data collection types that handle the metrics and the events data for each cloud. The initial Smart Gateways are defined in cloud1
with the following subscription addresses:
collector |
type |
default subscription address |
collectd |
metrics |
collectd/telemetry |
collectd |
events |
collectd/notify |
collectd-sensubility |
metrics |
sensubility/telemetry |
Ceilometer |
metrics |
anycast/ceilometer/metering.sample |
Ceilometer |
events |
anycast/ceilometer/event.sample |
-
You have determined your cloud naming scheme. For more information about determining your naming scheme, see Planning AMQP address prefixes.
-
You have created your list of clouds objects. For more information about creating the content for the
clouds
parameter, see The clouds parameter.
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
Edit the
default
ServiceTelemetry object and add aclouds
parameter with your configuration:Long cloud names might exceed the maximum pod name of 63 characters. Ensure that the combination of the
ServiceTelemetry
namedefault
and theclouds.name
does not exceed 19 characters. Cloud names cannot contain any special characters, such as-
. Limit cloud names to alphanumeric (a-z, 0-9).Topic addresses have no character limitation and can be different from the
clouds.name
value.$ oc edit stf default
apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: ... spec: ... clouds: - name: cloud1 events: collectors: - collectorType: collectd subscriptionAddress: collectd/cloud1-notify - collectorType: ceilometer subscriptionAddress: anycast/ceilometer/cloud1-event.sample metrics: collectors: - collectorType: collectd subscriptionAddress: collectd/cloud1-telemetry - collectorType: sensubility subscriptionAddress: sensubility/cloud1-telemetry - collectorType: ceilometer subscriptionAddress: anycast/ceilometer/cloud1-metering.sample - name: cloud2 events: ...
-
Save the ServiceTelemetry object.
-
Verify that each Smart Gateway is running. This can take several minutes depending on the number of Smart Gateways:
$ oc get po -l app=smart-gateway NAME READY STATUS RESTARTS AGE default-cloud1-ceil-event-smartgateway-6cfb65478c-g5q82 2/2 Running 0 13h default-cloud1-ceil-meter-smartgateway-58f885c76d-xmxwn 2/2 Running 0 13h default-cloud1-coll-event-smartgateway-58fbbd4485-rl9bd 2/2 Running 0 13h default-cloud1-coll-meter-smartgateway-7c6fc495c4-jn728 2/2 Running 0 13h default-cloud1-sens-meter-smartgateway-8h4tc445a2-mm683 2/2 Running 0 13h
Deleting the default Smart Gateways
After you configure Service Telemetry Framework (STF) for multiple clouds, you can delete the default Smart Gateways if they are no longer in use. The Service Telemetry Operator can remove SmartGateway
objects that were created but are no longer listed in the ServiceTelemetry clouds
list of objects. To enable the removal of SmartGateway objects that are not defined by the clouds
parameter, you must set the cloudsRemoveOnMissing
parameter to true
in the ServiceTelemetry
manifest.
If you do not want to deploy any Smart Gateways, define an empty clouds list by using the clouds: [] parameter.
|
The cloudsRemoveOnMissing parameter is disabled by default. If you enable the cloudsRemoveOnMissing parameter, you remove any manually created SmartGateway objects in the current namespace without any possibility to restore.
|
-
Define your
clouds
parameter with the list of cloud objects that you want the Service Telemetry Operator to manage. For more information, see The clouds parameter. -
Edit the ServiceTelemetry object and add the
cloudsRemoveOnMissing
parameter:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: ... spec: ... cloudsRemoveOnMissing: true clouds: ...
-
Save the modifications.
-
Verify that the Operator deleted the Smart Gateways. This can take several minutes while the Operators reconcile the changes:
$ oc get smartgateways
Setting a unique cloud domain
To ensure that Apache Qpid Dispatch Router router connections from OpenStack (OSP) to Service Telemetry Framework (STF) are unique and do not conflict, configure the CloudDomain
parameter.
-
Create a new environment file, for example,
hostnames.yaml
. -
Set the
CloudDomain
parameter in the environment file, as shown in the following example:hostnames.yamlparameter_defaults: CloudDomain: newyork-west-04 CephStorageHostnameFormat: 'ceph-%index%' ObjectStorageHostnameFormat: 'swift-%index%' ComputeHostnameFormat: 'compute-%index%'
-
Add the new environment file to your deployment. For more information, see Creating the OpenStack environment file for multiple clouds and Core overcloud parameters in the Overcloud Parameters guide.
Creating the OpenStack environment file for multiple clouds
To label traffic according to the cloud of origin, you must create a configuration with cloud-specific instance names. Create an stf-connectors.yaml
file and adjust the values of CeilometerQdrEventsConfig
, CeilometerQdrMetricsConfig
and CollectdAmqpInstances
to match the AMQP address prefix scheme.
If you enabled container health and API status monitoring, you must also modify the CollectdSensubilityResultsChannel parameter. For more information, see OpenStack API status and containerized services health.
|
-
You have created your list of clouds objects. For more information about creating the content for the clouds parameter, see the clouds configuration parameter.
-
You have retrieved the Apache Qpid Dispatch Router route address. For more information, see Retrieving the Apache Qpid Dispatch Router route address.
-
You have created the base configuration for STF. For more information, see Creating the base configuration for STF.
-
You have created a unique domain name environment file. For more information, see Setting a unique cloud domain.
-
Log in to the OpenStack undercloud as the
stack
user. -
Create a configuration file called
stf-connectors.yaml
in the/home/stack
directory. -
In the
stf-connectors.yaml
file, configure theMetricsQdrConnectors
address to connect to the Apache Qpid Dispatch Router on the overcloud deployment. Configure theCeilometerQdrEventsConfig
,CeilometerQdrMetricsConfig
,CollectdAmqpInstances
, andCollectdSensubilityResultsChannel
topic values to match the AMQP address that you want for this cloud deployment.stf-connectors.yamlresource_registry: OS::TripleO::Services::Collectd: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml (1) parameter_defaults: MetricsQdrConnectors: - host: stf-default-interconnect-5671-service-telemetry.apps.infra.watch (2) port: 443 role: edge verifyHostname: false sslProfile: sslProfile MetricsQdrSSLProfiles: - name: sslProfile CeilometerQdrEventsConfig: driver: amqp topic: cloud1-event (3) CeilometerQdrMetricsConfig: driver: amqp topic: cloud1-metering (4) CollectdAmqpInstances: cloud1-notify: (5) notify: true format: JSON presettle: false cloud1-telemetry: (6) format: JSON presettle: false CollectdSensubilityResultsChannel: sensubility/cloud1-telemetry (7)
1 Directly load the collectd service because you are not including the collectd-write-qdr.yaml
environment file for multiple cloud deployments.2 Replace the host
parameter with the value ofHOST/PORT
that you retrieved in Retrieving the Apache Qpid Dispatch Router route address.3 Define the topic for Ceilometer events. This value is the address format of anycast/ceilometer/cloud1-event.sample
.4 Define the topic for Ceilometer metrics. This value is the address format of anycast/ceilometer/cloud1-metering.sample
.5 Define the topic for collectd events. This value is the format of collectd/cloud1-notify
.6 Define the topic for collectd metrics. This value is the format of collectd/cloud1-telemetry
.7 Define the topic for collectd-sensubility events. Ensure that this value is the exact string format sensubility/cloud1-telemetry
-
Ensure that the naming convention in the
stf-connectors.yaml
file aligns with thespec.bridge.amqpUrl
field in the Smart Gateway configuration. For example, configure theCeilometerQdrEventsConfig.topic
field to a value ofcloud1-event
. -
Source the authentication file:
[stack@undercloud-0 ~]$ source stackrc (undercloud) [stack@undercloud-0 ~]$
-
Include the
stf-connectors.yaml
file and unique domain name environment filehostnames.yaml
in theopenstack overcloud deployment
command, with any other environment files relevant to your environment:If you use the collectd-write-qdr.yaml
file with a customCollectdAmqpInstances
parameter, data publishes to the custom and default topics. In a multiple cloud environment, the configuration of theresource_registry
parameter in thestf-connectors.yaml
file loads the collectd service.(undercloud) [stack@undercloud-0 ~]$ openstack overcloud deploy <other_arguments> --templates /usr/share/openstack-tripleo-heat-templates \ --environment-file <...other_environment_files...> \ --environment-file /usr/share/openstack-tripleo-heat-templates/environments/metrics/ceilometer-write-qdr.yaml \ --environment-file /usr/share/openstack-tripleo-heat-templates/environments/metrics/qdr-edge-only.yaml \ --environment-file /home/stack/hostnames.yaml \ --environment-file /home/stack/enable-stf.yaml \ --environment-file /home/stack/stf-connectors.yaml
-
Deploy the OpenStack overcloud.
-
For information about how to validate the deployment, see Validating client-side installation.
Querying metrics data from multiple clouds
Data stored in Prometheus has a service
label according to the Smart Gateway it was scraped from. You can use this label to query data from a specific cloud.
To query data from a specific cloud, use a Prometheus promql query that matches the associated service
label; for example: collectd_uptime{service="default-cloud1-coll-meter"}
.
Using operational features of Service Telemetry Framework
You can use the following operational features to provide additional functionality to the Service Telemetry Framework (STF):
Dashboards in Service Telemetry Framework
Use the third-party application, Grafana, to visualize system-level metrics that collectd and Ceilometer gathers for each individual host node.
For more information about configuring collectd, see Deploying OpenStack overcloud for Service Telemetry Framework.
You can use two dashboards to monitor a cloud:
- Infrastructure dashboard
-
Use the infrastructure dashboard to view metrics for a single node at a time. Select a node from the upper left corner of the dashboard.
- Cloud view dashboard
-
Use the cloud view dashboard to view panels to monitor service resource usage, API stats, and cloud events. You must enable API health monitoring and service monitoring to provide the data for this dashboard. API health monitoring is enabled by default in the STF base configuration. For more information, see Creating the base configuration for STF.
-
For more information about API health monitoring, see OpenStack API status and containerized services health.
-
For more information about OSP service monitoring, see Resource usage of OpenStack services.
-
Configuring Grafana to host the dashboard
Grafana is not included in the default Service Telemetry Framework (STF) deployment so you must deploy the Grafana Operator from OperatorHub.io. When you use the Service Telemetry Operator to deploy Grafana, it results in a Grafana instance and the configuration of the default data sources for the local STF deployment.
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
Deploy the Grafana operator:
$ oc apply -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: grafana-operator namespace: service-telemetry spec: channel: alpha installPlanApproval: Automatic name: grafana-operator source: operatorhubio-operators sourceNamespace: openshift-marketplace EOF
-
Verify that the Operator launched successfully. In the command output, if the value of the
PHASE
column isSucceeded
, the Operator launched successfully:$ oc get csv --selector operators.coreos.com/grafana-operator.service-telemetry NAME DISPLAY VERSION REPLACES PHASE grafana-operator.v3.10.3 Grafana Operator 3.10.3 grafana-operator.v3.10.2 Succeeded
-
To launch a Grafana instance, create or modify the
ServiceTelemetry
object. Setgraphing.enabled
andgraphing.grafana.ingressEnabled
totrue
:$ oc edit stf default apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry ... spec: ... graphing: enabled: true grafana: ingressEnabled: true
-
Verify that the Grafana instance deployed:
$ oc get pod -l app=grafana NAME READY STATUS RESTARTS AGE grafana-deployment-7fc7848b56-sbkhv 1/1 Running 0 1m
-
Verify that the Grafana data sources installed correctly:
$ oc get grafanadatasources NAME AGE default-datasources 20h
-
Verify that the Grafana route exists:
$ oc get route grafana-route NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD grafana-route grafana-route-service-telemetry.apps.infra.watch grafana-service 3000 edge None
Overriding the default Grafana container image
The dashboards in Service Telemetry Framework (STF) require features that are available only in Grafana version 8.1.0 and later. By default, the Service Telemetry Operator installs a compatible version. You can override the base Grafana image by specifying the image path to an image registry with graphing.grafana.baseImage
.
-
Ensure that you have the correct version of Grafana:
$ oc get pod -l "app=grafana" -ojsonpath='{.items[0].spec.containers[0].image}' docker.io/grafana/grafana:7.3.10
-
If the running image is older than 8.1.0, patch the ServiceTelemetry object to update the image. Service Telemetry Operator updates the Grafana manifest, which restarts the Grafana deployment:
$ oc patch stf/default --type merge -p '{"spec":{"graphing":{"grafana":{"baseImage":"docker.io/grafana/grafana:8.1.5"}}}}'
-
Verify that a new Grafana pod exists and has a
STATUS
value ofRunning
:$ oc get pod -l "app=grafana" NAME READY STATUS RESTARTS AGE grafana-deployment-fb9799b58-j2hj2 1/1 Running 0 10s
-
Verify that the new instance is running the updated image:
$ oc get pod -l "app=grafana" -ojsonpath='{.items[0].spec.containers[0].image}' docker.io/grafana/grafana:8.1.0
Importing dashboards
The Grafana Operator can import and manage dashboards by creating GrafanaDashboard
objects. You can view example dashboards at https://github.com/infrawatch/dashboards.
-
Import the infrastructure dashboard:
$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1.3/rhos-dashboard.yaml grafanadashboard.integreatly.org/rhos-dashboard-1.3 created
-
Import the cloud dashboard:
For some panels in the cloud dashboard, you must set the value of the collectd virt
plugin parameterhostname_format
toname uuid hostname
in thestf-connectors.yaml
file. If you do not configure this parameter, affected dashboards remain empty. For more information about thevirt
plugin, see collectd plugins.$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1.3/rhos-cloud-dashboard.yaml grafanadashboard.integreatly.org/rhos-cloud-dashboard-1.3 created
-
Import the cloud events dashboard:
$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1.3/rhos-cloudevents-dashboard.yaml grafanadashboard.integreatly.org/rhos-cloudevents-dashboard created
-
Import the virtual machine dashboard:
$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1.3/virtual-machine-view.yaml grafanadashboard.integreatly.org/virtual-machine-view-1.3 configured
-
Import the memcached dashboard:
$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1.3/memcached-dashboard.yaml grafanadashboard.integreatly.org/memcached-dashboard-1.3 created
-
Verify that the dashboards are available:
$ oc get grafanadashboards NAME AGE memcached-dashboard-1.3 115s rhos-cloud-dashboard-1.3 2m12s rhos-cloudevents-dashboard 2m6s rhos-dashboard-1.3 2m17s virtual-machine-view-1.3 2m
-
Retrieve the Grafana route address:
$ oc get route grafana-route -ojsonpath='{.spec.host}' grafana-route-service-telemetry.apps.infra.watch
-
In a web browser, navigate to https://<grafana_route_address>. Replace <grafana_route_address> with the value that you retrieved in the previous step.
-
To view the dashboard, click Dashboards and Manage.
Retrieving and setting Grafana login credentials
Service Telemetry Framework (STF) sets default login credentials when Grafana is enabled. You can override the credentials in the ServiceTelemetry
object.
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
Retrieve the default username and password from the STF object:
$ oc get stf default -o jsonpath="{.spec.graphing.grafana['adminUser','adminPassword']}"
-
To modify the default values of the Grafana administrator username and password through the ServiceTelemetry object, use the
graphing.grafana.adminUser
andgraphing.grafana.adminPassword
parameters.
Metrics retention time period in Service Telemetry Framework
The default retention time for metrics stored in Service Telemetry Framework (STF) is 24 hours, which provides enough data for trends to develop for the purposes of alerting.
For long-term storage, use systems designed for long-term data retention, for example, Thanos.
-
To adjust STF for additional metrics retention time, see Editing the metrics retention time period in Service Telemetry Framework.
-
For recommendations about Prometheus data storage and estimating storage space, see https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects
-
For more information about Thanos, see https://thanos.io/
Editing the metrics retention time period in Service Telemetry Framework
You can adjust Service Telemetry Framework (STF) for additional metrics retention time.
-
Log in to OpenShift.
-
Change to the service-telemetry namespace:
$ oc project service-telemetry
-
Edit the ServiceTelemetry object:
$ oc edit stf default
-
Add
retention: 7d
to the storage section of backends.metrics.prometheus.storage to increase the retention period to seven days:If you set a long retention period, retrieving data from heavily populated Prometheus systems can result in queries returning results slowly. apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: stf-default namespace: service-telemetry spec: ... backends: metrics: prometheus: enabled: true storage: strategy: persistent retention: 7d ...
-
Save your changes and close the object.
-
For more information about the metrics retention time, see Metrics retention time period in Service Telemetry Framework.
Alerts in Service Telemetry Framework
You create alert rules in Prometheus and alert routes in Alertmanager. Alert rules in Prometheus servers send alerts to an Alertmanager, which manages the alerts. Alertmanager can silence, inhibit, or aggregate alerts, and send notifications by using email, on-call notification systems, or chat platforms.
To create an alert, complete the following tasks:
-
Create an alert rule in Prometheus. For more information, see Creating an alert rule in Prometheus.
-
Create an alert route in Alertmanager. There are two ways in which you can create an alert route:
For more information about alerts or notifications with Prometheus and Alertmanager, see https://prometheus.io/docs/alerting/overview/
To view an example set of alerts that you can use with Service Telemetry Framework (STF), see https://github.com/infrawatch/service-telemetry-operator/tree/master/deploy/alerts
Creating an alert rule in Prometheus
Prometheus evaluates alert rules to trigger notifications. If the rule condition returns an empty result set, the condition is false. Otherwise, the rule is true and it triggers an alert.
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
Create a
PrometheusRule
object that contains the alert rule. The Prometheus Operator loads the rule into Prometheus:$ oc apply -f - <<EOF apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: creationTimestamp: null labels: prometheus: default role: alert-rules name: prometheus-alarm-rules namespace: service-telemetry spec: groups: - name: ./openstack.rules rules: - alert: Collectd metrics receive rate is zero expr: rate(sg_total_collectd_msg_received_count[1m]) == 0 (1) EOF
1 To change the rule, edit the value of the expr
parameter. -
To verify that the Operator loaded the rules into Prometheus, run the
curl
command against the default-prometheus-proxy route with basic authentication:$ curl -k --user "internal:$(oc get secret default-prometheus-htpasswd -ogo-template='{{ .data.password | base64decode }}')" https://$(oc get route default-prometheus-proxy -ogo-template='{{ .spec.host }}')/api/v1/rules {"status":"success","data":{"groups":[{"name":"./openstack.rules","file":"/etc/prometheus/rules/prometheus-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml","rules":[{"state":"inactive","name":"Collectd metrics receive count is zero","query":"rate(sg_total_collectd_msg_received_count[1m]) == 0","duration":0,"labels":{},"annotations":{},"alerts":[],"health":"ok","evaluationTime":0.00034627,"lastEvaluation":"2021-12-07T17:23:22.160448028Z","type":"alerting"}],"interval":30,"evaluationTime":0.000353787,"lastEvaluation":"2021-12-07T17:23:22.160444017Z"}]}}
-
For more information on alerting, see https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md
Configuring custom alerts
You can add custom alerts to the PrometheusRule
object that you created in Creating an alert rule in Prometheus.
-
Use the
oc edit
command:$ oc edit prometheusrules prometheus-alarm-rules
-
Edit the
PrometheusRules
manifest. -
Save and close the manifest.
-
For more information about how to configure alerting rules, see https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/.
-
For more information about PrometheusRules objects, see https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md
Creating a standard alert route in Alertmanager
Use Alertmanager to deliver alerts to an external system, such as email, IRC, or other notification channel. The Prometheus Operator manages the Alertmanager configuration as a OpenShift secret. By default, Service Telemetry Framework (STF) deploys a basic configuration that results in no receivers:
alertmanager.yaml: |-
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
receivers:
- name: 'null'
To deploy a custom Alertmanager route with STF, you must pass an alertmanagerConfigManifest
parameter to the Service Telemetry Operator that results in an updated secret, managed by the Prometheus Operator.
If your alertmanagerConfigManifest contains a custom template to construct the title and text of the sent alert, deploy the contents of the alertmanagerConfigManifest using a base64-encoded configuration. For more information, see Creating an alert route with templating in Alertmanager.
|
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
Edit the
ServiceTelemetry
object for your STF deployment:$ oc edit stf default
-
Add the new parameter
alertmanagerConfigManifest
and theSecret
object contents to define thealertmanager.yaml
configuration for Alertmanager:This step loads the default template that the Service Telemetry Operator manages. To verify that the changes are populating correctly, change a value, return the alertmanager-default
secret, and verify that the new value is loaded into memory. For example, change the value of the parameterglobal.resolve_timeout
from5m
to10m
.apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: backends: metrics: prometheus: enabled: true alertmanagerConfigManifest: | apiVersion: v1 kind: Secret metadata: name: 'alertmanager-default' namespace: 'service-telemetry' type: Opaque stringData: alertmanager.yaml: |- global: resolve_timeout: 10m route: group_by: ['job'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' receivers: - name: 'null'
-
Verify that the configuration has been applied to the secret:
$ oc get secret alertmanager-default -o go-template='{{index .data "alertmanager.yaml" | base64decode }}' global: resolve_timeout: 10m route: group_by: ['job'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' receivers: - name: 'null'
-
Run the
curl
command against thealertmanager-proxy
service to retrieve the status andconfigYAML
contents, and verify that the supplied configuration matches the configuration in Alertmanager:$ oc run curl -it --serviceaccount=prometheus-k8s --restart='Never' --image=radial/busyboxplus:curl -- sh -c "curl -k -H \"Content-Type: application/json\" -H \"Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" https://default-alertmanager-proxy:9095/api/v1/status" {"status":"success","data":{"configYAML":"...",...}}
-
Verify that the
configYAML
field contains the changes you expect. -
To clean up the environment, delete the
curl
pod:$ oc delete pod curl pod "curl" deleted
-
For more information about the OpenShift secret and the Prometheus operator, see Prometheus user guide on alerting.
Creating an alert route with templating in Alertmanager
Use Alertmanager to deliver alerts to an external system, such as email, IRC, or other notification channel. The Prometheus Operator manages the Alertmanager configuration as a OpenShift secret. By default, Service Telemetry Framework (STF) deploys a basic configuration that results in no receivers:
alertmanager.yaml: |-
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
receivers:
- name: 'null'
If the alertmanagerConfigManifest
parameter contains a custom template, for example, to construct the title and text of the sent alert, deploy the contents of the alertmanagerConfigManifest
by using a base64-encoded configuration.
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
Edit the
ServiceTelemetry
object for your STF deployment:$ oc edit stf default
-
To deploy a custom Alertmanager route with STF, you must pass an
alertmanagerConfigManifest
parameter to the Service Telemetry Operator that results in an updated secret that is managed by the Prometheus Operator:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: backends: metrics: prometheus: enabled: true alertmanagerConfigManifest: | apiVersion: v1 kind: Secret metadata: name: 'alertmanager-default' namespace: 'service-telemetry' type: Opaque data: alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogMTBtCiAgc2xhY2tfYXBpX3VybDogPHNsYWNrX2FwaV91cmw+CnJlY2VpdmVyczoKICAtIG5hbWU6IHNsYWNrCiAgICBzbGFja19jb25maWdzOgogICAgLSBjaGFubmVsOiAjc3RmLWFsZXJ0cwogICAgICB0aXRsZTogfC0KICAgICAgICAuLi4KICAgICAgdGV4dDogPi0KICAgICAgICAuLi4Kcm91dGU6CiAgZ3JvdXBfYnk6IFsnam9iJ10KICBncm91cF93YWl0OiAzMHMKICBncm91cF9pbnRlcnZhbDogNW0KICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJlY2VpdmVyOiAnc2xhY2snCg==
-
Verify that the configuration has been applied to the secret:
$ oc get secret alertmanager-default -o go-template='{{index .data "alertmanager.yaml" | base64decode }}' global: resolve_timeout: 10m slack_api_url: <slack_api_url> receivers: - name: slack slack_configs: - channel: #stf-alerts title: |- ... text: >- ... route: group_by: ['job'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'slack'
-
Run the
curl
command against thealertmanager-proxy
service to retrieve the status andconfigYAML
contents, and verify that the supplied configuration matches the configuration in Alertmanager:$ oc run curl -it --serviceaccount=prometheus-k8s --restart='Never' --image=radial/busyboxplus:curl -- sh -c "curl -k -H \"Content-Type: application/json\" -H \"Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" https://default-alertmanager-proxy:9095/api/v1/status" {"status":"success","data":{"configYAML":"...",...}}
-
Verify that the
configYAML
field contains the changes you expect. -
To clean up the environment, delete the
curl
pod:$ oc delete pod curl pod "curl" deleted
-
For more information about the OpenShift secret and the Prometheus operator, see Prometheus user guide on alerting.
Configuring SNMP traps
You can integrate Service Telemetry Framework (STF) with an existing infrastructure monitoring platform that receives notifications through SNMP traps. To enable SNMP traps, modify the ServiceTelemetry
object and configure the snmpTraps
parameters.
For more information about configuring alerts, see Alerts in Service Telemetry Framework.
-
Know the IP address or hostname of the SNMP trap receiver where you want to send the alerts
-
To enable SNMP traps, modify the
ServiceTelemetry
object:$ oc edit stf default
-
Set the
alerting.alertmanager.receivers.snmpTraps
parameters:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry ... spec: ... alerting: alertmanager: receivers: snmpTraps: enabled: true target: 10.10.10.10
-
Ensure that you set the value of
target
to the IP address or hostname of the SNMP trap receiver.
High availability
With high availability, Service Telemetry Framework (STF) can rapidly recover from failures in its component services. Although OpenShift restarts a failed pod if nodes are available to schedule the workload, this recovery process might take more than one minute, during which time events and metrics are lost. A high availability configuration includes multiple copies of STF components, which reduces recovery time to approximately 2 seconds. To protect against failure of an OpenShift node, deploy STF to an OpenShift cluster with three or more nodes.
STF is not yet a fully fault tolerant system. Delivery of metrics and events during the recovery period is not guaranteed. |
Enabling high availability has the following effects:
-
Three ElasticSearch pods run instead of the default one.
-
The following components run two pods instead of the default one:
-
Apache Qpid Dispatch Router
-
Alertmanager
-
Prometheus
-
Events Smart Gateway
-
Metrics Smart Gateway
-
-
Recovery time from a lost pod in any of these services reduces to approximately 2 seconds.
Configuring high availability
To configure Service Telemetry Framework (STF) for high availability, add highAvailability.enabled: true
to the ServiceTelemetry object in OpenShift. You can set this parameter at installation time or, if you already deployed STF, complete the following steps:
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
Use the oc command to edit the ServiceTelemetry object:
$ oc edit stf default
-
Add
highAvailability.enabled: true
to thespec
section:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry ... spec: ... highAvailability: enabled: true
-
Save your changes and close the object.
Ephemeral storage
You can use ephemeral storage to run Service Telemetry Framework (STF) without persistently storing data in your OpenShift cluster.
If you use ephemeral storage, you might experience data loss if a pod is restarted, updated, or rescheduled onto another node. Use ephemeral storage only for development or testing, and not production environments. |
Configuring ephemeral storage
To configure STF components for ephemeral storage, add ...storage.strategy: ephemeral
to the corresponding parameter. For example, to enable ephemeral storage for the Prometheus back end, set backends.metrics.prometheus.storage.strategy: ephemeral
. Components that support configuration of ephemeral storage include alerting.alertmanager
, backends.metrics.prometheus
, and backends.events.elasticsearch
. You can add ephemeral storage configuration at installation time or, if you already deployed STF, complete the following steps:
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
Edit the ServiceTelemetry object:
$ oc edit stf default
-
Add the
...storage.strategy: ephemeral
parameter to thespec
section of the relevant component:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: stf-default namespace: service-telemetry spec: alerting: enabled: true alertmanager: storage: strategy: ephemeral backends: metrics: prometheus: enabled: true storage: strategy: ephemeral events: elasticsearch: enabled: true storage: strategy: ephemeral
-
Save your changes and close the object.
Observability Strategy in Service Telemetry Framework
Service Telemetry Framework (STF) does not include storage backends and alerting tools. STF uses community operators to deploy Prometheus, Alertmanager, Grafana, and Elasticsearch. STF makes requests to these community operators to create instances of each application configured to work with STF.
Instead of having Service Telemetry Operator create custom resource requests, you can use your own deployments of these applications or other compatible applications, and scrape the metrics Smart Gateways for delivery to your own Prometheus-compatible system for telemetry storage. If you set the observability strategy to use alternative backends instead, persistent or ephemeral storage is not required for STF.
Configuring an alternate observability strategy
To configure STF to skip the deployment of storage, visualization, and alerting backends, add observabilityStrategy: none
to the ServiceTelemetry spec. In this mode, only Apache Qpid Dispatch Router routers and metrics Smart Gateways are deployed, and you must configure an external Prometheus-compatible system to collect metrics from the STF Smart Gateways.
Currently, only metrics are supported when you set observabilityStrategy to none . Events Smart Gateways are not deployed.
|
-
Create a
ServiceTelemetry
object with the propertyobservabilityStrategy: none
in thespec
parameter. The manifest shows results in a default deployment of STF that is suitable for receiving telemetry from a single cloud with all metrics collector types.$ oc apply -f - <<EOF apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: observabilityStrategy: none EOF
-
To verify that all workloads are operating correctly, view the pods and the status of each pod:
$ oc get pods NAME READY STATUS RESTARTS AGE default-cloud1-ceil-meter-smartgateway-59c845d65b-gzhcs 3/3 Running 0 132m default-cloud1-coll-meter-smartgateway-75bbd948b9-d5phm 3/3 Running 0 132m default-cloud1-sens-meter-smartgateway-7fdbb57b6d-dh2g9 3/3 Running 0 132m default-interconnect-668d5bbcd6-57b2l 1/1 Running 0 132m interconnect-operator-b8f5bb647-tlp5t 1/1 Running 0 47h service-telemetry-operator-566b9dd695-wkvjq 1/1 Running 0 156m smart-gateway-operator-58d77dcf7-6xsq7 1/1 Running 0 47h
For more information about configuring additional clouds or to change the set of supported collectors, see Deploying Smart Gateways
Configuring openshift-monitoring to consume metrics from STF
You can configure openshift-monitoring to consume metrics from STF so that you can use the existing Prometheus deployment for STF data. This configuration is useful in combination with observabilityStrategy: none
as an alternative to the community operators. You must add a label to the namespace where STF is deployed, and create ServiceMonitor objects for each Smart Gateway intended to be scraped.
-
Edit the namespace object:
$ oc edit namespace service-telemetry
-
Add the
openshift.io/cluster-monitoring: "true"
label under themetadata
property:metadata: labels: openshift.io/cluster-monitoring: "true"
-
Create a ServiceMonitor object for each Smart Gateway:
$ for collector_type in ceil coll sens; do oc apply -f <(sed -e "s/<<COLLECTOR_TYPE>>/${collector_type}/g" << EOF apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app: smart-gateway name: default-cloud1-<<COLLECTOR_TYPE>>-meter namespace: service-telemetry spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 10s metricRelabelings: - action: labeldrop regex: pod - action: labeldrop regex: namespace - action: labeldrop regex: instance - action: labeldrop regex: job - action: labeldrop regex: publisher port: prom-https scheme: https tlsConfig: caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt serverName: default-cloud1-<<COLLECTOR_TYPE>>-meter.service-telemetry.svc namespaceSelector: matchNames: - service-telemetry selector: matchLabels: app: smart-gateway smart-gateway: default-cloud1-<<COLLECTOR_TYPE>>-meter EOF ); done servicemonitor.monitoring.coreos.com/default-cloud1-ceil-meter configured servicemonitor.monitoring.coreos.com/default-cloud1-coll-meter configured servicemonitor.monitoring.coreos.com/default-cloud1-sens-meter configured
-
To verify the successful configuration of openshift-monitoring, ensure that Smart Gateway metrics appear in Prometheus.
-
Retrieve the route for the openshift-monitoring prometheus:
$ oc get route -n openshift-monitoring prometheus-k8s NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD prometheus-k8s prometheus-k8s-openshift-monitoring.apps.infra.watch prometheus-k8s web reencrypt/Redirect None
-
Visit the host in your browser and log in with OpenShift credentials.
-
Verify that the following targets are visible under the
Status -> Targets
tab:-
service-telemetry/default-cloud1-ceil-meter/0
-
service-telemetry/default-cloud1-coll-meter/0
-
service-telemetry/default-cloud1-sens-meter/0
If there are problems with the configuration, find them on this page.
-
-
Issue the following queries on the
Graph
tab:-
sg_total_collectd_metric_decode_count
-
sg_total_ceilometer_metric_decode_count
-
sg_total_sensubility_metric_decode_count
-
-
There should be one result from each Smart Gateway, as shown in the following example:
If the values returned are 0, it means that STF is not receiving that type of metric yet but as long as a result is returned, the configuration of openshift-monitoring is correct. -
sg_total_collectd_metric_decode_count{container="sg-core", endpoint="prom-https", service="default-cloud1-coll-meter", source="SG"}
-
sg_total_ceilometer_metric_decode_count{container="sg-core", endpoint="prom-https", service="default-cloud1-ceil-meter", source="SG"}
-
sg_total_sensubility_metric_decode_count{container="sg-core", endpoint="prom-https", service="default-cloud1-sens-meter", source="SG"}
-
Resource usage of OpenStack services
You can monitor the resource usage of the OpenStack (OSP) services, such as the APIs and other infrastructure processes, to identify bottlenecks in the overcloud by showing services that run out of compute power. Resource usage monitoring is enabled by default.
-
To disable resource usage monitoring, see Disabling resource usage monitoring of OpenStack services.
Disabling resource usage monitoring of OpenStack services
To disable the monitoring of OSP containerized service resource usage, you must set the CollectdEnableLibpodstats
parameter to false
.
-
You have created the
stf-connectors.yaml
file. For more information, see Deploying OpenStack overcloud for Service Telemetry Framework. -
You are using the most current version of OpenStack (OSP) Train.
-
Open the
stf-connectors.yaml
file and add theCollectdEnableLibpodstats
parameter to override the setting inenable-stf.yaml
. Ensure thatstf-connectors.yaml
is called from theopenstack overcloud deploy
command afterenable-stf.yaml
:CollectdEnableLibpodstats: false
-
Continue with the overcloud deployment procedure. For more information, see Deploying the overcloud.
OpenStack API status and containerized services health
You can use the OCI (Open Container Initiative) standard to assess the container health status of each OpenStack (OSP) service by periodically running a health check script. Most OSP services implement a health check that logs issues and returns a binary status. For the OSP APIs, the health checks query the root endpoint and determine the health based on the response time.
Monitoring of OSP container health and API status is enabled by default.
-
To disable OSP container health and API status monitoring, see Disabling container health and API status monitoring.
Disabling container health and API status monitoring
To disable OSP containerized service health and API status monitoring, you must set the CollectdEnableSensubility
parameter to false
.
-
You have created the
stf-connectors.yaml
file in your templates directory. For more information, see Deploying OpenStack overcloud for Service Telemetry Framework. -
You are using the most current version of OpenStack (OSP) Train.
-
Open the
stf-connectors.yaml
and add theCollectdEnableSensubility
parameter to override the setting inenable-stf.yaml
. Ensure thatstf-connectors.yaml
is called from theopenstack overcloud deploy
command afterenable-stf.yaml
:CollectdEnableSensubility: false
-
Continue with the overcloud deployment procedure. For more information, see Deploying the overcloud.
-
For more information about multiple cloud addresses, see Configuring multiple clouds.
Upgrading Service Telemetry Framework to version 1.4
To migrate from Service Telemetry Framework (STF) v1.3 to STF v1.4, you must replace the ClusterServiceVersion
and Subscription
objects in the service-telemetry
namespace on your OpenShift environment.
-
You have upgraded your OpenShift environment to v4.8. STF v1.4 does not run on OpenShift versions less than v4.8.
-
You have backed up your data. Upgrading STF v1.3 to v1.4 results in a brief outage while the Smart Gateways and other components are updated. Additionally, changes to the
ServiceTelemetry
andSmartGateway
objects do not have any effect while the Operators are being replaced.
To upgrade from STF v1.3 to v1.4, complete the following procedures:
Removing STF 1.3 Smart Gateway Operator
Remove the Smart Gateway Operator from STF 1.3.
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
Retrieve the
Subscription
name of the Smart Gateway Operator. Replaceservice-telemetry
in the selector with the namespace that hosts your STF instance if it is different from the default namespace. Verify that only one subscription is returned:$ oc get sub --selector=operators.coreos.com/smart-gateway-operator.service-telemetry NAME PACKAGE SOURCE CHANNEL smart-gateway-operator-stable-1.3-redhat-operators-openshift-marketplace smart-gateway-operator redhat-operators stable-1.3
-
Delete the Smart Gateway Operator subscription:
$ oc delete sub --selector=operators.coreos.com/smart-gateway-operator.service-telemetry subscription.operators.coreos.com "smart-gateway-operator-stable-1.3-redhat-operators-openshift-marketplace" deleted
-
Retrieve the Smart Gateway Operator ClusterServiceVersion and verify that only one ClusterServiceVersion is returned:
$ oc get csv --selector=operators.coreos.com/smart-gateway-operator.service-telemetry NAME DISPLAY VERSION REPLACES PHASE smart-gateway-operator.v3.0.1635451893 Smart Gateway Operator 3.0.1635451893 Succeeded
-
Delete the Smart Gateway Operator ClusterServiceVersion:
$ oc delete csv --selector=operators.coreos.com/smart-gateway-operator.service-telemetry clusterserviceversion.operators.coreos.com "smart-gateway-operator.v3.0.1635451893" deleted
-
Delete the SmartGateway Custom Resource Definition (CRD). After removal of the CRD, no data flows to STF until the upgrade is completed and the Smart Gateway instances are reinstantiated:
$ oc delete crd smartgateways.smartgateway.infra.watch customresourcedefinition.apiextensions.k8s.io "smartgateways.smartgateway.infra.watch" deleted
Updating the Service Telemetry Operator to 1.4
You must change the subscription channel of the Service Telemetry Operator which manages the STF instances to the stable-1.4
channel.
-
Log in to OpenShift.
-
Change to the
service-telemetry
namespace:$ oc project service-telemetry
-
Patch the Service Telemetry Operator Subscription to use the stable-1.4 channel. Replace the
service-telemetry
in the selector with the namespace that hosts your STF instance if it is different from the default namespace:$ oc patch $(oc get sub --selector=operators.coreos.com/service-telemetry-operator.service-telemetry -oname) --patch $'spec:\n channel: stable-1.4' --type=merge subscription.operators.coreos.com/service-telemetry-operator patched
-
Monitor the output of the
oc get csv
command until the Smart Gateway Operator is installed and Service Telemetry Operator moves through the update phases. When the phase changes toSucceeded
, the Service Telemetry Operator has completed the update:$ watch -n5 oc get csv NAME DISPLAY VERSION REPLACES PHASE amq7-cert-manager.v1.0.3 Red Hat Integration - AMQ Certificate Manager 1.0.3 amq7-cert-manager.v1.0.2 Succeeded amq7-interconnect-operator.v1.10.5 Red Hat Integration - AMQ Interconnect 1.10.5 amq7-interconnect-operator.v1.10.4 Succeeded elasticsearch-eck-operator-certified.1.9.1 Elasticsearch (ECK) Operator 1.9.1 Succeeded prometheusoperator.0.47.0 Prometheus Operator 0.47.0 prometheusoperator.0.37.0 Succeeded service-telemetry-operator.v1.4.1641504218 Service Telemetry Operator 1.4.1641504218 service-telemetry-operator.v1.3.1635451892 Succeeded smart-gateway-operator.v4.0.1641504216 Smart Gateway Operator 4.0.1641504216 Succeeded
-
Validate that all pods are ready and running. Your environment might differ from the following example output:
$ oc get pods NAME READY STATUS RESTARTS AGE alertmanager-default-0 3/3 Running 0 162m default-cloud1-ceil-event-smartgateway-5599bcfc9d-wp48n 2/2 Running 1 160m default-cloud1-ceil-meter-smartgateway-c8fdf579c-955kt 3/3 Running 0 160m default-cloud1-coll-event-smartgateway-97b54b7dc-5zz2v 2/2 Running 0 159m default-cloud1-coll-meter-smartgateway-774b9988b8-wb5vd 3/3 Running 0 160m default-cloud1-sens-meter-smartgateway-b98966fbf-rnqwf 3/3 Running 0 159m default-interconnect-675dd97bc4-dcrzk 1/1 Running 0 171m default-snmp-webhook-7854d4889d-wgmgg 1/1 Running 0 171m elastic-operator-c54ff8cc-jcg8d 1/1 Running 6 3h55m elasticsearch-es-default-0 1/1 Running 0 160m interconnect-operator-6bf74c4ffb-hkmbq 1/1 Running 0 3h54m prometheus-default-0 3/3 Running 1 160m prometheus-operator-fc64987d-f7gx4 1/1 Running 0 3h54m service-telemetry-operator-68d888f767-s5kzh 1/1 Running 0 163m smart-gateway-operator-584df7959-llxgl 1/1 Running 0 163m
collectd plugins
You can configure multiple collectd plugins depending on your OpenStack (OSP) Train environment.
The following list of plugins shows the available heat template ExtraConfig
parameters that you can set to override the defaults. Each section provides the general configuration name for the ExtraConfig
option. For example, if there is a collectd plugin called example_plugin
, the format of the plugin title is collectd::plugin::example_plugin
.
Reference the tables of available parameters for specific plugins, such as in the following example:
ExtraConfig: collectd::plugin::example_plugin::<parameter>: <value>
Reference the metrics tables of specific plugins for Prometheus or Grafana queries.
collectd::plugin::aggregation
You can aggregate several values into one with the aggregation
plugin. Use the aggregation functions such as sum
, average
, min
, and max
to calculate metrics, for example average and total CPU statistics.
Parameter | Type |
---|---|
host |
String |
plugin |
String |
plugininstance |
Integer |
agg_type |
String |
typeinstance |
String |
sethost |
String |
setplugin |
String |
setplugininstance |
Integer |
settypeinstance |
String |
groupby |
Array of Strings |
calculatesum |
Boolean |
calculatenum |
Boolean |
calculateaverage |
Boolean |
calculateminimum |
Boolean |
calculatemaximum |
Boolean |
calculatestddev |
Boolean |
Deploy three aggregate configurations which results in generation of files:
-
aggregator-calcCpuLoadAvg.conf
-
aggregator-calcCpuLoadMinMax.conf
-
aggregator-calcMemoryTotalMaxAvg.conf
The aggregation configurations use the default CPU and Memory plugin configurations. The following aggregations are created:
-
Calculate average CPU load for all CPU cores grouped by host and state.
-
Calculate minimum and maxiumum CPU load groups by host and state.
-
Calculate maximum, average, and total for memory grouped by type.
parameter_defaults: CollectdExtraPlugins: - aggregation ExtraConfig: collectd::plugin::aggregation::aggregators: calcCpuLoadAvg: plugin: "cpu" agg_type: "cpu" groupby: - "Host" - "TypeInstance" calculateaverage: True calcCpuLoadMinMax: plugin: "cpu" agg_type: "cpu" groupby: - "Host" - "TypeInstance" calculatemaximum: True calculateminimum: True calcMemoryTotalMaxAvg: plugin: "memory" agg_type: "memory" groupby: - "TypeInstance" calculatemaximum: True calculateaverage: True calculatesum: True
collectd::plugin::ampq
collectd::plugin::amqp1
Use the amqp1
plugin to write values to an amqp1 message bus, for example, Apache Qpid Dispatch Router.
Parameter | Type |
---|---|
manage_package |
Boolean |
transport |
String |
host |
string |
port |
integer |
user |
String |
password |
String |
address |
String |
instances |
Hash |
retry_delay |
Integer |
send_queue_limit |
Integer |
interval |
Integer |
Use the send_queue_limit
parameter to limit the length of the outgoing metrics queue.
If there is no AMQP1 connection, the plugin continues to queue messages to send, which can result in unbounded memory consumption. The default value is 0, which disables the outgoing metrics queue. |
Increase the value of the send_queue_limit
parameter if metrics are missing.
Parameter_defaults: CollectdExtraPlugins: - amqp1 ExtraConfig: collectd::plugin::amqp1::send_queue_limit: 5000
collectd::plugin::apache
Use the apache
plugin to collect Apache data.
Parameter | Type |
---|---|
instances |
Hash |
interval |
Integer |
manage-package |
Boolean |
package_install_options |
List |
parameter_defaults: ExtraConfig: collectd::plugin::apache: localhost: url: "http://10.0.0.111/status?auto"
For more information about configuring the apache
plugin, see apache.
collectd::plugin::battery
Use the battery
plugin to report the remaining capacity, power, or voltage of laptop batteries.
Parameter | Type |
---|---|
values_percentage |
Boolean |
report_degraded |
Boolean |
query_state_fs |
Boolean |
interval |
Integer |
For more information about configuring the battery
plugin, see battery.
collectd::plugin::bind
Use the bind
plugin to retrieve encoded statistics about queries and responses from a DNS server. The plugin submits the values to collectd.
collectd::plugin::ceph
Use the ceph
plugin to gather data from ceph daemons.
Parameter | Type |
---|---|
daemons |
Array |
longrunavglatency |
Boolean |
convertspecialmetrictypes |
Boolean |
manage_package |
Boolean |
package_name |
String |
parameter_defaults: ExtraConfig: collectd::plugin::ceph::daemons: - ceph-osd.0 - ceph-osd.1 - ceph-osd.2 - ceph-osd.3 - ceph-osd.4
If an Object Storage Daemon (OSD) is not on every node, you must list the OSDs. |
When you deploy collectd, the ceph plugin is added to the Ceph nodes. Do not add the ceph plugin on Ceph nodes to CollectdExtraPlugins , because this results in a deployment failure.
|
For more information about configuring the ceph
plugin, see ceph.
collectd::plugins::cgroups
Use the cgroups
plugin to collect information for processes in a cgroup.
Parameter | Type |
---|---|
ignore_selected |
Boolean |
interval |
Integer |
cgroups |
List |
For more information about configuring the cgroups
plugin, see cgroups.
collectd::plugin::connectivity
Use the connectivity plugin to monitor the state of network interfaces.
If no interfaces are listed, all interfaces are monitored by default. |
Parameter | Type |
---|---|
interfaces |
Array |
parameter_defaults: ExtraConfig: collectd::plugin::connectivity::interfaces: - eth0 - eth1
For more information about configuring the connectivity
plugin, see connectivity.
collectd::plugin::conntrack
Use the conntrack
plugin to track the number of entries in the Linux connection-tracking table. There are no parameters for this plugin.
collectd::plugin::contextswitch
Use the ContextSwitch
plugin to collect the number of context switches that the system handles.
For more information about configuring the contextswitch
plugin, see contextswitch.
collectd::plugin::cpu
Use the cpu
plugin to monitor the time that the CPU spends in various states, for example, idle, executing user code, executing system code, waiting for IO-operations, and other states.
The cpu
plugin collects _jiffies_
, not percentage values. The value of a jiffy depends on the clock frequency of your hardware platform, and therefore is not an absolute time interval unit.
To report a percentage value, set the Boolean parameters reportbycpu
and reportbystate
to true
, and then set the Boolean parameter valuespercentage
to true.
Name | Description | Query |
---|---|---|
idle |
Amount of idle time |
collectd_cpu_total{…,type_instance=idle} |
interrupt |
CPU blocked by interrupts |
collectd_cpu_total{…,type_instance=interrupt} |
nice |
Amount of time running low priority processes |
collectd_cpu_total{…,type_instance=nice} |
softirq |
Amount of cycles spent in servicing interrupt requests |
collectd_cpu_total{…,type_instance=waitirq} |
steal |
The percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor |
collectd_cpu_total{…,type_instance=steal} |
system |
Amount of time spent on system level (kernel) |
collectd_cpu_total{…,type_instance=system} |
user |
Jiffies that user processes use |
collectd_cpu_total{…,type_instance=user} |
wait |
CPU waiting on outstanding I/O request |
collectd_cpu_total{…,type_instance=wait} |
Parameter | Type |
---|---|
reportbystate |
Boolean |
valuespercentage |
Boolean |
reportbycpu |
Boolean |
reportnumcpu |
Boolean |
reportgueststate |
Boolean |
subtractgueststate |
Boolean |
interval |
Integer |
parameter_defaults: CollectdExtraPlugins: - cpu ExtraConfig: collectd::plugin::cpu::reportbystate: true
For more information about configuring the cpu
plugin, see cpu.
collectd::plugin::cpufreq
-
None
collectd::plugin::cpusleep
collectd::plugin::csv
-
collectd::plugin::csv::datadir
-
collectd::plugin::csv::storerates
-
collectd::plugin::csv::interval
collectd::plugin::curl_json
collectd::plugin::curl
collectd::plugin::curl_xml
collectd::plugin::dbi
collectd::plugin::df
Use the df
plugin to collect disk space usage information for file systems.
Name | Description | Query |
---|---|---|
free |
Amount of free disk space |
collectd_df_df_complex{…, type_instance="free"} |
reserved |
Amount of reserved disk space |
collectd_df_df_complex{…, type_instance="reserved"} |
used |
Amount of used disk space |
collectd_df_df_complex{…, type_instance="used"} |
Parameter | Type |
---|---|
devices |
Array |
fstypes |
Array |
ignoreselected |
Boolean |
mountpoints |
Array |
reportbydevice |
Boolean |
reportinodes |
Boolean |
reportreserved |
Boolean |
valuesabsolute |
Boolean |
valuespercentage |
Boolean |
parameter_defaults: CollectdExtraPlugins: - df ExtraConfig: collectd::plugin::df::fstypes: ['tmpfs','xfs']
For more information about configuring the df
plugin, see df.
collectd::plugin::disk
Use the disk
plugin to collect performance statistics of hard disks and, if supported, partitions. This plugin is enabled by default.
Parameter | Type |
---|---|
disks |
Array |
ignoreselected |
Boolean |
udevnameattr |
String |
Name | Description |
---|---|
merged |
The number of operations that can be merged together, already queued operations, for example, one physical disk access served two or more logical operations. |
time |
The average time an I/O-operation takes to complete. The values might not be fully accurate. |
io_time |
Time spent doing I/Os (ms). You can use this metric as a device load percentage. A value of 1 second matches 100% of load. |
weighted_io_time |
Measure of both I/O completion time and the backlog that might be accumulating. |
pending_operations |
Shows queue size of pending I/O operations. |
parameter_defaults: ExtraConfig: collectd::plugin::disk::disk: "sda" collectd::plugin::disk::ignoreselected: false
For more information about configuring the disk
plugin, see disk.
collectd::plugin::dns
collectd::plugin::entropy
-
collectd::plugin::entropy::interval
collectd::plugin::ethstat
-
collectd::plugin::ethstat::interfaces
-
collectd::plugin::ethstat::maps
-
collectd::plugin::ethstat::mappedonly
-
collectd::plugin::ethstat::interval
collectd::plugin::exec
-
collectd::plugin::exec::commands
-
collectd::plugin::exec::commands_defaults
-
collectd::plugin::exec::globals
-
collectd::plugin::exec::interval
collectd::plugin::fhcount
-
collectd::plugin::fhcount::valuesabsolute
-
collectd::plugin::fhcount::valuespercentage
-
collectd::plugin::fhcount::interval
collectd::plugin::filecount
-
collectd::plugin::filecount::directories
-
collectd::plugin::filecount::interval
collectd::plugin::fscache
-
None
collectd-hddtemp
-
collectd::plugin::hddtemp::host
-
collectd::plugin::hddtemp::port
-
collectd::plugin::hddtemp::interval
collectd::plugin::hugepages
Use the hugepages plugin to collect hugepages information. This plugin is enabled by default.
Parameter | Type | Defaults |
---|---|---|
report_per_node_hp |
Boolean |
true |
report_root_hp |
Boolean |
true |
values_pages |
Boolean |
true |
values_bytes |
Boolean |
false |
values_percentage |
Boolean |
false |
parameter_defaults: ExtraConfig: collectd::plugin::hugepages::values_percentage: true
-
For more information about configuring the
hugepages
plugin, see hugepages.
collectd::plugin::intel_rdt
collectd::plugin::interface
Use the interface
plugin to measure interface traffic in octets, packets per second, and error rate per second. This plugin is enabled by default.
Parameter | Type |
---|---|
Default |
interfaces |
Array |
[] |
ignoreselected |
Boolean |
false |
reportinactive |
Boolean |
true |
parameter_defaults: ExtraConfig: collectd::plugin::interface::interfaces: - lo collectd::plugin::interface::ignoreselected: true
-
For more information about configuring the
interfaces
plugin, see interfaces.
collectd::plugin::ipc
-
None
collectd::plugin::ipmi
-
collectd::plugin::ipmi::ignore_selected
-
collectd::plugin::ipmi::notify_sensor_add
-
collectd::plugin::ipmi::notify_sensor_remove
-
collectd::plugin::ipmi::notify_sensor_not_present
-
collectd::plugin::ipmi::sensors
-
collectd::plugin::ipmi::interval
collectd::plugin::iptables
collectd::plugin::irq
-
collectd::plugin::irq::irqs
-
collectd::plugin::irq::ignoreselected
-
collectd::plugin::irq::interval
collectd::plugin::load
Use the load
plugin to collect the system load and an overview of the system use. This plugin is enabled by default.
Parameter | Type |
---|---|
report_relative |
Boolean |
parameter_defaults: ExtraConfig: collectd::plugin::load::report_relative: false
-
For more information about configuring the
load
plugin, see load.
collectd::plugin::logfile
-
collectd::plugin::logfile::log_level
-
collectd::plugin::logfile::log_file
-
collectd::plugin::logfile::log_timestamp
-
collectd::plugin::logfile::print_severity
-
collectd::plugin::logfile::interval
collectd::plugin::log_logstash
collectd::plugin::madwifi
collectd::plugin::match_empty_counter
collectd::plugin::match_hashed
collectd::plugin::match_regex
collectd::plugin::match_timediff
collectd::plugin::match_value
collectd::plugin::mbmon
collectd::plugin::mcelog
Use the mcelog
plugin to send notifications and statistics that are relevant to Machine Check Exceptions when they occur. Configure mcelog
to run in daemon mode and enable logging capabilities.
Parameter | Type |
---|---|
Mcelogfile |
String |
Memory |
Hash { mcelogclientsocket[string], persistentnotification[boolean] } |
parameter_defaults: CollectdExtraPlugins: mcelog CollectdEnableMcelog: true
-
For more information about configuring the
mcelog
plugin, see mcelog.
collectd::plugin::md
collectd::plugin::memcachec
collectd::plugin::memcached
-
collectd::plugin::memcached::instances
-
collectd::plugin::memcached::interval
collectd::plugin::memory
The memory
plugin provides information about the memory of the system. This plugin is enabled by default.
Parameter | Type |
---|---|
valuesabsolute |
Boolean |
valuespercentage |
Boolean |
parameter_defaults: ExtraConfig: collectd::plugin::memory::valuesabsolute: true collectd::plugin::memory::valuespercentage: false
-
For more information about configuring the
memory
plugin, see memory.
collectd::plugin::multimeter
collectd::plugin::mysql
-
collectd::plugin::mysql::interval
collectd::plugin::netlink
-
collectd::plugin::netlink::interfaces
-
collectd::plugin::netlink::verboseinterfaces
-
collectd::plugin::netlink::qdiscs
-
collectd::plugin::netlink::classes
-
collectd::plugin::netlink::filters
-
collectd::plugin::netlink::ignoreselected
-
collectd::plugin::netlink::interval
collectd::plugin::network
-
collectd::plugin::network::timetolive
-
collectd::plugin::network::maxpacketsize
-
collectd::plugin::network::forward
-
collectd::plugin::network::reportstats
-
collectd::plugin::network::listeners
-
collectd::plugin::network::servers
-
collectd::plugin::network::interval
collectd::plugin::nfs
-
collectd::plugin::nfs::interval
collectd::plugin::notify_nagios
collectd::plugin::ntpd
-
collectd::plugin::ntpd::host
-
collectd::plugin::ntpd::port
-
collectd::plugin::ntpd::reverselookups
-
collectd::plugin::ntpd::includeunitid
-
collectd::plugin::ntpd::interval
collectd::plugin::numa
-
None
collectd::plugin::olsrd
collectd::plugin::openldap
collectd::plugin::openvpn
-
collectd::plugin::openvpn::statusfile
-
collectd::plugin::openvpn::improvednamingschema
-
collectd::plugin::openvpn::collectcompression
-
collectd::plugin::openvpn::collectindividualusers
-
collectd::plugin::openvpn::collectusercount
-
collectd::plugin::openvpn::interval
collectd::plugin::ovs_stats
Use the ovs_stats
plugin to collect statistics of OVS-connected interfaces. The ovs_stats
plugin uses the OVSDB management protocol (RFC7047) monitor mechanism to get statistics from OVSDB.
Parameter | Type |
---|---|
address |
String |
bridges |
List |
port |
Integer |
socket |
String |
The following example shows how to enable the ovs_stats
plugin. If you deploy your overcloud with OVS, you do not need to enable the ovs_stats
plugin.
parameter_defaults: CollectdExtraPlugins: - ovs_stats ExtraConfig: collectd::plugin::ovs_stats::socket: '/run/openvswitch/db.sock'
-
For more information about configuring the
ovs_stats
plugin, see ovs_stats.
collectd::plugin::pcie_errors
Use the pcie_errors
plugin to poll PCI config space for baseline and Advanced Error Reporting (AER) errors, and to parse syslog for AER events. Errors are reported through notifications.
Parameter | Type |
---|---|
source |
Enum (sysfs, proc) |
access |
String |
reportmasked |
Boolean |
persistent_notifications |
Boolean |
parameter_defaults: CollectdExtraPlugins: - pcie_errors
-
For more information about configuring the
pcie_errors
plugin, see pcie_errors.
collectd::plugin::ping
-
collectd::plugin::ping::hosts
-
collectd::plugin::ping::timeout
-
collectd::plugin::ping::ttl
-
collectd::plugin::ping::source_address
-
collectd::plugin::ping::device
-
collectd::plugin::ping::max_missed
-
collectd::plugin::ping::size
-
collectd::plugin::ping::interval
collectd::plugin::powerdns
-
collectd::plugin::powerdns::interval
-
collectd::plugin::powerdns::servers
-
collectd::plugin::powerdns::recursors
-
collectd::plugin::powerdns::local_socket
-
collectd::plugin::powerdns::interval
collectd::plugin::processes
The processes
plugin provides information about system processes. This plugin is enabled by default.
Parameter | Type |
---|---|
processes |
Array |
process_matches |
Array |
collect_context_switch |
Boolean |
collect_file_descriptor |
Boolean |
collect_memory_maps |
Boolean |
-
For more information about configuring the
processes
plugin, see processes.
collectd::plugin::protocols
-
collectd::plugin::protocols::ignoreselected
-
collectd::plugin::protocols::values
collectd::plugin::python
collectd::plugin::sensors
collectd::plugin::serial
collectd::plugin::smart
-
collectd::plugin::smart::disks
-
collectd::plugin::smart::ignoreselected
-
collectd::plugin::smart::interval
collectd::plugin::snmp
collectd::plugin::snmp_agent
Use the snmp_agent
plugin as an SNMP subagent to map collectd metrics to relevant OIDs. The snmp agent also requires a running snmpd service.
parameter_defaults: CollectdExtraPlugins: snmp_agent resource_registry: OS::TripleO::Services::Snmp: /usr/share/openstack-tripleo-heat- templates/deployment/snmp/snmp-baremetal-puppet.yaml
For more information about how to configure snmp_agent
, see snmp_agent.
collectd::plugin::statsd
-
collectd::plugin::statsd::host
-
collectd::plugin::statsd::port
-
collectd::plugin::statsd::deletecounters
-
collectd::plugin::statsd::deletetimers
-
collectd::plugin::statsd::deletegauges
-
collectd::plugin::statsd::deletesets
-
collectd::plugin::statsd::countersum
-
collectd::plugin::statsd::timerpercentile
-
collectd::plugin::statsd::timerlower
-
collectd::plugin::statsd::timerupper
-
collectd::plugin::statsd::timersum
-
collectd::plugin::statsd::timercount
-
collectd::plugin::statsd::interval
collectd::plugin::swap
-
collectd::plugin::swap::reportbydevice
-
collectd::plugin::swap::reportbytes
-
collectd::plugin::swap::valuesabsolute
-
collectd::plugin::swap::valuespercentage
-
collectd::plugin::swap::reportio
-
collectd::plugin::swap::interval
collectd::plugin::sysevent
collectd::plugin::syslog
-
collectd::plugin::syslog::log_level
-
collectd::plugin::syslog::notify_level
-
collectd::plugin::syslog::interval
collectd::plugin::table
-
collectd::plugin::table::tables
-
collectd::plugin::table::interval
collectd::plugin::tail
-
collectd::plugin::tail::files
-
collectd::plugin::tail::interval
collectd::plugin::tail_csv
-
collectd::plugin::tail_csv::metrics
-
collectd::plugin::tail_csv::files
collectd::plugin::target_notification
collectd::plugin::target_replace
collectd::plugin::target_scale
collectd::plugin::target_set
collectd::plugin::target_v5upgrade
collectd::plugin::tcpconns
-
collectd::plugin::tcpconns::localports
-
collectd::plugin::tcpconns::remoteports
-
collectd::plugin::tcpconns::listening
-
collectd::plugin::tcpconns::allportssummary
-
collectd::plugin::tcpconns::interval
collectd::plugin::ted
collectd::plugin::thermal
-
collectd::plugin::thermal::devices
-
collectd::plugin::thermal::ignoreselected
-
collectd::plugin::thermal::interval
collectd::plugin::threshold
-
collectd::plugin::threshold::types
-
collectd::plugin::threshold::plugins
-
collectd::plugin::threshold::hosts
-
collectd::plugin::threshold::interval
collectd::plugin::turbostat
-
collectd::plugin::turbostat::core_c_states
-
collectd::plugin::turbostat::package_c_states
-
collectd::plugin::turbostat::system_management_interrupt
-
collectd::plugin::turbostat::digital_temperature_sensor
-
collectd::plugin::turbostat::tcc_activation_temp
-
collectd::plugin::turbostat::running_average_power_limit
-
collectd::plugin::turbostat::logical_core_names
collectd::plugin::unixsock
collectd::plugin::uptime
-
collectd::plugin::uptime::interval
collectd::plugin::users
-
collectd::plugin::users::interval
collectd::plugin::uuid
-
collectd::plugin::uuid::uuid_file
-
collectd::plugin::uuid::interval
collectd::plugin::virt
Use the virt
plugin to collect CPU, disk, network load, and other metrics through the libvirt
API for virtual machines on the host.
Parameter | Type |
---|---|
connection |
String |
refresh_interval |
Hash |
domain |
String |
block_device |
String |
interface_device |
String |
ignore_selected |
Boolean |
plugin_instance_format |
String |
hostname_format |
String |
interface_format |
String |
extra_stats |
String |
ExtraConfig: collectd::plugin::virt::plugin_instance_format: name
For more information about configuring the virt
plugin, see virt.
collectd::plugin::vmem
-
collectd::plugin::vmem::verbose
-
collectd::plugin::vmem::interval
collectd::plugin::vserver
collectd::plugin::wireless
collectd::plugin::write_graphite
-
collectd::plugin::write_graphite::carbons
-
collectd::plugin::write_graphite::carbon_defaults
-
collectd::plugin::write_graphite::globals
collectd::plugin::write_http
Use the write_http
output plugin to submit values to an HTTP server by using POST requests and encoding metrics with JSON, or by using the PUTVAL
command.
Parameter | Type |
---|---|
ensure |
Enum[present, absent] |
nodes |
Hash[String, Hash[String, Scalar]] |
urls |
Hash[String, Hash[String, Scalar]] |
manage_package |
Boolean |
parameter_defaults: CollectdExtraPlugins: - write_http ExtraConfig: collectd::plugin::write_http::nodes: collectd: url: “http://collectd.tld.org/collectd” metrics: true header: “X-Custom-Header: custom_value"
-
For more information about configuring the
write_http
plugin, see write_http.
collectd::plugin::write_kafka
Use the write_kafka
plugin to send values to a Kafka topic. Configure the write_kafka
plugin with one or more topic blocks. For each topic block, you must specify a unique name and one Kafka producer. You can use the following per-topic parameters inside the topic block:
Parameter | Type |
---|---|
kafka_hosts |
Array[String] |
kafka_port |
Integer |
topics |
Hash |
properties |
Hash |
meta |
Hash |
parameter_defaults: CollectdExtraPlugins: - write_kafka ExtraConfig: collectd::plugin::write_kafka::kafka_hosts: - nodeA - nodeB collectd::plugin::write_kafka::topics: some_events: format: JSON
For more information about how to configure the write_kafka
plugin, see write_kafka.
collectd::plugin::write_log
-
collectd::plugin::write_log::format
collectd::plugin::zfs_arc
-
None