6. Alerting

How can we help?
< All Topics
Print

Alerting

Performance Store comes with many built-in alerts. Alerts are fully customizable and custom alerts can be created.

Alerts can be created to trigger on certain events (e.g. CPU usage above 80 percent), and rules can be defined (e.g. CPU usage should be above 80 percent for 3 minutes without any drops below 80 percent). Alerts can be triggered on a change in delta values e.g.:

  • an agent job run duration deviates more than 30 percent from the average run duration
  • a login has been granted sysadmin rights

Alerts can be created for any data available in Performance Store.

Alerting channels

Performance Store is having native support for sending alerts to the following channels:

  • Atlassian Opsgenie
  • Microsoft Teams
  • Email
  • PowerShell

Alerting channels are set up in the dbo.team_alert table.

When sending alerts to a mail target, alerts will be grouped in the same mail to avoid sending one email per alert.

When integrating with Opsgenie, alerts will automatically be closed in Opsgenie when the alert is no longer triggered by Performance Store.

Alerting channel teams

Different alerting channels are defined by alerting channel teams. Each team can have different alert triggers.

By default all alerts are send out to the alerting channels. If alerts should be ignored and not send out to specific teams, rules can be configured in the dbo.team_alert_exclude table:

alert_operator can be: <, > and =

To only get specific alerts or only alerts for specific servers, add them to the dbo.team_alert_include table.

Alerting channel teams are useful for e.g. directing alerts for specific servers to specific teams. One team might prefer to get alerts on Microsoft Teams and email, while another team might prefer to receive alerts only in ServiceNow.

Alerts

Alerts are defined in the dbo.alerts table:

If “received_within_minutes” is 0, an alert is thrown if the alert is triggered consequently within the last [alert_receive_count_trigger] samples.

Example: If an alert should be be thrown if 3 concurrent occurrences has happened, set the value of alert_receive_count_trigger to 3.

Red line is the alert threshold

If “received_within_minutes” is different from 0, an alert is thrown if the alert is triggered [alert_receive_count_trigger] number of times within the last [received_within_minutes] minutes.

Example: If an alerts should be thrown if 3 occurrences has happened within 10 minutes, set the value of received_within_minutes to 10.

Red line is the alert threshold

All alerts contains a link to the Performance Store documentation about the alert. Custom alerts can also include an optional link. This value is set in the link column in dbo.alerts.

An alert is only send once while it is relevant. That means, when the criterias for throwing an alert is met, the alert is thrown. For similar subsequent events, alerts are not thrown even if they are fulfilling the criterias. This is to prevent the same alert to be thrown again and again, e.g. if a SQL Agent job fails, an alert is not send every minute. Before a new alert is thrown for the failed SQL Agent job, a successful execution of the SQL Agent job must have taken place.

If the alerts is still exists after the number of days stated in [retention_days], the alert will be rethrown. This is to ensure, that no alerts will be forgotten. The rethrow will occur every number of days stated by [retention_days].

If [retention_days] is -1, the alert will not be rethrown.

A rethrown alert for a SQL Server with an inactive AD group configured.

Alerts with a priority of Critical or High, will be shown on the dashboard if the alerts are thrown on production servers.

Example on how to see latest “High CPU usage” alerts for “LHG_SQL01\SS01”:

select s.*
from dbo.v_latest_stats s
where 1 = 1
and s.server_name = 'LHG_SQL01\SS01'
and s.name = 'High CPU usage'

To see alerts that is partially fulfilling a criteria, but have not yet been thrown:

select a.*
from dbo.alert_status a
where 1 = 1
and a.server_name = 'LHG_SQL01\SS01'
and a.name = 'High CPU usage'

To see thrown alerts:

select al.*
from dbo.alert_log al
where 1 = 1
and al.server_name = 'LHG_SQL01\SS01'
and al.name = 'High CPU usage'

Custom alerts

Custom alerts can be created by adding rows to dbo.alerts.

When doing so, set the value of is_system to 0.

Alerts are defined by the t-sql statement in the sql column in dbo.alerts. The t-sql statements must have the following 3 columns: server_name, object_name, count

Example

To create a custom alert, alerting if a database is on a SQL Server 2019, but the compatibility level has not been set to 150, use the following t-sql statement in the sql column:

select s.server_name, d.database_name object_name, 1 [count]
from dbo.v_servers s
inner join dbo.v_databases d on d.server_name = s.server_name
where substring(s.sql_server_version, 1, charindex('.', s.sql_server_version) - 1) = 15
and d.compatibility_level != 150
group by s.server_name, d.database_name

The count column defines how many occurrences that should be reported as a part of the alert, but as databases are unique on each SQL Server, the value is set to 1 in the above example.

Example of an alert send to Microsoft Teams

Performance Store comes with the following 3 example custom alerts that are all disabled by default:

  • Full backups older than expected
  • Diff backups older than expected
  • Log backups older than expected

These can freely be modified and enabled.

See Creating a custom alert for a walkthrough in creating a custom alert.

Sending alerts to PowerShell

Alerts can be send to a PowerShell script, and from there integrated into other application and services.

The path to the PowerShell file is set in the dbo.team_alert table in the powershell_script_path column.

When alerts are fired, the PowerShell file is then called.

Example of a PowerShell script file:

param([string]$alertType, [string]$sampleId, [string]$sampleTime, [string]$serverName, [string]$objectName, [string]$count, [string]$name, [string]$link, [string]$priority, [string]$retentionDays)

[string]$message = [string]::Format("AlertType: {0}`r`nSampleId: {1}`r`nSampleTime: {2}`r`nServerName: {3}`r`nObjectName: {4}`r`nCount: {5}`r`nName: {6}`r`nLink: {7}`r`nPriority: {8}`r`nRetentionDays: {9}", $alertType, $sampleId, $sampleTime, $serverName, $objectName, $count, $name, $link, $priority, $retentionDays)

[string]$logFile = "C:\Temp\Alert_$((Get-Date).ToUniversalTime().ToString("yyyyMMdd_HHmmss")).txt"
Write-Output $message | Out-File $logFile

Example of using a PowerShell script file to create a Windows Event log entry on the affected machine:

param([string]$alertType, [string]$sampleId, [string]$sampleTime, [string]$serverName, [string]$objectName, [string]$count, [string]$name, [string]$link, [string]$priority, [string]$retentionDays)
[string]$message = [string]::Format("AlertType: {0}`r`nSampleId: {1}`r`nSampleTime: {2}`r`nServerName: {3}`r`nObjectName: {4}`r`nCount: {5}`r`nName: {6}`r`nLink: {7}`r`nPriority: {8}`r`nRetentionDays: {9}", $alertType, $sampleId, $sampleTime, $serverName, $objectName, $count, $name, $link, $priority, $retentionDays)

function GetServerNameWithoutInstance
{
	param([string]$serverName)

	$serverNameWithoutInstance = $serverName

	if ($serverName.Contains("\"))
	{
		$serverNameWithoutInstance = $serverName.Substring(0, $serverName.IndexOf("\"))
	}
	elseif ($serverName.Contains(","))
	{
		$serverNameWithoutInstance = $serverName.Substring(0, $serverName.IndexOf(","))
	}

	return $serverNameWithoutInstance
}

$serverNameWithoutInstance = GetServerNameWithoutInstance -serverName $serverName

$sourceName = "Performance Store"
$eventId = 3001

$script =
{
	param([string]$sourceName, [string]$serverNameWithoutInstance, [string]$message, [string]$eventId)

	if (-not [System.Diagnostics.EventLog]::SourceExists($sourceName, $serverNameWithoutInstance))
	{
		New-EventLog -ComputerName $serverNameWithoutInstance -LogName "Application" -Source $sourceName
	}

	Write-EventLog -LogName "Application" -Source $sourceName -EntryType Error -EventID $eventId -Message $message
}

Invoke-Command -Computer $serverNameWithoutInstance -ScriptBlock $script -ArgumentList $sourceName, $serverNameWithoutInstance, $message, $eventId

Sending alerts to Opsgenie

Performance Store integrates directly into Opsgenie.

If sending alerts to Opsgenie, new alerts from Performance Store will automatically be created as new alerts in Opsgenie. When an alert is no longer active in Performance Store, the corresponding alert will automatically be closed in Opsgenie.

In many monitoring systems, alerts are handled one-to-one meaning if an alert is created, the alert will be closed when the original alert has been solved. Due to the nature of Performance Store, not only one-to-one alerts will be closed, but also dependent alerts.

The following example shows how dependent alerts works:

A server is experiencing high memory pressure and high CPU usage, and two alerts are thrown – one for memory pressure and one for high CPU usage.

The server crashes and creates a third alert for the server crash.

When the server is up running again, it is no longer experiencing memory pressure or high CPU usage, so all three alerts will be marked as solved even though only one of them (the server crash) was actively handled. In this case, the alerts for memory pressure and high CPU usage is no longer relevant, and is not required to be manually closed, as they have already been automatically closed by Performance Store.

Note, the correct Opsgenie API url must be set in the Settings menu. Depending on the region, the value for “OpsGenieUrl” must be:

Depending on the Opsgenie plan, the value for “Atlassian Opsgenie API rate limit” should be set accordingly in the Settings menu. Default value is 2000 which correspond to a free Opsgenie plan.

Built-in alerts

By default, Performance Store can send alerts if any of the following events occurs:

  • Aborted executions
  • Agent job differences
  • Agent not running
  • Audit not enabled
  • Auto close is on
  • Auto shrink is on
  • Availability Group failover
  • Blockings
  • Collation mismatch
  • Data drive with incorrect cluster size
  • Database differences
  • Database role member added
  • Databases not backed up
  • Deadlocks
  • Diff backups older than expected
  • Disk drives running below 20 % free space within a week
  • Disk drives running empty within a week
  • Disk drives with less than 20 % free space
  • Disk latency too high
  • Drive differences
  • Errors
  • Failed SQL Agent jobs
  • Full backups older than expected
  • High CPU usage
  • High log space usage
  • High row count change
  • High VLF count
  • Inactive groups
  • Inactive users
  • Incorrect memory configuration
  • Incorrect tempdb configuration
  • License increase
  • Log backups older than expected
  • Log drive with incorrect cluster size
  • Login errors
  • Memory pressure
  • Non-copy-only snapshot backups are taken
  • Non-equal file growth detected
  • Non-existing default database for user
  • Non-healthy database availability group state
  • Plan cache can be optimized
  • Primary and secondary placed on same host
  • Primary- and secondary availability groups on same server
  • Redo queue size above threshold
  • Replicas with non-identical service accounts
  • SA user not disabled
  • SA user with default name
  • Server configuration differences
  • Server differences
  • Server role member added
  • Server securables differences
  • Server security differences
  • Server unreachable
  • SQL Server with databases not in availability group
  • Sysadmins added
  • Sysadmins removed
  • Transaction log full
  • Unused CPUs
  • Unused databases
  • Virtual servers with no matching entry in dbo.hosts
  • Virtual servers without a host_name set
  • Wrong Agent Job owner
  • Wrong availability group owner
  • Wrong database owner
  • Wrong mirroring endpoint owner
Table of Contents