Introduction¶

Quick Start¶

First, install urlwatch by following Installation. Once installed:

Run urlwatch once to migrate your old data or start fresh
Use urlwatch --edit to customize jobs and filters (urls.yaml)
Use urlwatch --edit-config to customize settings and reporters (urlwatch.yaml)
Add urlwatch to your crontab (crontab -e) to monitor webpages periodically

The checking interval is defined by how often you run urlwatch. You can use e.g. crontab.guru to figure out the schedule expression for the checking interval, we recommend not more often than 30 minutes (this would be */30 * * * *). If you have never used cron before, check out the crontab command help.

On Windows, cron is not installed by default. Use the Windows Task Scheduler instead, or see this StackOverflow question for alternatives.

How it works¶

Every time you run urlwatch(1), it:

retrieves the output of each job and filters it
compares it with the version retrieved the previous time (“diffing”)
if it finds any differences, it invokes enabled reporters (e.g. text reporter, e-mail reporter, …) to notify you of the changes

Jobs and Filters¶

Each website or shell command to be monitored constitutes a “job”.

The instructions for each such job are contained in a config file in the YAML format. If you have more than one job, you separate them with a line containing only ---.

You can edit the job and filter configuration file using:

urlwatch --edit

If you get an error, set your $EDITOR (or $VISUAL) environment variable in your shell, for example:

export EDITOR=/bin/nano

While you can edit the YAML file manually, using --edit will do sanity checks before activating the new configuration file.

Kinds of Jobs¶

Each job must have exactly one of the following keys, which also defines the kind of job:

url retrieves what is served by the web server (HTTP GET by default),
navigate uses a headless browser to load web pages requiring JavaScript, and
command runs a shell command.

Each job can have an optional name key to define a user-visible name for the job.

You can then use optional keys to finely control various job’s parameters.

Filters¶

You may use the filter key to select one or more Filters to apply to the data after it is retrieved, for example to:

select HTML: css, xpath, element-by-class, element-by-id, element-by-style, element-by-tag
make HTML more readable: html2text, beautify
make PDFs readable: pdf2text
make JSON more readable: format-json
make iCal more readable: ical2text
make binary readable: hexdump
just detect changes: sha1sum
edit text: grep, grepi, strip, sort, striplines

These filters can be chained. As an example, after retrieving an HTML document by using the url key, you can extract a selection with the xpath filter, convert this to text with html2text, use grep to extract only lines matching a specific regular expression, and then sort them:

name: "Sample urlwatch job definition"
url: "https://example.dummy/"
https_proxy: "http://dummy.proxy/"
max_tries: 2
filter:
  - xpath: '//section[@role="main"]'
  - html2text:
      method: pyhtml2text
      unicode_snob: true
      body_width: 0
      inline_links: false
      ignore_links: true
      ignore_images: true
      pad_tables: false
      single_line_break: true
  - grep: "lines I care about"
  - sort:
---

Reporters¶

urlwatch can be configured to do something with its report besides (or in addition to) the default of displaying it on the console.

Reporters are configured in the global configuration file:

urlwatch --edit-config

Examples of reporters:

email (using SMTP)
email using mailgun
slack
discord
pushbullet
telegram
matrix
pushover
stdout
xmpp
shell