Introduction¶
Quick Start¶
Run
urlwatch
once to migrate your old data or start freshUse
urlwatch --edit
to customize jobs and filters (urls.yaml
)Use
urlwatch --edit-config
to customize settings and reporters (urlwatch.yaml
)Add
urlwatch
to your crontab (crontab -e
) to monitor webpages periodically
The checking interval is defined by how often you run urlwatch
. You
can use e.g. crontab.guru to figure out the
schedule expression for the checking interval, we recommend not more
often than 30 minutes (this would be */30 * * * *
). If you have
never used cron before, check out the crontab command
help.
On Windows, cron
is not installed by default. Use the Windows Task
Scheduler
instead, or see this StackOverflow
question for
alternatives.
How it works¶
Every time you run urlwatch(1), it:
retrieves the output of each job and filters it
compares it with the version retrieved the previous time (“diffing”)
if it finds any differences, it invokes enabled reporters (e.g. text reporter, e-mail reporter, …) to notify you of the changes
Jobs and Filters¶
Each website or shell command to be monitored constitutes a “job”.
The instructions for each such job are contained in a config file in the YAML
format. If you have more than one job, you separate them with a line
containing only ---
.
You can edit the job and filter configuration file using:
urlwatch --edit
If you get an error, set your $EDITOR
(or $VISUAL
) environment
variable in your shell, for example:
export EDITOR=/bin/nano
While you can edit the YAML file manually, using --edit
will
do sanity checks before activating the new configuration file.
Kinds of Jobs¶
Each job must have exactly one of the following keys, which also defines the kind of job:
url
retrieves what is served by the web server (HTTP GET by default),navigate
uses a headless browser to load web pages requiring JavaScript, andcommand
runs a shell command.
Each job can have an optional name
key to define a user-visible name for the job.
You can then use optional keys to finely control various job’s parameters.
Filters¶
You may use the filter
key to select one or more Filters to apply to
the data after it is retrieved, for example to:
select HTML:
css
,xpath
,element-by-class
,element-by-id
,element-by-style
,element-by-tag
make HTML more readable:
html2text
,beautify
make PDFs readable:
pdf2text
make JSON more readable:
format-json
make iCal more readable:
ical2text
make binary readable:
hexdump
just detect changes:
sha1sum
edit text:
grep
,grepi
,strip
,sort
,striplines
These filters can be chained. As an example, after retrieving an HTML
document by using the url
key, you can extract a selection with the
xpath
filter, convert this to text with html2text
, use grep
to
extract only lines matching a specific regular expression, and then sort
them:
name: "Sample urlwatch job definition"
url: "https://example.dummy/"
https_proxy: "http://dummy.proxy/"
max_tries: 2
filter:
- xpath: '//section[@role="main"]'
- html2text:
method: pyhtml2text
unicode_snob: true
body_width: 0
inline_links: false
ignore_links: true
ignore_images: true
pad_tables: false
single_line_break: true
- grep: "lines I care about"
- sort:
---
Reporters¶
urlwatch can be configured to do something with its report besides (or in addition to) the default of displaying it on the console.
Reporters are configured in the global configuration file:
urlwatch --edit-config
Examples of reporters:
email
(using SMTP)email using
mailgun
slack
discord
pushbullet
telegram
matrix
pushover
stdout
xmpp
shell