.. _filters:
.. All code examples here should have a unique URL that maps to
an entry in test/data/filter_documentation_testdata.yaml which
will be used to provide input/output data for the filter example
so that the examples can be verified to be correct automatically.
Filters
=======
.. only:: man
Synopsis
--------
urlwatch --edit
Description
-----------
Each job can have two filter stages configured, with one or more
filters processed after each other:
* Applied to the downloaded page before diffing the changes (``filter``)
* Applied to the diff result before reporting the changes (``diff_filter``)
While creating your filter pipeline, you might want to preview what the
filtered output looks like. You can do so by first configuring your job
and then running urlwatch with the ``--test-filter`` command, passing in
the index (from ``--list``) or the URL/location of the job to be tested:
::
urlwatch --test-filter 1 # Test the first job in the list
urlwatch --test-filter https://example.net/ # Test the job with the given URL
The output of this command will be the filtered plaintext of the job,
this is the output that will (in a real urlwatch run) be the input to
the diff algorithm.
The ``filter`` is only applied to new content, the old content was
already filtered when it was retrieved. This means that changes to
``filter`` are not visible when reporting unchanged contents
(see :ref:`configuration_display` for details), and the diff output
will be between (old content with filter at the time old content was
retrieved) and (new content with current filter).
Once urlwatch has collected at least 2 historic snapshots of a job
(two different states of a webpage) you can use the command-line
option ``--test-diff-filter`` to test your ``diff_filter`` settings;
this will use historic data cached locally.
Built-in filters
----------------
The list of built-in filters can be retrieved using::
urlwatch --features
At the moment, the following filters are built-in:
- **beautify**: Beautify HTML
- **css**: Filter XML/HTML using CSS selectors
- **csv2text**: Convert CSV to plaintext
- **element-by-class**: Get all HTML elements by class
- **element-by-id**: Get an HTML element by its ID
- **element-by-style**: Get all HTML elements by style
- **element-by-tag**: Get an HTML element by its tag
- **format-json**: Convert to formatted json
- **grep**: Filter only lines matching a regular expression
- **grepi**: Remove lines matching a regular expression
- **hexdump**: Convert binary data to hex dump format
- **html2text**: Convert HTML to plaintext
- **pdf2text**: Convert PDF to plaintext
- **pretty-xml**: Pretty-print XML
- **ical2text**: Convert `iCalendar`_ to plaintext
- **ocr**: Convert text in images to plaintext using Tesseract OCR
- **re.sub**: Replace text with regular expressions using Python's re.sub
- **reverse**: Reverse input items
- **sha1sum**: Calculate the SHA-1 checksum of the content
- **shellpipe**: Filter using a shell command
- **sort**: Sort input items
- **remove-duplicate-lines**: Remove duplicate lines (case sensitive)
- **strip**: Strip leading and trailing whitespace
- **striplines**: Strip leading and trailing whitespace in each line
- **xpath**: Filter XML/HTML using XPath expressions
- **jq**: Filter, transform and extract values from JSON
.. To convert the "urlwatch --features" output, use:
sed -e 's/^ \* \(.*\) - \(.*\)$/- **\1**: \2/'
.. _iCalendar: https://en.wikipedia.org/wiki/ICalendar
Picking out elements from a webpage
-----------------------------------
You can pick only a given HTML element with the built-in filter, for
example to extract ``
.../
`` from a page, you
can use the following in your ``urls.yaml``:
.. code:: yaml
url: http://example.org/idtest.html
filter:
- element-by-id: something
Also, you can chain filters, so you can run html2text on the result:
.. code:: yaml
url: http://example.net/id2text.html
filter:
- element-by-id: something
- html2text
Chaining multiple filters
-------------------------
The example urls.yaml file also demonstrates the use of built-in
filters, here 3 filters are used: html2text, line-grep and whitespace
removal to get just a certain info field from a webpage:
.. code:: yaml
url: https://example.net/version.html
filter:
- html2text
- grep: "Current.*version"
- strip
Extracting only the ```` tag of a page
--------------------------------------------
If you want to extract only the body tag you can use this filter:
.. code:: yaml
url: https://example.org/bodytag.html
filter:
- element-by-tag: body
Filtering based on an XPath expression
--------------------------------------
To filter based on an
`XPath
`__ expression,
you can use the ``xpath`` filter like so:
.. code:: yaml
url: https://example.net/xpath.html
filter:
- xpath: /html/body/marquee
This filters only the ``