.. _filters: .. All code examples here should have a unique URL that maps to an entry in test/data/filter_documentation_testdata.yaml which will be used to provide input/output data for the filter example so that the examples can be verified to be correct automatically. Filters ======= .. only:: man Synopsis -------- urlwatch --edit Description ----------- Each job can have two filter stages configured, with one or more filters processed after each other: * Applied to the downloaded page before diffing the changes (``filter``) * Applied to the diff result before reporting the changes (``diff_filter``) While creating your filter pipeline, you might want to preview what the filtered output looks like. You can do so by first configuring your job and then running urlwatch with the ``--test-filter`` command, passing in the index (from ``--list``) or the URL/location of the job to be tested: :: urlwatch --test-filter 1 # Test the first job in the list urlwatch --test-filter https://example.net/ # Test the job with the given URL The output of this command will be the filtered plaintext of the job, this is the output that will (in a real urlwatch run) be the input to the diff algorithm. The ``filter`` is only applied to new content, the old content was already filtered when it was retrieved. This means that changes to ``filter`` are not visible when reporting unchanged contents (see :ref:`configuration_display` for details), and the diff output will be between (old content with filter at the time old content was retrieved) and (new content with current filter). Once urlwatch has collected at least 2 historic snapshots of a job (two different states of a webpage) you can use the command-line option ``--test-diff-filter`` to test your ``diff_filter`` settings; this will use historic data cached locally. Built-in filters ---------------- The list of built-in filters can be retrieved using:: urlwatch --features At the moment, the following filters are built-in: - **beautify**: Beautify HTML - **css**: Filter XML/HTML using CSS selectors - **csv2text**: Convert CSV to plaintext - **element-by-class**: Get all HTML elements by class - **element-by-id**: Get an HTML element by its ID - **element-by-style**: Get all HTML elements by style - **element-by-tag**: Get an HTML element by its tag - **format-json**: Convert to formatted json - **grep**: Filter only lines matching a regular expression - **grepi**: Remove lines matching a regular expression - **hexdump**: Convert binary data to hex dump format - **html2text**: Convert HTML to plaintext - **pdf2text**: Convert PDF to plaintext - **pretty-xml**: Pretty-print XML - **ical2text**: Convert `iCalendar`_ to plaintext - **ocr**: Convert text in images to plaintext using Tesseract OCR - **re.sub**: Replace text with regular expressions using Python's re.sub - **reverse**: Reverse input items - **sha1sum**: Calculate the SHA-1 checksum of the content - **shellpipe**: Filter using a shell command - **sort**: Sort input items - **remove-duplicate-lines**: Remove duplicate lines (case sensitive) - **strip**: Strip leading and trailing whitespace - **striplines**: Strip leading and trailing whitespace in each line - **xpath**: Filter XML/HTML using XPath expressions - **jq**: Filter, transform and extract values from JSON .. To convert the "urlwatch --features" output, use: sed -e 's/^ \* \(.*\) - \(.*\)$/- **\1**: \2/' .. _iCalendar: https://en.wikipedia.org/wiki/ICalendar Picking out elements from a webpage ----------------------------------- You can pick only a given HTML element with the built-in filter, for example to extract ``
.../
`` from a page, you can use the following in your ``urls.yaml``: .. code:: yaml url: http://example.org/idtest.html filter: - element-by-id: something Also, you can chain filters, so you can run html2text on the result: .. code:: yaml url: http://example.net/id2text.html filter: - element-by-id: something - html2text Chaining multiple filters ------------------------- The example urls.yaml file also demonstrates the use of built-in filters, here 3 filters are used: html2text, line-grep and whitespace removal to get just a certain info field from a webpage: .. code:: yaml url: https://example.net/version.html filter: - html2text - grep: "Current.*version" - strip Extracting only the ```` tag of a page -------------------------------------------- If you want to extract only the body tag you can use this filter: .. code:: yaml url: https://example.org/bodytag.html filter: - element-by-tag: body Filtering based on an XPath expression -------------------------------------- To filter based on an `XPath `__ expression, you can use the ``xpath`` filter like so: .. code:: yaml url: https://example.net/xpath.html filter: - xpath: /html/body/marquee This filters only the ```` elements directly below the ```` element, which in turn must be below the ```` element of the document, stripping out everything else. See Microsoft’s `XPath Examples `__ page for some other examples. You can also find an XPath of an ```` node in the Chromium/Google Chrome developer tools by right clicking on the node and selecting ``copy XPath``. Filtering based on CSS selectors -------------------------------- To filter based on a `CSS selector `__, you can use the ``css`` filter like so: .. code:: yaml url: https://example.net/css.html filter: - css: ul#groceries > li.unchecked This would filter only ``
  • `` tags directly below ``