| 
View
 

XPathTips

Page history last edited by Nils R Grotnes 14 years, 6 months ago

 

There are a number of XPath tutorials/resources available on the web. However, these are not always accessible to the average user and can include material that is not necessarily particularly helpful for using RIP. The goal of this page is to provide tips that are directly applicable to RIP and as easy-to-understand as possible. If anything here doesn't seem that easy to understand and you can improve it, please do. Note: Please do not plagiarize from existing resources when adding your own tips; do add links to helpful sites.


 


 

Overview

Tweaking the XPath query automatically generated by RIP can improve its selectivity. Knowing a little XPath also allows you to write your own queries from scratch.

 

How XPath analyzes HTML documents

In XPath, HTML documents are analyzed as having a tree structure similar to the folders (directories) and files on your computer. The word "node" is, however, used instead of "folder" or "directory."

 

A document like this

 

 

<html>
   <body>
      <div>
         <p>
            It was a 
            <a href="http://www.randomwebsite.com/">dark</a>
            and 
            <a href="http://www.randomwebsite.com/">stormy</a>
            <a href="http://www.randomwebsite.com/">night</a>.
         </p>
         <p>
            The 
            <a href="http://www.randomwebsite.com/">lightning</a>
            scared the 

            <a href="http://www.randomwebsite.com/">dog</a>. Still we held out 
            <a href="http://www.randomwebsite.com/">hope</a>
            that 
            <a href="http://www.randomwebsite.com/">she</a>
            would 
            <a href="http://www.randomwebsite.com/">make it</a>
            through the night.
         </p>
      </div>
   </body>
</html>

 

 

would be analyzed like this:

 

 

 

Syntax

XPath has both an unabbreviated and an abbreviated syntax. The abbreviated syntax is probably the easiest to use.

 

As the structural analysis of documents might suggest, XPath syntax is similar or identical to syntax used elsewhere when working with tree structures. Thus, as in Unix,

  • . (a single dot) refers to the context node
  • .. (two dots) refers to the parent of the context node

In RIP, the "context node" is specified by the query. Thus, if the query is /html/body/table, the node referred to by a single dot would be the first table that occurs in the body of the document.

 

Likewise, as in most languages used for queries, there are wildcards:

  • * refers to any HTML element (for example, div, p, a)
  • @* refers to any HTML attribute (for example, href, style)

 

 

 

TO DO: explain / (root) (perhaps unfamiliar to Windows users), //, [ ], @, functions

 

 

RIP is able to do two distinct selections, one is what shall be removed, the other what must match before the removal is done. The easiest example is the removal of an element with a given attribute, where the element to be removed is followed by a set of [ ] that contains the attribute you want to match, like in the example //div[@id='extraneous'].

 

More unique for RIP is how you can select another element distinct from the element you want to remove, and use it to decide whether to remove or not. So, a query of //div[./div[@id='extraneous']] will remove the outer (parent) section, even though the matching is done against the attribute of the (child) section within.

 

 

Examples

 

 

 

Query by attribute

 

  • Remove sections that have an id matching a certain string, e.g., "extraneous."

 

 

XPath query: //div[@id='extraneous']

 

  • Remove sections that have an id containing a certain string. For example, remove all the sections whose id's contain the string "distraction."

 

 

XPath query: //div[@id[contains(.,'distraction')]]

 

  • Remove any element that has an id containing the string "distraction."

 

 

XPath query: //*[@id[contains(.,'distraction')]]

 

  • Remove a table row containing a cell with the class "UglyJunk" (using parent node syntax):

 

 

XPath query: //td[@class='UglyJunk']/..

 

  • Remove a table row containing a cell with the class "UglyJunk" (using context node syntax):

 

 

XPath query: //tr[./td[@class='UglyJunk']]

 

 

 

 

Query by an element's enclosed text

 

At its most basic, this method is like specifying a folder by its contents. Here you're telling RIP to find an element (folder) by the text it surrounds (its contents).

 

  • Remove all links whose complete enclosed text is known. For example, to remove the link <a href="http://www.asdf.com/">blah blah</a>, you could use

 

 

XPath query: //a[text()='blah blah']

 

  • A variation on the previous example: The visible link text is actually enclosed within some additional tag, e.g., <em>: <a href="http://www.asdf.com/"><em>blah blah</em></a>

 

 

XPath query: //a//*[text()='blah blah']

 

  • A more useful approach though is to query for a part of the enclosed text. Thus the same link (<a href="http://www.asdf.com/"><em>blah blah</em></a>) can be removed using

 

 

XPath query: //a[contains(.,'blah')]

 

  • This approach could, in some cases, be safely applied to tables that contain a stable header string. Just check to make sure the table does not also include desired content.

 

 

XPath query: //table[contains(.,'Distraction of the Day')]

 

 

 

 

Query by an element's inner elements

 

This method is like specifying two folders, the first is the one to remove, the other is the one that has to match before the first one is removed.

 

  • Removes any section that has a paragraph with the given link inside.

 

 

XPath query: //div[./p/a[@href[contains('http://www.randomwebsite.com/')]]]

 

 

 

 

 

 

 

Q & A

 

 

 

 

 

 

 

 

 

Additional Useful Tools

 

 

DOM Inspector (included with the Developer Tools) and Inspect Element (requires DOM Inspector) and/or ColorZilla (with DOM Inspector) can help you quickly determine unique characteristics of the item you would like to remove.

 

If you have DOM Inspector installed, you should see it listed on your Tools menu. If it is not installed and you would like to use it, you will (unfortunately) have to reinstall Firefox. When you do, choose the custom installation and check "Developer Tools."

 

While lacking the close integration of Inspect Element and ColorZilla with the DOM Inspector, both the Aardvark extension and the Web Developer extension are excellent analytical tools. You'll definitely want to give them a try.

 

TO DO: Find out if ColorZilla gives the element information and DOM path in the Status Bar if you don't have DOM Inspector installed and edit this section to note that if it does

 

 

 

 

Links

 

Comments (0)

You don't have permission to comment on this page.