Regular Expressions

Regular expressions are a method of describing advanced pattern matching. This note describes the 4 symbols of the regular expression syntax that is relevant to building Plucker exclusion lists.

. (the decimal point) The wildcard, it matches any single character except the newline character.

* Means 0 or more occurrences of the preceding item. This is most useful when used in combination with the period wildcard. For example .* means match 0 characters or any number of characters of any word.

\ An 'escape' character. Indicates that the next character shouldn't behave as its usual function inside regular expressions, but instead be treated as a normal character, as though it wasn't a special function. For example, putting a \ in front of a . makes it an actual period, so an exclusion list entry of www\.plkr\.org only would match the string www.plkr.org, whereas the exclusion list entry www.plkr.org, since there is period wildcard, means to match www.plkr.org, as well as wwwaplkrxorg, wwwbplkrdorg, and so on.

Beginner tip: If the url has a ? character, for example http://www.mysite.com/index.shtml?login=tdavidson don't forget to escape the ? character, as the ? character has special meaning in regular expression syntax. The proper escaping would be http://www\.mysite\.com/index\.shtml\?login=tdavidson

Beginner tip: On MSW, if you are spidering a local file on your harddrive, then you need to remember to escape the backslashes for directories. Because, as we mentioned above, that \ means to escape the next letter, so we have to 'escape the escape' to get an actual backslash. So for example, a URL of C:\windows\myfile.html would be properly escaped to be C:\\windows\\myfile.html

$ Means to only match if the phrase is at end of the searched string. This is most useful for file extensions. For example, inspecting the exclusion list entry .*\.zip$, we see that starts with .* to mean any number of characters, then mean a literal period, then the letters 'zip' at the end. This will then match spam.zip and garbage.zip, and so on, but www.zip2net.com will not be matched.