Tue 14 Sep

One-liner to extract a list of link addresses from an HTML file

I'm moving my research group's website to a new server and making some updates at the same time. One of the main things I need to do is make sure links are going to work after the transition. Here is a little one-line shell "script" (if you can call it that) that will extract link addresses from an HTML web page:

wget -q -O - http://www.google.com | tr " " "\n" | grep "href" | cut -f2 -d"\""

wget fetches the file and outputs its content to stdout. tr replaces all spaces with newlines, grep filters out every line that doesn't contain an "href", and finally cut displays everything between the first pair of double-quotes.

If you want to use a file you have on your local machine, you can use this variant instead:

tr " " "\n" < [file_name.html]| grep "href" | cut -f2 -d"\""

Obligatory disclaimer: HTML is NOT a regular language and in general cannot be parsed with regex's as is done here. This is not guaranteed to work.

· Tags: , ,