Extract contents from .html file

Quick-and-easy way to extract body from a html file:

cat myfile.html | tr -d '\n' | grep -o -E '<\s*body[^>]*>(.*?)<\s*/\s*body\s*\>'
  • cat – print file contents
  • tr – remove newlines
  • grep (and a regular expr) – get the content

You could use redirect (“>”) to send output to a file (instead of standard output):

cat myfile.html | tr -d '\n' | grep -o -E '<\s*body[^>]*>(.*?)<\s*/\s*body\s*\>' > output.html

Leave a Comment