How To Crawl Websites With Google Sheets




How to scrape web pages with Google Sheets

An automated way to scrape data within Google Sheets seems too good to be true? In this article, I will show you how and go through some examples of how I’ve used this in my role as a Paid Media Marketer.

What Is IMPORTXML?

On Google’s support page they say this about IMPORTXML:

“IMPORTXML can import data from various structured data types including XML, CSV, HTML TSV, and RSS and ATOM XML feed.”

Effectively IMPORTXML is a way of scraping structured data from a web page without needing any HTML, CSS or coding knowledge.

How does IMPORTXML work?

IMPORTXML only requires two variables to extract data from a web page.

  • The URL of the webpage we want to extract the information from.
  • The XPath of the HTML element contains the data you are trying to extract.

The XPath is used to navigate through elements and attributes in a web document and is structured like this: //title.

For example, if we wanted to extract the title of this article: https://benjamindavis.uk/articles/voicesearch.html

We would use IMPORTXML like this:
=IMPORTXML(“https://benjamindavis.uk/articles/voicesearch.html”, “//title”)

An example of using IMPORTXML in Google Sheets

Bear in mind that if you copy the formula above to test it out you may need to re-write the speech marks.

If you want to find out how to extract specific elements that aren’t the page title if you go into Inspect Element on a web page and right-click the element you want to extract you can click copy > XPath which can then be used in the Google Sheets formula.

Screenshot of how to find Xpath data for a HTML element

When would IMPORTXML be useful?

If you had a list of URLs of articles on your website and you wanted all their meta descriptions you could input all the URLs in a column in Google sheets. Then in the next column, you can use =IMPORTXML(URLCell, “//meta[@name=’description’]/@content”) which will then result in a list of your meta descriptions.

Example:

Table showing urls in left column and a decription of the page on the right

What Is IMPORTHTML?

IMPORTHTML looks for a specific table or list in a webpage and copies the data out of it. In HTML this is anything inside <table, <ul> or <ol> tags.

How does IMPORTHTML work?

IMPORTHTML requires 3 variables:

=IMPORTHTML(URL, query_type, index)
  • The website URL of the page that you want to crawl. This needs to be enclosed in speech marks (“)
  • The query type is either “Table” or “List”
  • The index refers to the table on the page you want to access data from. This starts at 1.

How to get indexes of tables & lists

A webpage will most likely contain one or more tables and/or lists. If you want to find out the indexes of them on a page, follow these steps:

  1. Open your browser's developer console. This can be accessed on Chrome by pressing F12 on your keyboard.
  2. Copy the following code and press enter to return the indexes of all the tables on the page:
    var index = 1; [].forEach.call(document.getElementsByTagName("table"), function(elements) { console.log("Index: " + index++, elements); });
    If you are looking for all lists instead you just need to replace "table" with “ul,ol” like this:
    var index = 1; [].forEach.call(document.querySelectorAll("ul,ol"), function(elements) { console.log("Index: " + index++, elements); });
  3. You should then see a big list of tables like this:
    A list of HTML unordered-list elements
    Clicking through the indexes you should be able to identify which is the one you are looking to pull data from. In my example I want to pull the editorial quality review which is index 12.
    A selected unordered-list list element
    To then pull this data into Google sheets I enter this formula:
    =IMPORTHTML("https://en.wikipedia.org/wiki/Wikipedia:About", "List",12)

How to import a table

In this example I will be pulling in this crypto pricing data from https://coinstats.app/coins/.

Table taken from Coinstats showing prices of top 8 Crypto cryptocurrency

First I will use the developer console to get the index of this table. In this example it is just index 1.

Screenshot of a table element from an array

I can then use the URL and Index to pull this table into Google Sheets using this formula:
=IMPORTHTML("https://coinstats.app/coins/", "Table",1)
In Google Sheets this looks like:

A screenshot of a Google Sheets document displaying the IMPORTHTML function in action

How many IMPORTHTMLs can Google Sheets handle?

In the beginning Google limited you to only being able to have 50 IMPORTHTML functions per sheet but in 2015 this limit was removed.
However, with a lot of IMPORTHTML formulas you may notice a drop in speed as each request has to individually crawl an external URL.

Why is my IMPORTHTML formula suddenly not working?

If your formula suddenly stops working it is worth checking if the URL you are fetching from has changed failing that check if the index of the Table or List you are fetching from has changed.


If things are still not fixed perhaps the website owner has blocked website crawlers in their robots.txt file.

Why would you use these formulas?

I have used both these functions to pull in a list of articles on a clients website and then use that data to then pull in their meta titles, descriptions and the category of the article.
This gave me a list of all article URLS and Category which I then used to create a Dynamic Search Ad feed with category labels which I used in Google Search Ads.
The Meta titles and descriptions just gave me context as to what the articles were about.


Another potential use of these formulas is to fetch data from an external source. Similar to my example above using Coinstats cryptocurrency chart if there was some data that you wanted to pull in to then operate on these formulas allow you to do that.

Other ways of scraping data into Google Sheets

If the above methods don’t work for the data you are trying to pull in Google Sheets also has the following formula that allow you to access different data formats.

Formula Name
Description
IMPORTXML
This formula can import data from various structured data types including XML, HTML, TSV, CSV, RSS and ATOM XML feeds.
IMPORTRANGE
This formula can import a range of cells from a specified spreadsheet.
IMPORTFEED
This formula can import an RSS or ATOM feed.
IMPORTDATA
This formula can import data in CSV or TSV format.

Why would you use these formulas?

I have used both these functions to pull in a list of articles on a clients website and then use that data to then pull in their meta titles, descriptions and the category of the article.
This gave me a list of all article URLS and Category which I then used to create a Dynamic Search Ad feed with category labels which I used in Google Search Ads.
The Meta titles and descriptions just gave me context as to what the articles were about.


Another potential use of these formulas is to fetch data from an external source. Similar to my example above using Coinstats cryptocurrency chart if there was some data that you wanted to pull in to then operate on these formulas allow you to do that.



Related Articles

What is a Marketing Strategy?

A marketing strategy is a single comprehensive plan which contains all of a company’s marketing goals and objectives.

Read More

10 marketing KPI’s you should be tracking in 2021

KPIs ensure that you execute a more successful marketing strategy as well as helping you to identify which campaigns and tactics are having the biggest impact for your business.

Read More

Top 10 ways to market your business in 2021

Strategies that you can use in this new era of cookie-less, privacy focused browsers.

Read More


Let’s talk

Start your project with me

Start Talking