SEO & Digital Marketing Consultant » Technical » HubSpot Blog Post Exports: How to Clean Them Up for WordPress Using Google Sheets

HubSpot Blog Post Exports: How to Clean Them Up for WordPress Using Google Sheets

/

HubSpot makes it fairly straightforward to export blog content, but the HTML that comes out is often full of inline styles, classes and IDs. When that HTML is pasted into WordPress, it can clash with the theme’s styles, create inconsistent typography and generally make templates harder to maintain.

In this guide, a practical workflow is outlined for using Python, pandas, gspread and BeautifulSoup to clean HubSpot blog exports in bulk via Google Sheets. The script removes unwanted attributes from key text elements while leaving links and images intact, so content is ready to paste into WordPress with minimal manual editing.

Contents

  1. Why clean HubSpot exports before moving to WordPress?
  2. What you’ll need
  3. Step 1 – Export blog posts from HubSpot
  4. Step 2 – Set up Google Sheets API access
  5. Step 3 – Add the Python script
  6. Step 4 – Run the script and review the output
  7. How the clean_html function works
  8. Performance and real-world usage tips
  9. Versions, assumptions and limitations
  10. Troubleshooting common errors
  11. Taking this further

Why clean HubSpot exports before moving to WordPress?

HubSpot’s blog editor applies its own classes, IDs and inline styles to headings, paragraphs and other elements. That works within HubSpot’s templates, but the same styling can:

  • Override or clash with WordPress theme styles.
  • Make typography and spacing inconsistent between migrated and native posts.
  • Add unnecessary code bloat to the HTML.

Cleaning exported HTML before migration helps:

  • Keep styling under the control of the WordPress theme and block styles.
  • Improve consistency across all posts.
  • Reduce layout bugs caused by legacy HubSpot styling.

Doing this by hand inside each post is time-consuming. A small Python script plus Google Sheets can automate most of the work.

What you’ll need

  • A HubSpot account with permission to export blog posts.
  • A WordPress site where the posts will ultimately live.
  • A Google account with access to Google Sheets.
  • Python 3 (3.10+ recommended) installed locally or on a server.
  • The following Python packages:
    • gspread – for Google Sheets access (gspread docs)
    • pandas – for working with tabular data (pandas docs)
    • beautifulsoup4 – for HTML parsing (BeautifulSoup docs)
    • google-auth – for Google API authentication

Install the packages with:

pip install gspread pandas beautifulsoup4 google-auth

Step 1 – Export blog posts from HubSpot

The exact menu labels in HubSpot can change over time, but the export flow is broadly:

  1. In HubSpot, go to Content > Blog (or Marketing > Website > Blog, depending on the account layout).
  2. Choose the relevant blog if there are multiple.
  3. Use the actions menu to select Export blog posts.
  4. Select the fields needed (for example “Title”, “Content”, “Meta Description”).
  5. Export as a CSV file and download it.

HubSpot’s official documentation on exporting blog content provides more detail and screenshots: HubSpot – Export your blog posts.

Once the CSV has been downloaded, import it into a Google Sheet. A fresh Sheet with a tab named something like HubSpot Export keeps things organised.

Step 2 – Set up Google Sheets API access

The script uses a Google service account to read from and write to a Sheet. Google’s official quickstart shows the overall process in detail: Google Sheets API – Python quickstart.

In summary:

  1. Create a Google Cloud project.
  2. Enable the Google Sheets API and (optionally) the Google Drive API.
  3. Create a service account for the project and generate a JSON key file, then download it as credentials.json into the project folder.
  4. In Google Sheets, share the Sheet with the service account’s email address (usually something like your-project-name@your-project-id.iam.gserviceaccount.com).

This gives the Python script permission to read from the “HubSpot Export” tab and write cleaned content into a separate tab in the same Spreadsheet.

Step 3 – Add the Python script

The script below connects to Google Sheets, reads the exported HubSpot data, cleans the HTML in each cell and writes the results to a new worksheet. It implements:

  • A safer header-handling pattern (the header row is kept as column names, not treated as data).
  • Element-wise cleaning using DataFrame.applymap for broad pandas compatibility.
  • Optional Drive scope to avoid issues when listing worksheets.
import html
import unicodedata

import gspread
import pandas as pd
from bs4 import BeautifulSoup
from google.oauth2.service_account import Credentials

# === Configuration ===
SERVICE_ACCOUNT_FILE = "credentials.json"
SCOPES = [
    "https://www.googleapis.com/auth/spreadsheets",
    "https://www.googleapis.com/auth/drive",
]

SPREADSHEET_ID = "YOUR_SPREADSHEET_ID"  # Replace with your Sheet ID
INPUT_SHEET_NAME = "HubSpot Export"
OUTPUT_SHEET_NAME = "Cleaned Export"


def clean_html(html_content):
    """
    Clean a single cell of HTML.

    - Decodes HTML entities.
    - Normalises Unicode.
    - Strips non-breaking spaces.
    - Removes style, class and id attributes from common text elements.
    - Leaves links, images and structural elements untouched.
    """
    # Skip non-strings and very large cells
    if not isinstance(html_content, str) or len(html_content) > 5000:
        return html_content

    # Decode HTML entities
    html_content = html.unescape(html_content)

    # Normalise Unicode
    html_content = unicodedata.normalize("NFKC", html_content)

    # Remove stray invalid bytes
    html_content = html_content.encode("utf-8", "ignore").decode("utf-8")

    # Replace non-breaking spaces and trim
    html_content = html_content.replace("\xa0", " ").strip()

    # If it does not look like HTML, return as-is
    if "<" not in html_content or ">" not in html_content:
        return html_content

    # Parse the HTML
    soup = BeautifulSoup(html_content, "html.parser")

    # Only clean common text elements; leave links/images/layout elements alone
    tags_to_clean = ["p", "em", "blockquote", "strong", "b", "span"]
    tags_to_clean += [f"h{i}" for i in range(1, 7)]

    for tag in soup.find_all(tags_to_clean):
        for attr in ["style", "class", "id"]:
            tag.attrs.pop(attr, None)

    return str(soup)


def main():
    # Authenticate with the service account
    creds = Credentials.from_service_account_file(
        SERVICE_ACCOUNT_FILE,
        scopes=SCOPES,
    )
    client = gspread.authorize(creds)

    # Open the spreadsheet
    spreadsheet = client.open_by_key(SPREADSHEET_ID)

    # Get the input worksheet
    input_worksheet = spreadsheet.worksheet(INPUT_SHEET_NAME)

    # Get or create the output worksheet
    worksheet_titles = [ws.title for ws in spreadsheet.worksheets()]
    if OUTPUT_SHEET_NAME in worksheet_titles:
        output_worksheet = spreadsheet.worksheet(OUTPUT_SHEET_NAME)
    else:
        output_worksheet = spreadsheet.add_worksheet(
            title=OUTPUT_SHEET_NAME,
            rows=1000,
            cols=50,
        )

    # Read all data from the input sheet
    data = input_worksheet.get_all_values()
    if not data:
        print("No data found in input sheet.")
        return

    header, *rows = data
    if not rows:
        print("No data rows found in input sheet.")
        return

    # Build a DataFrame with the header row as column names
    df = pd.DataFrame(rows, columns=header)

    # Apply cleaning to every cell
    # applymap is supported in more pandas versions than DataFrame.map
    df = df.applymap(clean_html)

    # Write the cleaned data back to the output sheet
    output_worksheet.clear()
    output_worksheet.update([df.columns.tolist()] + df.values.tolist())

    print(f"Cleaning complete. Data written to worksheet: {OUTPUT_SHEET_NAME!r}")


if __name__ == "__main__":
    main()

Step 4 – Run the script and review the output

With the configuration updated (Spreadsheet ID and sheet names), run the script from the command line:

python clean_hubspot_export.py

When it completes, the Google Sheet will contain a new tab called something like Cleaned Export. This will have the same column structure as the original export, but with cleaned HTML in each cell.

The next steps are:

  1. Open a post in WordPress.
  2. Switch to the HTML/Code editor (or use a Custom HTML block).
  3. Copy the cleaned HTML from the relevant cell in the Sheet.
  4. Paste it into WordPress and preview it in the front-end theme.

Headings, paragraphs and emphasis should now respect the theme’s styling, without legacy HubSpot inline styles overriding anything.

How the clean_html function works

The clean_html function is deliberately conservative so that content is made safer for WordPress without breaking layout or media.

  • It decodes HTML entities using Python’s html.unescape, so characters such as   and ’ become plain Unicode text.
  • It normalises Unicode with the standard library’s unicodedata.normalize to reduce odd character variants that sometimes appear after exports.
  • It removes non-breaking spaces (\xa0) and trims whitespace to tidy paragraph text.
  • It only cleans specific tags:
    • Headings <h1><h6>
    • Paragraphs <p>
    • Emphasis and strong tags: <em>, <strong>, <b>, <span>, <blockquote>
    For these tags it strips style, class and id attributes.
  • It leaves links, images and layout elements such as <a>, <img>, <ul>, <ol>, <div> and <section> untouched. This helps preserve structure and media.
  • It skips obviously non-HTML content and very large cells (over 5,000 characters) to avoid wasting time on plain text columns and to guard against pathological cases.

This approach follows typical patterns described in the BeautifulSoup documentation for cleaning attributes from tags while preserving the underlying HTML: see the Modifying the tree section in the official docs for more examples.

Performance and real-world usage tips

The example above applies clean_html to every cell in the Sheet. For small to medium exports, that is perfectly acceptable. For larger datasets or very wide sheets, performance can be improved by:

  • Restricting cleaning to the column that contains post body HTML, for example: content_column = "Content" # change to match the export column name df[content_column] = df[content_column].apply(clean_html)
  • Splitting extremely large exports across multiple worksheets, then running the script per worksheet.
  • Running the script on a machine with a stable connection to Google’s APIs and avoiding very aggressive reruns (for example, not running the entire migration every few minutes).

Versions, assumptions and limitations

To keep the example focused, a few assumptions are made:

  • Python version: any reasonably up-to-date Python 3 version should work. Python 3.10+ is recommended.
  • pandas version: the script uses DataFrame.applymap, which is available in mainstream pandas releases. If working with a very old pandas version, it is worth checking the official pandas applymap documentation for any behavioural differences.
  • Google access: the service account must have access to the Sheet, and the Sheets API must be enabled in the Google Cloud project.
  • HTML scope: only a specific set of text-related tags is cleaned. If HubSpot adds important styling to other elements (for example custom cards or layout blocks), extra rules may be needed.
  • Length cap: the 5,000-character length check is a pragmatic safeguard. For extremely long posts stored in a single cell, this value can be increased.

For a detailed view of the authentication and authorisation flow used by this pattern, the official Google Sheets API documentation is the best reference point.

Troubleshooting common errors

When working with Google Sheets and external libraries, a few common issues tend to appear. Below are some quick checks that reflect real-world experience with this kind of script:

SpreadsheetNotFound or similar errors

  • Check that the correct SPREADSHEET_ID has been copied from the Sheet URL.
  • Confirm that the Sheet has been shared with the service account’s email address.
  • Verify that the service account JSON file path in SERVICE_ACCOUNT_FILE is correct.

WorksheetNotFound for the input worksheet

  • Make sure that the tab name in Google Sheets matches INPUT_SHEET_NAME exactly, including spaces and capital letters.
  • If the export tab has a different name, either rename it in Sheets or adjust the configuration in the script.

AttributeError related to applymap

  • If using a very old version of pandas, double check that applymap is available on DataFrame. If not, upgrading pandas to a current version is usually the simplest fix.

Slow performance or suspected rate limits

  • For very large sheets, consider cleaning only the content column instead of every cell.
  • Batching work across multiple runs or worksheets can help avoid hitting Google API quotas in a single burst.

For detailed exceptions and usage examples, the gspread documentation is a useful companion to this script.

Taking this further

The basic pattern here is flexible. It can be extended to handle:

  • Extra attribute cleaning for other tags such as <div> and <section>, if layout classes are not required in WordPress.
  • Custom find-and-replace operations for HubSpot-specific markup patterns.
  • Automated checks for heading levels, empty paragraphs or legacy shortcodes.

Because the heavy lifting is in Python, changes can be tested on a subset of posts first, then rolled out to an entire export with confidence.

Combined with a clear internal process for redirects and URL mapping, this kind of cleaning step helps make HubSpot-to-WordPress migrations both cleaner and more predictable from a technical SEO perspective.


contact.

From bespoke SEO strategies, content services, to a modern high-performance website-it’s all based on your requirement.

helpful SEO & digital marketing tips.

recent articles.

Read articles and guides to help you learn about SEO, how it works, and other useful tips to help generate organic traffic.