How to Clean Up HubSpot Blog Post Exports for WordPress Using Google Sheets

Why Clean Up HubSpot HTML Before Importing to WordPress?

When exporting blog posts from HubSpot to migrate them to WordPress, the HTML often contains unnecessary attributes such as inline styles, classes, and IDs. These attributes can:

Table of Contents

Cause inconsistent styling in WordPress.
Bloat the HTML, increasing page load times.
Conflict with WordPress themes and custom CSS.

To resolve this, we developed a script that cleans up exported HTML within Google Sheets. This script:

Strips unwanted attributes (style, class, id) from specific HTML elements.
Preserves all other elements, including images and links.
Processes bulk data efficiently.

Step 1: Export Your Blog Posts from HubSpot

Go to your HubSpot account.
Navigate to Marketing > Website > Blog.
Click Export Blog Posts and choose the format (CSV or XLSX).
Download the file and open it in Google Sheets.

Step 2: Set Up Google Sheets and Install Required Python Libraries

Install the Required Libraries:

pip install gspread pandas beautifulsoup4 google-auth

Enable Google Sheets API:

Go to Google Cloud Console.
Enable the Google Sheets API.
Create a Service Account and download credentials.json.
Share access to your Google Sheet with the service account email as an Editor.

Step 3: Run the Python Script to Clean the HTML

The following Python script:

Connects to Google Sheets.
Reads the HubSpot export.
Cleans up HTML content.
Saves the cleaned version in a new sheet.

The Cleaning Script

import gspread
import pandas as pd
from bs4 import BeautifulSoup
import html
import unicodedata
from google.oauth2.service_account import Credentials

# Authenticate Google Sheets
creds = Credentials.from_service_account_file("credentials.json", scopes=["https://www.googleapis.com/auth/spreadsheets"])
client = gspread.authorize(creds)

def clean_html(html_content):
    if not isinstance(html_content, str) or len(html_content) > 5000:
        return html_content  # Return as-is if not valid or too large
    
    html_content = html.unescape(html_content)
    html_content = unicodedata.normalize("NFKC", html_content)
    html_content = html_content.encode("utf-8", "ignore").decode("utf-8")
    
    # Remove non-breaking spaces
    html_content = html_content.replace("\xa0", " ").strip()
    
    if "<" not in html_content or ">" not in html_content:
        return html_content  # Skip non-HTML content

    soup = BeautifulSoup(html_content, "html.parser")

    tags_to_clean = ["p", "em", "blockquote", "strong", "b", "span"] + [f"h{i}" for i in range(1, 7)]

    for tag in soup.find_all(tags_to_clean):
        # Remove specified attributes even if the element has text or nested elements
        for attr in ["style", "class", "id"]:
            if attr in tag.attrs:
                del tag[attr]
    
    return str(soup)

def process_google_sheet(sheet_id, input_worksheet_name, output_worksheet_name):
    sheet = client.open_by_key(sheet_id)
    
    try:
        input_worksheet = sheet.worksheet(input_worksheet_name)
    except gspread.exceptions.WorksheetNotFound:
        print(f"Worksheet '{input_worksheet_name}' not found.")
        return
    
    try:
        output_worksheet = sheet.worksheet(output_worksheet_name)
    except gspread.exceptions.WorksheetNotFound:
        print(f"Worksheet '{output_worksheet_name}' not found. Creating a new one...")
        output_worksheet = sheet.add_worksheet(title=output_worksheet_name, rows="1000", cols="20")
    
    # Get all data
    data = input_worksheet.get_all_values()
    
    # Convert to Pandas DataFrame
    df = pd.DataFrame(data)
    
    # Apply cleaning function to all cells
    df = df.map(clean_html)
    
    # Write cleaned data to the output worksheet
    output_worksheet.update([df.columns.values.tolist()] + df.values.tolist())
    
    print(f"Processed data from '{input_worksheet_name}' and saved clean HTML to '{output_worksheet_name}'.")

# Example Usage
sheet_id = "YOUR_GOOGLE_SHEET_ID"
input_worksheet_name = "DirtyHTML"
output_worksheet_name = "CleanHTML"

process_google_sheet(sheet_id, input_worksheet_name, output_worksheet_name)

Step 4: Import the Cleaned Content into WordPress

Open your cleaned Google Sheet.
Copy the cleaned HTML content.
Go to WordPress Admin > Posts > Add New.
Switch to the HTML view and paste the cleaned content.
Preview and publish!

To Wrap Up…

Cleaning up HubSpot blog exports is essential before importing them into WordPress. This script automates the process, ensuring your content is clean, lightweight, and WordPress-ready. By leveraging Google Sheets and Python, you can efficiently process bulk content with minimal manual effort.

💡 Need modifications or enhancements? Feel free to tweak the script to suit your workflow or get in touch! 🚀

How to Clean Up HubSpot Blog Post Exports for WordPress Using Google Sheets

Why Clean Up HubSpot HTML Before Importing to WordPress?

Step 1: Export Your Blog Posts from HubSpot

Step 2: Set Up Google Sheets and Install Required Python Libraries

Install the Required Libraries:

Enable Google Sheets API:

Step 3: Run the Python Script to Clean the HTML

The Cleaning Script

Step 4: Import the Cleaned Content into WordPress

To Wrap Up…

contact.

recent articles.

How to Automate Keyword Research with Keywords Everywhere and Google Sheets

10 Steps to Evaluate Your SEO Agency

SEO Website & Blog Migrations: Checklist & Best Practice Guide