Why Clean Up HubSpot HTML Before Importing to WordPress?
When exporting blog posts from HubSpot to migrate them to WordPress, the HTML often contains unnecessary attributes such as inline styles, classes, and IDs. These attributes can:
Table of Contents
- Cause inconsistent styling in WordPress.
- Bloat the HTML, increasing page load times.
- Conflict with WordPress themes and custom CSS.
To resolve this, we developed a script that cleans up exported HTML within Google Sheets. This script:
- Strips unwanted attributes (
style
,class
,id
) from specific HTML elements. - Preserves all other elements, including images and links.
- Processes bulk data efficiently.
Step 1: Export Your Blog Posts from HubSpot
- Go to your HubSpot account.
- Navigate to Marketing > Website > Blog.
- Click Export Blog Posts and choose the format (CSV or XLSX).
- Download the file and open it in Google Sheets.
Step 2: Set Up Google Sheets and Install Required Python Libraries
Install the Required Libraries:
pip install gspread pandas beautifulsoup4 google-auth
Enable Google Sheets API:
- Go to Google Cloud Console.
- Enable the Google Sheets API.
- Create a Service Account and download
credentials.json
. - Share access to your Google Sheet with the service account email as an Editor.
Step 3: Run the Python Script to Clean the HTML
The following Python script:
- Connects to Google Sheets.
- Reads the HubSpot export.
- Cleans up HTML content.
- Saves the cleaned version in a new sheet.
The Cleaning Script
import gspread
import pandas as pd
from bs4 import BeautifulSoup
import html
import unicodedata
from google.oauth2.service_account import Credentials
# Authenticate Google Sheets
creds = Credentials.from_service_account_file("credentials.json", scopes=["https://www.googleapis.com/auth/spreadsheets"])
client = gspread.authorize(creds)
def clean_html(html_content):
if not isinstance(html_content, str) or len(html_content) > 5000:
return html_content # Return as-is if not valid or too large
html_content = html.unescape(html_content)
html_content = unicodedata.normalize("NFKC", html_content)
html_content = html_content.encode("utf-8", "ignore").decode("utf-8")
# Remove non-breaking spaces
html_content = html_content.replace("\xa0", " ").strip()
if "<" not in html_content or ">" not in html_content:
return html_content # Skip non-HTML content
soup = BeautifulSoup(html_content, "html.parser")
tags_to_clean = ["p", "em", "blockquote", "strong", "b", "span"] + [f"h{i}" for i in range(1, 7)]
for tag in soup.find_all(tags_to_clean):
# Remove specified attributes even if the element has text or nested elements
for attr in ["style", "class", "id"]:
if attr in tag.attrs:
del tag[attr]
return str(soup)
def process_google_sheet(sheet_id, input_worksheet_name, output_worksheet_name):
sheet = client.open_by_key(sheet_id)
try:
input_worksheet = sheet.worksheet(input_worksheet_name)
except gspread.exceptions.WorksheetNotFound:
print(f"Worksheet '{input_worksheet_name}' not found.")
return
try:
output_worksheet = sheet.worksheet(output_worksheet_name)
except gspread.exceptions.WorksheetNotFound:
print(f"Worksheet '{output_worksheet_name}' not found. Creating a new one...")
output_worksheet = sheet.add_worksheet(title=output_worksheet_name, rows="1000", cols="20")
# Get all data
data = input_worksheet.get_all_values()
# Convert to Pandas DataFrame
df = pd.DataFrame(data)
# Apply cleaning function to all cells
df = df.map(clean_html)
# Write cleaned data to the output worksheet
output_worksheet.update([df.columns.values.tolist()] + df.values.tolist())
print(f"Processed data from '{input_worksheet_name}' and saved clean HTML to '{output_worksheet_name}'.")
# Example Usage
sheet_id = "YOUR_GOOGLE_SHEET_ID"
input_worksheet_name = "DirtyHTML"
output_worksheet_name = "CleanHTML"
process_google_sheet(sheet_id, input_worksheet_name, output_worksheet_name)
Step 4: Import the Cleaned Content into WordPress
- Open your cleaned Google Sheet.
- Copy the cleaned HTML content.
- Go to WordPress Admin > Posts > Add New.
- Switch to the HTML view and paste the cleaned content.
- Preview and publish!
To Wrap Up…
Cleaning up HubSpot blog exports is essential before importing them into WordPress. This script automates the process, ensuring your content is clean, lightweight, and WordPress-ready. By leveraging Google Sheets and Python, you can efficiently process bulk content with minimal manual effort.
💡 Need modifications or enhancements? Feel free to tweak the script to suit your workflow or get in touch! 🚀
Leave a Reply