In any SEO and website migration project, accurately mapping old URLs to new ones is critical for maintaining search engine rankings and minimising disruptions. The following Python script can streamline this process by comparing page titles or content for similarity and generating a comprehensive URL mapping file.
Table of Contents
Script Overview: Page Title Matching for URL Mapping
This Python script is designed to match old URLs to new URLs based on page titles using fuzzy matching techniques. By automating this process, it saves time and reduces the risk of manual errors during SEO migrations.
Key Features
- Fetch Page Titles: Uses
requests
andBeautifulSoup
to extract page titles from provided URLs. - Fuzzy Matching: Leverages the
rapidfuzz
library to calculate similarity scores between old and new page titles. - CSV Input/Output: Reads old and new URLs from a CSV file and generates a mapped output file with match scores.
- Custom Threshold: Allows users to set a similarity threshold for better control over matches.
Code Implementation
Below is the script for page title-based URL mapping:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from rapidfuzz import fuzz
def fetch_page_title(url):
"""
Fetches the page title for a given URL.
"""
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.title.string.strip() if soup.title else None
return title
except Exception as e:
print(f"Error fetching title for {url}: {e}")
return None
def map_urls_by_titles(old_urls, new_urls, threshold=80):
"""
Maps old URLs to new URLs based on their page titles using fuzzy matching.
Parameters:
old_urls (list): List of old URLs to map from.
new_urls (list): List of new URLs to map to.
threshold (int): Minimum similarity score to consider a match (0-100).
Returns:
DataFrame: A mapping of old URLs to new URLs based on page titles and match scores.
"""
old_titles = {url: fetch_page_title(url) for url in old_urls}
new_titles = {url: fetch_page_title(url) for url in new_urls}
# Create mapping by fuzzy matching titles
mappings = []
for old_url, old_title in old_titles.items():
best_match = None
highest_score = 0
for new_url, new_title in new_titles.items():
if old_title and new_title:
# Calculate similarity score
score = fuzz.ratio(old_title, new_title)
if score > highest_score: # Update best match if score is higher
highest_score = score
best_match = new_url
# Add match details to the mapping
mappings.append({
"Old URL": old_url,
"Old Title": old_title,
"New URL": best_match if highest_score >= threshold else None,
"New Title": new_titles.get(best_match, None) if best_match else None,
"Match Score": highest_score
})
return pd.DataFrame(mappings)
def read_urls_from_csv(file_path):
"""
Reads old and new URLs from a CSV file.
The CSV should have two columns: 'Old URL' and 'New URL'.
"""
try:
data = pd.read_csv(file_path)
old_urls = data['Old URL'].dropna().tolist()
new_urls = data['New URL'].dropna().tolist()
return old_urls, new_urls
except Exception as e:
print(f"Error reading URLs from CSV: {e}")
return [], []
if __name__ == "__main__":
# Input and output file paths
input_csv = "urls.csv" # Replace with your CSV file path
output_csv = "url_mapping.csv"
# Read URLs from the CSV file
old_urls, new_urls = read_urls_from_csv(input_csv)
if not old_urls or not new_urls:
print("No URLs found in the input file. Please check the CSV format.")
else:
# Generate the URL mapping
url_mapping = map_urls_by_titles(old_urls, new_urls, threshold=80)
# Save the mapping to a CSV file
url_mapping.to_csv(output_csv, index=False)
print(f"URL mapping saved to {output_csv}")
Steps to Use the Script
- Install Required Libraries:
pip install requests beautifulsoup4 pandas rapidfuzz
- Prepare Input CSV: The input CSV should have two columns:
Old URL,New URL https://oldsite.com/page1,https://newsite.com/page-a https://oldsite.com/page2,https://newsite.com/page-b
- Run the Script: Save the script as
seo_migration.py
and execute it:python seo_migration.py
- Review the Output: The script generates a
url_mapping.csv
file with columns for old URL, old title, new URL, new title, and match score.
Note: If output is missing URLs, then it might be due to:
- Threshold Too High: If the similarity score does not meet the threshold (default is 80), the New URL will be None.
- Titles Not Found: If the script fails to fetch page titles for the URLs (due to errors like timeouts, incorrect URLs, or empty titles), the matching cannot proceed.
Extending the Script for Content Matching
To go beyond page titles, the script can also compare other on-page elements such as meta descriptions, H1s, body text, and image alt attributes. Below is a brief overview of this extended functionality:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from rapidfuzz import fuzz
def fetch_page_content(url):
"""
Fetches on-page content for a given URL, including title, meta description, H1s, body text, and images.
"""
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Extract relevant content
title = soup.title.string.strip() if soup.title else None
meta_description = soup.find("meta", attrs={"name": "description"})
meta_description = meta_description["content"].strip() if meta_description else None
h1 = soup.find("h1")
h1 = h1.text.strip() if h1 else None
body_text = " ".join([p.text.strip() for p in soup.find_all("p")])
images = [img['alt'] for img in soup.find_all("img", alt=True)]
return {
"title": title,
"meta_description": meta_description,
"h1": h1,
"body_text": body_text,
"images": images
}
except Exception as e:
print(f"Error fetching content for {url}: {e}")
return None
def calculate_similarity(old_content, new_content):
"""
Calculates a similarity score between two sets of page content.
"""
total_score = 0
components = ['title', 'meta_description', 'h1', 'body_text']
# Compare textual components
for component in components:
old = old_content.get(component, "")
new = new_content.get(component, "")
if old and new:
total_score += fuzz.ratio(old, new)
# Compare images (using alt text similarity)
old_images = old_content.get("images", [])
new_images = new_content.get("images", [])
image_score = sum(
max(fuzz.ratio(old_img, new_img) for new_img in new_images)
for old_img in old_images
) if old_images and new_images else 0
total_score += image_score
return total_score
def map_urls_by_content(old_urls, new_urls, threshold=70):
"""
Maps old URLs to new URLs based on the closest match of on-page content.
Parameters:
old_urls (list): List of old URLs to map from.
new_urls (list): List of new URLs to map to.
threshold (int): Minimum similarity score to consider a match.
Returns:
DataFrame: A mapping of old URLs to new URLs based on content similarity scores.
"""
old_contents = {url: fetch_page_content(url) for url in old_urls}
new_contents = {url: fetch_page_content(url) for url in new_urls}
mappings = []
for old_url, old_content in old_contents.items():
best_match = None
highest_score = 0
for new_url, new_content in new_contents.items():
if old_content and new_content:
score = calculate_similarity(old_content, new_content)
if score > highest_score:
highest_score = score
best_match = new_url
# Add match details to the mapping
mappings.append({
"Old URL": old_url,
"New URL": best_match if highest_score >= threshold else None,
"Similarity Score": highest_score
})
return pd.DataFrame(mappings)
def read_urls_from_csv(file_path):
"""
Reads old and new URLs from a CSV file.
The CSV should have two columns: 'Old URL' and 'New URL'.
"""
try:
data = pd.read_csv(file_path)
old_urls = data['Old URL'].dropna().tolist()
new_urls = data['New URL'].dropna().tolist()
return old_urls, new_urls
except Exception as e:
print(f"Error reading URLs from CSV: {e}")
return [], []
if __name__ == "__main__":
# Input and output file paths
input_csv = "urls.csv" # Replace with your CSV file path
output_csv = "url_mapping.csv"
# Read URLs from the CSV file
old_urls, new_urls = read_urls_from_csv(input_csv)
if not old_urls or not new_urls:
print("No URLs found in the input file. Please check the CSV format.")
else:
# Generate the URL mapping
url_mapping = map_urls_by_content(old_urls, new_urls, threshold=70)
# Save the mapping to a CSV file
url_mapping.to_csv(output_csv, index=False)
print(f"URL mapping saved to {output_csv}")
By analysing these additional components, SEO specialists can ensure even closer matches between old and new URLs during migrations.
Benefits of Automation in SEO Migrations
- Time Efficiency: Automates tedious URL comparisons.
- Accuracy: Reduces human errors by using data-driven methods.
- Customisation: Allows for adjustable thresholds and comparison elements.
Incorporating this script into your SEO workflow can significantly enhance the efficiency and accuracy of your migrations, ensuring a smooth transition and minimal impact on rankings.