SEO & Digital Marketing Consultant » Technical » URL Redirect Mapping with Python for Website Migrations

URL Redirect Mapping with Python for Website Migrations

/

Python Book

In any SEO and website migration project, accurately mapping old URLs to new ones is critical for maintaining search engine rankings and minimising disruptions. The following Python script can streamline this process by comparing page titles or content for similarity and generating a comprehensive URL mapping file.

Script Overview: Page Title Matching for URL Mapping

This Python script is designed to match old URLs to new URLs based on page titles using fuzzy matching techniques. By automating this process, it saves time and reduces the risk of manual errors during SEO migrations.

Key Features

  • Fetch Page Titles: Uses requests and BeautifulSoup to extract page titles from provided URLs.
  • Fuzzy Matching: Leverages the rapidfuzz library to calculate similarity scores between old and new page titles.
  • CSV Input/Output: Reads old and new URLs from a CSV file and generates a mapped output file with match scores.
  • Custom Threshold: Allows users to set a similarity threshold for better control over matches.

Code Implementation

Below is the script for page title-based URL mapping:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from rapidfuzz import fuzz

def fetch_page_title(url):
    """
    Fetches the page title for a given URL.
    """
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        title = soup.title.string.strip() if soup.title else None
        return title
    except Exception as e:
        print(f"Error fetching title for {url}: {e}")
        return None

def map_urls_by_titles(old_urls, new_urls, threshold=80):
    """
    Maps old URLs to new URLs based on their page titles using fuzzy matching.
    
    Parameters:
        old_urls (list): List of old URLs to map from.
        new_urls (list): List of new URLs to map to.
        threshold (int): Minimum similarity score to consider a match (0-100).
    
    Returns:
        DataFrame: A mapping of old URLs to new URLs based on page titles and match scores.
    """
    old_titles = {url: fetch_page_title(url) for url in old_urls}
    new_titles = {url: fetch_page_title(url) for url in new_urls}

    # Create mapping by fuzzy matching titles
    mappings = []
    for old_url, old_title in old_titles.items():
        best_match = None
        highest_score = 0

        for new_url, new_title in new_titles.items():
            if old_title and new_title:
                # Calculate similarity score
                score = fuzz.ratio(old_title, new_title)
                if score > highest_score:  # Update best match if score is higher
                    highest_score = score
                    best_match = new_url

        # Add match details to the mapping
        mappings.append({
            "Old URL": old_url,
            "Old Title": old_title,
            "New URL": best_match if highest_score >= threshold else None,
            "New Title": new_titles.get(best_match, None) if best_match else None,
            "Match Score": highest_score
        })

    return pd.DataFrame(mappings)

def read_urls_from_csv(file_path):
    """
    Reads old and new URLs from a CSV file.

    The CSV should have two columns: 'Old URL' and 'New URL'.
    """
    try:
        data = pd.read_csv(file_path)
        old_urls = data['Old URL'].dropna().tolist()
        new_urls = data['New URL'].dropna().tolist()
        return old_urls, new_urls
    except Exception as e:
        print(f"Error reading URLs from CSV: {e}")
        return [], []

if __name__ == "__main__":
    # Input and output file paths
    input_csv = "urls.csv"  # Replace with your CSV file path
    output_csv = "url_mapping.csv"

    # Read URLs from the CSV file
    old_urls, new_urls = read_urls_from_csv(input_csv)

    if not old_urls or not new_urls:
        print("No URLs found in the input file. Please check the CSV format.")
    else:
        # Generate the URL mapping
        url_mapping = map_urls_by_titles(old_urls, new_urls, threshold=80)

        # Save the mapping to a CSV file
        url_mapping.to_csv(output_csv, index=False)
        print(f"URL mapping saved to {output_csv}")

Steps to Use the Script

  1. Install Required Libraries: pip install requests beautifulsoup4 pandas rapidfuzz
  2. Prepare Input CSV: The input CSV should have two columns: Old URL,New URL https://oldsite.com/page1,https://newsite.com/page-a https://oldsite.com/page2,https://newsite.com/page-b
  3. Run the Script: Save the script as seo_migration.py and execute it: python seo_migration.py
  4. Review the Output: The script generates a url_mapping.csv file with columns for old URL, old title, new URL, new title, and match score.

Note: If output is missing URLs, then it might be due to: 

  1. Threshold Too High: If the similarity score does not meet the threshold (default is 80), the New URL will be None.
  2. Titles Not Found: If the script fails to fetch page titles for the URLs (due to errors like timeouts, incorrect URLs, or empty titles), the matching cannot proceed.

Extending the Script for Content Matching

To go beyond page titles, the script can also compare other on-page elements such as meta descriptions, H1s, body text, and image alt attributes. Below is a brief overview of this extended functionality:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from rapidfuzz import fuzz

def fetch_page_content(url):
    """
    Fetches on-page content for a given URL, including title, meta description, H1s, body text, and images.
    """
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract relevant content
        title = soup.title.string.strip() if soup.title else None
        meta_description = soup.find("meta", attrs={"name": "description"})
        meta_description = meta_description["content"].strip() if meta_description else None
        h1 = soup.find("h1")
        h1 = h1.text.strip() if h1 else None
        body_text = " ".join([p.text.strip() for p in soup.find_all("p")])
        images = [img['alt'] for img in soup.find_all("img", alt=True)]

        return {
            "title": title,
            "meta_description": meta_description,
            "h1": h1,
            "body_text": body_text,
            "images": images
        }
    except Exception as e:
        print(f"Error fetching content for {url}: {e}")
        return None

def calculate_similarity(old_content, new_content):
    """
    Calculates a similarity score between two sets of page content.
    """
    total_score = 0
    components = ['title', 'meta_description', 'h1', 'body_text']

    # Compare textual components
    for component in components:
        old = old_content.get(component, "")
        new = new_content.get(component, "")
        if old and new:
            total_score += fuzz.ratio(old, new)

    # Compare images (using alt text similarity)
    old_images = old_content.get("images", [])
    new_images = new_content.get("images", [])
    image_score = sum(
        max(fuzz.ratio(old_img, new_img) for new_img in new_images)
        for old_img in old_images
    ) if old_images and new_images else 0

    total_score += image_score
    return total_score

def map_urls_by_content(old_urls, new_urls, threshold=70):
    """
    Maps old URLs to new URLs based on the closest match of on-page content.
    
    Parameters:
        old_urls (list): List of old URLs to map from.
        new_urls (list): List of new URLs to map to.
        threshold (int): Minimum similarity score to consider a match.

    Returns:
        DataFrame: A mapping of old URLs to new URLs based on content similarity scores.
    """
    old_contents = {url: fetch_page_content(url) for url in old_urls}
    new_contents = {url: fetch_page_content(url) for url in new_urls}

    mappings = []
    for old_url, old_content in old_contents.items():
        best_match = None
        highest_score = 0

        for new_url, new_content in new_contents.items():
            if old_content and new_content:
                score = calculate_similarity(old_content, new_content)
                if score > highest_score:
                    highest_score = score
                    best_match = new_url

        # Add match details to the mapping
        mappings.append({
            "Old URL": old_url,
            "New URL": best_match if highest_score >= threshold else None,
            "Similarity Score": highest_score
        })

    return pd.DataFrame(mappings)

def read_urls_from_csv(file_path):
    """
    Reads old and new URLs from a CSV file.

    The CSV should have two columns: 'Old URL' and 'New URL'.
    """
    try:
        data = pd.read_csv(file_path)
        old_urls = data['Old URL'].dropna().tolist()
        new_urls = data['New URL'].dropna().tolist()
        return old_urls, new_urls
    except Exception as e:
        print(f"Error reading URLs from CSV: {e}")
        return [], []

if __name__ == "__main__":
    # Input and output file paths
    input_csv = "urls.csv"  # Replace with your CSV file path
    output_csv = "url_mapping.csv"

    # Read URLs from the CSV file
    old_urls, new_urls = read_urls_from_csv(input_csv)

    if not old_urls or not new_urls:
        print("No URLs found in the input file. Please check the CSV format.")
    else:
        # Generate the URL mapping
        url_mapping = map_urls_by_content(old_urls, new_urls, threshold=70)

        # Save the mapping to a CSV file
        url_mapping.to_csv(output_csv, index=False)
        print(f"URL mapping saved to {output_csv}")

By analysing these additional components, SEO specialists can ensure even closer matches between old and new URLs during migrations.

Benefits of Automation in SEO Migrations

  • Time Efficiency: Automates tedious URL comparisons.
  • Accuracy: Reduces human errors by using data-driven methods.
  • Customisation: Allows for adjustable thresholds and comparison elements.

Incorporating this script into your SEO workflow can significantly enhance the efficiency and accuracy of your migrations, ensuring a smooth transition and minimal impact on rankings.

contact.

From bespoke SEO strategies, content services, to a modern high-performance website-it’s all based on your requirement.

helpful SEO & digital marketing tips.

recent articles.

Read articles and guides to help you learn about SEO, how it works, and other useful tips to help generate organic traffic.