Detecting low-contrast text in websites using Machine Learning: a guide

By Jérôme Renaux
February 1st, 2021
8 minutes
AIArtificial IntelligenceMachine LearningLow-contrast text

In the context of our collaboration with Belgium’s Federal Public Service on Policy and Support, we have been exploring how to leverage Machine Learning to identify accessibility issues on Belgium’s public websites. One such issue is the presence of low-contrast text, which we will focus on in this piece. Low-contrast text is a vital accessibility issue because it can significantly hamper readability for people with impaired eyesight or users of text-to-speech assistive technology.

Specifically, the problem statement is the following: given an arbitrary web page, how to automatically list all the locations where low-contrast text (as defined by in the Web Context Accessibility Guidelines) is present. This should work for several common situations:

  • Normal text on top of a uniform background
  • Normal text on top of a background that is image
  • Text that is part of an image

An example of the latter is the following banner from Belgium’s federal government’s website:

Low-contrast text on a website
https://www.federale-regering.be/nl

In this example, the image contains text that we want to detect to assess if its contrast is sufficient.

The following sections will walk you through the different steps involved to solve this problem in Python, and we will discuss the challenges encountered along the way. The main steps are the following:

  • Capturing the web page
  • Finding all the text on the page
  • Computing the contrast

Capturing the web page

The first step consists of deciding what representation of a web page to work on. Two possibilities exist:

  1. Working with the source code of the page
  2. Capturing a visual representation of the entire web page

Option 1 offers a lot of flexibility in navigating the hierarchy of the elements present on the page. The main drawback is that this option isn’t of much help to identify text present within images.

Given that constraint to handle images and detect the text inside them, we decided to approach the entire problem in that way and use one single image, a screenshot of the web page, as the input for our text detection step.

A screenshot of a web page can easily be obtained by using the Selenium library (and its Python integration). Selenium allows to simulate a browser’s behaviour on a specific web page and render the entire page as your browser would, rendering HTML, stylesheets and dynamic content. The code below shows how Selenium can be used to generate a screenshot of a web page.

from selenium import webdriver

chrome_options = webdriver.chrome.options.Options()
chrome_options.add_argument("--headless")

# Start the web driver, we choose Chrome in this example
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)

# Take a screenshot, adjusting the size to capture the full length
# of the page
original_size = driver.get_window_size()
required_height = driver.execute_script("return document.body.parentNode.scrollHeight")
driver.set_window_size(original_size["width"], required_height)
driver.save_screenshot(path)

# Close the web driver
driver.close()

At this point, we are now in possession of one big image representing the entire web page. We now need to detect all the text on that image.

Finding all the text on the page

In the course of the development, we explored two main approaches for this step:

  • Text detection: detect where text is located on the page, without attempting to read it. In this case, the output is only a set of coordinates for each detected text element, e.g. {x: 12, y: 25, width: 30, height: 12}
  • Text recognition: detect where text is located on the page and identify what is written. In addition of the text coordinates, the output now includes a string representation of the text, e.g. {x: 12, y: 25, width: 30, height: 12, text: "Lorem ipsum"}

In this section, we will discuss both approaches, as they both presented interesting challenges. Ultimately, your specific use case should determine which one you follow, based on whether you need string representations of the text or not. Note, however, that in our experience, the model used for text detection ended up detecting more text than the text recognition approach, which motivated us to stick to the former.

Text detection

We achieved text detection by using a pre-trained text detection model called EAST. The model artefact can be loaded in OpenCV and used to generate a list of bounding boxes indicating the detected words’ coordinates. Our implementation comes from the following guide, which provides a very good example of using it in practice.

As you will observe, the detection performance is very high. In many cases, all the text of a page is successfully detected (no false negatives). On the other hand, some non-text elements are sometimes detected (false positives). The trade-off between the two can be controlled by adjusting the model’s confidence level depending on your requirements.

Text recognition

Text recognition on an image is a prevalent task called Optical Character Recognition (OCR). We explored several Python implementations of this technology and decided to go for Tesseract and its Python wrapper Pytesseract. One line of code suffices to run it, and it returns the coordinates of the bounding boxes of each text element, as well as string representation of the text and a confidence level.

import pytesseract as tess

def find_boxes(img: np.ndarray) -> List[Any]:
    """Find the bounding boxes of text in an image."""
    data = tess.image_to_data(
        img,
        config="--oem 3",
        output_type=tess.Output.DICT
    )
    boxes = list(zip(
        data["text"],
        data["left"], data["top"],
        data["width"], data["height"],
        data["conf"]
    ))
    return boxes

We experienced the main downside that the detection performance tends to be significantly lower than the previous approach, especially for colored text on exotic backgrounds. The models running under the hood of Tesseract seem to have been mostly trained on clean, black-on-white text (with the purpose of automating the digitization of printed documents), and therefore fail to detect wilder instances of text. This is particularly problematic in our case, since detecting unusual combinations of colored text on colored backgrounds (susceptible to exhibit low-contrast) is precisely what we want to find.

The performance can be increased in several ways. One way is to increase the resolution of the screenshot. To come back to a previous line of code used when taking the screenshot:

chrome_options.add_argument(f"--force-device-scale-factor={scale}")

The scale variable can be used to control the resolution and can be, for example, set to 2 to obtain retina-like resolution. Note, however, that the larger the image, the slower all approaches presented here will run.

Another way to help the OCR process is to generate several variants of the initial screenshot and submit them to the OCR before merging the results. In the code example below, we perform text recognition three times:

  • Once on the original image ( img )
  • Once on an altered version of the image aimed at enhancing contrast ( cv2.convertScaleAbs(img, alpha=3, beta=0) )
  • Once on a color-inverted version of the image ( (255 - img) )

Other variants, such as grayscale, are of course possible. A simplified version of the code would look like the following:

variants = [
    img,
    cv2.convertScaleAbs(img, alpha=3, beta=0),
    (255 - img)
]
all_boxes = []
for variant in variants:
    boxes = find_boxes(variant)
    all_boxes.append(box)

Note that a lot of text will be detected on several of the multiple variants, thus leading to many duplicates. Unfortunately, from one image variant to another, the same text might be recognized at slightly different coordinates, complicating the process of weeding out duplicates.

We came up with a relatively simple solution for this problem:

  • For each text element, compute a simple hash of the bounding box, mapping a bounding box to the sum of its coordinates. This allows to efficiently represent each box by a single number (with a relatively low likelihood of collisions)
  • For any given pair of text elements, compare their hashes and consider them duplicates if the difference between the hashes is lower than a certain threshold. This accounts for the fact that duplicates will have very similar, but not identical bounding boxes coordinates.

The full code for handling image variants, including duplicates filtering, is visible below.

def hash_box(box: Tuple[int, int, int, int]) -> int:
    """Map a bounding box to a more or less unique integer."""
    return sum(box)  # Sum of the coordinates

variants = [
    img,
    cv2.convertScaleAbs(img, alpha=3, beta=0),
    (255 - img)
]
hashes = defaultdict(list)  # type: ignore

all_boxes = []
for variant in variants:
    boxes = find_boxes(variant)
    for box in boxes:
        box_hash = hash_box(box[1:-1])
        found = False
        for h in hashes[box[0]]:
            if abs(h - box_hash) < 10:
                # If found a similar word with close hash: stop and skip
                found = True
                break
        if not found:
            all_boxes.append(box)

While this approach increases the performance, having to process multiple image variants incurs a cost in running time. Moreover, the detection performance never got remotely close to the EAST model’s detection performance described above.

Computing contrast

Regardless of the approach you adopt, the previous step’s outcome should result in a list of bounding boxes indicating the text elements’ locations on a page. The last step is to compute the contrast between text and background for each bounding box and list low-contrast text instances. The image below shows an example of a detected low-contrast text element:

Low-contrast text example

Specifically, given one such bounding box, the output should be a value indicating the contrast ratio between the foreground and background colors.

Computing the contrast ratio is well documented. The code below shows our Python implementation:

def calculate_contrast_ratio(color1: np.ndarray, color2: np.ndarray) -> float:
    light = color1 if np.sum(color1) > np.sum(color2) else color2
    dark = color1 if np.sum(color1) < np.sum(color2) else color2
    
    def linearize(rgb_value: float) -> float:
        """Linearize a gamma-compressed RGB value."""
        index = rgb_value / 255.0
        if index < 0.03928:
            return index / 12.92
        else:
            return ((index + 0.055) / 1.055) ** 2.4
    
    def relative_luminance(rgb: np.ndarray) -> float:
        """Calculate the relative luminance of a color."""
        scaled = [
            0.2126 * linearize(rgb[0]),
            0.7152 * linearize(rgb[1]),
            0.0722 * linearize(rgb[2]),
        ]
        return sum(scaled)
    
    light_lum = relative_luminance(light)
    dark_lum = relative_luminance(dark)
    return (light_lum + 0.05) / (dark_lum + 0.05)

This algorithm requires two colors as input: the foreground and background colors. However, in the example above, multiple colors are present due to anti-aliasing: the black of the text, the gray of the background, and several shades of black to gray around the text. How do we know which 2 colors to use out of several?

The heuristic we have applied consists in detecting the two most frequent colors. In practice, this heuristic works very well and will always yield the foreground and background colors. It doesn’t always allow to identify which of the two colors is the background color, and which one is the foreground color (depending on the size of the font and the size of the cropped background). This, however, is not an issue, as the code above handles both colors interchangeably. The code below shows how the two most common colors are efficiently identified.

def most_common_colors(a: np.ndarray) -> List[np.ndarray]:
    """Identify the two most common colors in part of an image."""

    def _condense_colors(a: np.ndarray) -> np.ndarray:
        r, g, b = map(np.transpose, a.T)
        return r * 65536 + g * 256 + b

    def _uncondense_color(a: np.ndarray) -> np.ndarray:
        r, remainder = np.divmod(a, 65536)
        return np.array((r,) + np.divmod(remainder, 256))

    # Convert the RGB channels into one, count color
    # occurrences, and pick the two most common ones
    single_channel = _condense_colors(a) 
    c = Counter(single_channel.flatten())
    colors = [v[0] for v in c.most_common(2)]
    if len(colors) < 2:
        # Handle edge case where there is only one color
        colors += [None] * (2 - len(colors))
    return [_uncondense_color(c) for c in colors]

With that code in hand, all ingredients are at your disposal to compute the identified text boxes’ contrast.

Wrapping it up

In this post, we explained how we tackled the problem of automatically listing all the locations where low-contrast text occurs on an arbitrary web page in Python. Our approach combines several well-known technologies such as headless browser rendering using Selenium, OCR using Pytesseract or text detection using a dedicated neural network architecture called EAST, as well as our own heuristics to combine these technologies into an end-to-end solution. Depending on your specific requirements, many variations of this solution are possible. We hope that by describing one such approach, you will be better able to make the right decisions for your own project on low-contrast text.

Stay up to date

Stay ahead of the world. Our team shares their
knowledge learnt on the field. Sign up for our
newsletter

SIGN UP