AI-Generated imagery in digital and print media for Bonnier News
AI-Generated imagery in digital and print media for Bonnier News
Is it feasible for current image generation models to produce high-quality, photorealistic visual content suitable for both print in glossy magazines, and digital publishing?
This was the key question we investigated at the end of 2023 in collaboration with one of Sweden’s leading news organizations and the Nordic region’s largest media group – Bonnier News.
Historically, commercial operations of Bonnier News relied on stock imagery for ad creation. That presented two primary challenges:
1. Locating relevant stock images could be a time-consuming task, sometimes stretching over an hour.
2. The aesthetic of stock photos often missed the mark for Scandinavian authenticity. The nuances in emotional expression, attire, and interior decor often leaned more towards American style, rather than reflecting Scandinavian authenticity.
This situation presented a perfect opportunity for generative AI since modern text-to-image models have the potential to generate images that are well-aligned with the needs of a graphic designer. However, early into the project, it became clear that even the most capable text-to-image models were not universally adept at producing high-quality, photorealistic images suitable for all scenarios. This led us to refine our research question:
In this blog post, we discuss the framework we used to answer this question and provide a Google Colab notebook with Python code for an automated analysis of gender bias in image-generating models. Spoiler alert: an online A/B test carried out by Bonnier News (a controlled experiment where two variations of an ad are shown to different groups of website visitors), revealed a notable preference for AI-generated images. Specifically, ads with AI-generated images showed a markedly higher click-through rate.
The final deliverable of our project was the development of a web application prototype, which integrated two image generation models specifically chosen for their suitability to Bonnier News’ commercial operations. Additionally, we incorporated a language model to facilitate prompting text-to-image models so that creative people unfamiliar with the concept of prompt engineering can get the best performance out of image-generating models. And now, let’s delve into the framework.
Step 1: Narrow Down the Model Search Space
It’s important to narrow down the list of potential models as soon as possible because their evaluation still heavily relies on human feedback. This can be done by applying a set of hard criteria such as legal risk assessment, availability of an API, or a particular feature like inpainting. In this blog post, we focus on some of the most advanced text-to-image models available at present:
Step 2: Identify Use Cases for the Generated Images
The next step is to identify the scenarios that are the most relevant for the business. Each scenario represents a specific use case of images in print and digital media. For example, if a company is producing visual content with an industrial environment in the background, then the capability of a generative model to produce a giraffe in a savanna is irrelevant. To accomplish this step, the Bonnier News team classified their history of published ads into a handful of scenarios and analyzed revenue produced by each of the classes. This led us to the top 5 scenarios that defined the ground for comparing text-to-mage models. Not revealing these 5 scenarios relevant for Bonnier, in this blogpost, we demonstrate using two closely related but modified scenarios.
Two modified scenarios:
|Nordic Home Comfort
|Cozy, minimalistic Scandinavian interior design, emphasizing comfort and style in a home setting
|Pictures of stunning natural landscapes, people hiking, camping, or engaging in outdoor activities like kayaking or fishing in scenic Nordic settings
Step 3: Craft Prompts for Each Usecase
For every scenario, multiple images can be generated, each requiring a text description or “prompt” as input for a text-to-image model. These prompts can either remain consistent across all models or be tailored to each model to potentially improve performance. We explored both methods. A recent research, using the Stable Diffusion model, found that keywords like “trending on artstation” and “hyper-realistic” significantly enhanced image quality. Similarly, terms like “Canon EOS 5D Mark IV” and “8k” have been effective in boosting photorealism in Midjourney 5 outputs, such as rendering more authentic grass textures and colors. Although such keywords, also known as image quality modifiers, are popular and even recommended in Google’s official prompt guide for Imagen, there is a noticeable shift in image generation technology moving away from their use. The latest Midjourney 6 announcement disparages these modifiers as “junk” and recommends avoiding them.
Below are example prompts, one for each scenario, with corresponding images.
Prompt: photo of a family of four in a cozy, minimalist Scandinavian living room with fireplace and plush sofa.
As illustrated by defects in human anatomy in all of these images, we found that no model is capable of consistently generating accurate human anatomy. Therefore, all scenarios that included humans as the main subjects were put on hold, awaiting further progress in the image-generating technology.
Prompt: photo of an early morning trek in a misty Scandinavian forest, a trail leading through dense fog, rays of sunlight piercing through.
All models have demonstrated solid knowledge of Scandinavian nature, although certain elements, such as shadows, are still often unrealistic.
Step 4: Establish Image Evaluation Aspects
The next step is to establish image evaluation aspects. For Bonnier News, it was important that the AI-generated images reflect Scandinavian authenticity. To this end, we incorporated specific evaluation aspects such as:
1. Scandinavian ambiance: Do the generated images exude a Scandinavian ambiance?
2. Subtle human emotions: Are human emotions on the generated images subtle (not overly expressive) in a Scandinavian way?
Overall, our methodology encompassed 12 distinct evaluation aspects. Some of these, applicable to a broader range of projects, include:
1. Text-to-image alignment: How well does the image match the description?
2. Photorealism: Does the generated image look like a real photograph?
3. Gender bias: Does the generated image display a biased representation of gender?
In our project, we engaged a team of eight graphic designers to carry out the image assessments. Each aspect, except for bias, was rated by each human evaluator using a 5-point Likert scale. For example:
How well does the image match the description?
a) Does not match at all, b) Has significant discrepancies, c) Has several minor discrepancies, d) Has a few minor discrepancies, e) Matches exactly
Step 5: Human Evaluation
While some aspects of image evaluation, like gender bias, lend themselves to automated measurement, the overall effectiveness of this approach can be limited. A recent study showed that correlations between human and automated metrics are generally weak, particularly in photorealism and aesthetics. This finding emphasizes the crucial role of human evaluation in assessing image generation models.
The figure below provides a summary of the selected evaluation aspects, except for the gender bias. Numbers on the x-axis correspond to a 5-point Likert scale, where higher values are better. Note that this is not a holistic evaluation of text-to-image models – we focused only on a small set of scenarios that were the most relevant for our client. The gender bias metric, being one of the aspects amenable to automated measurement, is discussed separately in the following section.
Step 6: Automated Evaluation of Gender Bias
To ensure representative evaluation results, it’s important to process a sufficient number of images. Human evaluation of a single image takes a considerable amount of time, thereby making a strong case for automation. Here, we focus on the automated measurement of gender bias using OpenAI’s CLIP model. However, GPT-4V, a newer addition to OpenAI’s suite, offers a viable alternative. CLIP was trained using a contrastive learning approach, where it learns to associate images with matching descriptions and dissociate them from non-matching descriptions. A useful feature of CLIP is its ability to numerically measure the similarity between a given piece of text and a given image. GPT-4V does not possess this ability, but it can be used for zero-shot classification and visual question answering. In any case, it’s important to acknowledge the complexity of the gender identification task and the limitations of these technologies in fully automating the process.
In the case of a gender bias aspect, using a publicly available dataset can be a plausible alternative to manually crafted scenarios and prompts. For example, MS-COCO is a large-scale dataset containing over 200,000 labeled images of humans and everyday objects. However, for illustrative purposes, we will continue the thread of Scandinavian authenticity by using the following two prompts:
– A photo of the face of a happy person in Sweden
– A photo of the face of a machine learning engineer in Sweden
Our analysis is based on a research paper that explored social biases in text-to-image generation models. In line with its methodology, we limit gender categorization to binary. The bias is measured as a distance between the fraction of females in a sample of images and 0.5, which represents a fraction of females in an unbiased uniform distribution. We sampled 20 images from each text-to-image generation model for each prompt. From the generated images, we detected gender using CLIP and measured the bias. Ten images for each prompt-model pair are presented below. For those interested in experimenting further, we’ve made available a Python script for gender bias analysis in a Google Colab notebook. You may try it with your own images by providing links to them.
Prompt: A photo of the face of a happy person in Sweden
Prompt: A photo of the face of a machine learning engineer in Sweden
Summarising the Gender Bias
The figure below provides a summary of an automated analysis of gender bias in two scenarios. Following the methodology described above, bias is quantified on a scale from 0 to 0.5, where a score of 0 indicates an unbiased, gender-uniform distribution in image generation, and a score of 0.5 denotes a consistent preference for one gender.
Samples of images and a summary figure reveal striking differences between the models. As shown above, DALL·E 3 ensures diversity in both scenarios. It has several mitigations that allow it to achieve such a low bias. The main mitigation is prompt transformation – ChatGPT rewrites submitted prompts to ensure they comply with OpenAI’s guidelines, including grounding people with specific attributes. This involves specifying characteristics like race, gender, attire, or other identifying attributes to ensure diversity and inclusivity in the generated images. At the other extreme, Midjourney 6 consistently generates only females as Sweden’s happy people and males as Sweden’s machine learning engineers. The other two models are in between.
An automated measurement of gender bias is just one example of how image evaluation can be automated. Other examples are text-image alignment and robustness of the model to changes in the prompt, such as typos and synonyms.
One of the main outcomes of this project with Bonnier News was a realization that no single model excels in all scenarios and aspects – different models have different strengths. This fully aligns with the key findings in the Holistic Evaluation of Text-To-Image Models research paper. For example, we found that DALL·E can’t generate images that look like real photographs, but it excels in the digital art category. As already mentioned, we also found that no model can consistently generate accurate human anatomy yet. However, as newer and more capable text-to-image models were being announced while the project was running, we also noted that whatever our evaluation presents today, may become outdated tomorrow. Text-to-image models differ in terms of prompting styles. On top of that, as the announcement of Midjourney 6 showed, prompting techniques may change dramatically in the next version of the same model. With these learnings in mind, we developed a web application prototype for Bonnier News with the following requirements:
1. Multiple models must be connected so that models with different capabilities can collectively cover the required scenarios.
2. It must be easy to add new models so that the next state-of-the-art model can be added to the web application the same day it surpasses existing models in an internal assessment.
3. Language models must be used to facilitate prompting text-to-image models.