Introduction

Artificial intelligence (AI) technologies offer transformative potential, such as enhanced decision-making, personalised services and breakthroughs in understanding complex systems. The use of AI is growing rapidly in modern medicine and education,1–3 however, its use in image generation and interpretation in a medical context is underexplored.4 This is particularly the case for generative adversarial networks, a branch of AI where two neural networks compete to generate synthetic data that can be used to create realistic images, texts or sounds. Artificial intelligence models based on generative adversarial networks can decipher statistical associations between textual prompts and corresponding images within their training data. This allows them to generate realistic and diverse images, including clinical photos in a medical context.

This emerging technology has the potential to offer numerous benefits in the realm of medical education, particularly in plastic surgery which greatly relies on visualisation and clinical exposure.4 The use of AI reduces the need for real patient images to be used for educational purposes. From an ethical perspective, this enhances the safeguarding of patient images and mitigates the risk of infringing patient confidentiality.4,5 From a practical perspective, AI-generated images offer the potential for students to quickly and easily study a wide array of conditions among a diverse patient population. This diverse case exposure is essential for developing proficient and well-rounded healthcare professionals. The availability of AI-generated images may also help to improve access to medical education. Regardless of geographical location or resources, all students can access the same breadth and quality of educational material, bridging the educational divide and fostering equal learning opportunities.

The assessment and diagnosis of skin cancers is a highly visual exercise requiring experience gained from exposure to numerous patients or clinical photos. This area of plastic surgery would benefit greatly from the establishment of a vast database of accurate AI-generated images for students to learn from. This would be particularly useful in Australia, which has one of the highest global incidence rates of skin cancer.6 In this study, we assess the ability of three prominent AI models based on generative adversarial networks—DALL-E, Midjourney and Blue Willow—to generate authentic-looking images of skin cancers and evaluate the potential implications of this technology for plastic surgery education.

Methods

This was an experimental study using the generative capabilities of DALL-E (https://openai.com/dall-e-2), Midjourney (https://www.midjourney.com) and BlueWillow (https://www.bluewillow.ai) to artificially create images of three types of skin cancer: squamous cell carcinoma (SCC), basal cell carcinoma (BCC) and melanoma. No images were of real patients. These models use generative adversarial networks to create images rather than using unconsented images from the web, helping to ensure ethical compliance and respect privacy and confidentiality. The following proactive measures were taken to ensure the use of AI-generated images comply with ethical and privacy standards:

  • OpenAI, the company producing DALL-E, has publicly committed to responsible AI-development practices, including the ethical sourcing of training data. This entails using datasets that are publicly available, created through ethical partnerships or generated inhouse, thus ensuring adherence to legal and ethical standards and minimising biases.7

  • Midjourney, while newer and with less publicly documented dataset sourcing practices, is expected to uphold the AI community’s ethical standards by using datasets that respect privacy norms and consent, focusing on generating diverse and unbiased images.8

  • BlueWillow upholds ethical data practices by ensuring privacy, copyright compliance, and bias minimisation in line with industry standards. Its privacy policy outlines transparent data collection, security measures and user rights, reinforcing its commitment to responsible AI development.9

By using these three models, this study avoids the use of unconsented images from the web. Our approach ensures the AI-generated images not only comply with ethical standards but also respect both company and individual privacy rights, highlighting our commitment to ethical research practices in the advancement of medical education through AI.

Five prompts were given to each AI model for the three types of skin cancer. The images generated were evaluated using a Likert scale by two plastic surgical residents (JC and IS) and an experienced specialist plastic surgeon (WMR) (Table 1). Analysis of interrater reliability was performed on the initial rating of each individual reviewer using intraclass correlation coefficients. Intraclass correlation coefficients ranged from 0 to 1, with higher values indicating better reliability. Following this, reviewers held a group discussion regarding any differences in Likert scale rating until a consensus rating was achieved.

Table 1.Criteria used to evaluate AI-generated images
Criteria DALL-E Midjourney BlueWillow
The AI-generated images resemble real-world example of pathology [ ] 1—Strongly disagree
[ ] 2—Disagree
[x] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree
[x] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree
[x] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree
The AI-generated images adequately represent the specific pathology [ ] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[x] 4—Agree
[ ] 5—Strongly agree
[x] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree
[x] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree
The AI-generated image’s details are visible and easily discernible [ ] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[x] 4—Agree
[ ] 5—Strongly agree
[x] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree
[x] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree
The AI-generated images are of a high quality [ ] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[x] 4—Agree
[ ] 5—Strongly agree
[x] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree
[x] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree
The AI-generated images are beneficial for educational purposes [ ] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[x] 4—Agree
[ ] 5—Strongly agree
[x] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree
[x] 1—Strongly disagree
[ ] 2—Disagree
[ ] 3—Neither agree or disagree
[ ] 4—Agree
[ ] 5—Strongly agree

x = final consensus rating

Only specific sets of five prompts per skin cancer type were used to generate AI images, with the aim of generating clinically accurate and pedagogically useful visuals. The choice of limited prompts was informed by preliminary tests that showed broad or numerous prompts often led to irrelevant or overly generalised images. To counter this, phrases that closely aligned with clinical descriptions of the disease, such as ‘nodular basal cell carcinoma’, ‘ulcerated’, ‘pearly edge’ and ‘telangiectasia’ were selected. These terms were chosen to direct the AI model’s focus towards generating images that accurately represented the critical characteristics of each skin cancer type. Setting a limit of five prompts per type was a strategic measure to ensure the AI model produced specific and relevant images, rather than misleading or educationally unsuitable visuals.

Additionally, we uploaded the AI-generated images to AI Dermatologist (https://ai-derm.com), a machine-learning system trained on over three million images. AI-based dermatology diagnosis algorithms such as this have shown remarkable accuracy comparable to dermatologists.10 However, AI Dermatologist states that it is not a diagnostic device and recommends that users should still obtain a professional medical opinion. Only DALL-E images were deemed appropriate to be analysed by AI Dermatology. All images created by Midjourney and BlueWillow had exaggerated fictionalised features that were deemed not appropriate for analysis by AI Dermatologist.

Each AI model produced four different images for each prompt given, thus producing a total of 20 images for each type of skin cancer. Prompts were continually refined and altered in an attempt to improve the accuracy of the generated images. The image that most precisely represented the intended diagnosis, as determined by the authors, was selected for inclusion in this study to demonstrate the capabilities of the AI software. The prompt associated with this image was recorded. As an experimental study on openly available data generated from an AI platform, no institutional ethics approval was required.

Results

Overall evaluation of AI models

DALL-E generated the most realistic medical images of skin cancers among the three AI models evaluated. Final author ratings for each AI model using a Likert scale are summarised in Table 1. Initial ratings for DALL-E by the three assessors revealed a good interrater reliability with an intraclass correlation coefficiency of 0.82. In contrast, the images generated by Midjourney and BlueWillow were considered unsuitable for clinical education and failed to accurately represent real-world pathology. The three assessors agreed with excellent interrater reliability (intraclass correlation coefficient 0.98) that the images produced by these models were of poor quality.

Cutaneous squamous cell carcinoma

The first prompt was ‘squamous cell carcinoma’. In DALL-E, this prompt yielded microscope images of histology slides, not of a cutaneous skin lesion. It was observed that the AI program did not have an adequate understanding of medical terminology with terms such as ‘carcinoma’, ‘ulcerated’ and ‘necrotic’ not producing the desired images. The final prompt was ‘bleeding skin wound on lower lip of face, round, rough edges, clinical photo, old woman’ which produced the result as shown in Figure 1. This image was deemed by the authors to be the most accurate at reflecting a cutaneous facial SCC out of all the images produced by DALL-E. The authors rated DALL-E with a 4 out of 5 on the Likert scale for similarity to actual clinical pathology. This was corroborated by AI Dermatologist, which identified cancer in Figure 1 with 93 per cent accuracy and strongly recommended immediate consultation with a dermatologist. The images created by both Midjourney and BlueWillow were hyper-realistic with exaggerated fictionalised features that hamper their suitability for clinical education (Figure 2).

Fig 1
Fig 1.Image generated by DALL-E in response to the prompt ‘bleeding skin wound on lower lip of face, round, rough edges, clinical photo, old woman’
Fig 2
Fig 2.Images produced by Midjourney (left) and Blue Willow (right) in response to the prompt ‘bleeding skin wound on lower lip of face, round, rough edges, clinical photo, old woman’

Basal cell carcinoma

The first prompt was ‘basal cell carcinoma of the face’. In response, DALL-E produced images of elderly male and female faces with what appeared to be scarring of their skin. None had the characteristic appearance of a BCC. The second prompt was ‘basal cell skin cancer on face’, which yielded similar results to the first. The third prompt was ‘small reddish pink skin lump, pearly edges, central ulceration, and telangiectasia, on cheek of old woman’. In response, DALL-E produced images of an elderly woman with a red spot on the cheek, none were elevated, none were ulcerated and none had overlying telangiectasia. The fourth prompt was ‘medical photo, small round, elevated, pink lesion on face, woman, bleeding centre’. Figure 3 is one of the images produced in response to this prompt. The last prompt added the words ‘overlying blood vessel’ to the fourth prompt but produced inferior results. Of the 20 images produced by DALL-E, Figure 3 is the image deemed to represent a facial BCC most accurately. The authors rated DALL-E with a 3 out of 5 on the Likert scale for BCC accuracy and similarity to actual clinical pathology. AI Dermatologist suggested less correlation with a BCC and stated a 43 per cent chance of the lesion being a benign skin neoplasm, advising a scheduled visit to a dermatologist. The images created in Midjourney and BlueWillow (Figure 4) were not suitable for clinical education or evaluation by AI Dermatologist due to their fictionalised elements.

Fig 3
Fig 3.Image generated by DALL-E in response to the prompt ‘medical photo, small round, elevated, pink lesion on face, woman, bleeding centre’
Fig 4
Fig 4.Images generated by Midjourney (left) and Blue Willow (right) in response to the prompt ‘medical photo, small round, elevated, pink lesion on face, woman, bleeding centre’

Melanoma

The first prompt was ‘facial melanoma in old man’. In response, DALL-E produced high-quality photos of elderly male faces but none had discernible skin lesions. The term melanoma seemed to be understood by the AI model despite the poor accuracy resulting from the first prompt. Subsequent prompts focused more on describing the appearance of the skin lesion and produced more accurate results. The final prompt was ‘medical photo of one melanoma on left cheek of old man, mixed brown and black in colour, irregular edge’. Figure 5 is one of the images produced in response to this prompt. This was deemed to be the most accurate image of melanoma produced by DALL-E in response to the five prompts given. The authors also rated DALL-E with a 3 out of 5 on the Likert scale for melanoma accuracy and similarity to actual clinical pathology. AI Dermatologist identified the image with an 89 per cent probability as a precancerous condition and urged immediate dermatological consultation and possible surgical removal. Once again, the images produced by Midjourney and BlueWillow (Figure 6) demonstrated limitations in their suitability for clinical education and evaluation through AI Dermatologist, primarily due to their inherent fictionalised components.

Fig 5
Fig 5.Image generated by DALL-E in response to the prompt ‘Medical photo of one melanoma on left cheek of old man, mixed brown and black in colour, irregular edge’
Fig 6
Fig 6.Images generated by Midjourney (left) and Blue Willow (right) in response to the prompt ‘Medical photo of one melanoma on left cheek of old man, mixed brown and black in colour, irregular edge’

Discussion

This study demonstrates the current efficacy of AI image-generating models to produce accurate medical images of skin cancers. Only the images created by DALL-E were deemed to be of reasonable accuracy, however, they remain insufficient on their own to replace traditional patient photographs of conclusively diagnosed skin lesions. Although the images produced by Midjourney and BlueWillow captivate with their visual allure, their hyper-realistic and fictionalised features impede their effectiveness in clinical education. These images fail to accurately depict real-world scenarios, which is critical in a clinical learning environment. Furthermore, the heavy reliance of AI models on their training data restricts the versatility and applicability of the generated images, thus considerably constraining their usefulness for clinical education. Nevertheless, the results presented here outline the potential of these AI models to serve as a useful adjunct to current teaching practices as they continue to improve their realism and accuracy.

The future integration of AI-generated images into medical education can bring significant benefits.11,12 This study has focused on skin cancers, however, all learning dependent on clinical examination or medical photography could benefit from AI-generated images. As technology improves, trainees would have access to an extensive collection of medical photos of pertinent clinical cases, enabling them to develop expertise in recognising visual aspects of diseases or lesions. Reducing the need for patients reduces the risk of breaching patient confidentiality, an increasingly important ethical consideration given the widespread use of images sourced from social media and the internet. Additionally, in areas where educational resources are limited, the use of AI tools can be transformative by supplying free and readily accessible clinical images.

These AI applications have the potential to generate a wide range of pathological clinical images, thereby furnishing current and future medical practitioners with information to aid their learning. Currently, AI applications in plastic surgery are primarily centred on harnessing machine learning and deep learning techniques. These techniques are backed by extensive image datasets that assist in providing diagnostic information. Dermatology diagnosis software programs powered by AI, such as DermEngine (MetaOptima, Vancouver, Canada), demonstrate high levels of accuracy but can be costly.13 They also require collaboration with trained specialists and are often focused on specific conditions like SCC. In contrast, tools like DALL-E present an exciting and emerging opportunity to enhance and revolutionise medical education. By capitalising on these AI tools, there is a potential to create a broader, more diverse and cost-effective educational platform that is not bound by geographical or financial constraints, thereby democratising access to quality medical education.

To ensure the suitability of AI-generated images for educational purposes, we recommend the implementation of a validation process. Multiple experts in the field should assess the generated images against real patient cases to validate their authenticity and accuracy. Rather than replacing traditional medical education resources, AI-generated images should supplement existing teaching materials. Combining AI-generated images with real-life patient cases, histopathological slides and clinical discussions can create a more comprehensive and dynamic learning environment for medical students and junior doctors.

Using AI to generate realistic images can also potentially help in surgical preparation and skill enhancement. Generating images of abnormalities like tumours can offer a personalised learning experience for surgeons, and assist in disease diagnosis, treatment planning and patient education.

As a field reliant on clinical images, plastic surgery can particularly benefit because AI generated images can help address issues of availability and diversity. AI-generated images can also potentially enhance plastic surgery education by providing realistic surgical simulations and personalised learning modules, making plastic surgery training more accessible and of a higher quality.

While AI-generated images have significant potential to advance healthcare and medical education, it is crucial to recognise their inherent limitations and areas for improvement.

  • First, the efficacy of AI models is contingent on the quality and diversity of the training data. AI algorithms, particularly those employing deep learning, require extensive, varied and unbiased data sets to generate accurate and representative images. Any deficiency in these data sets can result in the production of suboptimal images, thereby limiting their educational and diagnostic utility.

  • Second, AI-generated images may lack the nuanced details inherent in real clinical images. While AI can mimic pathological features, subtleties related to disease progression, response to treatment or patient-specific variations may not be adequately captured. This could potentially limit the applicability of AI-generated images for complex diagnostic processes and providing comprehensive education to medical professionals. Additionally, AI models, in their current state, may not be proficient at generating images of rare or complex pathologies, which may limit their utility in a comprehensive medical education context.

  • Lastly, ethical and legal concerns cannot be overlooked. Questions surrounding the misuse of AI-generated images, data privacy and informed consent, particularly when using real patient data, must be addressed. Clear regulations and guidelines need to be established to ensure the responsible and ethical use of this technology in healthcare.

AI-image recognition diagnostic machine learning models (such as AI Dermatologist used in this study), should not be seen as a replacement for specialist consultations.14–16 It is imperative to remember that these platforms are fundamentally reliant on the quality and diversity of data they have been trained on. They can occasionally misdiagnose rare or atypical presentations that fall outside their training sets. Additionally, these platforms lack the capacity to conduct a holistic evaluation of a patient’s health, which is often crucial for accurate diagnosis and treatment. While DALL-E showed some promise in generating images of skin cancers, there is a risk that reliance on AI-generated images for educational purposes could lead to misdiagnosis or incorrect treatment by doctors. The images in this study, particularly those from Midjourney and BlueWillow, often featured exaggerated or fictionalised elements, highlighting the potential for educational risks if these images are interpreted as realistic representations of disease. Such discrepancies underscore the necessity of caution and critical evaluation when incorporating AI-generated images into medical education. It is crucial to emphasise that AI-generated images cannot replace evidence-based clinical diagnostic tools, such as dermatoscopy scoring tools, which provide benchmarks for sensitivity and specificity in diagnostic accuracy. Relying solely on AI-generated images without corroborating with established clinical diagnostic tools could compromise the quality of medical education and patient care. This reinforces the importance of integrating AI tools as supplementary resources rather than definitive educational materials.

The success of AI in generating realistic medical images of skin cancers opens up exciting possibilities for further research. As AI technology continues to evolve, integrating other AI models and technologies into medical education could lead to more sophisticated interactive learning platforms and simulation tools for trainees. Additionally, exploring the use of AI-generated images for surgical planning and intraoperative guidance may revolutionise surgical practices in the future.

Conclusion

In conclusion, our study demonstrated that only one of the three models tested, DALL-E, exhibits a reasonable level of accuracy in generating realistic medical images of skin cancers, specifically SCC, BCC and melanoma, for educational purposes in plastic surgery. As technology improves, the integration of AI-generated images into medical education has the potential to augment traditional teaching practices.


Conflict of interest

The authors have no conflicts of interest to disclose.

Funding declaration

The authors received no financial support for the research, authorship and/or publication of this article.

Revised: March 29, 2024 AEST