AI-Assisted Text Analysis for Coaching Evaluation - Center for Creative Leadership

This paper investigates how effectively GPT-4 can analyze and categorize text feedback from leadership coaching programs compared to human coders. The study examines over 6,000 responses from coaching program participants using three different approaches: single theme tagging, multiple theme tagging, and multiple theme tagging with human intervention.

The research found that GPT-4 matched human consensus coding at a rate of 55-65% for single theme assignment, which was only slightly lower than the agreement rate between human coders (60-70%). When allowed to assign multiple themes, GPT-4 identified at least one matching theme 85% of the time, but assigned more than twice as many themes as human coders, creating more “noise” in the analysis. The most effective approach was a human-in-the-loop process where experts refined GPT-4’s initial theme set, which reduced the number of themes by nearly half while maintaining 75% accuracy.

The authors conclude that while GPT-4 is significantly faster than human coding (completing tasks in under an hour versus 2-4 hours for humans), it works best as a supplementary tool rather than a replacement for human analysis. The paper includes detailed methodology, results, and practical recommendations for balancing AI efficiency with human nuance in qualitative data analysis.

Read the paper

Katelyn McCoy Senior Research Scientist

Ezekiel Welsh Evaluation Analyst