This paper investigates how effectively GPT-4 can analyze and categorize text feedback from leadership coaching programs compared to human coders. The study examines over 6,000 responses from coaching program participants using three different approaches: single theme tagging, multiple theme tagging, and multiple theme tagging with human intervention.

The research found that GPT-4 matched human consensus coding at a rate of 55-65% for single theme assignment, which was only slightly lower than the agreement rate between human coders (60-70%). When allowed to assign multiple themes, GPT-4 identified at least one matching theme 85% of the time, but assigned more than twice as many themes as human coders, creating more “noise” in the analysis. The most effective approach was a human-in-the-loop process where experts refined GPT-4’s initial theme set, which reduced the number of themes by nearly half while maintaining 75% accuracy.

The authors conclude that while GPT-4 is significantly faster than human coding (completing tasks in under an hour versus 2-4 hours for humans), it works best as a supplementary tool rather than a replacement for human analysis. The paper includes detailed methodology, results, and practical recommendations for balancing AI efficiency with human nuance in qualitative data analysis.

Read the paper