TEAM
Item Analysis Using Machine Learning Approaches
Qizhou Duan
Since the lawsuit between ETS and Golden Rule, multitudes of research investigating item biases based on group membership have been conducted and published. The original lawsuit pointed out the difference between the accuracy rates of white Americans and African Americans with white Americans obtaining much higher accuracy rates on average than African Americans in the mid-1970s on an insurance licensure exam.
Questions about the fairness of the exam were heatedly debated. To ensure the fairness of tests, especially in a high-stakes environment where test scores are used for personnel selection, we must first identify problematic items that are biased against certain demographic groups. Over the last 40 years since the 1980s, methods such as Mantel-Haenszel DIF analysis, Simultaneous Item Bias Test, and IRT-based Methods were developed to address the fairness of the test by looking at each item, and checking if the performance of the test takers is different based on natural groups after matching the ability of the test takers. Such difference in performance is the concept of item bias, and it is later termed Differential Item Functioning (DIF).
Fairness of the test remains to be one of the central issues in educational and psychological measurement today. Compared to 40 years ago, now we are interested in more areas where unfairness may arise and are more committed to the principles of diversity, equity, and inclusion. To address the ongoing need, we can equip ourselves with more elaborate techniques for detecting DIF. In addition, the adamant progress in technology and computing power coupled with advances in computational statistics and cognitive neuroscience has brought more tools for us to use in evaluating fairness.
Nowadays, we are looking at the incorporation of multi-modal data with sophisticated statistical methods in DIF detection. For example, researchers has proposed to detect DIF using eye-tracking data and response time. To contribute to the current corpus of research in detecting DIF, we aim to develop new methods that consider response accuracy, response time, and possibly eye-tracking data, and heart rate data in the DIF detection procedure.