Key biases in AI models used for detecting depression on social media uncovered by Northeastern grads

9 hours ago 4

Yuchen Cao and Xiaorui Shen conducted a systematic review of AI models used in studies detecting depression in social media users and found major flaws.

Person holding smartphone with a blue screen light illuminating their face in a dimly lit room.

Artificial intelligence models used to detect depression on social media are often biased and methodologically flawed, according to a study led by Northeastern University computer science graduates.

Yuchen Cao and Xiaorui Shen were graduate students at Northeastern’s Seattle campus when they began examining how machine learning and deep learning models were being used in mental health research, particularly following the COVID-19 pandemic.

Teaming up with peers from several universities, they conducted a systematic review of academic papers using AI to detect depression among social media users. Their findings were published in the Journal of Behavioral Data Science.

“We wanted to see how machine learning or AI or deep learning models were being used for research in this field,” says Cao, now a software engineer at Meta.

Social media platforms like Twitter, Facebook and Reddit offer researchers a trove of user-generated content that reveals emotions, thoughts and mental health patterns. These insights are increasingly being used to train AI tools for detecting signs of depression. But the Northeastern-led review found that many of the underlying models were inadequately tuned and lacked the rigor needed for real-world application.

The team analyzed hundreds of papers and selected 47 relevant studies published after 2010 from databases such as PubMed, IEEE Xplore and Google Scholar. Many of these studies, they found, were authored by experts in medicine or psychology — not computer science — raising concerns about the technical validity of their AI methods.

“Our goal was to explore whether current machine learning models are reliable,” says Shen, also now a software engineer at Meta. “We found that some of the models used were not properly tuned.”

Traditional models such as Support Vector Machines, Decision Trees, Random Forests, eXtreme Gradient Boosting and Logistic Regression were commonly used. Some studies employed deep learning tools like Convolutional Neural Networks, Long Short-Term Memory networks and BERT, a popular language model.

Yet the review uncovered several significant issues:

Only 28% of studies adequately adjusted hyperparameters, the settings that guide how models learn from data.
Roughly 17% did not properly divide data into training, validation and test sets, increasing the risk of overfitting.
Many relied heavily on accuracy as the sole performance metric, despite imbalanced datasets that could skew results and overlook the minority class — in this case, users showing signs of depression.

“There are some constants or basic standards, which all computer scientists know, like, ‘Before you do A, you should do B,’ which will give you a good result,” Cao says. “But that isn’t something everyone outside of this field knows, and it may lead to bad results or inaccuracy.”

The studies also displayed notable data biases. X (formerly Twitter) was the most common platform used (32 studies), followed by Reddit (8) and Facebook (7). Only eight studies combined data from multiple platforms, and about 90% relied on English-language posts, mostly from users in the U.S. and Europe.

These limitations, the authors argue, reduce the generalizability of findings and fail to reflect the global diversity of social media users.

Another major challenge: linguistic nuance. Only 23% of studies clearly explained how they handled negations and sarcasm, both of which are vital to sentiment analysis and depression detection.

To assess the transparency of reporting, the team used PROBAST, a tool for evaluating prediction models. They found many studies lacked key details about dataset splits and hyperparameter settings, making results difficult to reproduce or validate.

Cao and Shen plan to publish follow-up papers using real-world data to test models and recommend improvements.

Sometimes researchers don’t have enough resources or AI expertise to properly tune open-source models, Cao says.

“So [creating] a wiki or a paper tutorial is something I think is important in this field to help collaboration,” he says. “I think that teaching people how to do it is more important than just helping you do it, because resources are always limited.”

The team will present their findings at the International Society for Data Science and Analytics annual meeting in Washington, D.C.

Read Entire Article