Toxicity dataset
The endpoints showed pairs association in some cases as as illustrated by the phi-coefficient matrix below. The highest coefficients were NR.AR - NR.AR.LBD (0.76) and NR.ER - NR.ER.LBD (0.74).

The widespread exposure to chemicals has raised concerns about their toxicity impact on public health and the environment. Identifying and quantifying these chemicals in complex samples is not always possible, making the assessment of their toxicities difficult.
In an effort to quickly screen chemicals for potential risks to human health, this study aims to predict toxicities based on tandem mass spectrometry MS2 data. To achieve this goal, endocrine-disrupting activity data and other relevant human endpoints from the Tox21 Challenge were collected and combined with mass spectra from Mass Bank Europe. A k-nearest neighbors (k-NN) and a spectra network-based algorithm were implemented to predict the activity from MS2 mass spectra. For k-NN, 5-fold cross-validation, the highest recall and precision were 47.1% and 44.4% (both for NR.AR), respectively. The implementation of a spectral similarity network enhanced the recall and precision to 81.8% and 75.0% (both for NR.AR), respectively. The spectral networks showed active clusters for the NR.AR, NR.ER, NR.AR.LBD, and NR.ER.LBD endpoints.
The approach was applied to retrospective analysis of MS2 mass spectra of a wastewater sample, showing potential for toxicity alerts. The predictive capabilities of the model could further benefit from feature selection techniques, network optimization, and integration with datasets from other domains.
The endpoints showed pairs association in some cases as as illustrated by the phi-coefficient matrix below. The highest coefficients were NR.AR - NR.AR.LBD (0.76) and NR.ER - NR.ER.LBD (0.74).
The similarity matrix showed groups of chemicals with high cosine similarities. This high scores will be represented as clusters when building the endpoint network.
The network changes according to the selected cosine similarity threshold. Use the slider to change the cosine threshold.
The mass spectra network can be used to predict the toxicity of unknown mass spectra (white nodes). In this example, red nodes are active and the green ones are inactive for NR-AR. The (+) sign means predicted as active and (-) as inactive. According to activity of the connected nodes, the unknown is assigned with the corresponding activity label based on a voting scheme.
The network was applied to the analysis of features from an environmental sample. The prediction model pinpointed the mass spectra features of potential interest. Some features were then individually examined for identification and showed active species matches. This methodology reduced the processing time of MS data and directly presented alerts for suspect features. Despite the rapid response of this approach in screening large amounts of mass spectra, further validation and verification studies are still needed for application in a routine basis.
This approach shows potential for the automated screening of mass spectra features in non-targeted analysis and the prioritization of samples.
The expansion of mass spectra databases, development of algorithms, and the on-going integration of multi-domain databases could potentially increase the performance and chemical space for this type of predictions, enabling high-throughput sample analysis in environmental monitoring.
There are many works that inspired this project, some of them are:
GNPS is a community-based platform that implements a similar molecular networking approach used in this project for the identification of MS/MS spectra.
Detective-QSAR predicts physicochemical properties and toxicities based on GC-MS data and the XGBoost Machine Learning method.
MS2Tox is a Machine Learning Tool for Predicting the Ecotoxicity based on LC-HRMS data and the CompTox database applying a xgbDART method.
Read more about environmental analytical chemistry and mass spectrometry on Kruve Lab's website.
@masterthesis{Soto_2023,
author={Soto, Leonardo},
title={Automated Prediction of the Endocrine Disruptive Potency of Chemicals detected with LC/ESI/HRMS based on Mass Spectral Networks},
url={https://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-532906},
institution = {Uppsala University, Department of Chemistry - BMC},
series = {UPKEM E},
number = {390},
year={2023}
}