Letter to the Editor
Authors: Emma L Schymanski (Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 Avenue du Swing, Belvaux, Luxembourg) , Evan E Bolton (National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA)
Abstract The exposome, the totality of lifetime exposures, is a new and highly complex paradigm for health and disease. Tackling this challenge requires an effort well beyond single individuals or laboratories, where every piece of the puzzle will be vital. The launch of this new Exposome journal coincides with the evolution of the exposome through its teenage years and into a growing maturity in an increasingly open and FAIR (findable, accessible, interoperable, and reusable) world. This letter discusses how both authors and the Exposome journal alike can help increase the FAIRness of the chemical structural information and the associated metadata in the journal, aiming to capture more details about the chemistry of exposomics. The proposed chemical structure template can serve as an interoperable supplementary format that is made accessible through the website and more findable by linking the DOI of this data file to the article DOI metadata, supporting further reuse. An additional transformations template provides authors with a means to connect predecessor (parent and substrate) molecules to successor (transformation product and metabolite) molecules and thus provide FAIR connections between observed (i.e., experimental) chemical exposures and biological responses, to help improve the public knowledgebase on exposome-related transformations. These connections are vital to extend current biochemical knowledge and to fulfil the current Exposome definition of “the cumulative measure of environmental influences and associated biological responses throughout the lifespan including exposures from the environment, diet, behavior, and endogenous processes”.
Keywords: FAIR, data workflows, Open science, data sharing, chemical information, transformation products
How to Cite: Schymanski, E. & Bolton, E. (2022) “FAIRifying the exposome journal: Templates for chemical structures and transformations”, Exposome. 2(1). doi: https://doi.org/10.1093/exposome/osab006
The “exposome” is a concept first mentioned in 2005 by Wild1 to offer an environmental complement to the genome2 in considering health and disease. Now that the exposome is in its adolescence and “emerging from the primordial swamp” sufficiently to warrant its own journal,2 it is a good time to reflect on what steps are required to enable exposomics to mirror the achievements of genomics. A quick search reveals, for instance, that global investment in genomics is projected into the tens of billions in the coming years,3,4 while the global investment in the exposome or exposomics is rather of the order of tens of millions. Yet, exposomics is an extraordinarily complex paradigm that will certainly require concerted global effort comparable to that of the human genome.5 Although capturing “the cumulative measure of environmental influences and associated biological responses throughout the lifespan including exposures from the environment, diet, behaviour, and endogenous processes”6 may seem unachievable for some, sequencing the human genome was also considered an almost impossible task only a few decades ago. While the success of genomics is arguably due to many factors (including extensive investment), one very significant factor in its success is the open exchange of genomics data and the ecosystem of open resources that has been built around genomics, enabling scientists around the world to achieve extraordinary progress in a relatively short time. Can exposomics achieve the same?
With this letter, we provide some perspectives and guidance on how both authors of articles in Exposome and the Exposome journal itself can contribute to the cumulative efforts needed to tackle the exposomics challenge from a chemical information and chemical informatics standpoint. Exposomics is inherently a data-driven discipline. The interlinking of chemical, disease and reference information is already providing support to exposomics efforts, as shown in Figure 1 using an examples from PubChem7 and the Comparative Toxicogenomics Database (CTD),8 as well as from the CompTox Chemicals Dashboard.9,10 Such information gathering and cross-resource integration efforts are much easier if data are both open and FAIR (findable, accessible, interoperable, reusable). Providing guidance and coordinating at a journal level is one way to enable such information gathering; genomics data deposition is mandated in most major journals and this has been key to building the open genomics data resources that are so critical for food-based pathogen surveillance, COVID-19 disease variant tracking, and so much more. If sufficient information for exposomics was available, what can we as a community achieve?
FAIRifying and opening up exposomics information is critical to “big data” exposomics, empowering information discovery and cross-resource integration. (Top, A) Associated disorders and diseases (and references) for a single chemical, 1-chloro-2,4-dinitrobenzene in PubChem,7 with information sourced from the Comparative Toxicogenomics Database (CTD).8 Source: https://pubchem.ncbi.nlm.nih.gov/compound/6#section=Associated-Disorders-and-Diseases. (Bottom, B) Individual chemical—disease endpoint mappings via Name, Chemical Abstract Services Registry Numbers (CAS RN), CompTox Chemicals Dashboard identifiers (DSSToxID or DTXSIDs), plus total and endpoint-specific reference counts in the context of neurotoxicity, embedded in an excel macro.10,11
Authors need guidance to properly and uniformly capture and report chemical structure information and transformations, that is, connecting either endogenous or exogenous chemicals with their metabolites—thus helping capture the associated biological responses. The flexible templates provided here (see Sections “Chemical Structure Data” and “Transformations Data”) show how authors can consistently submit this information to the Exposome journal as supplementary materials with their articles. These templates are designed such that authors can include as much or as little information as is available, yet still contribute their knowledge and outcomes to the exposomics “pool” (and beyond) in an open and FAIR manner. The “Chemical Structure Data” template is identical to the template introduced recently in the Journal of Cheminformatics.12
An incredible amount of knowledge relevant for exposomics has already been gathered, yet current studies are based primarily on using public resources to find existing information. To extend exposomics into the future, we need to enable the discovery and reporting of new findings via rapid integration into public resources. Thus, author contributions, no matter how small, will gradually help build the bigger picture needed to unravel and comprehend the exposome. Before we launch into the template descriptions, a few definitions are covered in the next section.
While “FAIR” and “Open” are used somewhat interchangeably in this article as we strongly believe that chemical data should be both where possible, there is a distinction that is particularly relevant for exposomics, as sensitive human data cannot necessarily be made open. Data can be “open” but not “FAIR,” and vice versa. Open science has many facets; of most relevance to this article is open access (OA). OA is a set of principles and a range of practices through which research outputs are distributed online, free of cost or other access barriers.13 The FAIR principles for digital assets, on the other hand, include guidance on how to make data more Findable, Accessible, Interoperable, and Reusable.14,15 For example, if you have open data that is not findable, no one can use it; whereas if you have “FAIR” data that is not “open”, it is not available for integration into open community resources. Thus, the most powerful data are both open and FAIR.
In Table 1, we provide some definitions of chemical and transformation terms used later in this article.
Definition of chemical and transformation terms used in this article and/or templates
| Concept | Definition |
|---|---|
| Biosystem | The medium in which the predecessor is transformed into the successor (e.g., environment and human liver) |
| Identifier | An identifier or name that you (the author) have for a chemical structure |
| InChI | IUPAC International Chemical Identifier is a descriptor of a chemical structure16 |
| InChIKey | A 27-character long, layered “hash” of an InChI16 |
| PubChem CID | PubChem Compound Identifier |
| Predecessor | Substrate/parent that is transformed (somehow) into a successor product |
| SMILES | Chemical structure notation expressed as a string |
| Successor | Transformation product/metabolite resulting from transformation (somehow) of a substrate/parent |
Better consideration of chemical factors in the exposome requires high-quality chemical information in research articles. Many exposomics resources are based (mostly) on literature mining using name and synonym matching, which can be notoriously prone to errors. In this section, we provide some guidance on what information authors should consider providing, as well as the pros and cons of various choices. Since this Chemical Structure Data template was presented recently to the Journal of Cheminformatics,12 some of the material in this section overlaps with the previous article.
Authors should consider submitting their chemical structure information with their manuscript as Supplementary Material using the suggested template as comma separated value (CSV; *.csv); or, alternatively, as tab-separated value (TSV; *.tsv) or structure data file (SDF; *.sdf) formats. These formats ensure maximum interoperability between resources and operating systems. The popular XLS(X) format is not truly interoperable (options to save as CSV or TSV are offered), while the extraction of information from PDF format is difficult without introducing errors. The content below describes the CSV/TSV formats, SDF instructions are available elsewhere17 (however, the SD fields should match the CSV/TSV headers). In our experience, so far CSV often proves most interoperable for the widest audience, although the other formats also have certain advantages.
For CSV/TSV files, the header (first row) indicates the data content of each column; each subsequent row corresponds to a complete chemical record description: chemical structure, chemical names, identifiers, comments, and any other data the authors wish to provide (as additional columns). The interoperable case-insensitive template CSV/TSV column headers (or SDF SD fields) are: SMILES, InChI, and InChIKey for chemical structure; Name and Synonym for chemical names; and Comment for textual comments. Any additional columns headers (e.g., for data, additional identifiers, or desired metadata) are up to the author (e.g., the PubChem_CID identifier header in Figure 2). Note that there may be many Synonym and Comment columns in the file to provide space for more chemical names and metadata, respectively.
An example chemical structure data file constructed according to the proposed template18 by taking a subset of the HSDBTPS structure data.19 Image created in RStudio (Version 1.2.5042). The HSDBTPS efforts resulted in the deposition of five new structures to PubChem all documented in HSDB text snippets, CIDs 146035700, 146035701, 146035702, 146035703, and 146037633.
The author-submitted template file18 should contain at least one of the following columns: SMILES, InChI, Name, or InChIKey. The Name column corresponds to a single primary name for the chemical structure. Each Synonym column corresponds to an additional chemical name (one name entry per column). Each Comment column can be added to provide additional text that may be important to the downstream user. Authors can also provide additional CSV/TSV columns (or SDF SD fields) containing information about their chemical substances (with unique, descriptive headers) for additional context. Chemical database identifiers or registry numbers could be included in this manner (as additional columns or fields), or as a Synonym. Note that chemical records indicating chemical structure with only InChIKey or Name will not contain sufficient information to describe a chemical structure; and can only be mapped to existing entries in destination resources. Batch services are available (e.g., from PubChem7,20 or CompTox9,21) for authors to add, e.g., SMILES and/or InChI to their records, based upon the Name or other identifiers.
Figure 1 in Schymanski and Bolton11 shows the template file, which is available for download18 and as Supplementary Material with this article. Figure 2 shows an example submission according to the proposed template, created by sub-setting the “HSDBTPS” dataset of literature-mined and curated transformation products from the Hazardous Substance Data Bank (HSDB) in PubChem.19,22 This example provides the Name, SMILES, and InChIKey fields as suggested, and an identifier (the PubChem Compound Identifier, CID) as an additional (optional) column (PubChem_CID) with a unique and easily recognizable header that can be processed by other resources as they choose, helping with interoperability.
The advancement of modern science is data driven.23,24 Providing key data in a ready to use format helps to assist in its reuse in research articles, regulatory reports, or machine-learning data models. Exposomics especially needs access to ready-to-use, high-quality chemical information from individual research articles (e.g., such as the connection of detected chemicals with the disease endpoint investigated or the aggregation of known metabolites of thousands of common chemicals). For instance, HSDB contains metabolites and metabolism information for 3220 chemicals gathered over 40 years, but these are only available as text snippets that need to be matched to chemical structures by synonyms followed by manual curation (initial efforts have covered only 1/100th of this dataset22). However, as mentioned above, a key challenge in exposomics is to connect chemicals (e.g., of anthropogenic origin, but also endogenous or exogenous chemicals) that are associated with exposures with their biological response. Since metabolism is the most dynamic of the biological responses, and metabolites per definition fall into the same molecular mass category as many anthropogenic chemicals of concern, a key gap in exposomics knowledge is the connection between chemicals and their metabolites. The efforts of many will be needed to help fill this knowledge gap and the timing could not be better for exposomics with several recent studies emerging using in vitro enzymes to investigate parent–metabolite relationships of drugs and other relevant chemicals.25,26
The Transformations template provided here has been designed on the basis of recent efforts to fill the gaps of transformation products in PubChem using literature data,27 in collaboration with the NORMAN Suspect List Exchange (NORMAN-SLE).28-30 Several datasets from a variety of sources have now been processed. Transformations from the NORMAN-SLE, where S## refers to the list number, followed by the list code, include: S60 SWISSPEST19,31,32 S66 EAWAGTPS,33,34 S68 HSDBTPS,21,22 S73 METXBIODB,35,36 S74 REFTPS,37 S78 SLUPESTTPS,38,39 S79 UACCSCEC,40,41 and S81 THSTPS42 (list available from https://git-r3lab.uni.lu/eci/pubchem/-/raw/master/annotations/tps/Transformation_Datasets.txt). Of these, MetXBioDB also contains enzyme information, while the rest are primarily environmental data. Figure 3 shows an example “environmental” dataset compiled from several of these lists, using the proposed template. In addition to the NORMAN-SLE datasets, a dataset of more than 1200 transformations from ChEMBL43 has also been added, including enzyme, gene, and protein information (where available). An example of Transformations with more biological information available is given in Figure 4.
An example of various environmental transformations constructed according to the proposed Transformations template44 (using Name and PubChem CID), taking a subset of transformations from NORMAN-SLE datasets (REFTPS,37 HSDBTPS,20 SLUPESTTPS,38 EAWAGTPS,33 and SWISSPEST1931). Image created in RStudio (Version 1.2.5042).
An example of biological transformations constructed according to the proposed Transformations template44 (using Name and SMILES), taking a subset of transformations from NORMAN-SLE dataset MetXBioDB35 (from BioTransformer36) and the ChEMBL43 datasets on PubChem; both datasets have some degree of enzyme, gene, and/or protein information available.
Information about both the predecessor (parent/precursor) and successor (transformation product/metabolite) must be given for a valid transformation. The template can accept at least one of Name, SMILES, or PubChem CID for each, where SMILES or CID is preferred, and SMILES will be the most interoperable. Note that these need not be consistent—for instance, it is possible to provide SMILES of the successor and a CID of the predecessor if a Name or CID is not available for the successor. It is preferable to give two fields, Figure 3 shows the example of Name and CID, while Figure 4 an example of SMILES and Name (top panel on each figure).
If available, a brief description of the transformation is useful and can be provided in the “Transformation” field (top panel, Figures 3 and 4). Short, informative descriptions are preferred; the current entries have been either extracted automatically from existing datasets or entered manually. In the future, it may be possible to provide some guidance via an ontology as the public dataset grows to improve the machine readability. Similarly, if information on the biosystem is available (i.e., where the transformation takes place), this can be included in the Biosystem column (see Figures 3 and 4 for examples).
For datasets with biological information, this can be provided (optionally) in the Enzyme, Gene_ID, and Protein_ID columns. At this stage, the template allows flexible input (see Figure 4 for examples) but recommend Enzyme are provided as either: Enzyme Commission (EC) number,45–47 such as “EC 2.3.2.23”; gene symbol, such as “CYP1A1”; or as enzyme names, such as “Aryl hydrocarbon hydroxylase.” The Gene_ID is expected to be an NCBI Gene48 ID, such as “1543.” The Protein_ID is expected to be either an NCBI Protein49 accession, such as “NP_059488.2” or an UniProt identifier,50 such as “P08684.” If multiple entries for Enzyme, Gene_ID, and Protein_ID are provided, they should be separated by a “pipe” symbol (“|”) or provided as new rows.
Finally, the Reference_ID and Reference_Description columns provide the opportunity to credit the original sources of the information. Reference_ID entries should be either PubMed identifiers51 (PMIDs) or Digital Object Identifiers52 (DOIs), preceded with “PMID:” or “DOI:”, respectively, for easy recognition, and separated by a “pipe” (“|”) if multiple IDs exist (they can be mixed—for example, “PMID:33929905|DOI:10.1186/s13321-018-0324-5”). The Reference_Description can be used to provide a free text form of the reference, to describe the data source (if no PMID/DOI available) or to describe evidence of the transformation. Only Reference_ID can be processed automatically. Again, see Figures 3 and 4 and the Transformations template44 for examples.
So far, about 6000 Transformations have been processed using these templates, from nine different sources (many of these being composite data from several sources themselves, including ChEMBL,43 MetXBioDB,35 and REFTPS37). The Transformations are being integrated into current computational mass spectrometry workflows (such as patRoon53 and as documented in Krier et al.22) and are openly available for all. The summarized files are likewise available for comprehensive efforts such as BioTransformer36 to add this new data to their training set (MetXBioDB35 is the library behind BioTransformer) and likewise improve predictions. Overall, FAIR transformations data will greatly support exposomics, and discussions to extend these templates into fields with formal ontologies and/or other formats such as mzTab54,55 in the future are welcomed. As demonstrated in Figures 5 and 6, one can see the benefits of arranging data in FAIR templates. Figure 5 is an example of a resulting Transformation entry in PubChem, while Figure 6 can be created automatically in CDK Depict using simple code in R to create annotated reaction SMILES from the fields shown in Figure 3 only.
Example “Transformations” table in PubChem for Carbamazepine, demonstrating possible display options (including hyperlinking) for FAIR Transformations. Source: https://pubchem.ncbi.nlm.nih.gov/compound/2554#section=Transformations.
Example reactions corresponding with the last four rows of Figure 3, automatically created and depicted with CDK Depict56 (https://www.simolecule.com/cdkdepict/depict.html) directly from template content shown in Figure 3 (SMILES, Name, Enzyme, and Reference_ID fields).
Exposomics is a data-driven science and vast quantities of information will be needed for it to be successful. By making the output of exposomics research available in a more machine-readable way, we can accelerate our progress and rise to the challenge. The templates provided here are a means to make primary outputs FAIR (Findable, Accessible, Interoperable, and Reusable). When authors provide this content as Supplementary Material, it can be readily accessed and utilized, ideally without human intervention. When the journal interlinks these Supplementary Material files with the article DOI and associated metadata, other resources can rapidly find and integrate this content and provide enhanced services for the entire community. Improving the FAIRness of Supplementary Material greatly decreases the effort to combine and aggregate information between papers and improves the correctness of the information over text-mining-based approaches. It also greatly enhances the visibility of the individual works and research outputs. As a young scientific discipline, the exposome should learn from its closely related “elder” disciplines. Genomic approaches gained incredible traction due to the widely encouraged and eventually mandated sharing of information. Let us take these lessons to heart and advance together as a field. We need to share information—and lots of it—to help make sense of the exposome. The use of these facile, ready-to-use templates will help advance exposomics by contributing vital information to complete the exposomics “puzzle.”
Editor's Note
The following Letter to the Editor is rather unique. Based on a dialog between the authors and the Editor, it was decided that the journal would grant an exception and allow an extended Letter to the Editor to facilitate discussion on this topic.
We gratefully acknowledge discussions with the entire PubChem team, especially Jian (Jeff) Zhang and Tiejun Cheng for their joint work on the transformations, as well as Ben Shoemaker, Paul Thiessen, Siqian He, and Asta Gindulyte. We also gratefully acknowledge discussions with Egon Willighagen and the editorial team at the Journal of Cheminformatics (surrounding the lead-up article to this article), and many collaborators who have worked on depositions within PubChem and the NORMAN-SLE. Special mentions go to Frank Menger (SLU, Sweden) and Lidia Belova (University of Antwerp, Belgium), for testing and depositing data using earlier versions of the transformations template (SLUPESTTPS and UACCSCEC, respectively). We are also grateful to Anca Baesu (McGill University, Canada) and Parviel Chirsir (University of Luxembourg), as well as Noelia Ramirez and colleagues (URV, Tarragona, Spain) for their testing and contributions using the existing templates (REFTPS and THSTPS, respectively).
Supplementary material is available at Exposome online. The chemical structure data submission template and transformations template are provided as Supplementary Material and are also available online.18,44,57 All transformations mentioned in this article are openly available on the NORMAN-SLE and PubChem.
E.E.B. is funded by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. E.L.S. acknowledges funding support from the Luxembourg National Research Fund (FNR) for project A18/BM/12341006.
The authors declare no competing interests.
1 WildCP. Complementing the genome with an “exposome”: The outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev 2005; 14:1847–1850. doi: http://doi.org/10.1158/1055-9965.EPI-05-0456
2 MillerGW. Exposome: A new field, a new journal. Exposome 2021; 1. doi: http://doi.org/10.1093/exposome/osab001
3 GlobeNewswire, Inc. Genomics Market to Reach USD 94.66 Billion by 2028; Increasing Genomics’ Application & Rising Government Investments to Amplify Market Growth: Says Fortune Business Insights. Accessed September 5, 2021. https://www.globenewswire.com/news-release/2021/05/20/2233128/0/en/Genomics-Market-to-Reach-USD-94-66-Billion-by-2028-Increasing-Genomics-Application-Rising-Government-Investments-to-Amplify-Market-Growth-Says-Fortune-Business-Insights.htmlhttps://www.globenewswire.com/news-release/2021/05/20/2233128/0/en/Genomics-Market-to-Reach-USD-94-66-Billion-by-2028-Increasing-Genomics-Application-Rising-Government-Investments-to-Amplify-Market-Growth-Says-Fortune-Business-Insights.html
4 P&S Intelligence. Global Genomics Market to Reach $68 Billion by 2030: P&S Intelligence. Accessed September 5, 2021. https://www.prnewswire.com/news-releases/global-genomics-market-to-reach-68-billion-by-2030-ps-intelligence-301125318.htmlhttps://www.prnewswire.com/news-releases/global-genomics-market-to-reach-68-billion-by-2030-ps-intelligence-301125318.html
5 VermeulenR, SchymanskiEL, BarabásiAL, MillerGW. The exposome and health: Where chemistry meets biology. Science. 2020; 367:392–396. doi: http://doi.org/10.1126/science.aay3164
6 MillerGW, JonesDP. The nature of nurture: Refining the definition of the exposome. Toxicol Sci. 2014; 137:1–2. doi: http://doi.org/10.1093/toxsci/kft251
7 KimS, ChenJ, ChengT, et al PubChem 2019 update: Improved access to chemical data. Nucleic Acids Res. 2019; 47:D1102–D1109. doi: http://doi.org/10.1093/nar/gky1033
8 DavisAP, GrondinCJ, JohnsonRJ, et al Comparative toxicogenomics database (CTD): Update 2021. Nucleic Acids Res. 2021; 49:D1138–D1143. doi: http://doi.org/10.1093/nar/gkaa891
9 WilliamsAJ, GrulkeCM, EdwardsJ, et al The CompTox chemistry dashboard: A community data resource for environmental chemistry. J Cheminform. 2017; 9:61. doi: http://doi.org/10.1186/s13321-017-0247-6
10 SchymanskiEL, BakerNC, WilliamsAJ, et al Connecting environmental exposure and neurodegeneration using cheminformatics and high resolution mass spectrometry: Potential and challenges. Environ Sci Process Impacts. 2019; 21:1426–1445. doi: http://doi.org/10.1039/C9EM00068B
11 BakerNC, SchymanskiEL, WilliamsAJ. Literature neurotoxicants: Excel Macro File. FigShare doi: http://doi.org/10.23645/epacomptox.7334603
12 SchymanskiEL, BoltonEE. FAIR chemical structures in the Journal of Cheminformatics. J Cheminform. 2021; 13:50. doi: http://doi.org/10.1186/s13321-021-00520-4
13 Peter Suber. Open Access Overview (Definition, Introduction). Accessed July 3, 2021. http://legacy.earlham.edu/~peters/fos/overview.htmhttp://legacy.earlham.edu/~peters/fos/overview.htm
14 GO FAIR. FAIR Principles. Published 2021. Accessed March 23, 2021. https://www.go-fair.org/fair-principles/https://www.go-fair.org/fair-principles/
15 WilkinsonMD, DumontierM, AalbersbergIJ, et al Comment: The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016; 3:1–9. doi: http://doi.org/10.1038/sdata.2016.18
16 HellerS, McNaughtA, SteinS, TchekhovskoiD, PletnevI. InChI—the worldwide chemical structure identifier standard. J Cheminform. 2013; 5:7. doi: http://doi.org/10.1186/1758-2946-5-7
17 NCBI/NLM/NIH. PubChem Documentation: Substance SDF Submission. Published 2021. Accessed March 23, 2021. https://pubchem.ncbi.nlm.nih.gov/upload/docs/examples/substance_submission.sdfhttps://pubchem.ncbi.nlm.nih.gov/upload/docs/examples/substance_submission.sdf
18 NCBI/NLM/NIH. Chemical Structure Data Template (CSV). Published 2021. Accessed May 9, 2021. https://ftp.ncbi.nlm.nih.gov/pubchem/Other/Submissions/Chemical_Structure_Data_Template.csvhttps://ftp.ncbi.nlm.nih.gov/pubchem/Other/Submissions/Chemical_Structure_Data_Template.csv
19 LCSB-ECI, KrierJ, SchymanskiE, et al S68 | HSDBTPS | Transformation Products Extracted from HSDB Content in PubChem. Published online June 11, 2020. doi: http://doi.org/10.5281/zenodo.3827487
20 NCBI/NLM/NIH. PubChem Identifier Exchange. Published 2021. Accessed March 23, 2021. https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgihttps://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi
21 United States Environmental Protection Agency. CompTox Batch Search. Published 2021. Accessed March 23, 2021. https://comptox.epa.gov/dashboard/dsstoxdb/batch_searchhttps://comptox.epa.gov/dashboard/dsstoxdb/batch_search
22 KrierJ, SinghRR, KondićT, et al Discovering pesticides and their TPs in Luxembourg waters using open cheminformatics approaches. Environ Int. 2022; 158:106885. doi: http://doi.org/10.1016/j.envint.2021.106885
23 MontánsFJ, ChinestaF, Gómez-BombarelliR, KutzJN. Data-driven modeling and learning in science and engineering. Comptes Rendus Mécanique. 2019; 347:845–855. doi: http://doi.org/10.1016/j.crme.2019.11.009
24 ClausetA, LarremoreDB, SinatraR. Data-driven predictions in the science of science. Science. 2017; 355:477–480. doi: http://doi.org/10.1126/science.aal4217
25 LiuK, LeeC, SingerG, et al Enzyme‐based chemical identification for metabolomics. FASEB J. 2021; 35:fasebj.2021.35.S1.04277. doi: http://doi.org/10.1096/fasebj.2021.35.S1.04277
26 RossDH, SeguinRP, KrinskyAM, XuL. High-throughput measurement and machine learning-based prediction of collision cross sections for drugs and drug metabolites. Bioinformatics 2021. doi: http://doi.org/10.1101/2021.05.13.443945
27 SchymanskiEL, KondićT, NeumannS, ThiessenPA, ZhangJ, BoltonEE. Large chemical knowledge bases for exposomics: PubChemLite meets MetFrag. J Cheminform. 2021; 13:19. doi: http://doi.org/10.1186/s13321-021-00489-0
28 NORMAN Network. NORMAN Suspect List Exchange. Accessed June 9, 2019. https://www.norman-network.com/nds/SLE/https://www.norman-network.com/nds/SLE/
29 NORMAN Network. NORMAN Suspect List Exchange on Zenodo. NORMAN Suspect List Exchange: Zenodo Community. Accessed June 9, 2019. https://zenodo.org/communities/norman-sle/https://zenodo.org/communities/norman-sle/
30 NORMAN Network, NCBI/NLM/NIH. NORMAN SLE Classification Browser. Accessed May 7, 2020. https://pubchem.ncbi.nlm.nih.gov/classification/#hid=101https://pubchem.ncbi.nlm.nih.gov/classification/#hid=101
31 KieferK, MüllerA, SingerH, HollenderJ. S60 | SWISSPEST19 | Swiss Pesticides and Metabolites from Kiefer 2019. Published online November 17, 2019. http://doi.org/10.5281/zenodo.3544760http://doi.org/10.5281/zenodo.3544760
32 KieferK, MüllerA, SingerH, HollenderJ. New relevant pesticide transformation products in groundwater detected using target and suspect screening for agricultural and urban micropollutants with LC-HRMS. Water Res. 2019; 165:114972. doi: http://doi.org/10.1016/j.watres.2019.114972
33 ScholleeJ, SchymanskiE. S66 | EAWAGTPS | Parent-Transformation Product Pairs from Eawag. Published online April 23, 2020. doi: http://doi.org/10.5281/zenodo.3754448
34 ScholléeJE, SchymanskiEL, StravsMA, GuldeR, ThomaidisNS, HollenderJ. Similarity of high-resolution tandem mass spectrometry spectra of structurally related micropollutants and transformation products. J Am Soc Mass Spectrom. 2017; 28:2692–2704. doi: http://doi.org/10.1007/s13361-017-1797-6
35 Djoumbou-FeunangY, SchymanskiE, ZhangJ, WishartDS. S73 | METXBIODB | Metabolite Reaction Database from BioTransformer. Published online November 5, 2020. doi: http://doi.org/10.5281/zenodo.4056560
36 Djoumbou-FeunangY, FiamonciniJ, Gil-de-la-FuenteA, GreinerR, ManachC, WishartDS. BioTransformer: A comprehensive computational tool for small molecule metabolism prediction and metabolite identification. J Cheminform. 2019; 11:2. doi: http://doi.org/10.1186/s13321-018-0324-5
37 SchymanskiE. S74 | REFTPS | Transformation products and reactions from literature. Published online December 2020. doi: http://doi.org/10.5281/zenodo.4318838
38 MengerF, BoströmG. S78 | SLUPESTTPS | Pesticides and TPs from SLU, Sweden. Published online May 10, 2021. doi: http://doi.org/10.5281/zenodo.4687924
39 MengerF, BoströmG, JonssonO, et al Identification of pesticide transformation products in surface water using suspect screening combined with national monitoring data. Environ Sci Technol. 2021; 55:10343–10353. doi: http://doi.org/10.1021/acs.est.1c00466
40 BelovaL, Caballero-CaseroN, van NuijsALN, CovaciA. Ion mobility-high-resolution mass spectrometry (IM-HRMS) for the analysis of contaminants of emerging concern (CECs): Database compilation and application to urine samples. Anal Chem. 2021; 93:6428–6436. doi: http://doi.org/10.1021/acs.analchem.1c00142
41 BelovaL, Caballero-CaseroN, van Nuijs AlexanderLN, CovaciA. S79 | UACCSCEC | Collision Cross Section (CCS) Library from UAntwerp. Published online May 10, 2021. doi: http://doi.org/10.5281/zenodo.4704648
42 MerinoC, VinaixaM, RamirezN. S81 | THSTPS | Thirdhand Smoke Specific Metabolites. Published online September 2, 2021. doi: http://doi.org/10.5281/zenodo.5394629
43 GaultonA, HerseyA, NowotkaM, et al The ChEMBL database in 2017. Nucleic Acids Res. 2017; 45:D945–D954. doi: http://doi.org/10.1093/nar/gkw1074
44 NCBI/NLM/NIH. Transformations data template (CSV). Published 2021. Accessed May 25, 2021. https://ftp.ncbi.nlm.nih.gov/pubchem/Other/Submissions/Transformations_Template.csvhttps://ftp.ncbi.nlm.nih.gov/pubchem/Other/Submissions/Transformations_Template.csv
45 McDonaldAG, BoyceS, TiptonKF. ExplorEnz: The primary source of the IUBMB enzyme list. Nucleic Acids Res. 2009; 37(Database issue):D593–D597. doi: http://doi.org/10.1093/nar/gkn582
46 BairochA. The ENZYME database in 2000. Nucleic Acids Res. 2000; 28:304–305. doi: http://doi.org/10.1093/nar/28.1.304
47 ChangA, JeskeL, UlbrichS, et al BRENDA, the ELIXIR core data resource in 2021: New developments and updates. Nucleic Acids Res. 2021; 49:D498–D508. doi: http://doi.org/10.1093/nar/gkaa1025
48 BrownGR, HemV, KatzKS, et al Gene: A gene-centered information resource at NCBI. Nucleic Acids Res. 2015; 43(Database issue):D36–D42. doi: http://doi.org/10.1093/nar/gku1055
49 BensonDA, CavanaughM, ClarkK, et al GenBank. Nucleic Acids Res. 2018; 46(D1):D41–D47. doi: http://doi.org/10.1093/nar/gkx1094
50 The UniProt Consortium. UniProt: The universal protein knowledgebase. Nucleic Acids Res 2017; 45(D1):D158–D169. doi: http://doi.org/10.1093/nar/gkw1099
51 SayersEW, BeckJ, BoltonEE, et al Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2021; 49(D1):D10–D17. doi: http://doi.org/10.1093/nar/gkaa892
52 International DOI Foundation. Frequently Asked Questions about the DOI® System. Accessed September 7, 2021. https://www.doi.org/faq.htmlhttps://www.doi.org/faq.html
53 HelmusR, ter LaakTL, van WezelAP, de VoogtP, SchymanskiEL. patRoon: Open source software platform for environmental mass spectrometry based non-target screening. J Cheminform. 2021; 13(1). doi: http://doi.org/10.1186/1020-00477-w
54 GrissJ, JonesAR, SachsenbergT, et al The mzTab data exchange format: Communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience. Mol Cell Proteomics. 2014; 13:2765–2775. doi: http://doi.org/10.1074/mcp.O113.036681
55 HoffmannN, ReinJ, SachsenbergT, et al mzTab-M: A data standard for sharing quantitative results in mass spectrometry metabolomics. Anal Chem. 2019; 91:3302–3310. doi: http://doi.org/10.1021/acs.analchem.8b04310
56 MayfieldJ. CDK Depict Web Interface. Accessed December 6, 2021. https://simolecule.com/cdkdepict/depict.htmlhttps://simolecule.com/cdkdepict/depict.html
57 NCBI/NLM/NIH. PubChem Submissions Template Folder. Published 2021. Accessed May 25, 2021. https://ftp.ncbi.nlm.nih.gov/pubchem/Other/Submissions/https://ftp.ncbi.nlm.nih.gov/pubchem/Other/Submissions/