新算法有助于评估，对科学文献进行排名

2013年4月18日 Matt Shipman

考虑到每天发表数百到数千篇论文，跟上当前的科学文献是一项艰巨的任务。现在，来自北卡罗来纳州立大学的研究人员已经开发了一项计算机计划，以帮助他们评估和对其领域的科学文章进行排名。

The researchers use a text-mining algorithm to prioritize research papers to read and include in their比较ToxicogenomicsDatabase(CTD), a public database that manually curates and codes data from the scientific literature describing how environmental chemicals interact with genes to affect human health.

北卡罗来纳州立大学（NC State）的CTD生物培养项目经理艾伦·彼得·戴维斯（Allan Peter Davis）博士解释说：“仅在1926年就发表了33,000篇科学论文，仅在重金属毒性上发表。”有关工作的文章。“我们根本无法阅读和编码所有内容。而且，在这种新算法的帮助下，我们不必这样做。”

To help select the most relevant papers for inclusion in the CTD, Thomas Wiegers, a research bioinformatician at NC State and the other co-lead author of the report, developed a sophisticated algorithm as part of a text-mining process. The application evaluates the text from thousands of papers and assigns a relevancy score to each document. “The score ranks the set of articles to help separate the wheat from the chaff, so to speak,” Wiegers says.

But how good is the algorithm at determining the best papers? To test that, the researchers text-mined 15,000 articles and sent a representative sample to their team of biocurators to manually read and evaluate on their own, blind to the computer’s score. “The results were impressive,” Davis says. The biocurators concurred with the algorithm 85 percent of the time with respect to the highest-scored papers.

使用该算法进行分类论文，可以使生物效力重点关注最相关的论文，从而提高生产率27％，而新颖的数据内容提高了100％。戴维斯解释说：“这是一个巨大的节省时间。”“这样一来，我们就可以通过将团队专注于最有用的论文来更有效地分配资源。”

这些类型的实验中总是有异常值：算法为文章分配了很高的分数，即人类生物效力库很快就无关紧要。查看这些异常值的团队通常能够看到一个模式，说明算法错误地将论文识别为重要。Wiegers说：“现在，我们可以回去调整算法来解释这一点，然后对系统进行微调。”

“We’re not at the point yet where a computer can read and extract all the relevant data on its own,” Davis concludes, “but having this text-mining process to direct us toward the most informative articles is a huge first step.”

The paper, “文本挖掘有效评分并对比较毒理学数据库中化学 - 饮食策略的文献进行排名，” 4月17日在线发布PLOS ONE。合着者是北卡罗来纳州立大学的生物库科学家辛迪·墨菲（Cindy Murphy）博士。北卡罗来纳州生物学副教授Carolyn Mattingly博士；和博士。Robin Johnson，Jean Lay，Kelley Lennon-Hopkins，Cindy Saraceni-Richards和Daniela Sciaky来自山山山岛生物实验室。这项工作得到了美国国家环境健康科学研究所的支持。

-shipman-

给编辑的注释：该论文的摘要如下。

“文本挖掘有效地评分并排名文献，以改善比较毒物学数据库中化学 - 基因疾病策展的文献”

Authors: Allan Peter Davis, Thomas Wiegers, Cynthia Murphy, and Carolyn Mattingly, North Carolina State University; Robin Johnson, Jean Lay, Kelley Lennon-Hopkins, Cynthia Saraceni-Richards, and Daniela Sciaky, The Mount Desert Island Biological Laboratory

Published: April 17, 2013, online inPLOS ONE

Abstract:The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.