Data mining tool is 'hypothesis generator'

Posted by Emma Woollacott

Researchers at Harvard University and MIT's Broad Institute have developed a data mining tool that can detect meaningful patterns in vast data sets.

The applications are endless: uncovering multiple patterns hidden in everything from health information to baseball statistics.

"There are massive data sets that we want to explore, and within them, there may be many relationships that we want to understand," says Broad Institute associate member Pardis Sabeti.

"The human eye is the best way to find these relationships, but these data sets are so vast that we can't do that. This toolkit gives us a way of mining the data to look for relationships."

Current data mining tools tend to fall short when it comes to even-handedly detecting different kinds of patterns in large data collections. MINE, though, can rank patterns fairly, detecting a wide range of patterns and characterizing them according to a number of different parameters a researcher might be interested in.

The researchers tested MINE on several large data sets, including one covering the trillions of microorganisms that live in the gut. MINE made more than 22 million comparisons, and narrowed in on a few hundred patterns of interest that had never before been observed.

"The goal of this statistic is to take data with a lot of different dimensions and many possible correlations and pick out the top ones," says Michael Mitzenmacher, a professor of computer science at Harvard University.

"We view this as an exploration tool – it can find patterns and rank them in an equitable way."

This, says the team, means it's possible to search for patterns without knowing what you're looking for in advance. MINE could generate new ideas and connections that no one has thought to look for before. It's especially good at exploring data sets with relationships that could contain more than one important pattern.

"Our tool is a hypothesis generator," says Yakir Reshef, a Fulbright scholar at the Weizmann Institute of Science.

"The standard paradigm is hypothesis-driven science, where you come up with a hypothesis based on your personal observations. But by exploring the data, you get ideas for hypotheses that would never have occurred to you otherwise."