Data Science (DS) is an emerging discipline that is arguably having a great impact in all human endeavors. Data scientists follow a particular workflow: data acquisition and preparation, exploratory analysis, mathematical modeling, visualization, and DS products that can be used in decision making.
Molecular Dynamics Simulations produce a large amount of complex data. To our knowledge there is no programming library available nor a systematic application of the Data Science workflow to these simulations.
The objective of this research is to develop tools to facilitate the use of the data science workflow in the analysis of molecular dynamics simulations. For this we developed a python package named granules. It is based on the Pandas package that is very popular among data scientists. It can help other researchers perform several types of analysis with concise programs. In this work we show how it serves to implement the data science workflow using a simulation of a PABA-functionalized cellulose crystal as a test case. The simulation results are stored as files in different formats which are loaded to a granules object. A method called select() is used for the cleansing of the data, this begins by passing a dictionary of the wanted atoms through the function. At this point several exploratory analysis methods provided by Pandas are used, producing statistics and describing distributions and visualizing the data. For example, we show how to create a correlation matrix to produce two different maps: a temporal map and a spatial map which serve to detect oscillation patterns and relationship between selected atoms.
This is one of the many key features that granules provide to help researchers understand and communicate the final results. It gives the users the ability to make different analysis within the same software.