Application of the Data Science Workflow to Molecular Dynamic Simulations

Authors: Lemuel I. Rivera Cantú, Department of Mathematics, University of Puerto Rico at Humacao, Humacao P.R. 00791 Jordan Caraballo-Vega, NASA Goddard Space Flight Center, Greenbelt, Maryland 20771 Faculty mentor : José O. Sotero Esteva, Department of Mathematics, University of Puerto Rico at Humacao, Humacao P.R. 00791

Data Science (DS) is an emerging discipline that is arguably having a great impact in all human endeavors. Data scientists follow a particular workflow: data acquisition and preparation, exploratory analysis, mathematical modeling, visualization, and DS products that can be used in decision making.  

Molecular Dynamics Simulations produce a large amount of complex data. To our knowledge there is no programming library available nor a systematic application of the Data Science workflow to these simulations. 

The objective of this research is to develop tools to facilitate the use of the data science workflow in the analysis of molecular dynamics simulations. For this we developed a python package named granules. It is based on the Pandas package that is very popular among data scientists. It can help other researchers perform several types of analysis with concise programs. In this work we show how it serves to implement the data science workflow using a simulation of a PABA-functionalized cellulose crystal as a test case. The simulation results are stored as files in different formats which are loaded to a granules object. A method called select() is used for the cleansing of the data, this begins by passing a dictionary of the wanted atoms through the function. At this point several exploratory analysis methods provided by Pandas are used, producing statistics and describing distributions and visualizing the data. For example, we show how to create a correlation matrix to produce two different maps: a temporal map and a spatial map which serve to detect oscillation patterns and relationship between selected atoms. 


This is one of the many key features that granules provide to help researchers understand and communicate the final results. It gives the users the ability to make different analysis within the same software.

Additional Abstract Information

Presenter: Lemuel I. Cantú

Institution: University of Puerto Rico at Humacao

Type: Poster

Subject: Computer Science

Status: Approved

Time and Location

Session: Poster 5
Date/Time: Tue 12:30pm-1:30pm
Session Number: 4047