Erika Ordog, Richard van Krieken, Ayush Khaitan
In this project, we develop a model to predict/rank the thermostability of enzyme variants based on experimental melting temperature data. We use a dataset provided by Novozymes through their Kaggle competition: Novozymes Enzyme Stability Prediction | Kaggle. This dataset provides the experimentally measured thermostability (melting temperature) data, natural enzyme sequences, as well as engineered sequences with single or multiple mutations upon the natural sequences.
We identify three predictive frameworks to explore the project, and perform exploratory data analysis with each framework. We also prepare a summary report on the performance of the framework, select the best operating framework, after initial optimization, utilizing Normalized Root Mean Squared Error and Spearman Correlation Coefficient, and optimize the selected framework through cross-validation. Finally, we prepare a project report with visualizations