Making protein-protein interaction prediction more reliable with a large-scale dataset at the proteome level


Abstract: Reliable information about protein-protein interactions (PPIs) enables us to better understand biological processes, pathways and functions. However, there are many experimental problems in identifying complete PPI-networks in a cell or organism. To supplement the limitations of current experimental techniques, we have previously proposed PSOPIA, a computational method to predict whether two proteins interact or not ( [1]. In the PPI prediction, the selection of datasets is a big issue for accurately evaluating the performance of different algorithms [2, 3]. It is generally believed that increasing the size and diversity of examples makes the dataset more representative and reduces the noise effects; however, for many algorithms, it is impractical to use a large-scale dataset because of the memory and CPU time requirements. In this study, PSOPIA was retrained on a highly imbalanced large-scale dataset having a diverse set of examples at the proteome level. The dataset consisted of 43,060 high-confidence direct physical PPIs obtained from TargetMine [4] (as PPIs being only 0.13% of the total) and 33,098,951 non-PPIs. As a result, the new prediction model achieved a higher AUC of 0.89 (pAUCFPR≤0.5% = 0.24) than the previous model of PSOPIA. Furthermore, it was applied to the problem of filtering out protein pairs incorrectly labeled as interacting from a low-confidence human PPI dataset. Here, we suggest that a diverse set of large-scale examples is key to more reliable PPI prediction, demonstrating the performance of PSOPIA at the proteome level.

Yoichi Murakami and Kenji Mizuguchi

The Author field can not be Empty

Tokyo University of Information Sciences

The Institution field can't be Empty

Vol.3 , Issue 3

Volume and Issue can't be empty


The Page Numbers field can't be Empty


Publication Date field can't be Empty