Making protein-protein interaction prediction more reliable with a large-scale dataset at the proteome level
Abstract: Reliable information about protein-protein interactions (PPIs) enables us to better understand biological processes, pathways and functions. However, there are many experimental problems in identifying complete PPI-networks in a cell or organism. To supplement the limitations of current experimental techniques, we have previously proposed PSOPIA, a computational method to predict whether two proteins interact or not (http://mizuguchilab.org/PSOPIA/) . In the PPI prediction, the selection of datasets is a big issue for accurately evaluating the performance of different algorithms [2, 3]. It is generally believed that increasing the size and diversity of examples makes the dataset more representative and reduces the noise effects; however, for many algorithms, it is impractical to use a large-scale dataset because of the memory and CPU time requirements. In this study, PSOPIA was retrained on a highly imbalanced large-scale dataset having a diverse set of examples at the proteome level. The dataset consisted of 43,060 high-confidence direct physical PPIs obtained from TargetMine  (as PPIs being only 0.13% of the total) and 33,098,951 non-PPIs. As a result, the new prediction model achieved a higher AUC of 0.89 (pAUCFPR≤0.5% = 0.24) than the previous model of PSOPIA. Furthermore, it was applied to the problem of filtering out protein pairs incorrectly labeled as interacting from a low-confidence human PPI dataset. Here, we suggest that a diverse set of large-scale examples is key to more reliable PPI prediction, demonstrating the performance of PSOPIA at the proteome level.