Probabilistic models for topic learning from images and captions in online biomedical literatures


Biomedical images and captions are one of the major sources of information in online biomedical publications. They often contain the most important results to be reported, and provide rich information about the main themes in published papers. In the data mining and information retrieval community, there has been much effort on using text mining and language modeling algorithms to extract knowledge from the text content of online biomedical publications; however, the problem of knowledge extraction from biomedical images and captions has not been fully studied yet. In this paper, a hierarchical probabilistic topic model with background distribution (HPB) is introduced to uncover the latent semantic topics from the co-occurrence patterns of caption words, visual words and biomedical concepts. With downloaded biomedical figures, restricted captions are extracted with regard to each individual image panel. During the indexing stage, the ‘bag-of-words’ representation of captions is supplemented by an ontology-based concept indexing to alleviate the synonym and polysemy problems. As the visual counterpart of text words, the visual words are extracted and indexed from corresponding image panels. The model is estimated via collapsed Gibbs sampling algorithm. We compare the performance of our model with the extension of the Correspondence LDA (Corr-LDA) model under the same biomedical image annotation scenario using cross-validation. Experimental results demonstrate that our model is able to accurately extract latent patterns from complicated biomedical image-caption pairs and facilitate knowledge organization and understanding in online biomedical literatures.

Proceedings of the 18th ACM international conference on Information and knowledge management - CIKM ‘09