Details
Paper ID 10
Difficulty - Medium

Categories

  • NLP
  • Extreme Multilable Classification
  • Convolutions
  • medium

Abstract - Extreme multi-label text classification (XMTC) refers to the prob- lem of assigning to each document its most relevant subset of class labels from an extremely large label collection, where the number of labels could reach hundreds of thousands or millions. The huge label space raises research challenges such as data sparsity and scalability. Significant progress has been made in recent years by the development of new machine learning methods, such as tree induction with large-margin partitions of the instance spaces and label-vector embedding in the target space. However, deep learning has not been explored for XMTC, despite its big successes in other related areas. This paper presents the first attempt at applying deep learning to XMTC, with a family of new Convolutional Neural Net- work (CNN) models which are tailored for multi-label classification in particular. With a comparative evaluation of 7 state-of-the-art methods on 6 benchmark datasets where the number of labels is up to 670,000, we show that the proposed CNN approach successfully scaled to the largest datasets, and consistently produced the best or the second best results on all the datasets. On the Wikipedia dataset with over 2 million documents and 500,000 labels in partic- ular, it outperformed the second best method by 11.7% ∼ 15.3% in precision@K and by 11.5% ∼ 11.7% in NDCG@K for K = 1,3,5. Paper - http://nyc.lti.cs.cmu.edu/yiming/Publications/jliu-sigir17.pdf Code - https://github.com/siddsax/XML-CNN/blob/master/code/cnn_train.py Dataset - https://drive.google.com/file/d/0b3lpmihmg6vgu0vtr1pcejfpwjg/view?usp=sharing&resourcekey=0-surjz4z_5tr38jenzf2iwg