{"title": "Multiple Instance Learning on Structured Data", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 153, "abstract": "Most existing Multiple-Instance Learning (MIL) algorithms assume data instances and/or data bags are independently and identically distributed. But there often exists rich additional dependency/structure information between instances/bags within many applications of MIL. Ignoring this structure information limits the performance of existing MIL algorithms. This paper explores the research problem as multiple instance learning on structured data (MILSD) and formulates a novel framework that considers additional structure information. In particular, an effective and efficient optimization algorithm has been proposed to solve the original non-convex optimization problem by using a combination of Concave-Convex Constraint Programming (CCCP) method and an adapted Cutting Plane method, which deals with two sets of constraints caused by learning on instances within individual bags and learning on structured data. Our method has the nice convergence property, with specified precision on each set of constraints. Experimental results on three different applications, i.e., webpage classification, market targeting, and protein fold identification, clearly demonstrate the advantages of the proposed method over state-of-the-art methods.", "full_text": "Multiple Instance Learning on Structured Data\n\n1Dan Zhang, 2Yan Liu, 1Luo Si, 3Jian Zhang, 4Richard D. Lawrence\n\n1. Computer Science Department, Purdue University, West Lafayette, IN 47906\n\n4. Machine Learning Group, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598\n1{zhang168, lsi}@cs.purdue.edu, 2yanliu.cs@usc.edu, 3jian.zhang@gmail.com , 4ricklawr@us.ibm.com\n\n2. Computer Science Department, University of Southern California, Los Angeles, CA 90089\n\n3. Statistics Department, Purdue University, West Lafayette, IN 47906\n\nAbstract\n\nMost existing Multiple-Instance Learning (MIL) algorithms assume data instances\nand/or data bags are independently and identically distributed. But there often\nexists rich additional dependency/structure information between instances/bags\nwithin many applications of MIL. Ignoring this structure information limits the\nperformance of existing MIL algorithms. This paper explores the research prob-\nlem as multiple instance learning on structured data (MILSD) and formulates a\nnovel framework that considers additional structure information.\nIn particular,\nan effective and ef\ufb01cient optimization algorithm has been proposed to solve the\noriginal non-convex optimization problem by using a combination of Concave-\nConvex Constraint Programming (CCCP) method and an adapted Cutting Plane\nmethod, which deals with two sets of constraints caused by learning on instances\nwithin individual bags and learning on structured data. Our method has the nice\nconvergence property, with speci\ufb01ed precision on each set of constraints. Experi-\nmental results on three different applications, i.e., webpage classi\ufb01cation, market\ntargeting, and protein fold identi\ufb01cation, clearly demonstrate the advantages of\nthe proposed method over state-of-the-art methods.\n\n1\n\nIntroduction\n\nMultiple Instance Learning (MIL) is a variation of the classical learning methods for problems with\nincomplete knowledge on the instances (or examples) [4]. In a MIL problem, the labels are assigned\nto bags, i.e., a set of instances, rather than individual instances [1, 4, 5, 13]. MIL has been widely\nemployed in areas such as text mining [1], drug design [4], and localized content based image\nretrieval (LCBIR) [13].\nOne major assumption of most existing MIL methods is that instances (and bags) are independently\nand identically distributed. But in many applications, the dependencies between instances/bags nat-\nurally exist and if incorporated in models, they can potentially improve the prediction performance\nsigni\ufb01cantly. For example, in business analytics, big corporations often analyze the websites of dif-\nferent companies to look for potential partnerships. Since not all of the webpages in a website are\nuseful, we can treat the whole website of a speci\ufb01c company as a bag, and each webpage in this\nwebsite is considered as an instance. The hyperlinks between webpages provide important infor-\nmation on various relationships between these companies (e.g. supply-demand or joint-selling) and\nmore partner companies can be identi\ufb01ed if we follow the hyperlinks of existing partners. Another\nexample is protein fold identi\ufb01cation [15], whose goal is to predict protein fold with low conser-\nvation in primary sequence, e.g., Thioredoxin-fold (Trx-fold). MIL algorithms have been applied\nto identify new Trx-fold proteins, where each protein sequence is considered as a bag, and some of\nits subsequences are instances. The relational information between protein sequences, such as same\norganism locations or similar species origins, can be used to help the prediction tasks.\n\n1\n\n\fSeveral recent methods have been proposed to model the dependencies between instances in each\nbag [11, 12, 21]. However, none of them takes into consideration the relational structure between\nbags or between instances across bags. Furthermore, most of existing MIL research only uses con-\ntent similarity for modeling structures between instances in each bag, but does not consider other\ntypes of relational structure information (e.g. hyperlink) among instances or bags. While much re-\nsearch work [16, 22] for traditional single instance learning has demonstrated that additional struc-\nture information (e.g., hyperlink) can be very useful, we believe this is similar for MIL.\nGenerally speaking, we summarize three scenarios of the structure information in MIL: (1) the\nrelational structures are on the instance level. For example, in the business partner example, the hy-\nperlinks between different webpages can be considered as the relational structure between instances\n(either in the same bag or across bags). (2) the structure information is available on the bag level.\nFor example, in protein fold identi\ufb01cation task, we can consider the phylogenetic tree to capture the\nevolutionary dependencies between protein sequences . (3) the structure information is available on\nboth instance level and bag level. We refer these three scenarios of learning problems collectively\nas multiple instance learning on structured data (MILSD).\nIn this paper, we propose a general framework that address all three structure learning scenarios\nfor MIL. The model consists of a regularization term that con\ufb01nes the capacity of the classi\ufb01er,\na term that penalizes the difference between the predicted labels of the bags and their true labels,\nand a graph regularization term based on the structure information. The corresponding optimization\nproblem is non-convex. But we show that it can be expressed as the difference between two convex\nfunctions. Then, we employ an iterative method \u2013 Constrained Concave-Convex Procedure (CCCP)\n[14, 19] to solve this problem. To make the proposed method scalable to large datasets, the Cutting\nPlane method [8] is adapted to solve the subproblems derived from each CCCP iteration. The novelty\nof the proposed variant of Cutting Plane method lies in modeling dual sets of constraints, i.e., one\nfrom modeling instances in individual bags, and the other from the structure information, and its\nability to control the precisions (i.e., \u00011 and \u00012 in Table 1) on different sets of constraints separately.\nThe reason why we need to control precisions separately is that since different sets of constraints\nnormally are derived from various sources and have different forms, their characteristics, as well as\nthe required optimization precisions, are very likely to be diverse. Furthermore, we prove an upper\nbound of the convergence rate of the proposed optimization method, which is a signi\ufb01cant result\ngiven our optimization scheme for dual constraint sets can also be applied to many other learning\nproblems. Experiments on three applications demonstrate the advantages of the proposed research.\n\n2 Methodology\n\n2.1 Problem Statement and Notation\nSuppose we are given a set of n labeled bags {(Bi, Yi), i = 1, 2,\u00b7\u00b7\u00b7 , n}, u unlabeled bags\n{Bi, i = n + 1, n + 2,\u00b7\u00b7\u00b7 , n + u}, and a directed or undirected graph G = (V, E) that depicts\nthe structure between either bags or instances. Here, the instances in the bag Bi are denoted as\n{Bi1, Bi2, ..., Bini} \u2208 X , where ni is the total number of instances in this bag and Yi \u2208 {\u22121, 1}.\nEach node v \u2208 V corresponds to either a bag or an instance in either the labeled set or the unlabeled\nset, and the j-th edge ej = (p, q) \u2208 E represents a link from node p to node q. The task is to\nlearn a classi\ufb01er w1 based on labeled, unlabeled bags, and the prede\ufb01ned structure graph so that the\nunlabeled bags can be correctly classi\ufb01ed. The soft label for the instance x can be estimated by:\nf (x) = wT Bij. The soft label of the bag Bi can be modeled as: f (Bi) = maxj\u2208Bi wT Bij, and if\nf (Bi) > 0, this bag would be labeled as positive and otherwise negative.\n\n2.2 Formulation\n\nOur motivation is that labeled bags should be correctly classi\ufb01ed and the soft labels of the bags or\ninstances de\ufb01ned on the graph G should be as smooth as possible. Speci\ufb01cally, a pair of nodes\nlinked by an edge tend to possess the same label and therefore the nodes lying on a densely linked\nsubgraph are likely to have the same labels [20]. The general formulation of MILSD is given as:\n\n1Without loss of generality, in this paper, we only consider linear classi\ufb01ers. Here, the bias of the classi\ufb01er\n\nis absorbed by the feature vectors. The kernel version [3] of the proposed method can be easily derived.\n\n2\n\n\fminw Hr(w) + Hd(w) + HG(w), where Hr(w) is a regularization term based on w, and depicts\nthe capacity of this classi\ufb01er. One of the possible options, which is also the one used in this paper,\n(cid:80)n\nis (cid:107)w(cid:107)2. Hd(w) penalizes the difference between the estimated bag labels and the given labels. In\nthis paper, without loss of generality, the hinge loss is used [3]. So, given a classi\ufb01er w, Hd(w) is\ni=1 max{0, 1\u2212maxj\u2208Bi YiwT Bij}, where C is the trade off parameter. HG(w)\ncalculated as: C\nn\nis a graph regularization term based on the given graph G that enforces the smoothness on the soft\n\u2212 f (vq)\u221a\nlabels of nodes in the given graph, which can be de\ufb01ned as: \u00b5|E|\nwhere vp and vq are two nodes in the graph. w(p, q) is a weight function that measures the weight\non the edge (p, q). d(p) and d(q) are the outgoing degrees for the node vp and vq respectively\n[20]. |E| is the number of edges in graph G. Depending on which one of the three scenarios the\ngraph is de\ufb01ned, we name the formulation where the graph is de\ufb01ned on instances as I-MILSD, the\nformulation where the graph is de\ufb01ned on bags as B-MILSD, and the formulation where the graph\nis de\ufb01ned on both bags and instances as BI-MILSD. In particular,\n\n(cid:12)(cid:12)(cid:12)(cid:12) f (vp)\u221a\n\n(p,q)\u2208E w(p, q)\n\n(cid:80)\n\n(cid:12)(cid:12)(cid:12)(cid:12),\n\nd(p)\n\nd(q)\n\nand xq are two instances.\n\n1. For I-MILSD, HG(w) can be de\ufb01ned as \u00b5|E|\n\n(cid:80)\n2. For B-MILSD, HG(w) is de\ufb01ned as \u00b5|E|\u00d7(cid:80)\n|E| \u00d7 (cid:88)\n\n(cid:107)w(cid:107)2 +\n\nn(cid:88)\n\n\u03bei +\n\nmin\n\nC\nn\n\n1\n2\n\n\u00b5\n\nw\n\ni=1\n\n(p,q)\u2208E w(p, q)\n\nd(q)\n\nd(p)\n\n\u2212 wT xq\u221a\n\n(cid:12)(cid:12)(cid:12)(cid:12) wT xp\u221a\n(cid:12)(cid:12)(cid:12)(cid:12) maxj\u2208Bp wT Bpj\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) maxj\u2208Bp wT Bpj\n(cid:112)d(q)\n(cid:112)d(p)\n\n\u221a\n\nd(p)\nis de\ufb01ned. Then, the B-MILSD problem can be formulated as follows:\n\n(p,q)\u2208E w(p, q)\n\n\u2212 maxj\u2208Bq wT Bqj\n\n(cid:12)(cid:12)(cid:12)(cid:12), where xp\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2212 maxj\u2208Bq wT Bqj\n\n\u221a\n\nd(q)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(1)\n\ns.t. \u2200i \u2208 {1, 2, . . . , n},\n\nYi max\nj\u2208Bi\n\nw(p, q)\n\n(p,q)\u2208E\nwT Bij \u2265 1 \u2212 \u03bei,\n\nwhere \u03bei is the hinge loss and \u00b5 is the trade-off parameter for the graph term. The formu-\nlation proposed in [1] is a special case of the proposed formulation, with \u00b5 equals zero.\n\n3. The de\ufb01nition of HG(w) in BI-MILSD can be considered as a combination of previous\n\ntwo formulations.\n\nIn the following sections, our focus will be on the more challenging problem as B-MILSD, while I-\nMILSD and BI-MILSD can be solved in a similar way, since the HG(w) in I-MILSD is convex and\nthe formulation of BI-MILSD can be considered as a combination of the B-MILSD and I-MILSD.\n\n2.3 Optimization Procedure with CCCP and Multi-Constraint Cutting Plane\n\nThe formulation in problem (1) combines both the goodness-of-\ufb01t for labeled bags and the structure\ninformation embedded in the graph. However, since both HG(w) and the constraints in problem\n(1) are non-convex, the global optimal solution of this problem cannot be attained. To solve this\nproblem, the constrained concave-convex procedure (CCCP) is used. It is an optimization method\nthat deals with the concave convex objective function with concave convex constraints [14]. In this\npaper, without loss of generality, we only assume w(p, q) to be a canonical weight function. To em-\nploy CCCP, \ufb01rst of all, for each edge (p, q), a non-negative loss variable \u03b6(p,q) is introduced. Then,\nproblem (1) can be solved iteratively. In particular, given an initial point w(0), CCCP iteratively\ncomputes w(t+1) from w(t) 2 by replacing maxj\u2208Bi wT Bij with its \ufb01rst order Taylor expansions\nat w(t), and solving the resulting quadratic programming problem as follows, until convergence\n(u(t)\n\ni = arg maxj\u2208{1,...,ni}(w(t))T Bij).\n\n2The superscript t is used to denote that the result is obtained from the t-th CCCP iteration. For example,\n\nw(t) is the optimized classi\ufb01er from the t-th CCCP iteration step.\n\n3\n\n\fmin\n\nw,\u03bei\u22650,\u03b6(p,q)\u22650\n\n(cid:107)w(cid:107)2 +\n\n1\n2\n\nC\nn\n\nn(cid:88)\n\ni=1\n\n\u03bei +\n\n\u00b5\n|E|\n\n(cid:88)\n\n(p,q)\u2208E\n\ns.t. \u2200i \u2208 {1, 2, . . . , n},\n\nYiwT B\n\niu\n\n(t)\ni\n\n\u2200(p, q) \u2208 E,\n\n\u2200k \u2208 {1, 2, . . . , np},\n\n\u03b6(p,q)\n\n(2)\n\n\u2265 1 \u2212 \u03bei\n\nwT Bpk(cid:112)d(p)\n\nqu\n\nq(cid:112)d(q)\n\u2212 wT B\n(cid:112)d(q)\n\nwT Bq(k\u2212np)\n\n(t)\n\n\u2264 \u03b6(p,q)\n\np(cid:112)d(p)\n\u2212 wT B\n\npu\n\n(t)\n\n\u2200k \u2208 {np + 1, . . . , np + nq},\n\n\u2264 \u03b6(p,q).\n\nThe problem (2) can be directly solved as a standard quadratic programming problem [2]. However,\nin many real world applications, the number of the labeled bags as well as the number of links\nbetween bags are huge.\nIn this case, we would need to \ufb01nd a way that can solve this problem\nef\ufb01ciently.\nInstead of directly solving this optimization problem, we employ the Cutting Plane\nmethod [8], which has shown its effectiveness and ef\ufb01ciency in solving similar tasks recently [6].\nBut different from the method employed in [6], in this paper, we need to deal with two sets of\nconstraints, rather than just one constraint set, with speci\ufb01ed precisions separately. A new way to\nadapt the Cutting Plane method is devised here. Problem (2) is equivalently transformed to the\nfollowing form:\n\nmin\n\nw,\u03be\u22650,\u03b6\u22650\n\n1\n2\n\n(cid:107)w(cid:107)2 + C\u03be + \u00b5\u03b6\n\ns.t. \u2200c \u2208 {0, 1}n,\n\n1\nn\n\nwT\n\nciYiB\n\niu\n\n(t)\ni\n\nci \u2212 \u03be\n\n\u2200(\u03c4 \u2208 {0, 1}|E|\u00d7(np+nq ))\n\n(\u2200ej \u2208 E,\n\n\u03c4jk \u2264 1)\n\ni=1\n\nn(cid:88)\n(cid:92)\nBpk(cid:112)d(p)\n\n\u2265 1\nn\n\ni=1\n\nn(cid:88)\nnp+nq(cid:88)\nnq(cid:88)\n\nk=1\n\n) +\n\nk=1\n\n(cid:32) np(cid:88)\n\n|E|(cid:88)\n\nj=1\n\nk=1\n\nwT\n|E|\n\n\u03c4jk(\n\nq(cid:112)d(q)\n\nqu\n\n(t)\n\n\u2212 B\n\nBqk(cid:112)d(q)\n\n\u03c4j(k+np)(\n\np(cid:112)d(p)\n\npu\n\n(t)\n\n\u2212 B\n\n(3)\n\n(cid:33)\n\n)\n\n\u2264 \u03b6,\n\nn\n\n3.\n\n(p,q)\n\ni and \u03b6\u2217 = 1|E|\n\n(cid:80)n\ni=1 \u03be\u2217\n\n(cid:80)\n(p,q)\u2208E \u03b6\u2217\n\nwhere, ej = (p, q), \u03c4 is a matrix with |E| rows and a varying number of columns: for the j-th row\nof \u03c4, it has np + nq columns (possible constraints). For each edge, at most one constraint could be\nactivated for each feasible \u03c4.\nTheorem 1: Any solution w\u2217 of problem (2) is also a solution to problem (3) (and vice versa) with\n\u03be\u2217 = 1\nProof: Please refer to the supplemental materials in the author\u2019s homepage.\nThe bene\ufb01t of making this transformation is that, as we shall see later, during each Cutting Plane it-\neration at most two constraints will be added and therefore the \ufb01nal solution would be extremely\nsparse, with the number of non-zero dual variables independent of the number of training ex-\namples. Now the problem turns to how to solve the problem (3) ef\ufb01ciently, which is convex,\nbut contains two sets of exponential number of constraints due to the large number of feasible c\nand \u03c4. We present a novel adaption of the Cutting Plane method that can handle the two sets\nof constraints simultaneously. More speci\ufb01cally, the main motivation of the method proposed\ni.e., \u21261 and \u21262 from constraint sets in Eq.(3).\nhere is to \ufb01nd two small subsets of constraints,\nWith these two sets of selected constraints, the solution of the corresponding relaxed problem\nsatis\ufb01es all the constraints from problem (3) up to two precisions \u00011 and \u00012, i.e., \u2200c \u2208 {0, 1}n:\n\u03c4jk \u2264\n\n(cid:80)n\ni=1 ci \u2212 (\u03be + \u03b51) and \u2200(\u03c4 \u2208 {0, 1}|E|\u00d7(np+nq ))(cid:84)(\u2200ej \u2208 E,(cid:80)np+nq\n\u2265 1\nk=1 \u03c4jk( Bpk\u221a\n\nIt\n1):\nindicates that the two remaining sets of constraints (that are not added to \u21261 and \u21262) will not be\nviolated up to two precisions \u03b51 and \u03b52 respectively, and therefore they do not need to be added to\n\u21261 and \u21262 explicitly.\n\nn wT(cid:80)n\n1|E| wT(cid:80)|E|\n\nk=1 \u03c4j(k+np)( Bqk\u221a\n\n) +(cid:80)nq\n\n(cid:18)(cid:80)np\n\n(t)\ni\n\n\u2264 (\u03b6 + \u03b52).\n\n\u2212 B\n\np\u221a\n\n\u2212 B\n\nq\u221a\n\ni=1 ciYiB\n\n(cid:19)\n\nd(p)\n\n(t)\n\nqu\n\n(t)\n\npu\n\nd(p)\n\nk=1\n\nd(q)\n\nd(q)\n\niu\n\nn\n\nj=1\n\n)\n\n1\n\n3the subscript \u2217 denotes the optimal value of the corresponding variable.\n\n4\n\n\f4\n2 respectively. During the s-th Cutting Plane iteration, based on the wts, the most violated\n\n2 iteratively, which starts from two empty sets \u2126t0\n1\n\n1 and \u2126t\n\nThe proposed method constructs \u2126t\nand \u2126t0\nconstraint for \u2126ts\n\n1 can be computed as:\n\n(cid:40) 1,\n\ncts\ni =\n\nif Yi(wts )T B\n\n(t)\niu\ni\n\n< 1\n\n,\n\n0,\n\notherwise\n\n(cid:40)\nk\u2208{1,...,np}(wts )T (\n\n2 can be computed as:\n\u2212 B\n\nBpk(cid:112)d(p)\n\nq(cid:112)d(q)\n\nmax\n\nqu\n\n(t)\n\nand the most violated constraint for \u2126ts\n\n(cid:92)\n\nif(k = k\n\n\u2217\n\n)\n\n(max\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 1,\n\n\u03c4 ts\njk =\n\n),\n\nk\u2208{np+1,...,np+nq}(wts )T (\n\nmax\n\n(cid:41)\n\n)\n\n> 0)\n\n(4)\n\nBq(k\u2212np)(cid:112)d(q)\n\n(t)\n\npu\n\np(cid:112)d(p)\n\u2212 B\n(cid:27)\n\n(t)\n\np\u221a\n\npu\n\nd(p)\n\n)\n\n.\n\n0, otherwise,\n\n(cid:26)\nmaxk\u2208{1,...,np}(wts )T ( Bpk\u221a\n\n(5)\n\u2212 B\nk\u2217 = arg maxk\nAfter calculating these two sets of most violated constraints, the two stopping conditions can be\ncomputed:\n\n), maxk\u2208{np+1,...,np+nq}(wts )T (\n\n\u221a\nBq(k\u2212np )\n\n\u2212 B\n\nq\u221a\n\nd(p)\n\nd(q)\n\nd(q)\n\n(t)\n\nqu\n\n(cid:33)\n\n(cid:32)\n\nn(cid:88)\n\ni=1\n\nH ts\n\n1 =\n\n\uf8eb\uf8ed np(cid:88)\n\n|E|(cid:88)\n\nj=1\n\nk=1\n\nH ts\n\n2 =(\n\n(wts )T\n\n|E| \u00d7\n\n(wts )T\n\nn\n\ncts\ni YiB\n\niu\n\n(t)\ni\n\ni \u2212 (\u03bets + \u03b51)\ncts\n\n,\n\nBpk(cid:112)d(p)\n\n\u03c4 ts\njk(\n\nq(cid:112)d(q)\n\nqu\n\n(t)\n\n\u2212 B\n\n) +\n\nBqk(cid:112)d(q)\n\np(cid:112)d(p)\n\npu\n\n(t)\n\n\u2212 B\n\n\u03c4 ts\nj(k\u2212np)(\n\n\u2265 1\nn\n\nn(cid:88)\nnp+nq(cid:88)\n\ni=1\n\nk=np+1\n\n(6)\n\n\uf8f6\uf8f8 \u2264 (\u03b6 ts + \u03b52)).\n\n)\n\n(7)\n\nThe Cutting Plane iteration will terminate if both conditions H ts\nwill be added to \u2126ts\noptimization problem turns to:\n(cid:107)w(cid:107)2 + C\u03be + \u00b5\u03b6\n\n1 is false, and \u03c4 ts will be added to \u2126ts\n\n1 if H ts\n\nmin\n\nw,\u03be\u22650,\u03b6\u22650\n\n1\n2\n\n1 and H ts\n2 if H ts\n\n2 are true. Otherwise, cts\n2 is false. Then, the new\n\ns.t. \u2200c \u2208 \u2126ts\n1 ,\n\n1\nn\n\nwT\n\nciYiB\n\niu\n\n(t)\ni\n\n\u2265 1\nn\n\nn(cid:88)\n(cid:32) np(cid:88)\n|E|(cid:88)\n\ni=1\n\nj=1\n\nk=1\n\nBpk(cid:112)d(p)\n\n\u03c4jk(\n\nci \u2212 \u03be\n\nn(cid:88)\nq(cid:112)d(q)\n\ni=1\n\nqu\n\n(t)\n\n\u2212 B\n\n) +\n\n\u2200\u03c4 \u2208 \u2126ts\n2 ,\n\nwT\n|E|\n\n(8)\n\n(cid:33)\n\n)\n\n\u2264 \u03b6.\n\nnq(cid:88)\n\nk=1\n\nBqk(cid:112)d(q)\n\n\u03c4j(k+np)(\n\np(cid:112)d(p)\n\npu\n\n(t)\n\n\u2212 B\n\nThis optimization problem can be solved ef\ufb01ciently through the dual form [2].\n\n2.4 Analysis and Discussions\n\n2(cid:107)w(t)(cid:107)2+C\u03be(t)+\u00b5\u03b6 (t). The\nThe whole algorithm of B-MILSD is described in Table 1. Here, J t = 1\nconvergence of the proposed method is guaranteed. Given an initial w, the outer CCCP iteration has\nalready been proved to converge to a local optimal solution [14]. The \ufb01nal solution can be improved\nby running this algorithm several times and picking the solution with the smallest J (t) value. We\nwill show that the Cutting Plane iterations with two different sets of constraints converge in a \ufb01xed\nnumber of steps through the following two theorems.\nTheorem 2: For each Cutting Plane iteration described in Table 1, the objective function of (8) will\nbe increased by at least \u03ba = min{ C\u00011\nSketch of Proof: The detailed proof of this theorem can be found in the supplemental materials.\nHere, we only brie\ufb02y outline the way how we proved it. In each Cutting Plane iteration described\nin Table 1, there are three possibilities for updating the constraints. In each case, we will \ufb01nd a\nfeasible direction for increasing the objective function. A line search method will then be used to\n\n2)R2}, where R2 = maxi,j B2\nij.\n\n8R2 , \u00b5\u00012\n2 ,\n\n2 , \u00012\n\n\u00012\n16R2 ,\n2\n\n(\u00011+\u00012)2\n\n(24+16\n\n\u221a\n\n1\n\n4Here, ts denotes the s-th Cutting Plane iteration for solving the problem from the t-th CCCP iteration.\n\n5\n\n\fTable 1: The description of B-MILSD\n\nInput: 1. Labeled bags: {(Bi, Yi), i = 1, 2, \u00b7 \u00b7 \u00b7 , n}; 2. Unlabeled Bags: {Bi, i = n + 1, 2, \u00b7 \u00b7 \u00b7 , n + u}; 3. A graph G which represents the\nrelationship between these bags. (The graph can be built solely on the labeled bags, or on an union of both the labeled bags and the unlabeled bags); 4. parameters:\nloss weight C and \u00b5, CCCP solution precision \u03b4, Cutting Plane solution precision for constraint1: \u00011, Cutting Plane solution precision for constraint2: \u00012.\nOutput: The classi\ufb01er w.\nCCCP Iterations:\n1. Initialize w0,t=0,\u2206J = 103, J\u22121 = 103.\n2. while \u2206J/J t\u22121 > \u03b4 do\n3. Derive problem (2). Set the constraint set \u2126\n\n2 = \u03c6 and s = \u22121.\nt0\n\nt0\n1 = \u03c6, \u2126\n\nt(s+1)\n1\n\n= \u2126ts\n\n1 . Update \u2126ts\n\n2 by \u2126\n\n(cid:83) \u03c4 ts if Hts\n\n2 is\n\nts+1\n2\n\n= \u2126ts\n2\n\nCutting Plane Iterations:\n\n4.\n5.\n6.\n7.\n8.\n9.\n\nrepeat\ns = s + 1.\nGet (w(ts ), \u03be(ts ), \u03b6(ts)) by solving (8).\nCompute the most constraints, i.e., cts , and \u03c4 ts by Eq.(4) and Eq.(5).\nCompute the stopping criteria, i.e., Hts\nUpdate \u2126ts\n\n(cid:83) cts , if Hts\n\n1 , and Hts\n\n2 by Eq.(6) and Eq.(7).\n\n= \u2126ts\n1\n\n1 is false. Otherwise, \u2126\n\nt(s+1)\n1 by \u2126\n1\nt(s+1)\n= \u2126ts\n2 .\n2\n2 is false\n\n(cid:86) Hts\n\nfalse. Otherwise, \u2126\nwhile Hts\n10.\n1\n11.\nt = t + 1.\n12. w(t) = w(t\u22121)s , \u03be(t) = \u03be(t\u22121)s , and \u03b6(t) = \u03b6((t\u22121)s .\n13. \u2206J = J t\u22121 \u2212 J t.\n14.end while\n\n\ufb01nd the optimal increment, which serves as the lower bound for each updating. (1) H ts\n1 is false.\nH ts\n2 is true. cts is added to \u2126ts\n1 . The minimal improvement of the objective function for problem\n(8) after this constraint is added would be min{ C\u00011\nis true. H ts\nis false. \u2126ts\n2\n2\n16R2}.\nIn this case, the minimal increment will be min{ \u00b5\u00012\n(3)\nis updated by appending \u03c4 ts.\n2 ,\nBoth H ts\n2 are false. The most violated constraints are added to both \u2126ts\n1 and \u2126ts\n2 . We\nproved that the minimal increment is min{ min{C,\u00b5}(\u00011+\u00012)\n2)R2}. By integrating all of\nthese three cases, it is clear that for each Cutting Plane iteration, the minimal increment is \u03ba =\nmin{ C\u00011\n\n2 } \u2264 min{C,\u00b5}(\u00011+\u00012)\n\n2)R2}, since min{ C\u00011\n\n1 and H ts\n\n8R2 , \u00b5\u00012\n2 ,\n\n8R2}.\n\n(2) H ts\n1\n\n2 , \u00012\n\n2 , \u00012\n\n2 , \u00b5\u00012\n\n\u00012\n16R2 ,\n2\n\n(\u00011+\u00012)2\n\n(\u00011+\u00012)2\n\n(24+16\n\n(24+16\n\n\u221a\n\n\u221a\n\n\u00012\n2\n\n2\n\n,\n\n2\n\n.\n\n1\n\n1\n\n1\n\n\u221a\n\n(24+16\n\n(\u00011+\u00012)2\n\n\u00012\n16R2 ,\n2\n\n2 , \u00012\n\n8R2 , \u00b5\u00012\n2 ,\n\n\u03ba steps, where, \u03ba =\n\n2)R2}, and R2 = maxi,j B2\nij.\n\nTheorem 3: The proposed Cutting Plane iteration terminates after at most C\nmin{ C\u00011\nProof: w = 0, \u03be = 1, \u03b6 = 0 is a feasible solution for problem (3). Therefore, the objective function\nof (3) is upper bounded by C, and should be lower bounded by 0. Given the conclusion from\nTheorem 2, it is clear that the Cutting Plane iteration will terminate within C\nThe Cutting Plane method has already been employed in several previous works. In [6, 7, 17], the\nauthors adapted the Cutting Plane method to accelerate structural SVM related algorithms. However,\nthese works do not explicitly consider the case when several different sets of constraints with speci-\n\ufb01ed precisions are involved. The novelty of the proposed method lies on its ability to control these\noptimization precisions separately, and meanwhile it still enjoys the sparseness of the \ufb01nal solution\nwith respect to the number of dual variables, which is brought by slack variable transformation.\nIn [18], the authors solved the problem of structural SVM with latent variables by employing CCCP\nand the bundle method [9]. MIL problem itself can be considered as a special case of latent variable\nproblem. But the major limitation of [18] is that they cannot incorporate the relational information\ninto the formulation, and therefore cannot be used here. Furthermore, [18] does not consider dual\nsets of constraints in optimization, which is less appropriate than the proposed optimization method.\n\n\u03ba steps.\n\n3 Experiments\n\n3.1 Webpage Classi\ufb01cation\n\nIn webpage classi\ufb01cation, each webpage can be considered as a bag, and its corresponding passages\nrepresent its instances [1]. The hyperlinks between different webpages are treated as the additional\nrelational structure/links between different bags. WebKB5 dataset is used in experiments. There are\n\n5http://www.cs.cmu.edu/\u223cwebkb/\n\n6\n\n\fFigure 1: Classi\ufb01cation and CPU Time Comparisons\n\n(a)\n\n(e)\n\n(b)\n\n(c)\n\n(d)\n\n(f)\n\n(g)\n\n(h)\n\nin total 8280 webpages in this dataset. The webpages without any incoming and outgoing links are\ndeleted, and 6883 webpages are left. The three most frequently appeared categories, i.e., student,\ncourse, and faculty, are used for classi\ufb01cation, where each sub-dataset contains all of the web-\npages/bags from one of the three categories, and the same number of the negative bags randomly\nsampled from the remaining six categories in WebKB. The hyperlinks between these webpages are\nused as the structure/link information. The tf-idf (normalized term frequency and log inverse docu-\nment frequency) [10] features are extracted for each passage, and the stop words are removed. We\nuse porter as the stemmer.\nIn the proposed method, C and \u00b5 are set by 5-fold cross validation through the grids 2[\u22125:5] and\n[0, 0.01, 0.1, 1] respectively on the training set. To show the effects of the structure information on\nthe performance of MIL methods, we compare the proposed method with the instance-based multi-\nple instance support vector machine (I-miSVM) as well as the bag-based multiple instance support\nvector machine (B-miSVM) [1]. The formulation of these two methods can be considered as a spe-\ncial case of the proposed method with \u00b5 equals zero, and they are two different heuristic iterative\nways of implementing the same formulation. Their parameters are also set by 5 fold cross valida-\ntions. Link-Content Matrix Factorization (LC-MF) is a non-MIL matrix factorization method [22],\nwhich has been shown to outperform several alternatives, including SVM. We conduct experiments\nwith LC-MF based on the single instances that we extract for the same set of examples, and the\ncorresponding links. Similar to [22], the number of latent factors is set to be 50. After computing\nthe latent factors, a linear SVM is trained on the training set with the hinge loss parameter C being\ndetermined by using 5-fold cross validation. For each experiment, a \ufb01xed ratio of bags are chosen\nas the training set, while the remaining examples are treated as the testing set. The average results\nover 20 independent runs are reported on the training ratios [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5].\nThe classi\ufb01cation results are reported in Fig.1(a)(b)(c) and the CPU time comparison results are\nannounced in Fig.1(e)(f)(g). In Table 2, we further report the performances when the training ratio\nequals 0.2. From these experimental results, it is clear that the performance of the proposed method\nis better than the other comparison methods in accuracy and its CPU time is comparatively low.\n\n3.2 Market Targeting\n\nMarket targeting is a popular topic for big corporations.\nIts basic objective is to automatically\nidentify potential partners. One of the feasible market targeting strategy is to analyze the websites of\nthe potential partners. But usually not all of the webpages are useful for partner identi\ufb01cation. So,\nit is better to formulate it as a MIL problem, in which each website is considered as a bag, and its\n\n7\n\n00.10.20.30.40.50.650.70.750.80.850.90.951Training RatioAccuracyCourse B\u2212MILSDLC\u2212MFI\u2212miSVMB\u2212miSVM00.10.20.30.40.50.750.80.850.90.95Training RatioAccuracy B\u2212MILSDLC\u2212MFI\u2212miSVMB\u2212miSVMFaculty00.10.20.30.40.50.70.750.80.850.9Training RatioAccuracyStudent B\u2212MILSDLC\u2212MFI\u2212miSVMB\u2212miSVM0.10.20.30.40.50.10.150.20.250.30.350.40.450.5Training RatioAUCASCOT B\u2212MILSDLC\u2212MFI\u2212miSVMB\u2212miSVM00.10.20.30.40.5020040060080010001200Training RatioTime (in Seconds)Course B\u2212MILSDLC\u2212MFI\u2212miSVMB\u2212miSVM00.10.20.30.40.50100200300400500600700Training RatioTime (in Seconds)Faculty B\u2212MILSDLC\u2212MFI\u2212miSVMB\u2212miSVM00.10.20.30.40.5020040060080010001200Training RatioTime (in Seconds)Student B\u2212MILSDLC\u2212MFI\u2212miSVMB\u2212miSVM0.10.20.30.40.5050100150200Training RatioTime (in Seconds)ASCOT B\u2212MILSDLC\u2212MFI\u2212miSVMB\u2212miSVM\fassociated webpages are considered as instances. Two related companies may be connected through\nhyperlinks in some of their webpages.\nWe obtained a dataset (ASCOT) from a big international corporation. In ASCOT, the webpages of\naround 225 companies are crawled. 25 of the companies/bags are labeled as positive, since they\nare partners of this corporation, while the remaining 200 companies/bags are treated as negative\nones. For each company, the webpages with less than 100 unique words are removed and at most\n50 webpages with the largest number of unique words6 are selected as instances. The hyperlinks\nbetween webpages of different companies are treated as the structure information. For each experi-\nment, we \ufb01x the training ratio of positive and negative bags, while the remaining bags are considered\nas the testing set. The averaged results over 20 independent runs are reported on the training ratios\n[0.1, 0.2, 0.3, 0.4, 0.5]. The parameters for different methods are tuned in the same way as on We-\nbKB. But for the ratios 0.1 and 0.2, we use 3-fold cross validation due to the lack of positive bags.\nFor LC-MF, experiments are conducted on the instances which are the averages of the instances in\neach bag. Because of the extremely imbalanced nature of this dataset, the Area Under Curve (AUC)\nis used as the measure criteria.\nThe corresponding results are reported in Fig.1(d) and Fig.1(h). In Table 2, we report the perfor-\nmances when the training ratio equals 0.2. On this dataset, B-MILSD performs much better than\nthe comparison methods, especially when the ratio of training examples is low. This is because the\nhyperlink information helps a lot when the content information is rare in MIL, and the MIL setting\nis useful to eliminate the useless instances especially when the supervised information is scare.\n\n3.3 Protein Fold Identi\ufb01cation\n\nIn protein fold identi\ufb01cation [15], the low conservation of primary sequence in protein superfamilies\nsuch as Thioredoxin-fold (Trx-fold) makes conventional modeling methods, such as Hidden Markov\nModels dif\ufb01cult to use. MIL can be used to identify new Trx-fold proteins naturally, in which each\nprotein sequence is considered as a bag, and some of its subsequences are considered as instances.\nHere, we use a benchmark protein dataset7. In each protein\u2019s primary sequence, \ufb01rst of all, the\nprimary sequence motif (typically CxxC) are found. Then, a window of size 214 around it are\nextracted and aligned. These windows are then mapped to a 8-dimensional feature space. The\nsimilarities between different proteins are estimated by using clustalw8. If the score between a pair\nof proteins exceed 25, then we consider there exists a link between them.\nFollowing the experiment setting in [15], we conduct 5 fold cross validation to test the performances.\nThe averaged classi\ufb01cation accuracies and CPU Running Time are reporeted in Table 2. From the\ncomparison methods, we can see that on this dataset, the proposed method is both ef\ufb01cient and\neffective. Its CPU running time is almost 10 \u2013 100 times faster than the comparison methods.\n\nTable 2: Performance Comparisons\nLC-MF\n\nB-miSVM\n\nI-miSVM\n\nMeasure\n\n95.9\n648.5\n95.3\n360.6\n91.7\n526.3\n\n94.3\n23.9\n93.3\n29.7\n89.5\n41.2\n\n94.5\n95.6\n93.4\n591.6\n89.1\n540.4\n\nASCOT\n\nProtein\n\nAUC\n\nTime (seconds)\nAccuracy (%)\nTime (seconds)\n\nB-MILSD\n0.350\n76.0\n96.2\n1.7\n\nLC-MF\n0.248\n56.4\n95.2\n16.9\n\nI-miSVM\n\nB-miSVM\n\n0.264\n20.9\n92.2\n160.3\n\n0.230\n20.7\n82.7\n73.8\n\nCourse\n\nFaculty\n\nStudent\n\nMeasure\n\nAccuracy (%)\nTime (seconds)\nAccuracy (%)\nTime (seconds)\nAccuracy (%)\nTime (seconds)\n\nB-MILSD\n\n97.2\n49.1\n95.2\n73.9\n92.7\n245.7\n\n4 Conclusions\n\nThis paper presents a novel machine learning problem \u2013 multiple instance learning on structured\ndata (MILSD) for incorporating additional structure information into multiple instance learning.\nIn particular, a general framework of MILSD is proposed for dealing with the additional structure\ninformation in different scenarios. An effective and ef\ufb01cient optimization method is proposed for\nMILSD by combining the CCCP method and a new multi-constraint Cutting Plane method. Some\ntheoretical results are proved to justify the methodology that we employed to handle multi-sets of\n\n6Still, we use porter as the stemmer and have removed the stop words.\n7http://cse.unl.edu/\u223c qtao/datasets/mil dataset Trx protein.html\n8http://www.ebi.ac.uk/Tools/msa/clustalw2/\n\n8\n\n\fconstraints with the Cutting Plane method. The experimental results on three different applications\nclearly demonstrate the advantages of the proposed method. For future work, we plan to adapt the\ncurrent framework to solve multi-view multiple instance learning on structured data.\nAcknowledgement: The work of Dan Zhang and Luo Si was partially supported by NSF research\ngrant IIS-0746830, CNS-1012208, IIS-1017837, and the Center for Science of Information (CSoI),\nan NSF Science and Technology Center, under grant agreement CCF-0939370. The work of Yan Liu\nwas partially sponsored by the U.S. Defense Advanced Research Projects Agency (DARPA) under\nthe Anomaly Detection at Multiple Scales (ADAMS) program, Agreement Number W911NF-11-\nC-0200. The authors would also like to express their sincere thanks to Prof. S.V.N. Vishwanathan\nand the anonymous reviewers for their constructive suggestions.\n\nReferences\n[1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance\n\nlearning. In NIPS, 2003.\n\n[2] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Press, 2004.\n[3] B.Scholkopf and A.Smola. Learning with Kernels. MITPress, Cambridge, MA, 2002.\n[4] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solving the multiple instance problem\n\nwith axis-parallel rectangles. In Arti\ufb01cial Intelligence, 1998.\n\n[5] T. G\u00a8artner, P.A. Flach, A. Kowalczyk, and A.J. Smola. Multi\u2013instance kernels. In ICML, 2002.\n[6] T. Joachims. Training linear SVMs in linear time. In KDD, 2006.\n[7] T. Joachims, T. Finley, and C.N.J. Yu. Cutting-plane training of structural SVMs. Machine\n\nLearning, 2009.\n\n[8] JE Kelley Jr. The cutting-plane method for solving convex programs. JSIAM, 1960.\n[9] Krzysztof C. Kiwiel. Proximity control in bundle methods for convex nondifferentiable mini-\n\nmization. Math. Program., 46:105\u2013122, 1990.\n\n[10] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduction to Informa-\n\ntion Retrieval. Cambridge University Press, 2008.\n\n[11] Amy McGovern and David Jensen. Identifying predictive structures in relational data using\n\nmultiple instance learning. In ICML, 2003.\n\n[12] G.J. Qi, X.S. Hua, Y. Rui, T. Mei, J. Tang, and H.J. Zhang. Concurrent multiple instance\n\nlearning for image categorization. In CVPR, 2007.\n\n[13] R. Rahmani and S.A. Goldman. MISSL: Multiple-instance semi-supervised learning. In ICML,\n\n2006.\n\n[14] A.J. Smola, SVN Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In\n\nAISTATS, 2005.\n\n[15] Qingping Tao, Stephen D. Scott, N. V. Vinodchandran, and Thomas Takeo Osugi. Svm-based\n\ngeneralized multiple-instance learning via approximate box counting. In ICML, 2004.\n\n[16] Benjamin Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks.\n\nNIPS, 2003.\n\nIn\n\n[17] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\n\nand interdependent output variables. JMLR, 2006.\n\n[18] Chun-Nam John Yu and T. Joachims. Learning structural svms with latent variables. In ICML,\n\n2009.\n\n[19] A. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 2003.\n[20] D. Zhou, J. Huang, and B. Scholkopf. Learning from labeled and unlabeled data on a directed\n\ngraph. In ICML, 2005.\n\n[21] Z-H Zhou, Y-Y Sun, and Y-F Li. Multi-instance learning by treating instances as non i.i.d.\n\nsamples. In ICML, 2009.\n\n[22] S-H Zhu, K. Yu, Y. Chi, and Y-H Gong. Combining content and link classi\ufb01cation using matrix\n\nfactorization. In SIGIR, 2007.\n\n9\n\n\f", "award": [], "sourceid": 127, "authors": [{"given_name": "Dan", "family_name": "Zhang", "institution": null}, {"given_name": "Yan", "family_name": "Liu", "institution": null}, {"given_name": "Luo", "family_name": "Si", "institution": null}, {"given_name": "Jian", "family_name": "Zhang", "institution": null}, {"given_name": "Richard", "family_name": "Lawrence", "institution": null}]}