Hierarchical Classification and Its Application in University Search
Hierarchical Classification and Its Application in University Search
Web search engines have been adopted by most universities fo r searching webpages in their own domains. Basically, a user sends keywords to the search e ngine and the search engine returns a flat ranked list of webpages. However, in universit y search, user queries are usually related to topics. Simple keyword queries are often insu ffi cient to express topics as keywords. On the other hand, most E-commerce sites allow users to brows e and search products in various hierarchies. It would be ideal if hierarchical browsing and keyword search can be seamlessly combined for university search engines. The main di ffi culty is to automatically classify and rank a massive number of webpages into the topic hierarchies for universities. In this thesis, we use machine learning and data mining techn iques to build a novel hybrid search engine with integrated hierarchies for universitie s, called SEEU (S earch E ngine with hi E rarchy for U niversities). Firstly, we study the problem of e ff ective hierarchical webpage classification. We develop a parallel webpage classification system based on Support Ve ctor Machines. With extensive experiments on the well-known ODP (Open Directory Project) dataset, we empirically demon- strate that our hierarchical classification system is very e ff ective and outperforms the traditional flat classification approaches significantly. Secondly, we study the problem of integrating hierarchical classification into the ranking system of keywords-based search engines. We propose a novel ranking framework, called ERIC (E nhanced R anking by h I erarchical C lassification), for search engines with hierarchies. Experimental results on four large-scale TREC (Text REtrie val Conference) web search datasets show that our ranking system with hierarchical classificati on outperforms the traditional flat keywords-based search methods significantly. Thirdly, we propose a novel active learning framework to imp rove the performance of hi- erarchical classification, which is important for ranking w ebpages in hierarchies. From our experiments on the benchmark text datasets, we find that our a ctive learning framework can achieve good classification performance yet save a consider able number of labeling e ff ort com- pared with the state-of-the-art active learning methods fo r hierarchical text classification. Fourthly, based on the proposed classification and ranking m ethods, we present a novel hierarchical classification framework for mining academic topics from university webpages. We build an academic topic hierarchy based on the commonly ac cepted Wikipedia academic disciplines. Based on this hierarchy, we train a hierarchic al classifier and apply it to mine academic topics. According to our comprehensive analysis, the academic topics mined by our method are reasonable and consistent with the real-world to pic distribution in universities. Finally, we combine all the proposed techniques together an d implement the SEEU search engine. According to two usability studies conducted in the ECE and the CS departments at our university, SEEU is favored by the majority of participa nts. To conclude, the main contribution of this thesis is a novel s earch engine, called SEEU, for universities. We discuss the challenges toward building SE EU and propose e ff ective machine learning and data mining methods to tackle them. With extens ive experiments on well-known benchmark datasets and real-world university webpage data sets, we demonstrate that our sys- tem is very e ff ective. In addition, two usability studies of SEEU in our uni versity show that SEEU has a great promise for university search.