Mining Uncertain Networked Data

Data from many applications often contain multiple sources of noise. For instance, data could be missing from a survey or online form in a non-random fashion, records may be labeled incorrectly, or it may be necessary to extract information from unstructured data (e.g. images or free form text). In all of these scenarios, it is necessary to utilize methods that can appropriately deal with these various and often unpredictable forms of uncertainties. These problems have prompted a rise in the popularity of probabilistic methods for data mining. Until recently, much of the research effort in statistics, machine learning and data mining has focused on problems in which data is assumed to be independent and identically distributed (iid). The recent explosion of data from the internet, especially social network and interaction data, has similar issues in terms of uncertainty. The main difference in these sources, however, is the complex structure of dependencies that is provided by the network (eg. if all of your friends enjoy comedies, you are more likely to enjoy them as well due to network homophily). Such dependencies add information that has been shown to significantly improve the accuracy of mining methods (eg. collective classification) at the cost of increased computational complexity. Developing methods that optimize the trade-off between these two factors is crucial for improving the accuracy and scalability of mining networked data sources.

The central goal for our current and future work is in developing new methods that can cope with both noisy data and the dependencies that exist within a network. Our approach is threefold: (i) develop methods to mine interesting patterns from noisy networked data, (ii) develop new techniques for learning from networked data that are able to handle the complexity of data uncertainty, and (iii) extend statistical models of networked data to infer additional information about the heterogeneous types of relationships that exist between nodes (eg. Trust). Understanding these aspects of the data introduces many more opportunities for new and different kinds of knowledge discovery in noisy, interdependent data.

This project is supported in the INARC APP in T1.

Participants:

  • Nick Larusso
  • Petko Bogdanov
  • Ambuj Singh
  • Collaboration with Sibel Adali (RPI)

Publications:

  • Bogdanov et al., 2010
  • Larusso and Singh, 2011
  • Bogdanov et al., (2), 2011

 

 

 

 

 

Back to top

Figure 3: TopicNets visualization of entire CTA Initial Program Plan (IPP). First section starts in yellow at top left, last section ends in purple at nearly full circle. Two Sections (== subtasks) have been selected (red labels).

Scalable, Human-Centric Information Network Systems

A central characteristic of information/social networks is that it facilitates rapid dissemination of information between social entities. We examined the problem of determination of information flow representatives, a small group of authoritative representatives to whom the dissemination of a piece of information leads to the maximum spread. Information flow is affected by a number of different structural factors such as the node degree, connectivity, intensity of information flow interaction and the global structural behavior of the underlying network. We proposed a stochastic information flow model, and used it to determine the authoritative representatives in the underlying social network. We designed an accurate RankedReplace algorithm, and then used a Bayes probabilistic model in order to approximate the effectiveness of this algorithm with the use of a fast algorithm. The proposed method was shown extremely effective and efficient in determining intuitively authoritative representatives in the network. (Aggarwal et al. 2010)

The identification of clusters, well-connected components in a network is useful in many applications from biological function prediction to social community detection. However, finding these clusters can be difficult as graph sizes increase. Most of the existing graph clustering algorithms scale poorly in terms of time or memory. An important insight is that many clustering applications need only the subset of most strongly connected clusters, and not all clusters in the entire graph. Marcopol and Singh proposed a new model, Top Graph Clusters (TopGC), which probabilistically searches large, edge-weighted, directed graphs for their most strongly connected clusters in linear time. The algorithm is inherently parallelizable, and is able to find variable size, overlapping clusters. When compared with three other state-of-the art clustering techniques, TopGC achieves running time speedups of up to 70% on large scale real world datasets. In addition, the clusters returned by TopGC are consistently found to be better both in calculated score and when compared on real world benchmarks. (Macropol and Singh 2010)

We have had great success furthering people's understanding of large heterogeneous information networks through development of an interactive tool for analyzing topic-based relations among large text collections. Through an external collaboration with Topic Modeling experts at UC Irvine, we have adapted our graph visualization framework, WiGis, to support visualization of output from an LDA algorithm, by mapping LDA output (Topics, Documents and probabilistic association data) onto various dimensions of an interactive graph. For example, color, node types and similarity based topic clustering. During this period, we have applied our novel approach to exploration of a diverse set of heterogeneous data sources, including a large corpus of awarded NSF grants, a collection of about ten thousand research papers from the California Institute for Telecommunications and Information Technology (CalIT2), the 2009 US Health bill, as well as the initial program plan of this Collaborative Technology Alliance (cf. Figure 3). During this period, we have also researched new ways to use interactive graph manipulation (e.g. selection, filtering, and expansion of result sets; grouping nodes and highlighting relationships by interactive dragging) as a control mechanism for complex data mining algorithms. (Gretarsson et al. 2010)

The above three projects are supported in the INARC APP in I2.

Participants:

  • Kathy Macropol
  • Ambuj Singh
  • Brynjar Gretarsson
  • Xifeng Yan
  • Arijit Khan
  • Collaboration with Charu Aggarwal (IBM)

Publications:

  • Macropol and Singh, 2010
  • Gretarsson et al., 2010
  • Aggarwal et al., 2011

 

 

 

 

 

 

 

 

Back to top

Information Network Visualization

We designed "SmallWorlds", a visual interactive graph-based interface that allows users to specify, refine and build item-preference profiles in information/social networks. The interface facilitates expressions of taste through simple graph interactions and these preferences are used to compute personalized, fully transparent item recommendations for a target user. Predictions are based on a collaborative analysis of preference data from a user's direct peer group on a network. We find that in addition to receiving transparent and accurate item recommendations, users also learn a wealth of information about the preferences of their peers through interaction with our visualization. Such information is not easily discoverable in traditional text based interface

This project is supported in the INARC APP in T2.

Participants:

  • Brynjar Gretarsson
  • John O'Donovan
  • Svetlin Bostandjiev
  • Christopher Hall
  • Tobias Hollerer
  • Collaboration with (PARC)

Publications:

  • Gretarsson et al., 2010

 

 

 

 

 

 

Back to top

Information Propagation Model and Graph Mining

Graph mining in large information networks is critical to a variety of applications. Unfortunately, frequent subgraphs are often ineffective to capture association existing in these applications, due to the complexity of isomorphism testing and the inelastic pattern definition. We introduce proximity pattern which is a significant departure from the traditional concept of frequent subgraphs. Defined as a set of labels that co-occur in neighborhoods, proximity pattern blurs the boundary between itemset and structure. It relaxes the rigid structure constraint of frequent subgraphs, while introducing connectivity to frequent itemsets. Therefore, it can benefit from both: efficient mining in itemsets and structure proximity from graphs. We applied information propagation model to transform a complex graph mining problem to a simplified probabilistic itemset mining problem, which can be solved efficiently by a modified FP-tree algorithm. Empirical results show that it not only finds interesting patterns that are ignored by the existing approaches, but also achieves high performance for finding proximity patterns in largescale graphs.

This project is supported in the INARC APP in I2.

Participants:

  • Xifeng Yan
  • Arijit Khan
  • Collaboration with Kun-Lung Wu (IBM)

Publications:

  • Khan et al., 2010

 

 

 

 

Back to top

Knowledge Discovery in Information Networks

Text analysis in composite information and collaborative networks where multiple expert groups work together to process IT trouble tickets: A unified generative topic/network model, Optimized Network Model (ONM), has been developed to characterize such process. It includes a text/network classification module to combine heterogeneous text and network information together for better expert classification and recommendation. It has potential military applications, when trouble problems are transferred among different intelligence agents for resolution.

This project is supported in the INARC APP in I3.

Participants:

  • Gengxin Miao
  • Xifeng Yan
  • Collaboration with Shu Tao, N. Anerousis (IBM)
  • Collaboration with Yi Chen (ASU)

Publications:

  • Miao et al., 2010

 

 

 

 

 

Back to top

Storage and Processing of Information Network Systems

Information networks are large, heterogeneous, and often contain a huge number of links. The linkage structure encodes rich structural information about the underlying topical behavior of the network. It is often dynamic and evolves rapidly over time. Much of the work in the literature has focused either on the problem of classification with purely text behavior, or on the problem of classification with purely linkage. Furthermore, the work in the literature is mostly designed for static networks. We investigate the problem of node classification in dynamic information networks with both text content and links. We propose using a random walk approach in conjunction with the content of the network in order to facilitate an effective classification process, which is more robust to variations in content and linkage structure. It will also be dynamic, and can be applied to networks that get constantly updated.

This project is supported in the INARC APP in I2.

Participants:

  • Nan Li
  • Collaboration with Charu Aggarwal (IBM)

Publications:

  • Aggarwal and Li, 2011

 

 

 

 

Back to top

Mobility in Composite Networks

This project is aimed at understanding mobility in composite networks. We are given a set of moving nodes in a 2-D plane. From time to time, nodes request for particular information objects with real-time requirements on the recency of information. At the very beginning, all information is stored in a central server (e.g., a satellite) and nodes can communicate with the server to access information. The server-node communication is expensive, though a node always has an option to communicate with the server at any time. Meanwhile, when a pair of nodes gets close to each other, they have an opportunity to share information in a much cheaper or more cost-effective way. In order to reduce the number of the costly server-node communications, it is important to utilize the mobility of nodes and maximize the benefit from the node-node communication. Our work is focusing on designing algorithms that optimize the cost of connection in a mobile network environment satisfying all the information needs of moving nodes.

This project is supported in the INARC APP in E2.

Participants:

  • Bo Zong
  • Xifeng Yan
  • Misael Mongiovi
  • Ambuj Singh
  • Collaboration with Kostas Psounis (USC)

 

 

 

 

 

Back to top

Mining Time Evolving Networks

Time evolving networks arise in multiple domains. They can characterize traffic variations in transportation networks, information flow in communication networks, variation of trade rates in a network of trading agents or phases of pathway switching in gene interaction networks. Regardless of the rich representative power of time-evolving network in modeling processes in all the above domains, they have received limited attention from the data mining and database communities..

Our model of a time-varying network is based on fixed network structure (edges and nodes) and a time-varying characteristic of the edges. This scenario is applicable to modeling traffic in transportation, communication and information networks. We propose a method for mining a time-evolving network for heavy-edge sub-networks. In the transportation network scenario such sub-networks correspond to congested regions, while in a communication network they correspond to links that are overloaded with information traffic.

This project is supported in the INARC APP in E1.

Participants:

  • Petko Bogdanov
  • Misael Mongiovi

 

 

 

 

Back to top

Discovering Influential Groups of Agents in Composite Networks

We consider the problem of discovering effective collaboration cliques of agents in complex organizations in which agents are related based on one or multiple homophily principles, based on geographical, demographic or social commonalities. Independent of the aforementioned relations, agents are grouped in teams to perform a specific task. While there exists a pairwise method of assessing team performance as a whole for a given task, we assume there is no way of directly evaluating the performance of individuals as their success or ineffectiveness is dependent on their teammates. At task completion teams regroup and new tasks are assigned.

The above scenario arises in many real-world situations. For example, it can represent the dynamics of employees in a company working on team projects. The management forms teams for specific projects and evaluate each team's performance at project completion.

We propose a model of individual and group effectiveness, based on historical data and agent social and demographic ties. We take into account the performance of dynamic large groups for specific tasks and extrapolate individual performance, conditioned on group success. We discover core subgroups that drive team success and relate these to history of collaboration and proximity in the social graph.

This project is supported in the INARC APP in E1.

Participants:

  • Petko Bogdanov
  • Collaboration with Benjamin Baumer (CUNY)
  • Collaboration with Prithwish Basu (BBN)

Publications:

  • Bogdanov et al. 2011

 

 

 

 

Back to top

Back to top

Interaction Networks from Collaborative Content Creation

Collaborative systems such as Wikipedia produce high-quality content based on democratic principles of contribution. The high quality is due to a well-balanced self-regulated community of contributors. We propose a method to characterize and study the community structure of contributors of Wiki-based systems based on raw-content generation analysis. We leverage topic modelling in order to capture agreement and opposition of contributors and analyze these multi-modal relations to map communities in the contributor base.

The key steps of our approach include (i) modeling of pairwise variable-strength contributor interactions that can be both positive and negative, (ii) synthesis of a global network incorporating all pairwise interactions, and (iii) detection and analysis of community structure encoded in such networks. Analysis of the discovered community structure reveals coalitions of common-interest editors who back each other in promoting certain topics and collectively oppose other coalitions or single authors.

A different trust within this project addresses the problem of quality of content and its dependence on a timely convergence of opposing viewpoints to an unbiased state. One problem that hinders this process is unnoticed controversial content due to multiple succeeding contributions that move the focus away from it.

We develop a predictive model of editor behavior based on the content dynamics throughout the revision history of an article. A central component of our approach is the detection of the level of local agreement between a pair of contributors based on consecutive revisions. We employ these pairwise interactions (i) to estimate the level of controversy of a new contribution and (ii) to rank contributors based on their sensitivity to a specific new edit. Our ranking is based on the underlying content and the social structure of the editors encoded in a global contributor interaction network.

This project is supported in the INARC APP in E1.

Participants:

  • Petko Bogdanov
  • Nick Larusso

Publications:

  • Bogdanov et al. (2), 2010
  • Bogdanov et al. (1), 2010

 

 

 

 

 

 

 

Back to top

Scalable Proximity Search in Large and Composite Graphs

Pairwise relations among entities and agents in networks arise from different sources and can be based on multiple features. For example, people may have accounts in several online social networks, targeted towards different interaction types. A single individual can be a user of Facebook to keep up with her friends, LinkedIn to organize and maintain her business contacts, and Last.fm to discuss musical trends with fellow listeners. This diverse social ambiance is complemented by an information layer in the form of email communication, generation and discussion of blog entries or shared photographs. Supporting exploratory and analysis tasks in such composite systems has to be flexible and scalable in order to reflect frequent network changes and user-centric prioritization of the different components.

In the context of composite networks, we address the problem of k-Nearest Neighbor search, according to a random walk proximity measure called Effective Importance. Our approach retrieves the exact top neighbors at query time without relying on off-line indexing or summaries of the entire network. This makes it suitable for very large dynamic networks, as well as for composite network overlays mixed at query time.

This project is supported in the INARC APP in I2.

Participants:

  • Petko Bogdanov

Publications:

  • Bogdanov and Singh, 2010

 

 

 

 

 

Back to top

WiGipedia: Visual Editing of Wikipedia

Wikipedia is emerging as the dominant global knowledge repository. Recently, large numbers of Wikipedia users have collaborated to produce more structured information in the online encyclopedia. For example, the information found in tables, categories and infoboxes. Infoboxes contain key-value pairs, manually appended to articles based on the unstructured text therein. The wiki contains some structured information which can be crawled by DBpedia, which attempts to organize wiki data into into a database of subject-predicate-object triples. By leveraging this data we generate an interface, which we call WiGipedia, embedded on every Wikipedia article as an interactive graph visualization where entities represent articles, categories and relational entities, with typed edges between them. This intelligent web interface is designed to simplify the elicitation of semantically structured information in Wikipedia. Our motivation is to both inform the user of interesting contextual information pertaining to the current article, and to provide a simple way to introduce and/or repair semantic relations between wiki articles. User actions result in improved accuracy and consistency of structured data spread across multiple articles. WiGipedia provides users with an intuitive interface that allows single-click Wikipedia edits without knowledge of the Wikipedia markup language, templates, etc.

This project is supported in the INARC APP in T2.

Participants:

  • Svetlin Bostadjiev
  • Tobias Hollerer

 

 

 

 

Back to top


Figure 1


Figure 2

Cross-modal Fusion

Monitoring and fusing large amount of heterogeneous data is critical for advanced situational awareness. Thus there exists a need to efficiently filter and summarize such data. To conduct our research, we have set up a large camera network in the UCSB campus, which consists of more than 40 fixed cameras deployed along the bike paths and pedestrian walkways around campus. Figure 1 shows an aerial view of the camera positions as well as a few snapshots. The main goal is to explore the relationship between the traffics observed from the cameras with activities in the nearby area. In other words, exploiting cross-modal linkages between two different sources of information, e.g. observations from camera network and approximated building occupancy statistics, as shown in Figure 2. By utilizing canonical correlation analysis and probability model estimation, we are able to uncover such virtual linkages. We demonstrate that such information fusion can lead to better scene understanding and the discovery of aggregated human movement pattern. Our work was published in ACM Multimedia 2010 (Xu et al., 2010)

This project is supported in the INARC APP in I1.

Participants:

  • Jiejun Xu
  • B. S. Manjunath

 

 

 

 

 

 

Back to top