Research

My work falls into two streams. The first stream centers around the modeling and sustaining Wikipedia open collaborations. In my postdoc research, I have applied machine learning (e.g., time series clustering, recursive neural networks) and econometrics (e.g., simultaneous equations model) to: 1) mine temporal patterns in the quality evolution of Wikipedia articles; 2) quantify the interactions among article quality, group size and article nomination; 3) predict group participation based on past group interaction and search interest. I am also developing a large-scale agent-based simulation that is theoretically and empirically grounded, to test design and operation decisions in Wikipedia. Using the system, we hope to first examine the impact of anti-vandalism tools. The second stream is around modeling and promoting the adoption of renewable technologies. In my doctoral research, I developed an agent-based computing system parameterized by machine learning to model and predict solar panel adoption in San Diego, California, and designed computer algorithms to optimize the marketing strategies in the solar market. Below I describe the two research streams in more detail.

Stream 1: Modeling and Sustaining Wikipedia Knowledge Production

Online open collaboration systems like Wikipedia are complex socio-technical systems, where editors and artifacts interact and coevolve. To better understand the open production in Wikipedia, the following three papers made use of longitudinal data to model and predict dynamics around Wikipedia articles.

Mining and predicting dynamic evolution of Wikipedia articles

This work is aimed to answer: 1) What are the common trajectories through which article quality grows over time and 2) What factors determine an article’s trajectory? We harvested archival data from three samples of articles in the categories of road infrastructure, films and battles. A time-series clustering method called Dynamic Time Warping was applied to identify common patterns in the articles’ quality trajectories. We found three distinctive clusters of trajectories, namely, stalled , plateaued and sustained , that varied by initial quality, final quality, and the rate at which quality increased over time. Multinomial logistic regression suggested that although multiple factors, including article relevance, contribution inequality, and communication among editors, can affect quality trajectories, different factors matter at different stages of article development.

Modeling the Interplay among article quality, crowd size and article nomination in Wikipedia

Much work in Wikipedia has focused on factors that affect article quality, but failed to demonstrate the feedback effect from article quality to editors. To provide a more complete view of how the open collaborations lead to quality artifacts in Wikipedia, this paper proposed a simultaneous equations model to capture the complex interactions among article quality, crowd size and Good Article (GA) nomination. With instrumental variables, e.g., 2SLS, we estimated the model with a longitudinal dataset of more than 300 thousand monthly observations for a sample of 2,425 GA nominees in Wikipedia. We found that higher quality attracted more participation, but more editors only led to a marginal increase in quality. Notably, the goal setting behavior, such as, GA nomination (GAN) was very effective in attracting editors and improving article quality. In addition, search interest brought in more editors. The quality of editor’s work was also an important predictor of article quality and the number of GANs previously made by the crowd also predicted GA nomination. While the number of WikiProjects positively affected quality, it showed a negative impact on the crowd size and GAN event.

Predicting group participation in online open collaborations

Online open collaborations, such as Wikipedia and open source software projects, are major sources for quality information, artifacts, and services. However, they need sustained participation by contributors to be successful. Accurate prediction of changes in group participation can be important to these communities when system managers want to plan and optimize their actions. The goal in this paper to predict the highly variable group contributions to Wikipedia articles using a deep learning approach. We collected rich data of month-to-month changes in the contributions to over 2000 Good Article (GA) nominees along with a rich set of factors that may influence editors’ contributions. To handle both static and temporal features, we developed a deep neural network model combining feedforward network layers with recurrent network layers (Long Short-Term Memory, LSTM). Our experiments suggested the model significantly outperformed in a holdout sample several baselines, such as regularized linear regression, support vector machine and random forest. We examined the relative importance of predictors and discussed the implications with respect to both Wikipedia research and its management.

In a complex socio-technical system like Wikipedia, editor behaviors are influenced by and also influence other agents and artifacts, e.g., articles, tools, and policies. The design and management of such systems are therefore facing tremendous challenges, e.g., a small change made to one part of the system can cascade to other parts of the system and lead to unintended consequences. To model the inherent complexity in Wikipedia, the following studies are aimed to develop an agent-based computing platform to simulate open knowledge production in Wikipedia informed by social theories, empirical findings and data.

A comprehensive review of research on Wikipedia collaboration

To ground our model, we conducted a comprehensive review of empirical research on how Wikipedia editors collaborate to create and improve the articles. Although several reviews of Wikipedia research exist, most are summaries of individual papers with little integration to show the big picture of what has been learned across studies. We searched and found approximately 1600 papers from the ACM digital library in March 2017. We chose 244 articles to review, which examine various aspects and processes of how Wikipedia editors collaborate with one another. Some popular themes include article quality (28 articles), conflict (14 articles), and tools (13 articles). We have completed coding and are currently in the process of drafting the paper. Preliminary analyses show several common themes around quality such as the definition and measurements of article definition, antecedents of quality, and methods of article quality prediction. Research on conflict centers around the sources and consequences of conflict, conflict management, and detection.

Agent-based modeling to study the impact of Wikipedia vandal-fighting tools

Wikipedia has experienced a slow but steady decline in editor contribution since 2007. One speculation about the reason for the decline is the unfriendly environment for newcomers who are more likely to be reverted, particularly due to the use of semi-automated tools to fight vandalism. These tools have improved the efficiency of quality control work substantially, however, several studies noted their negative impact on editor retention. The design and operation of these power tools thus deserve careful consideration. For example, who should be given permission to use these tools? What is the appropriate action when an editor noticed an unconstructive edit possibly made by a good faith newcomer? What new features can be introduced to the existing tools in order to mitigate the undesirable effect on new editors? To answer these questions, we are developing an agent-based model to simulate editors’ activities on article creation and maintenance and capture the dual impact of vandal-fighting tools. We have finished designing the conceptual model and have implemented it with Repast. We are currently in the middle of model calibration and validation.

Stream 2: Modeling and Promoting Technology Adoption

In the last decade, agent-based modeling has become a new paradigm to model innovation diffusion. However, our careful review of empirically-grounded agent-based models (ABMs) of innovation diffusion suggested that few ABMs are calibrated properly, validated rigorously, and developed explicitly for prediction.[AI Review 2017] To fill these gaps, my Ph.D. research introduced machine learning techniques to calibrate and validate agent-based models and applied the approach to forecasting the adoption of solar panels in San Diego County, California. Computational adoption models also make it possible to augment and even automate marketing decisions by designing efficient algorithms. My dissertation also contributed several marketing optimization algorithms in different problem settings.

Agent-based modeling of solar adoption and diffusion [JAAMAS 2016]

In this work, we proposed a novel, data-driven agent-based modeling framework, in which assumptions about individual behaviors were rooted in and parameterized by machine learning and validated using a holdout sample of collective adoption decisions. The model was deployed and shown to accurately forecast the trend of residential rooftop solar adoption in San Diego County, California. We then ran virtual experiments by the model to evaluate the efficacy of the solar incentive program. Our experiments suggested that the impact of California’s solar rebates could have been overestimated as there exist potentially more effective incentive structures and seeding strategies (e.g. giving away low-cost solar panels) to promote solar adoption.

In operations research, optimizing marketing strategies is often set its goal to maximize influence, awareness, adoption or revenue. For example, product seeding is aimed to maximize the word-of-mouth effects by giving away free samples to potential influencers in social networks. The operational need of finding optimal seeding strategies can be formulated as a maximization problem with an objective function that follows decreasing returns to scale or submodularity, where a simple greedy algorithm is proven to have a good approximation of the optimal solution.

Dynamic influence maximization under increasing returns to scale [AAMAS 2015]

The influence of new users on future adoptions may vary at different stages of diffusion. For instance, a new adoption in the early stages of technology diffusion may have a greater marginal impact on others’ likelihood of adoption than in the later stages. We refer to this property as increasing returns to scale . To optimize seeding strategies in the setting, we formulated a dynamic influence maximization problem and proposed a straightforward but optimal strategy, that is using all budget in one single stage rather splitting the budget into multiple stages. We experimentally verified the performance of the algorithm and also identified conditions (e.g., if the seeding cost decreases as a function of aggregate adoption instead of time), under which its optimality guarantee does not hold, where a more effective algorithm was also proposed.

Submodular optimization with routing constraints [AAAI 2016]

Marketing optimization in the real world often faces complex constraints. For example, in door-to-door marketing, salesmen typically have routing constraints. Unfortunately, the classic submodular maximization (i.e., finding an optimal set to maximize a set function that exhibits the natural diminishing returns) only deals with cardinality or cost constraints. This work investigated the problem of maximizing a submodular objective function subject to routing cost constraints and proposed a Generalized Cost-Benefit algorithm with proven approximation guarantees. Experiments on both real and simulated networks from mobile sensing and door-to-door marketing showed that our algorithm achieved significantly higher utility than state-of-the-art methods with lower and competitive running time.

Multi-channel marketing with budget complementarities [AAMAS 2017]

Data and computing intensive models, such as, simulations, have become new tools for advertisers to estimate and optimize campaign outcomes (i.e., sales or revenues) especially in multi-channel marketing. These advanced predictive models have certainly reduced the uncertainty and risk in marketing decisions, but the estimation of marketing outcomes using these sophisticated systems is often time-consuming. In addition, the fact that non-negligible budget increments are required for noticeable marginal impact also introduces intractable combinatorial structure to the optimization problem. This work was aimed to address the two computational challenges by solving a multi-choice knapsack problem, where the weights are not known but can be approximated by querying channel-wise outcome estimators. We developed an approximation algorithm that has proven optimality guarantee and several query strategies to achieve the approximation ratio in an online fashion.

Haifeng Zhang

Research

Stream 1: Modeling and Sustaining Wikipedia Knowledge Production

Stream 2: Modeling and Promoting Technology Adoption