2018 IEEE International Conference on Big Data

IEEE Big Data 2018 Tutorials

Tutorial 1:

Large-Scale Multi-view Data Analysis

Tutorial 2:

Analysis of Complex Rare Categories

Tutorial 3:

High-Performance SVD for big data

Tutorial 4:

Recent Progress in Zeroth Order Optimization and Its Applications to Adversarial Robustness in Deep Learning

Tutorial 5:

Big Data Analytics for Societal Event Forecasting

Tutorial 6:

Anomaly Detection in Cyber Physical Systems

Tutorial 7:

Creating Reproducible Bioinformatics Workflows Using BioDepot-workflow- Builder (BwB)

Tutorial 8:

Managing Big Structured Data for Unsupervised Feature Representation Learning

Tutorial 9:

Big Data for everyone: Modeling of Data Processing Pipelines for the Industrial Internet of Things

Tutorial 1: Large-Scale Multi-view Data Analysis

Tutorial PPT

Zhengming Ding, Ph.D. Candidate

427 Richards Hall, 360 Huntington Ave., Boston, MA 02115
Email: allanding@ece.neu.edu

Ming Shao, Assistant Professor

285 Old Westport Road, Dartmouth, MA 02747-2300, Dion 303A
Email: mshao@umassd.edu

Yun Fu, Associate Professor

403 Dana Research Center, 360 Huntington Ave., Boston, MA 02115
Email: yunfu@ece.neu.edu

Abstract
Multi-view data are extensively accessible nowadays, since various types of features, view-points and different sensors. For example, the most popular commercial depth sensor Kinect uses both visible light and near infrared sensors for depth estimation; automatic driving uses both visual and radar sensors to produce real-time 3D information on the road; and face analysis al- gorithms prefer face images from different views for high-fidelity reconstruction and recognition. All of them tend to facilitate better data representation in different application scenarios. Essen- tially, multiple features attempt to uncover various knowledge within each view to alleviate the final tasks, since each view would preserve both shared and private information. This becomes increasingly common in the era of “Big Data” where the data are in large-scale, subject to cor- ruptions, generated from multiple sources, and have complex structures. While these problems attracted substantial research attention recently, a systematic overview of multi-view learning for Big Data analysis has never been given. In face of big data and challenging real-world applica- tions, we summarize and go through the most recent multi-view learning techniques appropriate to different data driven problems. Specifically, our tutorial covers most multi-view data represen- tation approaches, centered around two major applications along with Big Data, i.e., multi-view clustering, multi-view classification. In addition, it discusses the current and upcoming challenges. This would benefit the community in both industry and academia from literature review to future directions.

Tutorial 2: Analysis of Complex Rare Categories

Tutorial PPT

Dawei Zhou , PhD student

School of Computing, Informatics, Decision Systems Engineering Arizona State University
Email: Dawei.Zhou@asu.edu

Jingrui He , Associate Professor

School of Computing, Informatics, Decision Systems Engineering Arizona State University
Email: jingrui.heg@asu.edu

Abstract
In the era of big data, it is often the rare categories that are of great importance in many high-impact applications, ranging from financial fraud detection in online transaction networks to emerging trend detection in social networks, from spam image detection in social media to rare disease diagnosis in medical decision support system. As a result, the detection, characterization, tracking, and representation of rare categories become fundamental learning tasks, that may protect us from malicious behaviors, discover the novelty for scientific studies, and even save lives. The unique challenges of rare category analysis include (1) the highly-skewed class-membership distribution; (2) the nonseparability nature of the rare categories from the majority classes; (3) the data heterogeneity, e.g., the multi-modal representation of examples, and the analysis of similar rare categories across multiple related tasks. In this tutorial, we will provide a concise review of rare category analysis, where the majority classes have a smooth distribution, while the minority classes exhibit a compactness property. In particular, we will start with early developments on rare category analysis, that focus on discovering or characterizing rare examples from static homogeneous data. Then, we will introduce the more recent developments of rare category analysis in complex scenarios, such as rare category detection on time series data sets, rare category tracking in time-evolving graphs, rare category characterization with data heterogeneity. Finally, we will conclude the existing works and share our thoughts regarding the future directions.

Tutorial 3: High-Performance SVD for big data

Tutorial PPT

Andreas Stathopoulos, Professor

Computer Science Department, College of William & Mary
Email: andreas@cs.wm.edu

Eloy Romero, Postdoc

Computer Science Department, College of William & Mary
Email: eloy@cs.wm.edu

Abstract
The singular value decomposition (SVD) is one of the core computations of today's scientific applications and data analysis tools. The main goal is to compute a compact representation of a high dimensional operator, a matrix, or a set of data that best resembles the original in its most important features. Thus, SVD is widely used in scientific computing and machine learning, including low rank factorizations, graph learning, unsupervised learning, compression and analysis of images and text. The popularity of SVD has resulted in an increasing diversity of methods and implementations that exploit specific features of the input data (e.g., dense/sparse matrix, data distributed among the computing devices, data from queries or batch access, spectral decay) and certain constraints on the computed solutions (e.g., few/many number of singular values and singular vectors computed, targeted part of the spectrum, accuracy). The use of the proper method and the customization of the settings can significantly reduce the cost. In this tutorial we present a classification of the most relevant methods in terms of computing cost and accuracy (direct methods, iterative methods, online methods), including the most recent advances in randomized and online SVD solvers. We present what parameters have the biggest impact on the computational cost and the quality of the solution, and some intuition for their tuning. Finally, we discuss the current state of the software on widely used platforms (MATLAB, Python's numpy/scipy and R) as well as high-performance solvers with support for multicore, GPU, and distributed memory.

Tutorial 4: Recent Progress in Zeroth Order Optimization and Its Applications to Adversarial Robustness in Deep Learning

Tutorial PPT

Pin-Yu Chen, Research Staff

IBM Research AI
Email: pin-yu.chen@ibm.com

Sijia Liu, Research Staff

IBM Research AI
Email: sijia.liu@ibm.com

Abstract
Zeroth-order (ZO) optimization is increasingly embraced for solving big data and machine learning problems when explicit expressions of the gradients are difficult or infeasible to obtain. It achieves gradient-free optimization by approximating the full gradient via efficient gradient estimators. Some recent important applications include: a) generation of prediction-evasive, black-box adversarial attacks on deep neural networks, b) online network management with limited computation capacity, c) parameter inference of black-box/complex systems, and d) bandit optimization in which a player receives partial feedback in terms of loss function values revealed by her adversary. This tutorial aims to provide a comprehensive introduction to recent advances in ZO optimization methods in both theory and applications. On the theory side, we will cover convergence rate and iteration complexity analysis of ZO algorithms and make comparisons to their first-order counterparts. On the application side, we will highlight one appealing application of ZO optimization to studying the robustness of deep neural networks - practical and efficient adversarial attacks that generate adversarial examples from a black-box machine learning model. We will also summarize potential research directions regarding ZO optimization, big data challenges and some open-ended machine learning problems.

Tutorial 5: Big Data Analytics for Societal Event Forecasting

Tutorial PPT (Part I)

Tutorial PPT (Part II)

Liang Zhao, Assistant Professor

Geroge Mason University
Email: lzhao9@gmu.edu

Feng Chen, Assistant Professor

Univerity at Albany - SUNY
Email: chen5@albany.edu

Abstract
Spatio-temporal societal event forecasting, which has traditionally been prohibitively challenging, is now becoming possible and experiencing rapid growth thanks to the big data from Open Source Indicators (OSI) such as social media, news sources, blogs, economic indicators, and other meta- data source. Spatio-temporal societal event forecasting benefits the soci- ety in various aspects, such as political crises, humanitarian crises, mass violence, riots, mass migrations, disease outbreaks, economic instability, resource shortages, responses to natural disasters, and others. Differ- ent from traditional event detection that identifies ongoing events, event forecasting focuses on predicting the future events yet to happen. Also different from traditional spatio-temporal prediction on numerical indices, spatio-temporal event forecasting needs to leverage the heterogeneous in- formation from OSI to discover the predictive indicators and mappings to future societal events. The resulting problems typically require the pre- dictive modeling techniques that can jointly handle semantic, temporal, and spatial information, and require a design of efficient algorithms that scale to high-dimensional large real-world datasets. In this tutorial, we will present a comprehensive review of the state-of- the-art methods for spatio-temporal societal event forecasting. First, we will categorize the inputs OSI and the predicted societal events commonly researched in the literature. Then we will review methods for temporal and spatio-temporal societal event forecasting. We will illustrate the basic theoretical and algorithmic ideas and discuss specific applications in all the above settings.

Tutorial 6: Anomaly Detection in Cyber Physical Systems

Tutorial PPT

Maggie Cheng, Associate Professor

Illinois Institute of Technology
Email: maggie.cheng@iit.edu

Abstract
In a large distributed network, devices generate data with three V’s: high volume, high velocity, and high variety. The data are generally unstructured and correlated. To quickly and accurately detect anomalies from the massive amount of data is paramount as detection of anomalies can help identify system faults and enables immediate countermeasures to mitigate faults and to stop fault propagation in the network, and yet it is very challenging as it requires effective detection algorithms as well as adequate understanding of the underlying physical process that generated the data. In this tutorial, we will cover elements of anomaly detection in a networked system, basic detection techniques and their applications in Internet of Things (IoT) and Cyber Physical Systems (CPS). First, will will introduce the concept and categories of anomalies; then we focus on the models and algorithms for anomaly detection, and group the existing detection techniques based on the underlying models and approaches used. The statistical property and algorithmic aspects of the detection methods will be discussed. Subsequently, using communication networks and power grids as examples, we will discuss the application of these detection techniques in application domains. Finally, we will discuss the outlook of this research topic and its relation to other areas of study. We will focus on two broadly defined anomaly detection problems: 1) outlier detection from a large dataset, and 2) change point detection from a dynamic process. Both offline and online algorithms will be discussed.

Tutorial 7: Creating Reproducible Bioinformatics Workflows Using BioDepot-workflow- Builder (BwB)

Tutorial PPT

Ling-Hong Hung, Research Scientist

Institute of Technology, University of Washington, Tacoma, WA, USA
Email: lhhung@uw.edu

Ka Yee Yeung, Associate Professor

Institute of Technology, University of Washington, Tacoma, WA, USA
Email: kayee@uw.edu

Wes Lloyd, Assistant Professor

Institute of Technology, University of Washington, Tacoma, WA, USA
Email: wlloyd@uw.edu

Eyhab Al-Masri, Assistant Professor

Institute of Technology, University of Washington, Tacoma, WA, USA
Email: ealmasri@uw.edu

Abstract
Reproducibility is essential for the verification and advancement of scientific research. It is often necessary, not just to recreate the code, but also the software and hardware environment to reproduce results of computational analyses. Software containers like Docker that distribute the entire computing environment are rapidly gaining popularity in bioinformatics. Docker not only allows for the reproducible deployment of bioinformatics workflows, but also facilitates mix-and-match of components from different workflows that have complex and possibly conflicting software requirements. However, configuration and deployment of Docker, a command-line tool, can be exceedingly challenging for biomedical researchers with limited training in programming and technical skills. We developed a drag and drop GUI called the Biodepot-Workflow-Builder (Bwb) to allow users to assemble, replicate, modify and execute Docker workflows. Bwb represents individual software modules as widgets which are dragged onto a canvas and connected together to form a graphical representation of an analytical pipeline. These widgets allow the user interface to interact with software containers such that software tools written in other languages are compatible and can be used to build modular bioinformatics workflows. We will present a case study using the BwB to create and execute a RNA sequencing data workflow.

Tutorial 8: Managing Big Structured Data for Unsupervised Feature Representation Learning

Tutorial PPT

Lingfei Wu, Research Staff

IBM Research AI, Yorktown Heights, NY 10598
Email: lwu@email.wm.edu, wuli@us.ibm.com

Ian E.H. Yen, Ph.D

Carnegie Mellon University, Pittsburgh, PA 15213
Email: eyan@cs.cmu.edu

Abstract
In recent years, there have been a surge of interests in learning expressive representation from large-scale structured inputs, ranging from time-series data, string data, text data, and graph data. Transforming such massive variety of structured inputs into a contextpreserving representation in an unsupervised setting, is a grand challenge to the research community. In many problem domains, it is easier to specify a reasonable dissimilarity (or similarity) function between instances, than to construct a feature representation. Even for complex structured inputs, there are many well-developed dissimilarity measures, such as the Edit Distance (Levenshtein distance) between sequences, Dynamic Time Warping measure between time series, Hausdor. distance between sets, and Wasserstein distance between distributions. However, most standard machine models are designed for inputs with a vector feature representation. Through the proposed tutorial, we aim to present a simple yet principled learning framework for generating vector representations of structured inputs such as time-series, strings, text, and graphs from a well-defined distance function. The resulting representation can then be used for any downstream machine learning tasks such as classification and regression problems. We show a comprehensive catalog of the best practices of generating such vector representations, and demonstrating state-of-the-art performance compared to existing best methods through various analytic applications. We will also share our experiences of various challenges in construction of these representations in di.erent applications in time-series, strings, text, and graph domains.

Tutorial 9: Big Data for everyone: Modeling of Data Processing Pipelines for the Industrial Internet of Things

Tutorial PPT

Dr. Dominik Riemer , Department Manager, Senior Researcher

FZI Research Center for Information Technology, Karlsruhe, Germany
Email: riemer@fzi.de

Nenad Stojanovic

Nissatech Innovation Center, Nis, Serbia
Email: nenad.stojanovic@nissatech.com

Ljiljana Stojanovic

Fraunhofer Institute of Optronics, System Technologies, and Image Exploitation IOSB, Karlsruhe, Germany
Email: ljiljana.stojanovic@iosb.fraunhofer.de

Philipp Zehnder

FZI Research Center for Information Technology, Karlsruhe, Germany
Email: zehnder@fzi.de

Abstract
In many application domains such as manufacturing, the integration and continuous processing of real-time sensor data from the Internet of Things (IoT) provides users with the opportunity to continuously monitor and detect upcoming situations. One example is the optimization of maintenance processes based on the current condition of machines (conditionbased maintenance). While continuous processing of events in scalable architectures is already well supported by the existing Big Data tool landscape (e.g., Apache Kafka, Apache Spark or Apache Flink), building such applications requires a still enormous effort which, besides programming skills, demands for a rather deep technical background on distributed, scalable infrastructures. On the other hand, especially small and medium enterprises from the manufacturing domain often do not have the expertise required to build such programs. Therefore, there is a need for more intuitive solutions supporting the development of real-time applications. In this tutorial, we present methods and tools enabling flexible modeling of real-time and batch processing pipelines by domain experts. We will present ongoing standardization efforts and novel, lightweight, semantics-based models that allow to enhance data streams and stream processing algorithms with background knowledge. Furthermore, we look deeper into graphical modeling of processing pipelines, i.e., stream processing programs that can be defined using graphical tool support and are automatically deployed in distributed stream processors. The tutorial is accompanied by a hands-on session and includes many real-world examples and motivating scenarios we gathered from a number of research and industry projects within the last years.