IEEE Big Data 2018 Tutorials
Tutorial 1:
Large-Scale Multi-view Data AnalysisTutorial 2:
Analysis of Complex Rare CategoriesTutorial 3:
High-Performance SVD for big dataTutorial 4:
Recent Progress in Zeroth Order Optimization and Its Applications to Adversarial Robustness in Deep LearningTutorial 5:
Big Data Analytics for Societal Event ForecastingTutorial 6:
Anomaly Detection in Cyber Physical SystemsTutorial 7:
Creating Reproducible Bioinformatics Workflows Using BioDepot-workflow- Builder (BwB)Tutorial 8:
Managing Big Structured Data for Unsupervised Feature Representation LearningTutorial 9:
Big Data for everyone: Modeling of Data Processing Pipelines for the Industrial Internet of ThingsTutorial 1: Large-Scale Multi-view Data Analysis
Abstract
Multi-view data are extensively accessible nowadays, since various types of features, view-points and different sensors. For example, the most popular commercial depth sensor Kinect uses both visible light and near infrared sensors for depth estimation; automatic driving uses both visual and radar sensors to produce real-time 3D information on the road; and face analysis al- gorithms prefer face images from different views for high-fidelity reconstruction and recognition. All of them tend to facilitate better data representation in different application scenarios. Essen- tially, multiple features attempt to uncover various knowledge within each view to alleviate the final tasks, since each view would preserve both shared and private information. This becomes increasingly common in the era of “Big Data” where the data are in large-scale, subject to cor- ruptions, generated from multiple sources, and have complex structures. While these problems attracted substantial research attention recently, a systematic overview of multi-view learning for Big Data analysis has never been given. In face of big data and challenging real-world applica- tions, we summarize and go through the most recent multi-view learning techniques appropriate to different data driven problems. Specifically, our tutorial covers most multi-view data represen- tation approaches, centered around two major applications along with Big Data, i.e., multi-view clustering, multi-view classification. In addition, it discusses the current and upcoming challenges. This would benefit the community in both industry and academia from literature review to future directions.
Tutorial 2: Analysis of Complex Rare Categories
Abstract
In the era of big data, it is often the rare categories that are of great importance in many high-impact
applications, ranging from financial fraud detection in online transaction networks to emerging trend
detection in social networks, from spam image detection in social media to rare disease diagnosis
in medical decision support system. As a result, the detection, characterization, tracking, and representation
of rare categories become fundamental learning tasks, that may protect us from malicious
behaviors, discover the novelty for scientific studies, and even save lives. The unique challenges
of rare category analysis include (1) the highly-skewed class-membership distribution; (2) the nonseparability
nature of the rare categories from the majority classes; (3) the data heterogeneity, e.g.,
the multi-modal representation of examples, and the analysis of similar rare categories across multiple
related tasks. In this tutorial, we will provide a concise review of rare category analysis, where
the majority classes have a smooth distribution, while the minority classes exhibit a compactness
property. In particular, we will start with early developments on rare category analysis, that focus on
discovering or characterizing rare examples from static homogeneous data. Then, we will introduce
the more recent developments of rare category analysis in complex scenarios, such as rare category
detection on time series data sets, rare category tracking in time-evolving graphs, rare category
characterization with data heterogeneity. Finally, we will conclude the existing works and share our
thoughts regarding the future directions.
Tutorial 3: High-Performance SVD for big data
Abstract
The singular value decomposition (SVD) is one of the core computations of today's
scientific applications and data analysis tools. The main goal is to compute a compact representation
of a high dimensional operator, a matrix, or a set of data that best resembles the original in its
most important features. Thus, SVD is widely used in scientific computing and machine learning,
including low rank factorizations, graph learning, unsupervised learning, compression and analysis
of images and text.
The popularity of SVD has resulted in an increasing diversity of methods and implementations
that exploit specific features of the input data (e.g., dense/sparse matrix, data distributed among
the computing devices, data from queries or batch access, spectral decay) and certain constraints on
the computed solutions (e.g., few/many number of singular values and singular vectors computed,
targeted part of the spectrum, accuracy). The use of the proper method and the customization of
the settings can significantly reduce the cost.
In this tutorial we present a classification of the most relevant methods in terms of computing
cost and accuracy (direct methods, iterative methods, online methods), including the most recent
advances in randomized and online SVD solvers. We present what parameters have the biggest
impact on the computational cost and the quality of the solution, and some intuition for their
tuning. Finally, we discuss the current state of the software on widely used platforms (MATLAB,
Python's numpy/scipy and R) as well as high-performance solvers with support for multicore, GPU,
and distributed memory.
Tutorial 4: Recent Progress in Zeroth Order Optimization and Its Applications to Adversarial Robustness in Deep Learning
Abstract
Zeroth-order (ZO) optimization is increasingly embraced for solving big data and
machine learning problems when explicit expressions of the gradients are difficult or infeasible
to obtain. It achieves gradient-free optimization by approximating the full gradient via efficient
gradient estimators. Some recent important applications include: a) generation of
prediction-evasive, black-box adversarial attacks on deep neural networks, b) online network
management with limited computation capacity, c) parameter inference of black-box/complex
systems, and d) bandit optimization in which a player receives partial feedback in terms of loss
function values revealed by her adversary.
This tutorial aims to provide a comprehensive introduction to recent advances in ZO
optimization methods in both theory and applications. On the theory side, we will cover
convergence rate and iteration complexity analysis of ZO algorithms and make comparisons to
their first-order counterparts. On the application side, we will highlight one appealing application
of ZO optimization to studying the robustness of deep neural networks - practical and efficient
adversarial attacks that generate adversarial examples from a black-box machine learning
model. We will also summarize potential research directions regarding ZO optimization, big data
challenges and some open-ended machine learning problems.
Tutorial 5: Big Data Analytics for Societal Event Forecasting
Abstract
Spatio-temporal societal event forecasting, which has traditionally been
prohibitively challenging, is now becoming possible and experiencing rapid
growth thanks to the big data from Open Source Indicators (OSI) such as
social media, news sources, blogs, economic indicators, and other meta-
data source. Spatio-temporal societal event forecasting benefits the soci-
ety in various aspects, such as political crises, humanitarian crises, mass
violence, riots, mass migrations, disease outbreaks, economic instability,
resource shortages, responses to natural disasters, and others. Differ-
ent from traditional event detection that identifies ongoing events, event
forecasting focuses on predicting the future events yet to happen. Also
different from traditional spatio-temporal prediction on numerical indices,
spatio-temporal event forecasting needs to leverage the heterogeneous in-
formation from OSI to discover the predictive indicators and mappings to
future societal events. The resulting problems typically require the pre-
dictive modeling techniques that can jointly handle semantic, temporal,
and spatial information, and require a design of efficient algorithms that
scale to high-dimensional large real-world datasets.
In this tutorial, we will present a comprehensive review of the state-of-
the-art methods for spatio-temporal societal event forecasting. First, we
will categorize the inputs OSI and the predicted societal events commonly
researched in the literature. Then we will review methods for temporal
and spatio-temporal societal event forecasting. We will illustrate the basic
theoretical and algorithmic ideas and discuss specific applications in all
the above settings.
Tutorial 6: Anomaly Detection in Cyber Physical Systems
Abstract
In a large distributed network, devices generate data with three V’s: high volume, high
velocity, and high variety. The data are generally unstructured and correlated.
To quickly and accurately detect anomalies from the massive amount of data is paramount
as detection of anomalies can help identify system faults and enables immediate
countermeasures to mitigate faults and to stop fault propagation in the network, and yet it
is very challenging as it requires effective detection algorithms as well as adequate
understanding of the underlying physical process that generated the data.
In this tutorial, we will cover elements of anomaly detection in a networked system, basic
detection techniques and their applications in Internet of Things (IoT) and Cyber Physical
Systems (CPS). First, will will introduce the concept and categories of anomalies; then we
focus on the models and algorithms for anomaly detection, and group the existing detection
techniques based on the underlying models and approaches used. The statistical property
and algorithmic aspects of the detection methods will be discussed. Subsequently, using
communication networks and power grids as examples, we will discuss the application of
these detection techniques in application domains. Finally, we will discuss the outlook of
this research topic and its relation to other areas of study.
We will focus on two broadly defined anomaly detection problems: 1) outlier detection
from a large dataset, and 2) change point detection from a dynamic process. Both offline
and online algorithms will be discussed.
Tutorial 7: Creating Reproducible Bioinformatics Workflows Using BioDepot-workflow- Builder (BwB)
Abstract
Reproducibility is essential for the verification and advancement of scientific research. It is often necessary, not just to recreate the code, but also the software and hardware environment to reproduce results of computational analyses. Software containers like Docker that distribute the entire computing environment are rapidly gaining popularity in bioinformatics. Docker not only allows for the reproducible deployment of bioinformatics workflows, but also facilitates mix-and-match of components from different workflows that have complex and possibly conflicting software requirements. However, configuration and deployment of Docker, a command-line tool, can be exceedingly challenging for biomedical researchers with limited training in programming and technical skills.
We developed a drag and drop GUI called the Biodepot-Workflow-Builder (Bwb) to allow users to assemble, replicate, modify and execute Docker workflows. Bwb represents individual software modules as widgets which are dragged onto a canvas and connected together to form a graphical representation of an analytical pipeline. These widgets allow the user interface to interact with software containers such that software tools written in other languages are compatible and can be used to build modular bioinformatics workflows. We will present a case study using the BwB to create and execute a RNA sequencing data workflow.
Tutorial 8: Managing Big Structured Data for Unsupervised Feature Representation Learning
Abstract
In recent years, there have been a surge of interests in learning expressive representation
from large-scale structured inputs, ranging from time-series data, string data, text data,
and graph data. Transforming such massive variety of structured inputs into a contextpreserving
representation in an unsupervised setting, is a grand challenge to the research
community. In many problem domains, it is easier to specify a reasonable dissimilarity
(or similarity) function between instances, than to construct a feature representation.
Even for complex structured inputs, there are many well-developed dissimilarity measures,
such as the Edit Distance (Levenshtein distance) between sequences, Dynamic Time
Warping measure between time series, Hausdor. distance between sets, and Wasserstein
distance between distributions. However, most standard machine models are designed for
inputs with a vector feature representation. Through the proposed tutorial, we aim to
present a simple yet principled learning framework for generating vector representations of
structured inputs such as time-series, strings, text, and graphs from a well-defined distance
function. The resulting representation can then be used for any downstream machine
learning tasks such as classification and regression problems. We show a comprehensive
catalog of the best practices of generating such vector representations, and demonstrating
state-of-the-art performance compared to existing best methods through various analytic
applications. We will also share our experiences of various challenges in construction
of these representations in di.erent applications in time-series, strings, text, and graph
domains.
Tutorial 9: Big Data for everyone: Modeling of Data Processing Pipelines for the Industrial Internet of Things
Abstract
In many application domains such as manufacturing, the integration and continuous
processing of real-time sensor data from the Internet of Things (IoT) provides users with the
opportunity to continuously monitor and detect upcoming situations. One example is the
optimization of maintenance processes based on the current condition of machines (conditionbased
maintenance). While continuous processing of events in scalable architectures is
already well supported by the existing Big Data tool landscape (e.g., Apache Kafka, Apache
Spark or Apache Flink), building such applications requires a still enormous effort which,
besides programming skills, demands for a rather deep technical background on distributed,
scalable infrastructures. On the other hand, especially small and medium enterprises from the
manufacturing domain often do not have the expertise required to build such programs.
Therefore, there is a need for more intuitive solutions supporting the development of real-time
applications.
In this tutorial, we present methods and tools enabling flexible modeling of real-time and batch
processing pipelines by domain experts. We will present ongoing standardization efforts and
novel, lightweight, semantics-based models that allow to enhance data streams and stream
processing algorithms with background knowledge. Furthermore, we look deeper into
graphical modeling of processing pipelines, i.e., stream processing programs that can be
defined using graphical tool support and are automatically deployed in distributed stream
processors. The tutorial is accompanied by a hands-on session and includes many real-world
examples and motivating scenarios we gathered from a number of research and industry
projects within the last years.