IEEE Big Data 2020 Sponsors
Tutorial 1:
Big Data Applications for High-Impact / Emerging Topics: Case Studies on Molecular Biology and Its Applications to COVID-19Tutorial 2:
Big Data System Benchmarking: State of the Art, Current Practices, and Open ChallengesTutorial 3:
Systems and Algorithms for Massively Parallel Graph MiningTutorial 4:
Semantic Exploration of Big DataTutorial 5:
A Tutorial on Topological Data Analysis in Text MiningTutorial 6:
Data Sources, Tools, and Techniques for Big Data-driven Machine Learning in HeliophysicsTutorial 7:
Big data stream miningTutorial 8:
Learning from complex medical dataTutorial 9:
Big Sequence ManagementTutorial 1: Big Data Applications for High-Impact / Emerging Topics: Case Studies on Molecular Biology and Its Applications to COVID-19
Abstract
The development of big data foundation, infrastructure, management, and algorithms have been
demonstrated fruitful in the past years. To ensure broad impacts, the next step is to deploy those
developed assets into real-life applications.
In particular, the recent advances in high-throughput have revolutionized the field of molecular
biology where there are high volumes of sequencing data with molecular veracity, functional variety,
and real-life values.
At the beginning (30 mins), we will teach you the basic concepts in molecular biology. Specifically,
we will focus on the central dogma of molecular biology and the corresponding classic
bioinformatics tools for extracting useful information from the high-throughput sequencing data at
different stages in genetics.
As the main content (1 hrs), we will show you different big data case studies for high-impact topics in
molecular biology as published on high-impact journals (i.e. IF>5). Specifically, the fundamental
topics in big data sequence modelling and pattern recognition for molecular biology (e.g. DNA
motifs) will be introduced as the first case study. After that, we will switch to two advanced topics in
molecular biology for personalized medicine solutions. The first one lies in the off-target predictions
for CRISPR-Cas9 gene editing, while the second one is the early cancer detection from blood.
Lastly (30 mins), we will demonstrate how different big data tools can be applied to COVID-19 in the
context of molecular modelling (e.g. vaccine design) and text mining on social media.
Tutorial 2: Big Data System Benchmarking: State of the Art, Current Practices, and Open Challenges
Abstract
Big data system benchmarking enables practitioners and
developers to assess the systems’ functionality and performance so that they can make wise decision to choose the
proper big data systems, or improve them. As we are witnessing the emergence and evolvement of various benchmarks for big data systems, either in the form of macrobenchmark or micro-benchmark, it is crucial to thoroughly
study, analyze, and understand the key techniques and applications of those benchmarks. In this tutorial, we offer a comprehensive presentation of a wide range of the
state-of-the-art benchmarks with a focus on big data systems. We classify these benchmarks into five categories:
Map-Reduce based system benchmarking, SQL-based analytical system benchmarking, NoSQL-based database benchmarking, Big graph system benchmarking, and Multi-model
database benchmarking. We discuss the key techniques of
each approach, as well as the current practices. We also
provide insights on the research challenges and directions
for benchmarking different big data systems.
Tutorial 3: Systems and Algorithms for Massively Parallel Graph Mining
Abstract
Big graph processing systems such as Pregel, GraphLab, GraphX and Gemini have become increasingly popular thanks to their emphasis on ease of programming. Unfortunately, these frameworks are dominantly designed for IO-bound execution and are only suitable for problems with a low time complexity. Graph mining problems such as finding dense and frequent subgraph structures usually have a very high time complexity, and when IO-bound systems are applied, the performance is a catastrophe. However, this problem is still not getting enough attention among many graph-parallel algorithm and system researchers who are still using those IO-bound systems to addresscompute-heavy graph mining problems.
In this tutorial, we explicitly categorize the popular graph mining problems into IO-heavy and CPU-heavy categories, and provide prior evidence that CPU-heavy graph mining problems should not be addressed using IO-bound systems which can lead to performance worse than even a serial algorithm. We then introduce two recent compute-intensive solutions to mining dense subgraph structures and frequent subgraph patterns, respectively, that satisfactorily address the IO-bound issue of existing systems. The key design is to expose an explicit task-based divide-and-conquer API to users, in contrast to the existing iterative computation paradigms. We will also show how to develop popular graph mining algorithms in these frameworks, including finding maximum cliques, triangle counting, finding maximal γ-quasi-cliques, and finding k-plexes.
Tutorial 4: Semantic Exploration of Big Data
Abstract
As the volume of the available information increases,
more and more datasets meet the criteria of being characterized
as Big Data, thus creating the need for systems that can support
their exploration. As the characteristics of Big Data create new
challenges that render existing solutions obsolete, in the recent
years there have been many efforts to identify new techniques
and develop systems that will enable the semantic exploration
of Big Data by a wider audience. Some of these systems rely
on the Semantic Web standard, which allows them to provide
incremental navigation based on the semantic ontologies, while
others take advantage of the structure of the RDF model to
provide exploration based on graph topology. Other systems focus
on the datasets offered through SPARQL endpoints either aiming
to identify their schema or to provide support for query creation.
The technique over which a system is designed is very important,
as it determines the challenges that are addressed, the datasets
and use cases that can be used as well as any limitations. In addition, specific design decision may create differentiation between
systems implementing the same technique. We propose here, the
presentation of a tutorial that will present the characteristics of
the Big Data, discuss the challenges for their exploration and
understanding, as well as techniques that aim to overcome them.
The tutorial will offer to the audience a deep understanding of
the strengths and weaknesses of the techniques, the use cases
and datasets that can be applied to as well as an overview of
the available systems currently implementing them and their
functionalities.
Tutorial 5: A Tutorial on Topological Data Analysis in Text Mining
Abstract
Topological Data Analysis (TDA) introduces methods that capture the underlying structure
of shapes in data. Despite the old history of computational geometry and computational
topology in applied mathematics, utilization of topology in data science is relatively a new
phenomenon. Within the last decade, TDA has been mostly examined in unsupervised
machine learning tasks. As an alternative to the conventional algorithms, TDA has been
often considered due to its capability to deal with high-dimensional data in different tasks
including but not limited to clustering, dimensionality reduction or descriptive modeling.
This tutorial will focus on applications of topological data analysis to text, and in particular to text classification. After introducing the fundamentals we will show three ways
in which topological information can be added to improve the accuracy of classification.
More specifically, we explain three different methods of extracting topological features from
textual documents, using as the underlying representations of text the two most popular
methods, namely term frequency vectors and word embeddings, and also without using any
conventional features. In addition, we show how even the simplest out of the box topological methods can be used to provide similarity judgments, e.g. topological plots of classical
novels.
Tutorial 6: Data Sources, Tools, and Techniques for Big Data-driven Machine Learning in Heliophysics
Abstract
During the past decade, Georgia State University’s (GSU) Data Mining Lab (DMLab) has
been conducting research on a wide range of topics centering on understanding, detection,
and forecast of solar events, those of which can (directly or indirectly) have significant
economic and collateral impacts on mankind, through electromagnetic radiation
and energetic particles. The close collaboration of the Computer Scientists and Solar
Physicists with the sole dedication to research on solar events using advanced statistical
tools, machine learning (ML) and deep learning (DL), resulted in a couple of hundreds of
in-depth studies in this domain. Many of these studies have been published in prestigious
journals such as Nature’s Scientific Data and The Astrophysical Journal. We would like to
prepare a tutorial on some of the methodologies we engineered, the challenges we faced, and
the products we put together. We believe our solutions and products can stimulate new
data-driven discoveries in heliophysics, as well as to serve and inspire communities of other
domains.
Tutorial 7: Big data stream mining
Abstract
The data revolution over the last two decades has changed almost every aspect of contemporary data
analytics. One must consider the fact that the size of data is constantly growing, and one cannot store all of it. Data
is in fast motion, constantly expanding and changing its properties. Velocity of data gave rise to the notion of data
streams, potentially unbounded collections of data that continuously flood the system. As new data is continuously
arriving, storing all the data stream is not a viable option. One needs to analyze new instances on-the-fly, incorporate
the useful information into the classifier, and discard them. Data streams are also subject to a phenomenon known as
concept drift, where the properties of stream are subject to unexpected changes over time. This includes not only the
discriminatory power of features, but also the size of the feature space, ratios of instances among classes, as well the
emergence and disappearance of features and classes.
To accommodate such characteristics, data streams inspired the development of new families of algorithms capable of
continuously integrating new data, while being robust to its evolving nature.
This tutorial aims at introducing the audience to both basic concepts of learning from data streams, as well as to
advanced contemporary research topics in this area. Special properties and requirements of data stream mining will
be highlighted, to clearly emphasize the state of this field, as well as underlying challenges and emerging trends.
Tutorial 8: Learning from complex medical data
Abstract
Big data analytics and machine learning methods are intensively employed in medicine and healthcare.
Electronic Health Records (EHRs) are perceived as big patient data. On them, scientists strive to
perform predictions on patients' progress, to understand and predict response to treatment, to detect
adverse drug effects and factors for cardiovascular disease, and many other learning tasks. Medical
researchers are also interested in learning from cohorts of population-based studies and of
experiments. Learning tasks include the identification of disease predictors that can lead to new
diagnostic tests and the acquisition of insights on interventions.
In this tutorial, we elaborate on data sources, methods, and case studies in medical mining. Next to the
aforementioned conventional data sources, we address the potential of data from mobile devices. We
discuss the learning problems that can be solved with those data, we present case studies and
investigate the methods needed to prepare and mine those data and to present the results to a medical
expert. Furthermore, we will emphasize the need for interpretable and explainable models that can
inspire trust and facilitate informed decision making. Towards this goal we will discuss and elaborate
on actionable models for complex EHR data and their applicability on the interpretation of black-box
models, such as deep learning architectures.
Tutorial 9: Big Sequence Management
Abstract
Data series are a prevalent data type that has
attracted lots of interest in recent years. Specifically, there has
been an explosive interest towards the analysis of large volumes
of data series in many different domains. This is both in
businesses (e.g., in mobile applications) and in sciences (e.g., in
biology). In this tutorial, we focus on applications that produce
massive collections of data series, and we provide the necessary
background on data series storage, retrieval and analytics. We
look at systems historically used to handle and mine data in the
form of data series, as well as at the state of the art data series
management systems that were recently proposed. Moreover,
we discuss the need for fast similarity search for supporting
data mining applications, and describe efficient similarity search
techniques, indexes and query processing algorithms. Finally,
we look at the gap of modern data series management systems
in regards to support for efficient complex analytics, and we
argue in favor of the integration of summarizations and indexes
in modern data series management systems. We conclude with
the challenges and open research problems in this domain.