2020 IEEE International Conference on Big Data

IEEE Big Data 2020 Sponsors

Tutorial 1:

Big Data Applications for High-Impact / Emerging Topics: Case Studies on Molecular Biology and Its Applications to COVID-19

Tutorial 2:

Big Data System Benchmarking: State of the Art, Current Practices, and Open Challenges

Tutorial 3:

Systems and Algorithms for Massively Parallel Graph Mining

Tutorial 4:

Semantic Exploration of Big Data

Tutorial 5:

A Tutorial on Topological Data Analysis in Text Mining

Tutorial 6:

Data Sources, Tools, and Techniques for Big Data-driven Machine Learning in Heliophysics

Tutorial 7:

Big data stream mining

Tutorial 8:

Learning from complex medical data

Tutorial 9:

Big Sequence Management

Tutorial 1: Big Data Applications for High-Impact / Emerging Topics: Case Studies on Molecular Biology and Its Applications to COVID-19

Tutorial PPT

Ka-Chun Wong

Department of Computer Science, City University of Hong Kong
Email: kc.w@cityu.edu.hk

Jiecong Lin

Department of Computer Science, City University of Hong Kong
Email: jieconlin3-c@my.cityu.edu.hk

Shixiong Zhang

Department of Computer Science, City University of Hong Kong
Email: sxzhang7-c@my.cityu.edu.hk

Xiangtao Li

Department of Computer Science, City University of Hong Kong
Email: sxzhang7-c@my.cityu.edu.hk

Abstract
The development of big data foundation, infrastructure, management, and algorithms have been demonstrated fruitful in the past years. To ensure broad impacts, the next step is to deploy those developed assets into real-life applications. In particular, the recent advances in high-throughput have revolutionized the field of molecular biology where there are high volumes of sequencing data with molecular veracity, functional variety, and real-life values.
At the beginning (30 mins), we will teach you the basic concepts in molecular biology. Specifically, we will focus on the central dogma of molecular biology and the corresponding classic bioinformatics tools for extracting useful information from the high-throughput sequencing data at different stages in genetics.
As the main content (1 hrs), we will show you different big data case studies for high-impact topics in molecular biology as published on high-impact journals (i.e. IF>5). Specifically, the fundamental topics in big data sequence modelling and pattern recognition for molecular biology (e.g. DNA motifs) will be introduced as the first case study. After that, we will switch to two advanced topics in molecular biology for personalized medicine solutions. The first one lies in the off-target predictions for CRISPR-Cas9 gene editing, while the second one is the early cancer detection from blood.
Lastly (30 mins), we will demonstrate how different big data tools can be applied to COVID-19 in the context of molecular modelling (e.g. vaccine design) and text mining on social media.

Tutorial 2: Big Data System Benchmarking: State of the Art, Current Practices, and Open Challenges

Tutorial PPT

Chao Zhang

Department of Computer Science, University of Helsinki, Finland
Email: chao.z.zhang@helsinki.fi

Jiaheng Lu

Department of Computer Science, University of Helsinki, Finland
Email: jiahenglu@gmail.com

Abstract
Big data system benchmarking enables practitioners and developers to assess the systems’ functionality and performance so that they can make wise decision to choose the proper big data systems, or improve them. As we are witnessing the emergence and evolvement of various benchmarks for big data systems, either in the form of macrobenchmark or micro-benchmark, it is crucial to thoroughly study, analyze, and understand the key techniques and applications of those benchmarks. In this tutorial, we offer a comprehensive presentation of a wide range of the state-of-the-art benchmarks with a focus on big data systems. We classify these benchmarks into five categories: Map-Reduce based system benchmarking, SQL-based analytical system benchmarking, NoSQL-based database benchmarking, Big graph system benchmarking, and Multi-model database benchmarking. We discuss the key techniques of each approach, as well as the current practices. We also provide insights on the research challenges and directions for benchmarking different big data systems.

Tutorial 3: Systems and Algorithms for Massively Parallel Graph Mining

Tutorial PPT

Da Yan

Department of Computer Science, The University of Alabama at Birmingham
Email: yanda@uab.edu

Guimu Guo

Department of Computer Science, The University of Alabama at Birmingham
Email: guimuguo@uab.edu

Abstract
Big graph processing systems such as Pregel, GraphLab, GraphX and Gemini have become increasingly popular thanks to their emphasis on ease of programming. Unfortunately, these frameworks are dominantly designed for IO-bound execution and are only suitable for problems with a low time complexity. Graph mining problems such as finding dense and frequent subgraph structures usually have a very high time complexity, and when IO-bound systems are applied, the performance is a catastrophe. However, this problem is still not getting enough attention among many graph-parallel algorithm and system researchers who are still using those IO-bound systems to addresscompute-heavy graph mining problems. In this tutorial, we explicitly categorize the popular graph mining problems into IO-heavy and CPU-heavy categories, and provide prior evidence that CPU-heavy graph mining problems should not be addressed using IO-bound systems which can lead to performance worse than even a serial algorithm. We then introduce two recent compute-intensive solutions to mining dense subgraph structures and frequent subgraph patterns, respectively, that satisfactorily address the IO-bound issue of existing systems. The key design is to expose an explicit task-based divide-and-conquer API to users, in contrast to the existing iterative computation paradigms. We will also show how to develop popular graph mining algorithms in these frameworks, including finding maximum cliques, triangle counting, finding maximal γ-quasi-cliques, and finding k-plexes.

Tutorial 4: Semantic Exploration of Big Data

Tutorial PPT

Maria Krommyda

School of ECE, NTUA, Athens, Greece
Email: maria.krommyda@iccs.gr

Verena Kantere

School of ECE, NTUA, Athens, Greece
Email: verena.kantere@gmail.com

Abstract
As the volume of the available information increases, more and more datasets meet the criteria of being characterized as Big Data, thus creating the need for systems that can support their exploration. As the characteristics of Big Data create new challenges that render existing solutions obsolete, in the recent years there have been many efforts to identify new techniques and develop systems that will enable the semantic exploration of Big Data by a wider audience. Some of these systems rely on the Semantic Web standard, which allows them to provide incremental navigation based on the semantic ontologies, while others take advantage of the structure of the RDF model to provide exploration based on graph topology. Other systems focus on the datasets offered through SPARQL endpoints either aiming to identify their schema or to provide support for query creation. The technique over which a system is designed is very important, as it determines the challenges that are addressed, the datasets and use cases that can be used as well as any limitations. In addition, specific design decision may create differentiation between systems implementing the same technique. We propose here, the presentation of a tutorial that will present the characteristics of the Big Data, discuss the challenges for their exploration and understanding, as well as techniques that aim to overcome them. The tutorial will offer to the audience a deep understanding of the strengths and weaknesses of the techniques, the use cases and datasets that can be applied to as well as an overview of the available systems currently implementing them and their functionalities.

Tutorial 5: A Tutorial on Topological Data Analysis in Text Mining

Tutorial PPT

Wlodek Zadrozny

College of Computing, School of Data Science, University of North Carolina at Charlotte
Email: wzadrozn@uncc.edu

Shafie Gholizadeh

College of Computing, University of North Carolina at Charlotte
Email: Sgholiza@uncc.edu

Abstract
Topological Data Analysis (TDA) introduces methods that capture the underlying structure of shapes in data. Despite the old history of computational geometry and computational topology in applied mathematics, utilization of topology in data science is relatively a new phenomenon. Within the last decade, TDA has been mostly examined in unsupervised machine learning tasks. As an alternative to the conventional algorithms, TDA has been often considered due to its capability to deal with high-dimensional data in different tasks including but not limited to clustering, dimensionality reduction or descriptive modeling. This tutorial will focus on applications of topological data analysis to text, and in particular to text classification. After introducing the fundamentals we will show three ways in which topological information can be added to improve the accuracy of classification. More specifically, we explain three different methods of extracting topological features from textual documents, using as the underlying representations of text the two most popular methods, namely term frequency vectors and word embeddings, and also without using any conventional features. In addition, we show how even the simplest out of the box topological methods can be used to provide similarity judgments, e.g. topological plots of classical novels.

Tutorial 6: Data Sources, Tools, and Techniques for Big Data-driven Machine Learning in Heliophysics

Tutorial PPT

Azim Ahmadzadeh

Computer Science Department, Georgia State University, 30303, GA
Email: aahmadzadeh1@cs.gsu.edu

Berkay Aydin

Computer Science Department, Georgia State University, 30303, GA
Email: baydin2@gsu.edu

Dustin J. Kempton

Computer Science Department, Georgia State University, 30303, GA
Email: dkempton1@cs.gsu.edu

Rafal A. Angryk

Computer Science Department, Georgia State University, 30303, GA
Emails: angryk@cs.gsu.edu

Abstract
During the past decade, Georgia State University’s (GSU) Data Mining Lab (DMLab) has been conducting research on a wide range of topics centering on understanding, detection, and forecast of solar events, those of which can (directly or indirectly) have significant economic and collateral impacts on mankind, through electromagnetic radiation and energetic particles. The close collaboration of the Computer Scientists and Solar Physicists with the sole dedication to research on solar events using advanced statistical tools, machine learning (ML) and deep learning (DL), resulted in a couple of hundreds of in-depth studies in this domain. Many of these studies have been published in prestigious journals such as Nature’s Scientific Data and The Astrophysical Journal. We would like to prepare a tutorial on some of the methodologies we engineered, the challenges we faced, and the products we put together. We believe our solutions and products can stimulate new data-driven discoveries in heliophysics, as well as to serve and inspire communities of other domains.

Tutorial 7: Big data stream mining

Tutorial PPT

Bartosz Krawczyk

Department of Computer Science, Virgina Commonwealth University, USA
Email: bkrawczyk@vcu.edu

Alberto Cano

Department of Computer Science, Virgina Commonwealth University, USA
Email: acano@vcu.edu

Abstract
The data revolution over the last two decades has changed almost every aspect of contemporary data analytics. One must consider the fact that the size of data is constantly growing, and one cannot store all of it. Data is in fast motion, constantly expanding and changing its properties. Velocity of data gave rise to the notion of data streams, potentially unbounded collections of data that continuously flood the system. As new data is continuously arriving, storing all the data stream is not a viable option. One needs to analyze new instances on-the-fly, incorporate the useful information into the classifier, and discard them. Data streams are also subject to a phenomenon known as concept drift, where the properties of stream are subject to unexpected changes over time. This includes not only the discriminatory power of features, but also the size of the feature space, ratios of instances among classes, as well the emergence and disappearance of features and classes. To accommodate such characteristics, data streams inspired the development of new families of algorithms capable of continuously integrating new data, while being robust to its evolving nature. This tutorial aims at introducing the audience to both basic concepts of learning from data streams, as well as to advanced contemporary research topics in this area. Special properties and requirements of data stream mining will be highlighted, to clearly emphasize the state of this field, as well as underlying challenges and emerging trends.

Tutorial 8: Learning from complex medical data

Tutorial PPT

Myra Spiliopoulou

Faculty of Computer Science, Otto-von-Guericke-University Magdeburg
Email: myra@ovgu.de

Panagiotis Papapetrou

Data Science group, Department of Computer and Systems Sciences, PO Box 7003, 164 07, Stockholm, Sweden
Email: panagiotis@dsv.su.se

Jaakko Hollmen

Data Science group, Department of Computer and Systems Sciences, PO Box 7003, 164 07, Stockholm, Sweden
Email: jaakko.hollmen@dsv.su.se

Abstract
Big data analytics and machine learning methods are intensively employed in medicine and healthcare. Electronic Health Records (EHRs) are perceived as big patient data. On them, scientists strive to perform predictions on patients' progress, to understand and predict response to treatment, to detect adverse drug effects and factors for cardiovascular disease, and many other learning tasks. Medical researchers are also interested in learning from cohorts of population-based studies and of experiments. Learning tasks include the identification of disease predictors that can lead to new diagnostic tests and the acquisition of insights on interventions. In this tutorial, we elaborate on data sources, methods, and case studies in medical mining. Next to the aforementioned conventional data sources, we address the potential of data from mobile devices. We discuss the learning problems that can be solved with those data, we present case studies and investigate the methods needed to prepare and mine those data and to present the results to a medical expert. Furthermore, we will emphasize the need for interpretable and explainable models that can inspire trust and facilitate informed decision making. Towards this goal we will discuss and elaborate on actionable models for complex EHR data and their applicability on the interpretation of black-box models, such as deep learning architectures.

Tutorial 9: Big Sequence Management

Tutorial PPT

Karima Echihabi

Mohammed VI Polytechnic University (UM6P)
Email: karima.echihabi@gmail.com

Kostas Zoumpatianos

Harvard University
Email: kostas@seas.harvard.edu

Themis Palpanas

LIPADE, Universite de Paris & French University Institute (IUF)
Email: themis@mi.parisdescartes.fr

Abstract
Data series are a prevalent data type that has attracted lots of interest in recent years. Specifically, there has been an explosive interest towards the analysis of large volumes of data series in many different domains. This is both in businesses (e.g., in mobile applications) and in sciences (e.g., in biology). In this tutorial, we focus on applications that produce massive collections of data series, and we provide the necessary background on data series storage, retrieval and analytics. We look at systems historically used to handle and mine data in the form of data series, as well as at the state of the art data series management systems that were recently proposed. Moreover, we discuss the need for fast similarity search for supporting data mining applications, and describe efficient similarity search techniques, indexes and query processing algorithms. Finally, we look at the gap of modern data series management systems in regards to support for efficient complex analytics, and we argue in favor of the integration of summarizations and indexes in modern data series management systems. We conclude with the challenges and open research problems in this domain.