Tutorial 1: Process mining: Leveraging event data to understand and improve organizations

  • Henrik Leopold
  • Email: han.van.der.aa@hu-berlin.de

  • Han van der Aa

Abstract
Process mining is a family of data analysis methods that aims to discover, monitor, and improve organizational processes by analyzing data from so-called event logs. These event logs are generated by various information systems that are used in an organization and, therefore, capture how organizational processes are actually exe- cuted. The main difference to traditional data analysis techniques is that process min- ing explicitly focuses on the process perspective. That is, it aims to reveal the complex order relations among the activities captured in the event log. In this tutorial, we give an introduction into the field or process mining and focus on its two most common tasks: (1) process discovery and (2) conformance checking. In the discovery part, we show that process mining techniques can be used to learn and visualize how a process is actually running in practice. In the conformance checking part, we explain how pro- cess mining can be used to detect differences between intended behavior (captured in a normative specification) and actual behavior (as found in the event log). Besides in- troducing the required theory and mechanisms behind discovery and conformance checking, we will put an emphasis on demonstrating how process discovery and con- formance checking can be conducted using the open-source tool ProM. In this way, participants will learn how the introduced concepts can be applied and how they can successfully use process mining themselves.



Tutorial 2: Taming Unstructured Big Data: Automated Information Extraction from Massive Text

  • Xuan Wang
  • Email: xwang174@illinois.edu

  • Yu Zhang
  • Qi Li
  • Jiawei Han

Abstract
Text data is a powerful information source that covers almost every aspect of our life. Automated information extraction has attracted considerable attention with various approaches developed to mine structured knowledge from un- structured text. In this tutorial, we present an organized picture of automated information extraction from massive text to answer the need of a systematic review and comparison of the techniques. We first introduce major tasks of information extraction such as named entity recognition and relation extraction. Then we introduce downstream applications such as heterogeneous in- formation network construction and claim mining that utilize the extracted information. Specifically, we focus on the methods that are scalable, effective, minimum supervised and working on various kinds of text (e.g., news and biomedical science). We also demonstrate on a real-world dataset, PubMed that includes over 29 million biomedical literature, how the heterogeneous information network can be constructed and how the scientific claims can be automatically retrieved based on automated infor- mation extraction. The covered topics will be interesting to both advanced researchers and beginners in data mining, text mining, natural language processing and machine learning.



Tutorial 3: Secure and Privacy-Preserving Big-Data Processing

  • Anton Burtsev
  • Sharad Mehrotra
  • Shantanu Sharma
  • Email: shantanu.sharma@uci.edu

Abstract
Over the last decade, public and private clouds emerged as de facto platforms for big-data analytical workloads. Outsourcing one’s data to the cloud, however, comes with multiple security and privacy challenges. In a world where service providers can be located anywhere in the world, fall under varying legal jurisdictions, i.e., be a subject of different laws governing privacy and confidentiality of one’s data, and be a target of well-sponsored (sometimes even government-sponsored) security attacks protecting data in a cloud is far from trivial. This tutorial focuses on two principal lines of research (cryptographic- and hardware-based) aimed to provide secure processing of big-data in a modern cloud. First, we focus on cryptographic (encryption- and secret- sharing-based) techniques developed over the last two decades and specifically compare them based on efficiency and information leakage. We demonstrate that despite extensive research on cryptography, secure query processing over outsourced data remains an open challenge. We then survey the landscape of emerging secure hardware, i.e., recent hardware extensions like Intel’s Software Guard Extensions (SGX) aimed to secure third-party computations in the cloud. Unfortunately, despite being designed to provide a secure execution environment, existing SGX implementations suffer from a range of side-channel attacks that require careful software techniques to make them practically secure. Taking SGX as an example, we will discuss representative classes of side-channel attacks, and security challenges involved in the construction of hardware-based data processing systems. We conclude that neither cryptographic techniques nor secure hardware are sufficient alone. To provide efficient and secure large-scale data processing at the cloud, a new line of work that combines software and hardware mechanisms is required. We discuss an orthogonal approach designed around the concept of data partitioning, i.e., splitting the data processing into cryptographically secure and non-secure parts. Finally, we will discuss some open questions in designing secure cryptographic techniques that can process large-sized data efficiently.



Tutorial 4: NewSQL : principles, systems and current trends

  • Patrick Valduriez
  • Email: Patrick.Valduriez@inria.fr

  • Ricardo Jimenez-Peris

Abstract
NewSQL is the latest technology in the big data management landscape, enjoying a fast growing rate in the DBMS and BI markets. NewSQL combines the scalability and availability of NoSQL with the consistency and usability of SQL. By providing online analytics over operational data, NewSQL opens up new opportunities in many application domains where real-time decision is critical. Important use cases are Google Adwords, proximity marketing, real-time pricing, risk monitoring, real-time fraud detection, etc. NewSQL may also simplify data management, by removing the traditional separation between operational database and data warehouse / data lake (no more ETLs!). However, a hard problem is scaling out transactions in mixed operational and analytical (HTAP) workloads over big data, possibly coming from different data stores (HDFS, SQL, NoSQL). Today, only a few NewSQL systems have solved this problem. This tutorial provides an in-depth presentation of NewSQL, with its principles, architectures and techniques. We provide a taxonomy of NewSQL systems based on major dimensions including targeted workloads, capabilities and implementation techniques. We illustrate with popular NewSQL systems such as Google F1/Spanner, LeanXcale, CockroachDB, SAP HANA, MemSQL and Splice Machine. In particular, we give a spotlight on some of the more advanced systems. We also compare with major NoSQL and SQL systems, and discuss integration within big data ecosystems and corporate information systems. Finally, we discuss the current trends and research directions.



Tutorial 5: An Overview of the Big Data Approaches for Profitable Social Network Analysis

  • Elio Masciari
  • Email: elio.masciari@unina.it

  • Domenico Saccà

Abstract
The pervasive diffusion of social networks caused the generation of unprecedented amounts of heterogenous data. Thus, traditional approaches quickly became unpractical for real life applications. More in detail, the analysis of user generated data by popular social networks like Facebook, Twitter, Instagram, LinkedIn to cite a few, poses quite intriguing challenges for both research and industry communities for analyzing user behavior, user interactions, link evolution, opinion spreading and several other important tasks. This tutorial will focus on the requirements needed for effective analysis of these new kind of data by analyzing some of the most recent approaches in literature. No specific prerequisites are needed, except the basic notions of graph theory, as we aim at guiding the attendees through an high level tour of the most recent approaches proposed by both researchers and companies. In particular, we will focus on the Big Data peculiar features of SN by analyzing the best solutions according to state of the art. Moreover, as gathering reliable data for research purposes is crucial, we will explain how to properly get huge datasets.



Tutorial 6: Large scale semantic graph data management and analytics

  • Olivier Cure
  • Email: olivier.cure@u-pem.fr

Abstract
After years of research and development, standards and technologies for semantic data are sufficiently mature to be used as the foundation of novel data science projects that employ semantic technologies in various application domains such as bioinformatics, materials science, intelligence, and social science. Typically, such projects are carried out by domain experts who have a conceptual understanding of semantic technologies but lack the expertise to choose and to employ existing data management and analytical solutions for the semantic data in their project. For such experts, including domain-focused data scientists, business analysts, project coordinators, and project engineers, our tutorial will deliver a ​practitioner's guide to semantic data management and analytics.​ We will discuss the following important aspects of graph-based semantic data management and demonstrate how to address these aspects in practice by using mature, production-ready tools: Storing and querying semantic data; automated reasoning; integrating external data and knowledge; and analytics.



Tutorial 7: Industrial AI: Machine Learning for Maintenance and Repair

  • Chetan Gupta
  • Ahmed Farahat
  • Email: Ahmed.Farahat@hal.hitachi.com

Abstract
Industrial AI is concerned with the application of Artificial Intelligence (AI), Machine Learning (ML) and related technologies towards addressing real world challenges in industrial and societal domains. These challenges can be categorized into the horizontal areas of maintenance and repair ( M&R) , quality, operations, safety, etc. and have applications in a large number of verticals. These applications will be profound and far reaching impact over the next several years and decades. One of the key horizontals in Industrial AI is Maintenance and Repair (M&R). This tutorial presents an overview of the application of machine learning for industrial operations with a focus on the M&R of physical equipment. To set the context, we will begin with an overview of the M&R business, and introduce a taxonomy of M&R problems that can be solved using AI & ML. We will then deep dive into recent applications in which new modeling techniques have been introduced to solve unique challenges in the M&R, such as using LSTMs and Functional Neural Networks (FNNs) for addressing prognostics problems, using RL for health indicator learning, GANs for generating failure data, etc. Finally, we will present some open problems in Industrial AI and discuss how the research community can shape the future of the next industrial revolution. We hope that the by the end of the tutorial, the attendees will not only have a better appreciation for the space of Industrial AI but will be exposed new real world problems and cutting edge solutions.



Tutorial 8: How to build and run a big data platform in the 21st century

  • Ali Dasdan
  • Email: adasdan@atlassian.com

  • Dhruba Borthakur

Abstract
We want to show that building and running a big data platform for both streaming and bulk data processing for all kinds of applications involving analytics, data science, reporting, and the like in today’s world can be as easy as following a checklist. We live in a fortunate time that many of the components needed are already available in the open source or as a service from commercial vendors. We show how to put these components together in multiple sophistication levels to cover the spectrum from a basic reporting need to a full fledged operation across geographically distributed regions with business continuity measures in place. We plan to provide enough information and checklists to the audience that this tutorial can also serve as a goto reference in the actual process of building and running.



Tutorial 9: Deep Learning on Big Data with Multi-Node GPU Jobs

  • Thomas Breuel
  • Email: tbreuel@nvidia.com

  • Alex Aizman

Abstract
Both traditional machine learning (clustering, decision trees, parametric models, cross-validation, function decompositions) and deep learning (DL) are often used for the analysis of big data on hundreds of nodes (clustered servers). However, the systems and I/O considerations for multi-node deep learning are quite different from traditional machine learning. While traditional machine learning is often well served by MapReduce style infrastructure (Hadoop, Spark), distributed deep learning places different demands on hardware, storage software, and networking infrastructure. In this tutorial, we cover: ● the structure and properties of large-scale GPU-based deep learning systems ● large-scale distributed stochastic gradient descent and supporting frameworks (PyTorch, TensorFlow, Horovod, NCCL) ● common storage and compression formats (TFRecord/tf.Example, DataLoader, etc.) and their interconnects (Ethernet, Infiniband, RDMA, NVLINK) ● common storage architectures for large-scale DL (network file systems, distributed file systems, object storage) ● batch queueing systems, Kubernetes, and NGC for scheduling and large-scale parallelism ● ETL techniques including distributed GPU-based augmentation (DALI) The tutorial will focus on techniques and tools by which deep learning practitioners can take advantage of these technologies and move from single-desktop training to training models on hundreds of GPUs and petascale datasets. It will also help researchers and system engineers to choose and size the systems necessary for such large-scale deep learning. Participants should have some experience in training deep learning models on a single node. The tutorial will cover both TensorFlow and PyTorch frameworks as well as additional open-source tools required to scale deep learning to multi-node storage and multi-node training.