Tutorial

1. Optimizing Big Data Analytics on Heterogeneous Processors


Tutorial file for download: Big_Data_Heterogeneous_Proc_Tutorial

  • Mayank Daga, Advanced Micro Devices, Inc.
  • Mauricio Breternitz, Advanced Micro Devices, Inc.
  • Junli Gu, Advanced Micro Devices, Inc.
  • Increasing computational and storage capacities have unleashed the field of big data analytics which necessitates novel architectures and tools to extract meaning from gargantuan volumes of data. The launch of AMD APUs with compliance to the Heterogeneous System Architecture (HSA) addresses some specific architectural challenges for big data analytics. In general, the community has focused on enhancing tools on discrete GPUs (dGPUs). However dGPUs bring a performance limitation due to the high overhead associated with data copies from CPU memory to GPU memory as well as constraints on dGPU memory size. HSA eliminates these performance limitations.
    This tutorial will explore the optimization of tools for big data analytics on the first HSA-enabled APU. We will cover the following topics:
    • Programming APUs with HSA with a focus on using C++ and OpenCL programming languages
    • Enhancing the programming model with a focus on accelerating Hadoop and Apache Spark
    • Enhancing data operations with a focus on optimizing breadth-first search (BFS) and deep neural networks (DNN)

2. The Era of Big Spatial Data

  • Mohamed F. Mokbel, University of Minnesota
  • Ahmed Eldawy, University of Minnesota
  • In this tutorial, we present the recent work in the big data community for handling Big Spatial Data. This topic became very hot due to the recent explosion in the amount of spatial data generated by smart phones, satellites and medical devices, among others. This tutorial goes beyond the use of existing systems as-is (e.g., Hadoop, Spark or Impala), and digs deep into the core components of big systems (e.g., indexing and query processing) to describe how they are designed to handle big spatial data. During this 90-minute tutorial, we review the state-of-the-art work in the area of Big Spatial Data while classifying the existing research efforts according to the implementation approach, underlying architecture, and system components. In addition, we provide case studies of full-fledged systems and applications that handle Big Spatial Data which allows the audience to better comprehend the whole tutorial.

3. Platforms and Algorithms for Big Data Analytics

http://dmkd.cs.wayne.edu/TUTORIAL/Bigdata/
Tutorial file for download: http://dmkd.cs.wayne.edu/TUTORIAL/Bigdata/Slides.pdf
  • Chandan K. Reddy, Wayne State University

  • This tutorial consists of two parts: (i) Big data platforms and their characteristics (ii) Large-scale classification and clustering algorithms.
    The first part will provide an in-depth analysis of different platforms available for studying and performing big data analytics. It will survey different hardware platforms available for big data analytics and assesses the advantages and drawbacks of each of these platforms based on various metrics such as scalability, data I/O rate, fault tolerance, real-time processing, data size supported and iterative task support. In addition to the hardware, a detailed description of the software frameworks used within each of these platforms is also discussed along with their strengths and drawbacks. Some of the critical characteristics that will be discussed here can potentially aid the audience in making an informed decision depending on their computational needs. Using ratings table, a rigorous qualitative comparison between different platforms is also discussed.
    The second part of the tutorial will consist of big data classification and clustering algorithms. In order to provide more insights into the effectiveness of each of the platform in the context of big data analytics, specific implementation level details of the widely used k-nearest neighbor and the k-means clustering algorithm on various platforms will also be described in the form of pseudocode. In addition, recent advances in large-scale linear classification and map-reduce based classification algorithms will be discussed. In the context of clustering, some of the well-known one-pass clustering algorithms and other parallel and distributed clustering solutions will be briefly mentioned.

4. Optimal Connectivity on Big Graphs: Measures, Algorithms and Applications


Tutorial file for download: PDF File
  • Hanghang Tong, Arizona State University
  • Graph mining has been playing a pivotal role in many disciplines, ranging from computer science, sociology, civil engineering, physics, economics/marketing, to biology, life science, management science, political science, etc. Among others,a common and fundamental property of the graphs arising from these domains is connectivity. The goal of this tutorial is to (1) provide a concise review of the recent advances in optimizing graph connectivity and its applications; and (2) identify the open challenges and future trends. We believe this is an emerging, high-impact topic in graph mining, which will attract both researchers and practitioners in the big data research community. Our emphasis will be on (1) the recent emerging techniques on addressing graph connectivity optimization problem, especially in the context of big data; and (2) the open challenges/future trends, with a careful balance between the theories, algorithms and applications.

5. The World is Big and Linked: Whole Spectrum Industry Solutions towards Big Graphs


Tutorial file for download: Compressed-Tutorial_IEEE-BigData-2015-IBM-SystemG-20151031_v3.compressed

  • Toyotaro Suzumura, IBM Research
  • Ching-Yung Lin, IBM Research
  • Yinglong Xia, IBM Research
  • Lifeng Nai, Georgia Institute of Technology
  • The importance of graph needs no emphasis, since many big data applications consist of entities with internal links, naturally forming a graph. However, graph computing and storage is notorious for its low efficiency, resulting in performance barriers in reality, especially when the graph volume becomes huge. Albeit many efforts have been made to improve the efficiency, it remains lack of systematic study on the characteristics of big graph computing and storage on commodity platforms. For example, Spark GraphX and Apache Giraph address merely the computing framework, sort of skipping the integration with a full functional graph storage layer; while existing graphDBs, e.g. Titan, takes a naive way to organize graph data, with little consideration on computational behaviors, not to mention visualization and scaling out/up techniques.
    In this tutorial, we provide full perspective investigations into big graph technology, addressing the overall architecture of industry solutions for big graphs. The tutorial consists of three mutually related parts, all under the umbrella of IBM System G, a real industrial high performance solution towards big graphs. The three parts are: 1) An overview of Graphen, a sister of IBM System G and possibly to be in Apache Incubator, based on open source technologies and compatible with de facto graph standards. 2) An exploration to ScaleGraph, an open source package built with IBM’s PGAS and helps win the 1st in the latest Graph500, aiming at scaling out the graph computation and addressing both software performance and productivity. 3) An introduction to GraphBIG, an open source suite with an unified in-memory graph representation and graph analytics on multicore CPUs/GPUs, aiming at scaling up graph computing. We will present industry solutions and experimental results for shedding lights onto the essential behaviors of the big graph processing.

6. Tutorial on Predictive Maintenance

Tutorial file for download: http://media.wix.com/ugd/560696_4fc53c34a10744cfb695df9d185262d4.pdf
  • Zhuang Wang, Skytree Inc.

  • Predictive maintenance strives to anticipate equipment failures to allow for advance scheduling of corrective maintenance, thereby preventing unexpected equipment downtime and improving service quality for the customers. There is a tremendous interest in industry to leverage recent advances in machine learning and data mining to tackle this problem. Whereas the key enabling techniques (such as failure diagnostics and prediction) for predictive maintenance have been of considerable emphasis in the community, the design of practical predictive maintenance systems has not enjoyed the same attention. This is partially because the lack of access to real-world use cases becomes an obstacle for researchers to consider all the characteristics of data and the nature of the problem for the practical design.
    In this tutorial, we aim to fill the gap between the business needs and technology offerings by a detailed study on the nature and requirements of the real-world predictive maintenance problems as well as a comprehensive survey of the techniques tacking the problems. We will survey the underlying data sources and feature engineering techniques, the learning scenarios and model creation and selection techniques, and will also present several real-world case studies and lessons learned.