Data systems are everywhere. A data system is a collection of data structures and algorithms working together to achieve complex data processing tasks. For example, with data systems that utilize the correct data structure design for the problem at hand, we can reduce the monthly bill of large-scale data applications on the cloud by hundreds of thousands of dollars. We can accelerate data science tasks by dramatically speeding up the computation of statistics over large amounts of data. We can train drastically more neural networks within a given time budget, improving accuracy. However, knowing the right data system design for any given scenario is a notoriously hard problem; there is a massive space of possible designs, while no single design is perfect across all data, queries, and hardware contexts. In addition, building a new system may take several years for any given (fixed) design. As a result modern data-driven applications incur massive cloud costs and development details.
We will discuss our quest for the first principles of data system design. We will show that it is possible to reason about this massive design space. This allows us to automatically create self-designing data systems that can take drastically different shapes to optimize for the workload, hardware, and available cloud budget using a grammar for data systems. These shapes include data structure, algorithms, and overall system designs which are discovered automatically and do not (always) exist in the literature or industry, yet they can be more than 10x faster. We will show performance examples for up to 1000x faster NoSQL processing and up to 10x faster neural network training.
Stratos Idreos is a Gordon McKay Professor of Computer Science at the Harvard John A. Paulson School of Engineering and Applied Sciences. He leads DASlab, the Data Systems Laboratory at Harvard. His research focuses on building a grammar for systems with the goal of making it dramatically easier or even automating in many cases the design of workload and hardware-conscious systems for diverse applications including relational data analytics, NoSQL, machine learning, and Blockchain. For his doctoral work on Database Cracking, Stratos was awarded the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. In 2015 he was awarded the IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems. In 2020 he received the ACM SIGMOD Contributions award for his work on reproducible research and in 2022 he received the ACM SIGMOD Test of Time Award for his work on raw data processing. Stratos was PC Chair of ACM SIGMOD 2021 and IEEE ICDE 2022, he is the founding editor of the ACM/IMS Journal of Data Science and the chair of the ACM SoCC Steering Committee.
Abstract: First-principle-based models are extensively used to study engineering and environmental systems. Such models have well-known limitations, e.g., they are incomplete representations of the underlying physical processes and often have many parameters that need to be calibrated. With massive amount of data about Earth and its environment that is now continuously being generated by Earth observing satellites and in-situ sensors, there is a tremendous opportunity to systematically advance modeling in environmental domains by using state of the art machine learning (ML) methods that have already rev-olutionized computer vision and language translation. However, capturing this opportunity is contingent on a paradigm shift in data-intensive scientific discovery since the “black box” use of ML often leads to serious false discoveries in scientific applications. Because the hypothesis space of scientific applica-tions is often complex and exponentially large, an uninformed data-driven search can easily select a highly complex model that is neither generalizable nor physically interpretable, resulting in the discov-ery of spurious relationships, predictors, and patterns. This problem becomes worse when there is a scarcity of labeled samples, which is quite common in science and engineering domains.
This talk makes the case that in real-world systems that are governed by physical processes, there is an opportunity to take advantage of fundamental physical principles to inform the search of a physical-ly meaningful and accurate ML model. While this talk will illustrate the potential of the knowledge-guided machine learning (KGML) paradigm in the context of environmental problems (e.g., Fresh wa-ter science, Hydrology, Agronomy), the paradigm has the potential to greatly advance the pace of dis-covery in a diverse set of discipline where mechanistic models are used, e.g., climate science, weather forecasting, and pandemic management.
Vipin Kumar is a Regents Professor and holds William Norris Chair in the department of Computer Science and En-gineering at the University of Minnesota. His research spans data mining, high-performance compu-ting, and their applications in Climate/Ecosystems and health care. He also served as the Director of Army High Performance Computing Research Center (AHPCRC) from 1998 to 2005. He has authored over 400 research articles, and co-edited or coauthored 11 books including two widely used text books ``Introduction to Parallel Computing", "Introduction to Data Mining", and a recent edited collection, “Knowledge Guided Machine Learning”. Kumar's current major research focus is on knowledge-guided machine learning and its applications to understanding the impact of human induced changes on the Earth and its environment. Kumar’s research on this topic is funded by NSF’s AI Institutes, BIG-DATA, INFEWS, STC, GCR, and HDR programs, as well as ARPA-E, DARPA, and USGS. He has recently finished serving as the Lead PI of a 5-year, $10 Million project, "Understanding Climate Change - A Data Driven Approach", funded by the NSF's Expeditions in Computing program. Kumar is a Fellow of the AAAI, ACM, IEEE, AAAS, and SIAM. Kumar's foundational research in data mining and high performance computing has been honored by the ACM SIGKDD 2012 Innovation Award, which is the highest award for technical excellence in the field of Knowledge Discovery and Data Min-ing (KDD), the 2016 IEEE Computer Society Sidney Fernbach Award, one of IEEE Computer Society's highest awards in high performance computing, and Test-of-time award from 2021 Supercomputing conference (SC21).
Abstract: Graphics processing units, or GPUs, are widely employed as hardware accelerators in various applications, such as algorithmic trading, computer vision, and large language model training. In particular, NVIDIA's GPUs, together with its Compute-Unified Device Architecture (CUDA) interface, provide a massively parallel platform for general-purpose computing. However, it is often challenging to accelerate data processing and analytical tasks on the GPU when they are irregular and do not match well the GPU architecture or programming paradigm. In this talk, I will discuss general methodologies as well as specific design and implementation techniques on using the GPU to accelerate such tasks, and compare them with CPU-based solutions. With the prevalence of GPU-equipped computing resources and big data applications of increasing complexity and scale, more opportunities and challenges will arise in this space.
Qiong Luo is a Professor of Computer Science and Engineering at the HongKong University of Science and Technology and a Professor of Data Science and Analytics at the Hong Kong University of Science and Technology (Guangzhou). Her research interests are in big data systems, parallel and distributed systems, and data-intensive applications. Her current focus is on data management on modern hardware, GPU acceleration for data analytics, and data processing support for e-science. Qiong has published over 160 research papers at international conferences and journals, and the number of citations on her work is more than 9,500 according to Google Scholar. She has served diligently on the program committees and organization committees of major conferences, is the Program Co-Chair of the EDBT 2024 conference, and is an Associate Editor of the Distributed and Parallel Databases Journal and the VLDB Journal. Qiong received her Ph.D. in Computer Sciences from the University of Wisconsin-Madison, her M.S. and B.S. in Computer Sciences from Beijing (Peking) University, China.
Abstract: As the importance of data protection and privacy legislation is increasingly
recognized worldwide, protecting sensitive information and individual privacy
presents a major challenge for modern big data analytics systems.
The ever-growing list of major data breaches (and associated fines) clearly
demonstrates the inadequacy of earlier ad-hoc solutions to the problem, as well
as the need to effectively bridge legal and technical/systems interpretations
of data privacy.
In this talk, I will present different modern privacy enhancing technologies (including federated learning, secure computing, differential privacy, and synthetic data), and discuss how they can enable formal, cryptographic notions of privacy in large-scale data analytics. The focus will be on our recent efforts to build systems and tools to support querying and machine learning over sensitive medical data. Several open challenges and directions for future research will also be discussed.
Minos Garofalakis is the Director of the Information Management Systems Institute (IMSI) at the ATHENA Research and Innovation Center and a Professor at the School of ECE at the Technical University of Crete. He also works as a senior research consultant for Huawei Research, and is the co-founder of Agora Labs, a startup company bringing state-of-the-art data privacy technologies to the medical domain. Minos received the MSc and PhD degrees from the University of Wisconsin-Madison in 1994 and 1998, respectively, and previously held senior/principal researcher positions at Bell Labs, Lucent Technologies (1998-2005), Intel Research Berkeley (2005-2007), and Yahoo! Research (2007-2008); in parallel, he held an Adjunct Associate Professor position at the EECS Department of UC Berkeley (2006-2008).
Minos’s research interests lie in the broad area of Big Data Analytics. He has published over 170 scientific papers in refereed international conferences and journals, and has delivered several invited keynote talks and tutorials in major international events. His work has resulted in 36 US Patent filings (29 patents issued) for companies such as Lucent, Yahoo!, and AT&T. Google Scholar gives over 16,000 citations to his work, and an h-index value of 69. He is a Fellow of the ACM and IEEE, a Member of the Academia Europaea, and a recipient of the TUC “Excellence in Research” Award (2015) and the Bell Labs President’s Gold Award (2004).
Abstract: We shall consider four problems and their efficient solution when the dataset is very large. (1) A brief introduction to locality-sensitive hashing; e.g., finding similar records or plagiarized documents. (2) Counting distinct items; e.g., unique visitors to a Website. (3) Random sampling of relational queries on databases. (4) Counting triangles; an example of optimal join computation, with an application to finding communities in social networks.
Jeff Ullman is the Stanford W. Ascherman Professor of Engineering (Emeritus) in the Department of Computer Science at Stanford and CEO of Gradiance Corp. He received the B.S. degree from Columbia University in 1963 and the PhD from Princeton in 1966. Prior to his appointment at Stanford in 1979, he was a member of the technical staff of Bell Laboratories from 1966-1969, and on the faculty of Princeton University between 1969 and 1979. From 1990-1994, he was chair of the Stanford Computer Science Department. Ullman was elected to the National Academy of Engineering in 1989, the American Academy of Arts and Sciences in 2012, the National Academy of Sciences in 2020, and has held Guggenheim and Einstein Fellowships. He has received the Sigmod Contributions Award (1996), the ACM Karl V. Karlstrom Outstanding Educator Award (1998), the Knuth Prize (2000), the Sigmod E. F. Codd Innovations award (2006), the IEEE von Neumann medal (2010), the NEC C&C Foundation Prize (2017), and the ACM A.M. Turing Award (2020). He is the author of 16 books, including books on database systems, data mining, compilers, automata theory, and algorithms.