Abstract: Every field has data. We use data to discover new knowledge, to interpret the world, to make decisions, and even to predict the future. The recent convergence of big data, cloud computing, and novel machine learning algorithms and statistical methods is causing an explosive interest in data science and its applicability to all fields. This convergence has already enabled the automation of some tasks that better human performance. The novel capabilities we derive from data science will drive our cars, treat disease, and keep us safe. At the same time, such capabilities risk leading to biased, inappropriate, or unintended action. The design of data science solutions requires both excellence in the fundamentals of the field and expertise to develop applications which meet human challenges without creating even greater risk.
The Data Science Institute at Columbia University promotes “Data for Good”: using data to address societal challenges and bringing humanistic perspectives as—not after—new science and technology is invented. Started in 2012, the Institute is now a university-level institute representing over 350 affiliated faculty from 16 different schools and institutes across campus. Data science literally touches every corner of the university.
In this talk, I will present the mission of the Institute and highlights of our educational and research activities—all with the aim of ensuring the responsible use of data to benefit society.
Prof. Jeannette M. Wing is Avanessians Director of the Data Science Institute and Professor of Computer Science at Columbia University. From 2013 to 2017, she was a Corporate Vice President of Microsoft Research. She is Adjunct Professor of Computer Science at Carnegie Mellon where she twice served as the Head of the Computer Science Department and had been on the faculty since 1985. From 2007-2010 she was the Assistant Director of the Computer and Information Science and Engineering Directorate at the National Science Foundation. She received her S.B., S.M., and Ph.D. degrees in Computer Science, all from the Massachusetts Institute of Technology.
Professor Wing's general research interests are in the areas of trustworthy computing, specification and verification, concurrent and distributed systems, programming languages, and software engineering. Her current interests are in the foundations of security and privacy, with a new focus on trustworthy AI. She was or is on the editorial board of twelve journals, including the Journal of the ACM and Communications of the ACM.
Professor Wing is known for her work on linearizability, behavioral subtyping, attack graphs, and privacy-compliance checkers. Her 2006 seminal essay, titled Computational Thinking is credited with helping to establish the centrality of computer science to problem-solving in fields where previously it had not been embraced.
She is currently a member of: the National Library of Medicine Blue Ribbon Panel; the Science, Engineering, and Technology Advisory Committee for the American Academy for Arts and Sciences; the Board of Trustees for the Institute of Pure and Applied Mathematics; the Advisory Board for the Association for Women in Mathematics; and the Alibaba DAMO Technical Advisory Board. She has been chair and/or a member of many other academic, government, and industry advisory boards. She received the CRA Distinguished Service Award in 2011 and the ACM Distinguished Service Award in 2014. She is a Fellow of the American Academy of Arts and Sciences, American Association for the Advancement of Science, the Association for Computing Machinery (ACM), and the Institute of Electrical and Electronic Engineers (IEEE).
Abstract: Given a large graph, like who-calls-whom, or who-likes-whom, what behavior is normal and what should be surprising, possibly due to fraudulent activity? How do graphs evolve over time? We focus on these topics: (a) anomaly detection in large static graphs and (b) patterns and anomalies in large time-evolving graphs.
For the first, we present a list of static and temporal laws, including advances patterns like 'eigenspokes'; we show how to use them to spot suspicious activities, in on-line buyer-and-seller settings, in FaceBook, in twitter-like networks. For the second, we show how to handle time-evolving graphs as tensors, as well as some surprising discoveries such settings.
Christos Faloutsos is a Professor at Carnegie Mellon University and an Amazon Scholar. He received the Fredkin Professorship in Artificial Intelligence (2020); the Presidential Young Investigator Award by the National Science Foundation (1989), the Research Contributions Award in ICDM 2006, the SIGKDD Innovations Award (2010), the PAKDD Distinguished Contributions Award (2018), 28 ``best paper'' awards (including 7 ``test of time'' awards), and four teaching awards.
Eight of his advisees or co-advisees have attracted KDD or SCS dissertation awards. He is an ACM Fellow, he has served as a member of the executive committee of SIGKDD; he has published over 400 refereed articles, 17 book chapters and three monographs. He holds 8 patents (and several more are pending), and he has given over 50 tutorials and over 25 invited distinguished lectures.
His research interests include large-scale data mining with emphasis on graphs and time sequences; anomaly detection, tensors, and fractals.
Abstract: Data are being generated, collected, and analyzed today at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. As the use of big data has grown, so too have concerns that poor-quality data, prevalent in large data sets, can have serious adverse consequences on data-driven decision making. Responsible data science thus requires a recognition of the importance of veracity, the fourth “V” of big data. In this talk, we first present a vision of high-quality big data and highlight the substantial challenges that the first three V’s, volume, velocity, and variety, bring to dealing with veracity in big data. We then present the FIT Family of adaptive, data-driven statistical tools that we have designed, developed, and deployed at AT&T for continuous data quality monitoring of a large and diverse collection of continuously evolving data. These tools monitor data movement to discover missing, partial, duplicated, and delayed data; identify changes in the content of spatiotemporal streams; and pinpoint anomaly hotspots based on persistence, pervasiveness, and priority. We conclude with lessons from FIT relevant to big data quality that are cause for optimism.
Divesh Srivastava is the Head of Database Research at AT&T. He is a Fellow of the Association for Computing Machinery (ACM), the Vice President of the VLDB Endowment, on the Board of Directors of the Computing Research Association (CRA), on the ACM Publications Board and an associate editor of the ACM Transactions on Data Science (TDS). He has served as the managing editor of the Proceedings of the VLDB Endowment (PVLDB), as associate editor of the ACM Transactions on Database Systems (TODS), and as associate Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering (TKDE). He has presented keynote talks at several international conferences, and his research interests and publications span a variety of topics in data management. He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.