October 27-30, 2014, Washington DC, USA
TUTORIAL 1: Big Data Stream Mining
Presenters: Gianmarco De Francisci Morales, Joao Gama, Albert Bifet, andWei Fan
The challenge of deriving insights from big data has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams is bound to become a key area of data mining research as the number of applications requiring such processing increases. Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in stream mining. This tutorial is a gentle introduction to mining big data streams. The first part introduces data stream learners for classification, regression, clustering, and frequent pattern mining. The second part discusses data stream mining on distributed engines such as Storm, S4, and Samza.
Fundamentals and Stream Mining Algorithms
– Concept drift
– Frequent Pattern mining
Distributed Big Data Stream Mining
– Distributed Stream Processing Engines
Gianmarco De Francisci Morales is a Research Scientist at Yahoo Labs Barcelona. He received his Ph.D. in Computer Science and Engineering from the IMT Institute for Advanced Studies of Lucca in 2012. His research focuses on large scale data mining and big data, with a particular emphasis on web mining and Data Intensive Scalable Computing systems. He is an active member of the open source community of the Apache Software Foundation working on the Hadoop ecosystem, and a committer for the Apache Pig project. He is the co-leader of the SAMOA project, an open-source platform for mining big data streams.
Joao Gama's Profile
Joao Gama is a Researcher at LIAAD, University of Porto, working at the Machine Learning group. His main research interest is in Learning from Data Streams. He published more than 80 articles. He served as Co-chair of ECML 2005, DS09, ADMA09 and a series ofWorkshops on KDDS and Knowledge Discovery from Sensor Data with ACM SIGKDD. He is serving as Co-Chair of next ECM-PKDD 2015. He is author of a recent book on Knowledge Discovery from Data Streams.
Albert Bifet's Profile
Albert Bifet is a Research Scientist at Huawei. He is the author of a book on Adaptive Stream Mining and Pattern Learning and Mining from Evolving Data Streams. He is one of the leaders of MOA and SAMOA software environments for implementing algorithms and running experiments for online learning from evolving data streams.
Wei Fan's Profile
Fan is the associate director of Huawei Noah’s Ark Lab. He
received his PhD in Computer Science from Columbia University in 2001. His
main research interests and experiences are in various areas of data mining
and database systems, such as, stream computing, high performance computing,
extremely skewed distribution, cost-sensitive learning, risk analysis,
ensemble methods, easy-touse nonparametric methods,
graph mining, predictive feature discovery, feature selection, sample
selection bias, transfer learning, time series analysis, bioinformatics,
social network analysis, novel applications and commercial data mining
systems. His co-authored paper received ICDM’2006 Best Application Paper Award, he led the team that used his Random Decision Tree
method to win 2008 ICDM Data Mining Cup Championship. He received 2010 IBM
Outstanding Technical Achievement Award for his contribution to IBM Infosphere Streams. He is the associate editor of ACM
Transaction on Knowledge Discovery and Data Mining (TKDD). Since he joined
Huawei in August 2012, he has led his colleagues to develop Huawei StreamSMART – a streaming platform for online and
real-time processing, query and mining of very fast streaming data. In
addition, he also led his colleagues to develop a real-time processing and
analysis platform of Mobile Broad Band (MBB) data.
IEEE Big Data 2014 Big Data Stream Mining Tutorial website: https://sites.google.com/site/bigdatastreamminingtutorial/
Slide can be downloaded at here.
TUTORIAL 2: Big ML Software for Modern ML Algorithms
Presenters: Eric P. Xing and Qirong Ho
Many Big Data practitioners are familiar with classical Machine Learning techniques such as Naive Bayes, Decision Trees, Kmeans, PCA, and Collaborative Filtering (to name but a few), and their implementations on popular Big Data systems such as Hadoop. Going beyond these classic techniques, a new generation of ML algorithms for example, topic models, nonparametric Bayesian models, deep neural networks, and sparse regression has been gaining popularity in both academia and industry, because they improve performance on existing tasks like recommendation and prediction, or even enable completely new ones such as topical visualization and image object detection. Initially, these algorithms were the exclusive privilege of large companies with the engineering resources to build their own cluster implementations from scratch. Today however, new opensource software platforms, such as GraphLab, Petuum and Spark, have democratized some or all of these advanced algorithms, putting them within reach of individual researchers and data analysts that do not mind getting their hands a little dirty. In this tutorial, you will learn about these emerging ML algorithms, the software platforms that can run them today, the MLcentric theory, principles and design of an ideal parallel ML system and how today’s platforms fit that idea, and the open research opportunities that have sprouted in this space between advanced ML and distributed systems.
Advanced, emerging ML algorithms:
Open source software platforms that can run some or all of these algorithms at scale:
– e.g. GraphLab, Petuum and Spark
Principles, design and theory of an algorithmic and systems interface to BigML
– Pros and cons of each platform: when should you favor one over the other
Research opportunities in the space between advanced ML and distributed systems
a list of references for the 3 systems:
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
Exploiting bounded staleness to speed up Big Data analytics
Fugue: Slow-Worker-Agnostic Distributed Learning for Big Models on Big Data
We have one more papers whose camera ready is still being prepared. I'd be happy to send a link once it is ready.
Spark: Cluster Computing with Working Sets
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
GraphLab: A New Parallel Framework for Machine Learning
Graph-Parallel Computation on Natural Graphs
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud
TUTORIAL 3: Large-scale Heterogeneous Learning in Big Data Analytics
Presenters: Jun Huan
Heterogeneous learning deals with data from complex real-world applications such as social networks, biological networks, internet of things among others. The heterogeneity could be found in multi-task learning, multi-view learning, multi-label learning, and multi-instance learning. In this talk we will present our and other groups’ recent progresses for designing and implementing large-scale heterogeneous learning algorithms include multi-task learning, multi-view learning, transfer learning algorithms. The applications of these work in social network analysis and bioinformatics will be discussed as well..
We cover the recent progresses on the following aspects:
–Multi-task learning (MTL) aims to train multiple related learning tasks together to reduce generalization error. MTL has been widely utilized in many application domains include bioinformatics, social network analysis, image processing among others.
–Multi-view learning (MVL) aims to identify a model where data are collected from different sources (a.k.a. views). There is an intense discussion on how and to what extend multi-view may help.
–Multi-label learning (MLL) aims to build classifier that assign multi-labels to an instance. It has wide applications in image annotation, recommender systems, and etc.
We cover the theoretic foundation of MTL/MVL/MLL learning algorithms using penalized maximum likelihood estimation, Bayesian MTL, and Gaussian process. We also cover the related algorithms such as MTL with known task relationship, multi-task & multi-view learning, learning with structured input and output. We also want to discuss a very important but less investigated area of scaling those learning algorithms to large-scale data. We plan to cover a few platforms that are suitable to support large-scale heterogeneous learning. Applications of heterogeneous learning in Bioinformatics, Health care informatics, Drug Discovery, Social network analysis will be reviewed.
Dr. Jun (Luke) Huan's Profile
Dr. Jun (Luke) Huan is a Professor in the Department of Electrical Engineering and Computer Science at the University of Kansas. He directs the Bioinformatics and Computational Life Sciences Laboratory at KU Information and Telecommunication Technology Center (ITTC) and the Cheminformatics core at KU Specialized Chemistry Center, funded by NIH. He holds courtesy appointments at the KU Bioinformatics Center, the KU Bioengineering Program, an adjunct professorship from the Department of Internal Medicine in the KU Medical School, and a visiting professorship from GlaxoSmithKline plc.. Dr. Huan received his Ph.D. in Computer Science from the University of North Carolina.
Dr. Huan works on data science, machine learning, data mining, big data, and interdisciplinary topics including bioinformatics. He has published more than 80 peer-reviewed papers in leading conferences and journals and has graduated more than ten graduate students including six PhDs. Dr. Huan serves the editorial board of several international journals including the Springer Journal of Big Data, Elsevier Journal of Big Data Research, and the International Journal of Data Mining and Bioinformatics. He regularly serves the program committee of top-tier international conferences on machine learning, data mining, big data, and bioinformatics.
Dr. Huan's research is recognized internationally. He was a recipient of the prestigious National Science Foundation Faculty Early Career Development Award in 2009. His group won the Best Student Paper Award at the IEEE International Conference on Data Mining in 2011 and the Best Paper Award (runner-up) at the ACM International Conference on Information and Knowledge Management in 2009. His work appeared at mass media including Science Daily, R&D magazine, and EurekAlert (sponsored by AAAS). Dr. Huan's research was supported by NSF, NIH, DoD, and the University of Kansas.
TUTORIAL 4: Big Data Benchmarking
Presenters: Chaitan Baru and Tilmann Rabl
This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.
Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz., TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now being pursued—one, called BigBench, is based on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics. The other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.
We conclude with a discussion of a number of future directions for big data benchmarking
– Introduction to benchmarking: What are TPC and SPEC; what is each organization’s role and approach to benchmarking.
– Characteristics of good industry standard benchmarks: Why has TPC-C lasted so long? Brief overview of TPCx-HS.
– Overview of big data benchmarking approaches: From micro-benchmarks to application-level pipelines.
– Applications scenarios and use cases: Big data scenarios and use cases that help define the application-level benchmark.
BigBench: In-depth discussion of an example, proposed big data benchmark
– Data generation: Synthetic data generation for big data.
– The Benchmarking process: Steps involved in setting up, executing, and verifying end-to-end benchmarks.
– Benchmark metrics: Existing metrics for industry standards, and appropriate metrics for big data benchmarks.
– Discussion of performance results on a small 6-node cluster at Intel and a large, 540-node cluster at Pivotal.
– Possible extensions to BigBench
Benchmarking challenges and future directions
– Modeling system failures in the benchmark; extrapolating from one scale factor to the next; benchmarking for new application scenarios, e.g. the Internet of Things.
Dr. Chaitan Baru and Dr. Tilmann Rabl 's Profile
Dr. Baru and Dr. Rabl have been collaborating since 2012 on the topic of big data benchmarking. They were both instrumental in starting the Workshops on Big Data Benchmarking series, and serve on the Steering Committee for these workshops. Five workshop have been held so far in May 2012 (San Jose), December 2012 (Pune, India), July 2013 (Xi’an, China), October 2013 (San Jose), and August 2014 (Potsdam, Germany). They are co-editors of the Springer Verlag Lecture Notes in Computer Science series on Specifying Big Data Benchmarks. They have also co-authored three papers:
•Big Data Benchmarking and the BigData Top100 List, C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, T. Rabl, Big Data Journal, Vol.1, No.1, Mary Ann Liebert Inc. Publishers, http://online.liebertpub.com/toc/big/1/1.
•Setting the Direction for Big Data Benchmark Standards, C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, T. Rabl, TPC Technical Conference, VLDB 2012, Aug 27-30, Istanbul, Turkey. http://link.springer.com/chapter/10.1007/978-3-642-36727-4_14.
•Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data, Baru, Bhandarkar, Curino, Danisch, Frank, Gowda, Jacobsen, Jie, Kumar, Nambiar, Poess, Raab, Rabl, Ravi, Sachs, Sen, Yi, Youn, Proceedings of the TPC Technical Conference, VLDB 2014, September, Hangzhou, China.
Furthermore, Dr. Rabl is the Chair of the recently formed SPEC Research Group on Big Data Benchmarking, and Dr. Baru is co-Chair. Thus, even though the tutorial instructors are from different institutions, they have worked together closely for several years, and have a continuing working relationship.
The tutorial presentation slide can be downloaded at here.