IEEE Big Data 2017 Tutorials

Tutorial 1: Enterprise Knowledge Graphs for Large Scale Analytics

  • Nidhi Rajshree (Contact Author)
  • IBM Watson San Jose, USA

  • Nitish Aggarwal
  • IBM Watson San Jose, USA

  • Sumit Bhatia
  • IBM Research New Delhi, India

  • Anshu Jain
  • IBM Research Almaden San Jose, USA

In recent years, there have been lot of efforts in facilitating an user-friendly access to vast amounts of heterogeneous text data, ranging from news articles, social media post, scientific publications, associated with various domains such as corporate reports, legal acts, patient history, advertisements and security). Transforming such massive variety of unstructured text into an actionable knowledge, is a grand challenge to the research community. Through the proposed tutorial, we aim to present a comprehensive catalog of the best practices of building such large scale enterprise knowledge graphs, and enabling them to provide a user-friendly access to large amount of unstructured text data through various analytic applications. We will share our experiences of various challenges in construction of Knowledge Graph in IBM Watson Discovery Services and its applications in life sciences and intelligence domains.

Short Bio
Nidhi Rajshree is the architect of the Knowledge Graph technology within IBM Watson Discovery Services. She also manages a group of researchers and engineers working on the Knowledge Graph technology. In her previous role, she worked on various research problems related to data mining and text analytics in the area of Service Science and Education research. She has spearheaded multiple research projects at IBM Research, on Intelligent Information Systems that leverage on information extraction and retrieval techniques for decision making, recommendation and knowledge discovery. Her efforts in this space has led to some extremely novel contributions through patents and papers published in top conferences.
Nitish Aggarwal is a Research Scientist in Watson Knowledge Graph Department at IBM Watson, Almaden Research Centre, USA, where he is leading the research effort in building intelligent industrial applications using knowledge graph. He received his PhD from Insight Centre for Data Analytics, National University of Ireland. He works on the intersection of natural language processing, Information Retrieval and semantic web technologies. Nitish has contributed to several European funded projects in the area of knowledge graphs construction, mining and retrieval. He was the organizing chair of Proactive Information Retrieval workshop, collocated with ECIR 2016, and has served in program committee of multiple conferences and journals including ISWC, ESWC, AAAI, ACL, WWW, JASIST, IPM, SWJ and JWS.
Sumit Bhatia is a Research Scientist in Knowledge Engineering Department at IBM India Research Laboratory where he is working on developing a shared knowledge infrastructure for different client engagements. Previously, as a Researcher in IBM Watson, he led the development of cognitive analytic algorithms build on top of Watson's Knowledge Graph. He was a Post-doctoral Researcher at Xerox PARC and was a part of CiteSeerX project at Penn State University. Sumit's primary research interests are in the fields of Knowledge Management, Information Retrieval and Text Analytics, and he has published 25+ papers in top journals and conferences. He was the organizing chair of Proactive Information Retrieval workshop, collocated with ECIR 2016 and Social Multimedia Data Mining Workshop, collocated with ICDM 2014. He has served as a reviewer for multiple conferences and journals including WWW, CIKM, ACL, TKDE, TOIS, WebDB, JASIST, IJCAI, and AAAI.
Anshu Jain is the PI for AI and Deep Learning Applications at IBM Almaden Research Center, responsible for defining the AI strategy at the lab. In his previous role as the Principal Investigator of Knowledge Graph technology at IBM Watson, Anshu led a team of researchers and engineers to deliver multiple products including Watson Discovery Advisor on prem product and and Watson Discovery Service GA release on the IBM Watson Cloud. At IBM Research, he led major research efforts in the area of text analytics for service quality and building consumable analytics tools. He is the author of several US Patent Applications in areas of service delivery, text analytics, and user interaction, and peer reviewed papers in technical journals.

    -- Motivation for Knowledge Graphs
    -- Knowledge Graphs in general and industry
        -- Big Data Problems
            -- Volume
            -- Velocity (frequent updates, especially, in social media)
            -- Veracity (multiple sources, heterogeneous data)
  Construction of KGs
    -- Overview of state of art
          -- Ontology/Schema oriented
          -- Schema agnostic
    -- Challenges:
        -- Noise and refinement
        -- Veracity aspects
        -- Scale and infrastructure issues (Velocity and Volume)
  Analytic Problems
    -- Entity disambiguation
    -- Entity recommendation
    -- Relationship ranking
    -- Question answering
  Case Study: Knowledge Graph in Watson Discovery for Life Sciences and Intelligence

Tutorial 2: Popularity on the Web: From Estimation to Prediction

  • Charalampos Chelmis , Assistant Professor (Contact Author)
  • University at Albany - SUNY

  • Daphney-Stavroula Zois , Assistant Professor
  • University at Albany - SUNY

There sharing of content on the Web has become an important mechanism by which people promote themselves, as well as discover and consume information, services, and products online. In certain instances, a product, a photo, a news article, or other piece of information may get reshared multiple times (i.e., a user shares content with her set of friends, several of whom share it with their respective sets of friends, and so on, such that the content potentially reaches a large number of people), liked or “pined” (e.g., on a content sharing service such as Pinterest), highly reviewed (e.g., on Amazon), or cited (e.g., academic publications in Google Scholar). A growing body of research has focused on characterizing such aspects of “popularity”, identifying its characteristics, and estimating and predicting its dynamics in these domains. Popularity estimation and prediction are problems of particular interest with multiple applications, including facilitating beŠtter provision of resources, marketing and monetization, and blocking of illegal content. ŒThe goal of this tutorial is to (1) perform an in-depth study of the fundamental properties and similarities of popularity estimation and prediction with an emphasis on the algorithmic techniques and key ideas developed to derive efficient solutions; (2) identify the universal challenges associated with approaching the estimation and prediction tasks regardless of domain; and (3) summarize the most promising paths for future research.

Short Bio
Dr. Charalampos Chelmis ( ∼cchelmis/) is an Assistant Professor in Computer Science at the University at Albany, where he leads the Intelligent Big Data Analytics, Applications, and Systems group. Dr. Chelmis received his Ph.D. and M.Sc. in Computer Science from the University of Southern California, and his B.Eng. in Computer Engineering and Informatics from the University of Patras, Greece. He specializes in Big Data analytics and machine learning with applications in online social media, and data science. He was Senior Research Associate at the University Of Southern California during 2013–2016. He has published over 10 refereed articles on the analysis, modeling, and accurate prediction of process dynamics on large-scale real–world networks. He is Guest Co–Editor of Encyclopedia of Social Network Analysis and Mining. Dr. Chelmis also serves as Co–Chair of multiple international conferences and workshops, including the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. He is also serving as program commiŠee member of the WWW, SocialCom, and HighPC conferences, and as a reviewer of top journals such as PLOS ONE, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Computational Social Systems, and IEEE Transactions on Information Forensics and Security.
Dr. Daphney-Stavroula Zois ( ∼dz973423/) is an Assistant Professor in Electrical and Computer Engineering at the University at Albany, where she leads the Decision Making for Ambient Intelligence Environments group. Dr. Zois received her Ph.D. and M.Sc. in Electrical Engineering from the University of Southern California, and her B.Eng. in Computer Engineering and Informatics from the University of Patras, Greece. She specializes in signal processing, control and machine learning focusing on continuous and resource–ecient sensing, estimation and decision–making that adapts to information learned from unknown and uncertain environments. She was Postdoctoral Research Associate at the University of Illinois, Urbana–Champaign during 2014–2016. She has published various conference and journal refereed articles on the design of estimation and sequential decisionmaking frameworks and algorithms in a variety of Internet of things applications (e.g., e–health, intelligent transportation, computer vision). Dr. Zois serves as a reviewer of high–impact journals such as IEEE Transactions on Signal Processing, IEEE Transactions on Automatic Control, ACM Transactions on Sensor Networks, and conferences such as NIPS, ICASSP, and ISIT.

1. Introduction, context, and fundamental results
1.1. Motivation and background
1.2. Real-world examples
1.3. Fundamental di‚fferences between popularity estimation and prediction
1.4. Universal challenges associated with popularity estimation and prediction
1.5. Datasets and Competitions
1.6. End goals (e.g., monetization, blocking of illegal content, marketing, resource planning, investment)
2. Popularity Estimation
2.1. Systems and measures of popularity estimation
2.2. Data mining and machine learning approaches for popularity estimation
2.3. Case studies, including products on websites such as Amazon and Ebay, events in Ticketmaster, posts in Digg, or news and buzzwords in social media, photos in Instagram, github repositories, economical trends, pirated content, music artistis, videos, hotels, and resources
3. Popularity Prediction
3.1. Systems and measures of popularity prediction
3.2. Data mining and machine learning approaches for popularity prediction
3.3. Case studies, including products on election results, videos in YouTube and images in Instagram, tweets or hashtags in Twitter, github repositories, adolecence popularity, products on websites such as Etsy, movies on TV and box office, and news
4. Challenges and directions for future research
5. Conclusion

Tutorial 3: Security and Automated Platform Development for Big Data Analytics

  • Jun (Luke) Huan, Professor (Contact Author)
  • University of Kansas

  • Sohaib Kiani, Ph.D. Candidate
  • University of Kansas

  • Xiaoli Li, Ph.D. Candidate
  • University of Kansas

Data science is penetrating virtually every aspect of our society. However, data science systems—including data acquisition and processing pipelines and analytical techniques, such as deep learning—are becoming increasingly complex. Many data analytics and predictive analytics algorithms and systems are not transparent to the end-user. For example, how the underlying models work and when such models may fail, are not clear. Many approaches, especially those that apply to human subjects, may learn and reinforce pre-existing biases leading, for example, to unfair treatment of minority sections of a population. To enable widespread adoption of data science approaches requires assurances that the system will operate safely and securely, in a controlled and transparent manner. However, current research in this area is very limited.
In this tutorial, we plan to cover a set of theories behind secure data analytics. We review recent efforts in developing algorithms to achieve data science safety using different techniques based on various evaluation metrics. We use several real-world applications of safe data science to further illustrate the importance of the topic. We also review efforts to provide open analytics platform. We conclude the tutorial by pointing out challenges, issues in current research of safe data science and future research directions.

Short Bio
Dr. Jun (Luke) Huan is a Professor in the Department of Electrical Engineering and Computer Science at the University of Kansas. He directs the Data Science and Computational Life Sciences Laboratory at KU Information and Telecommunication Technology Center (ITTC). He holds courtesy appointments at the KU Bioinformatics Center, the KU Bioengineering Program, and a visiting professorship from GlaxoSmithKline plc. Dr. Huan received his Ph.D. in Computer Science from the University of North Carolina.
Dr. Huan works on data science, machine learning, data mining, big data, and interdisciplinary topics including bioinformatics. He has published more than 120 peer-reviewed papers in leading conferences and journals and has graduated more than ten graduate students including seven PhDs. Dr. Huan serves the editorial board of several international journals including the Springer Journal of Big Data, Elsevier Journal of Big Data Research, and the International Journal of Data Mining and Bioinformatics. He regularly serves the program committee of top-tier international conferences on machine learning, data mining, big data, and bioinformatics.
Dr. Huan's research is recognized internationally. He was a recipient of the prestigious National Science Foundation Faculty Early Career Development Award in 2009. His group won the Best Student Paper Award at the IEEE International Conference on Data Mining in 2011 and the Best Paper Award (runner-up) at the ACM International Conference on Information and Knowledge Management in 2009. His work appeared at mass media including Science Daily, RD magazine, and EurekAlert (sponsored by AAAS). Dr. Huan's research was supported by NSF, NIH, DoD, and the University of Kansas.
Starting January 2016, Dr. Huan serves as a Program Director in NSF/CISE/IIS and is on leave from KU. At NSF, Dr. Huan manages programs such as Big Data, IIS core program, CAREER, and Partnership of International Research and Education.
Xiaoli Li is a Ph.D student at the Department of Electrical Engineering and Computer Science, the University of Kansas. Her research interests include transparent machine learning, Bayesian non-parametric methods, and multi-task, multi-view and multi-label learning.
Sohaib Kiani is a PhD student in Computer Science at the University of Kansas. His research interests include security in machine learning applications, and optimization techniques to find hyper-parameters of deep networks. He has served as Teaching Assistant for undergraduate level courses in Object-oriented Programming and Data Structures for 3 years.

Security in Data Analytics:
1. Problem Formulation
2. Adversarial attacks for Deep Networks
3. Adversarial Detection or Defense mechanism
4. Transferability Of Adversarial attacks
Automated Machine Learning Platform (WOLF)
1. Motivation and aim
2. Platform architecture
3. Demo

Tutorial 4: Time Series Data Mining using the Matrix Profile: A Unifying View of Motif Discovery, Anomaly Detection, Segmentation, Classification, Clustering and Similarity Joins

  • Abdullah Mueen, Assistant Professor (Contact Author)
  • University of New Mexico

  • Eamonn Keogh, Professor
  • University of California Riverside

Time series data mining is a perennially popular research topic in ACM SIGKDD, due to the ubiquity of time series in medical, financial, industrial, and scientific domains. There are about a dozen major time series data mining tasks, including:
• Time Series Motif Discovery
• Time Series Joins
• Time Series Classification (shapelet discovery)
• Time Series Density Estimation
• Time Series Semantic Segmentation
• Time Series Visualization
• Time Series Clustering
• Time Series Similarity Search (indexing)
• Time Series Monitoring (complex event processing)
In 2016, an international group of researchers introduced the Matrix Profile, with the following two surprising claims. Firstly, if you have the Matrix Profile computed, then all time series data mining tasks are easy or trivial, and secondly, computing the Matrix Profile is unexpectedly scalable, and is completely free of the curse of dimensionality. Given these two facts, the Matrix Profile is poised to become an incredibly useful and ubiquitous primitive for time series data mining. It is difficult to overstate the scalability of the Matrix Profile computation, it has been used to perform ten exact quadrillion pairwise comparisons of a single time series during a self-join, surely the largest exact self-join ever attempted.
In this tutorial, two of the inventers of the Matrix Profile will explain how to use it efficiently to solve problems in time series analytics. The tutorial will be illustrated with case studies from domains as diverse as entomology, oil-and-gas production, music, bioinformatics. medicine, seismology and human behavior understanding. All attendees will be given free access to a Matlab toolbox that will allow them to immediately leverage the power of the Matrix Profile, and start building their own novel applications and extensions.

Short Bio
Dr. Mueen is an Assistant Professor at Computer Science in University of New Mexico. He has won the runner-up of SIGKDD doctoral dissertation contest in 2012 and a best paper award in the same conference. His research work has been published in top conferences including SIGMOD, KDD and ICDM. He has recently given well received tutorials in SIGKDD, CIKM, SDM and ICDM conferences on repeated pattern mining, similarity search and speeding up DTW.
Dr. Keogh, a full professor at UCR, is a top-ten most prolific author in all three of the top ranked data mining conferences, SIGKDD, ICDM and SDM, and the most prolific author in the Data Mining and Knowledge Discovery Journal. He has won best paper awards at ICDM, SIGKDD and SIGMOD. His H-index of 75 reflects the influence of time series representations and algorithms. He has given well-received tutorials at SIGKDD (four times), ICDM (four times), VLDB, SDM, and CIKM.

1. Introduction
2. What is MP?
3. Applications of MP in Data Mining
3.1. Motif Discovery
3.2. Shapelet Discovery
3.3. Discord Discovery
3.4. Time Series Segmentation
3.5. Time Series Join
4. Algorithms to compute MP
4.1. Brute-Force MP
4.2. Anytime MP
4.3. Optimum MP
4.4. GPU-based MP
4.5. Weighted/filtered MP
4.6. Distributed MP
5. Open Problems (Including: Five great problems for an ambitious grad student to work on)
6. Summary, Conclusions

Tutorial 5: Mathematics of Big Data

  • Jeremy Kepner (Contact Author)
  • MIT Lincoln Laboratory Supercomputing Center

Big Data describes a new era in the digital age in which the volume, velocity, and variety of data created across a wide range of fields (e.g., internet search, healthcare, finance, social media, defense, ...) are increasing at a rate well beyond our ability to analyze the data. Tools such as spreadsheets, databases, matrices, and graphs have been developed to address these challenges. The common theme amongst these tools is the need to store and operate on data as whole sets instead of as individual data elements. This tutorial provides hands-on programming examples that illustrate the common mathematical foundations of these data sets (associative arrays) that apply across many applications and technologies. Associative arrays unify and simplify data, leading to rapid solutions to volume, velocity, and variety problems. Understanding the mathematical underpinnings of big data allows the student to see past the differences that lie on the surface of these tools and to leverage their mathematical similarities to solve the hardest data big challenges. Specifically, understanding associative arrays (1) reduces the effort required to pass data between steps in a data processing system, (2) allows steps to be interchanged with full confidence that the results will be unchanged, and (3) makes it possible to recognize when steps can be simplified or eliminated.

Short Bio
Dr. Jeremy Kepner is a MIT Lincoln Laboratory Fellow. He founded the Lincoln Laboratory Supercomputing Center and pioneered the establishment of the Massachusetts Green High Performance Computing Center. He has developed novel big data and parallel computing software used by thousands of scientists and engineers worldwide. He has led several embedded computing efforts, which earned him a 2011 RD 100 Award. Dr. Kepner has chaired SIAM Data Mining, the IEEE Big Data conference, and the IEEE High Performance Extreme Computing conference. Dr. Kepner is the author of two bestselling books, Parallel MATLAB for Multicore and Multinode Computers and Graph Algorithms in the Language of Linear Algebra. His peer-reviewed publications include works on abstract algebra, astronomy, astrophysics, cloud computing, cybersecurity, data mining, databases, graph algorithms, health sciences, plasma physics, signal processing, and 3D visualization. In 2014, he received Lincoln Laboratory's Technical Excellence Award.

1. Motivating examples of big data processing pipelines in various fields
a. Internet search (e.g., PageRank)
b. Healthcare (e.g., DNA sequence analysis)
c. Finance (e.g., mixed auctions)
d. Social Media (e.g., analyzing Twitter)
e. Computer Networks (e.g., correlating IP addresses)
f. Machine Learning (e.g., deep neural networks)
2. Introduction to the underlying mathematics of important big data technologies
a. Spreadsheets as arrays, databases, and time series
b. SQL, NoSQL, and NewSQL databases
c. Transforming and correlating data with matrices
d. Graphs with directed/undirected, weighted/unweighted, multi, and hyper edges
3. Introduction to associative arrays and their useful mathematical properties
a. Common mathematical representation of data across data processing steps in data processing pipeline can reduce the amount of translation required between steps
b. The ability to swapping operations between and among data processing steps by exploiting the associativity, commutativity, and distributivity properties
c. Elimination of steps by exploiting the linearity of common data transformations
4. Hands-on interactive big data demonstrations of associative arrays using Jupyter
a. Importing CSV, TSV, and other types of data
b. Interacting with databases
c. Summarizing data with histograms and degree distributions
d. Displaying data using adjacency arrays and incidence arrays
e. Correlating and transforming data
f. Understanding deep neural networks

Tutorial 6: Industrial Big Data for Industrial Applications – Systematic Methodology

  • David Siege (Contact Author)
  • Predictronics Corp

  • Jay Lee , Professor
  • University of Cincinnati

  • Hossein Davari , Post-doctoral Fellow
  • University of Cincinnati

  • Brian Weiss
  • National Institute of Standards and Technology

Industrial big data presents significant opportunities for organizations to improve their operation, reduce maintenance cost, and have higher productivity. These potential benefits can only be properly harnessed if one can extract actionable information and value from these large industrial data sets. This tutorial will first highlight the differences between industrial big data and other big data applications, including the structure of the data, the data quality, the volume of data, and the balance of the data classes. The tutorial will then focus on the data analysis methodology for predictive monitoring, including pre-processing and data quality checks, feature engineering, health index and anomaly detection algorithms, diagnosis and prognostic methods. In addition to the data analysis methodology, test methods, verification and validation approaches will also be included and discussed. Industrial case studies in manufacturing, transportation, and energy domains will be shown to highlight the methodology and the successful use cases in industry. Lastly, some concluding remarks on the future direction for industrial big data and the unmet challenges will be discussed.

Short Bio
Dr. Jay Lee is Ohio Eminent Scholar, L.W. Scott Alter Chair Professor, and Distinguished Univ. Professor at the Univ. of Cincinnati and is founding director of National Science Foundation (NSF) Industry/University Cooperative Research Center (I/UCRC) on Intelligent Maintenance Systems ( which is a multi-campus NSF Industry/University Cooperative Research Center which consists of the Univ. of Cincinnati (lead institution), the Univ. of Michigan, Missouri Univ. of ST, and the Univ. of Texas-Austin. Since its inception in 2001, the Center has been supported by over 85 global companies. His current research focuses on Industrial Big Data Analytics, Cyber Physical Systems, and Self-Aware Asset Management Systems. He is one of the pioneers in the field of Intelligent Maintenance Systems, Prognostics and Health Management (PHM), as well as Predictive Analytics of Asset Management. He has mentored his students and won 1st prize of PHM Data Challenges five times since 2008. He also mentored his students and developed a spin-off company Predictronics through NSF ICorps Award in 2012. He has been selected as one of 30 Visionaries in Smart Manufacturing by SME in May 2016. Currently, he serves as a member of Technical Advisory committee (TAC) of Digital Manufacturing and Design Innovation (DMDI), as well as a member of Leadership Council of MForesight which is a NSF/NIST funded Manufacturing Think Tank in Sept. 2015. He was also invited to be a member of White House Cyber Physical Systems (CPS) American Challenge Program in Dec. 2013.
Dr. David Siegel is currently the chief technology officer for Predictronics Corp. His current role includes developing the technology road map for the companies predictive monitoring software and service solutions, developing new algorithms and methodologies, as well as leading a data science team to carry out the customization and deployment of various predictive monitoring solutions. Prior to joining Predictronics, Dr. Siegel was a research assistant at the Center for Intelligent Maintenance Systems at the University of Cincinnati. During his time there, he led numerous research efforts on customized diagnostic and prognostic software development for a variety of industrial customers and applications. A sample of these research efforts include advanced diagnostic methods for industrial robots, health monitoring systems for railway applications, failure prediction tools for machine tool bearings, and intelligent maintenance systems for military ground vehicles. Dr. Siegel is also a two-time winner of the Prognostics and Health Management Data Challenge and has won several best paper awards at various conferences focused on predictive monitoring and data analytics.
Dr. Brian A. Weiss is a mechanical engineer and the project leader of the Prognostics, Health Management, and Control (PHMC) project within the Engineering Laboratory (EL) at the National Institute of Standards and Technology (NIST). His current research efforts are focused on developing the necessary measurement science to verify and validate emerging monitoring, diagnostic, and prognostic technologies and strategies for smart manufacturing to enable manufacturers to respond to planned and un-planned performance changes. Dr. Weiss is taking a hierarchical approach in this project by leading research efforts at the component, work cell, and system levels. The project is focused on the application domains of machine tools and robot systems. From 2013-2016, Dr. Weiss also served as the Associated Program Manager for the Smart Manufacturing Operations Planning and Control (SMOPAC) program which contains his PHMC project. Prior to his manufacturing research, he spent 15 years conducting performance assessments across numerous military and first response technologies including autonomous unmanned ground vehicles; tactical applications operating on Android™ devices; advanced soldier sensor technologies; free-form, two-way, speech-to-speech translation devices for tactical use; urban search and rescue robots; and bomb disposal robots. He also spent six years developing robotic crane technologies which included the deployment of a prototype system on a military installation. His efforts have earned him numerous awards including a GCN for IT Excellence (2014), Department of Commerce Gold Medal (2013), Colleague’s Choice Award (2013), Silver Medal (2011), Bronze Medals (2004 and 2008), and the Jacob Rabinow Applied Research Award (2006). He has a B.S. in Mechanical Engineering (2000), Professional Masters in Engineering (2003), and Ph.D. in Mechanical Engineering (2012) from the University of Maryland, College Park, Maryland, USA.
Dr. Hossein Davari is currently the associate director of the NSF I/UCRC for Intelligent Maintenance Systems (IMS), and an adjunct faculty at the Dept. of Mechanical and Materials Engineering at the University of Cincinnati. His current role in the IMS center includes conducting research in the field of Prognostics and Health Management (PHM) for rotating machinery, manufacturing, health care and sports analytics applications, along with exploring novel research areas in this field and formulating projects with IMS industry members. During his PhD studies, Dr. Davari worked at the IMS Center as a Research Assistant and contributed in the development of PHM systems for multistage manufacturing processes and machinery including wind turbines, gearboxes, induction motors, railway transportation systems etc. Additional prior experience includes an internship in Goodyear Tire and Rubber company in which he developed a quality dependency map for monitoring the manufacturing assets based on the product quality measurements.

Introduction to Industrial Big Data – Trends and Recent Advances
a. Industrial Big Data vs. Big Data
b. Business Case and Past Success Stories
c. Unmet Challenges
Industrial Big Data – Data Analysis Methodology
a. Data Pre-Processing, Segmentation, and Data Quality Analysis
b. Feature Engineering
i. Waveform Based Feature Extraction Methods
ii. Low Frequency Based Feature Analysis Techniques
c. Feature Selection Methods and Approaches
d. Anomaly Detection / Health Index Algorithms
e. Diagnostic Methods
f. Prognostic Algorithms and Approaches
Test Methods, Reference Data Sets, Verification and Validation (VV)
a. Test Methods – Manufacturing Examples
b. Reference Data Sets
c. Verification and Validation Procedures
d. Discussion on VV and Future Work
Industrial Case Studies
a. Wind Turbine Predictive Monitoring Example
i. Data Pre-Processing
ii. Vibration-Based Features
iii. Residual Based Health Index iv. Discussion of Results
b. Manufacturing / Product Quality Example
i. Data Pre-Processing
ii. Peer-to-Peer Analysis Approach
iii. Distribution-Based Health Index
iv. Discussion of Results
c. NIST Case Study (Linear Axis, Robot System Work cell, Spindle, Factory Level)
Concluding Remarks
a. Successes and Lessons Learned
b. Future Direction – Research and Commercial Perspectives

Tutorial 7: Game Theory for Data Science: Eliciting truthful information

  • Boi Faltings , Professor (Contact Author)
  • Swiss Federal Institute of Technology (EPFL)

  • Goran Radanovic, Post-doctoral Researcher
  • Harvard University

As Big Data is increasingly used as a basis for decision making, it becomes important to ensure its quality. Often, data is provided by other agents, for example in sensor networks, user-contributed content, or crowdsourcing. Providing accurate and relevant data requires costly effort that agents may not always be willing to provide. Thus, it becomes important both to verify the correctness of data, but also to provide incentives so that agents that provide high-quality data are rewarded while those that do not are discouraged by low rewards. We will show how game theory makes such rewards possible.
We will cover different settings and the assumptions they admit, including sensing, human computation, peer grading, reviews and predictions. We will survey different incentive mechanisms, including proper scoring rules, prediction markets and peer prediction, Bayesian Truth Serum, Peer Truth Serum, and the settings where each of them would be suitable. As an alternative, we also consider reputation mechanisms. We complement the game-theoretic analysis with practical examples of applications in prediction platforms, community sensing and peer grading.

Short Bio
Boi Faltings is a full professor at EPFL and has worked in AI since 1983, and has been one of the pioneers on the topic of mechanisms for truthful information elicitation, with the first work dating back to 2003. He has taught AI and multiagent systems to students at the Swiss Federal Institute of Technology for 28 years. He is a fellow of AAAI and ECCAI and an area chair of IJCAI-16, and served as tutorial chair of IJCAI-13.
Goran Radanovic is a post-doctoral researcher at Harvard University and has worked on the topic of mechanisms for information elicitation since 2011, including his Ph.D. that he received from EPFL in 2016. His work has been published mainly at AI conference (see publication list). He has participated in the teaching of a course on multi-agent systems between 2012-2016.

Introduction (15 min)
1. Practical scenarios (10 min)
2. Formal models (5 min)
Eliciting verifiable information (30 min)
3. Eliciting distributions using proper scoring rules (10 min)
4. Combining elicitation and aggregation (20 min)
a. Prediction markets: market maker with a logarithmic scoring rule (10 min)
b. Reputation based approach: The influence limiter (10 min)
Eliciting unverifiable information: parametric mechanisms (30 min)
5. Objective information: output agreement (5 min)
6. Subjective information: Peer prediction (25 min):
a. Classical peer prediction and its extensions
b. The shadowing method and the peer truth serum
Eliciting unverifiable information: non-parametric mechanisms (30 min)
7. Bayesian Truth Serum (BTS) (10 min)
8. PTS for Crowdsourcing (10 min)
9. Correlated Agreement (10 min)
Integration with machine learning (15 min)

Tutorial 8: Anti-discrimination Learning: From Association to Causation

  • Lu Zhang, Post-doctoral Fellow(Contact Author)
  • University of Arkansas

  • Yongkai Wu, Ph.D. student
  • University of Arkansas

  • Xintao Wu, Professor
  • University of Arkansas

Anti-discrimination learning is an increasingly important task in data mining and machine learning fields. Discrimination discovery is the problem of unveiling discriminatory practices by analyzing a dataset of historical decision records, and discrimination prevention aims to remove discrimination by modifying the biased data and/or the predictive algorithms. Discrimination is causal, which means that to prove discrimination one needs to derive a causal relationship rather than an association relationship. Although it is well-known that association does not mean causation, the gap between association and causation is not paid enough attention by many researchers. The aim of this tutorial is to survey existing association-based approaches and point out their limitations, introduce a causal modeling-based framework and cover the very latest research on causal modeling-based fairness aware learning, and finally suggest potential future research directions.

Short Bio
Dr. Lu Zhang is a postdoctoral researcher in the Computer Science and Computer Engineering Department, University of Arkansas. He received the BEng degree in computer science and engineering from the University of Science and Technology of China, in 2008, and the PhD degree in computer science from Nanyang Technological University, Singapore in 2013. His research interests include data mining algorithms, discrimination-aware data mining, and causal inference.
Yongkai Wu is a Ph.D. student in the Department of Computer Science and Computer Engineering at the University of Arkansas.
Dr. Xintao Wu is a Professor in the Department of Computer Science and Computer Engineering at the University of Arkansas. His major research interests include data privacy, bioinformatics and discrimination-aware data mining.

1. Introduction
1.1. Context
1.2. Literature review
2. Causal modeling background
2.1. From statistics to causal inference
2.2. Structural causal model and causal graph
2.3. Causal effect inference
3. Anti-discrimination learning
3.1. Causal model-based anti-discrimination framework
3.2. Direct and indirect discrimination
3.3. Discrimination in prediction
3.4. Discrimination in data release
3.5. Individual discrimination
4. Challenges and directions for future research