SOCR News ISI DSPA Training 2022

SOCR News & Events: 2022 ISI Short Course - Data Science and Predictive Analytics (DSPA)

Instructor

Ivo Dinov, University of Michigan, SOCR, MIDAS.

Dr. Dinov is a professor of Health Behavior and Biological Sciences and Computational Medicine and Bioinformatics at the University of Michigan. He is a member of the Michigan Center for Applied and Interdisciplinary Mathematics (MCAIM) and a core member of the University of Michigan Comprehensive Cancer Center. Dr. Dinov serves as Director of the Statistics Online Computational Resource, Co-Director of the Center for Complexity and Self-management of Chronic Disease (CSCD Center), Co-Director of the multi-institutional Probability Distributome Project, Associate Director of the Michigan Institute for Data Science (MIDAS), and Associate Director of the Michigan Neuroscience Graduate Program (NGP). He is a member of the American Statistical Association (ASA), International Association for Statistical Education (IASE), American Mathematical Society (AMS), American Association for the Advancement of Science (AAAS), and an Elected Member of the International Statistical Institute (ISI).

Session Logistics

Date/Time: March 7, 8, and 21, 2022; 16:00-19:00 Central European Time (10 AM - 1 PM US ET).
Registration: Registration Link, moderate registration fees apply, 2022 ISI Short Courses.
- Fee-waiver Application: A few need-based registration fee waivers may be awarded. If interested to be considered for a registration fee-waiver, please complete this web-form application
GoToMeeting: <individual GoToMeeting links are provided upon registration>.
URL: Official ISI DSPA Course website.
Session Format: Three daily sessions (3-hours each).
Session URL (https://myumi.ch/e6zyw).

Overview

This course will be based on a Data Science and Predictive Analytics (DSPA) course the instructor teaches at the University of Michigan. The training will provide intermediate to advanced learners with a solid data science foundation to address challenges related to collecting, managing, processing, interrogating, analyzing and interpreting complex health and biomedical datasets using R. Participants will gain skills and acquire a tool-chest of methods, software tools, and protocols that can be applied to a broad spectrum of Big Data problems.

Before diving into the mathematical algorithms, statistical computing methods, software tools, and health analytics, we will discuss a number of driving motivational problems. These will ground all the subsequent scientific discussions, data modeling, and computational approaches.

Prerequisites

Assumed prior knowledge includes: Completed undergraduate study with quantitative STEM exposure, some quantitative training, programming experience, and high-level of energy and motivation to learn. Preinstalled R and RStudio on user local client computer.

Vision

This course is based on active-learning and integrates driving motivational challenges with mathematical foundations, computational statistics, and modern scientific inference.

Values

The training aims to provide effective, reliable, reproducible, and transformative data-driven discovery supporting open-science.

Strategic priorities

Trainees will develop scientific intuition, computational skills, and data-wrangling abilities to tackle Big biomedical and health data problems. Instructors will provide well-documented R-scripts and software recipes implementing atomic data-filters as well as complex end-to-end predictive big data analytics solutions.

Outcomes

Upon successful completion of this course, participants are expected to have moderate competency in at least two of each of the three competency areas: Algorithms and Applications, Data Management, and Analysis Methods. Specifically, participants will get end-to-end R-protocols, gain ML/AI algorithm knowledge, explore data validation, wrangling, and visualization, experiment with statistical inference and model-free Machine Learning tools.

Areas	Competency	Expectation	Notes
Algorithms and Applications	Tools	Working knowledge of basic software tools (command-line, GUI based, or web-services)	Familiarity with statistical programming languages, e.g., R or SciKit/Python, and database querying languages, e.g., SQL or NoSQL
	Algorithms	Knowledge of core principles of scientific computing, applications programming, API’s, algorithm complexity, and data structures	Best practices for scientific and application programming, efficient implementation of matrix linear algebra and graphics, elementary notions of computational complexity, user-friendly interfaces, string matching
	Application Domain	Data analysis experience from at least one application area, either through coursework, internship, research project, etc.	Applied domain examples include: computational social sciences, health sciences, business and marketing, learning sciences, transportation sciences, engineering and physical sciences
Data Management	Data validation & visualization	Curation, Exploratory Data Analysis (EDA) and visualization	Data provenance, validation, visualization via histograms, Q-Q plots, scatterplots (ggplot, Dashboard, D3.js)
	Data wrangling	Skills for data normalization, data cleaning, data aggregation, and data harmonization/registration	Data imperfections include missing values, inconsistent string formatting (‘2016-01-01’ vs. ‘01/01/2016’, PC/Mac/Linux time vs. timestamps, structured vs. unstructured data
	Data infrastructure	Handling databases, web-services, Hadoop, multi-source data	Data structures, SOAP protocols, ontologies, XML, JSON, streaming
Analysis Methods	Statistical inference	Basic understanding of bias and variance, principles of (non)parametric statistical inference, and (linear) modeling	Biological variability vs. technological noise, parametric (likelihood) vs non-parametric (rank order statistics) procedures, point vs. interval estimation, hypothesis testing, regression
	Study design and diagnostics	Design of experiments, power calculations and sample sizing, strength of evidence, p-values, False Discovery Rates	Multistage testing, variance normalizing transforms, histogram equalization, goodness-of-fit tests, model overfitting, model reduction
	Machine Learning	Dimensionality reduction, k-nearest neighbors, random forests, AdaBoost, kernelization, SVM, ensemble methods, CNN	Empirical risk minimization. Supervised, semi-supervised, and unsupervised learning. Transfer learning, active learning, reinforcement learning, multiview learning, instance learning

Topics

The Data Science and Predictive Analytics textbook is divided into the following 23 chapters, each progressively building on the previous content.

Motivation
Foundations of R
Managing Data in R
Data Visualization
Linear Algebra & Matrix Computing
Dimensionality Reduction
Lazy Learning: Classification Using Nearest Neighbors
Probabilistic Learning: Classification Using Naive Bayes
Decision Tree Divide and Conquer Classification
Forecasting Numeric Data Using Regression Models
Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
Apriori Association Rules Learning
k-Means Clustering
Model Performance Assessment
Improving Model Performance
Specialized Machine Learning Topics
Variable/Feature Selection
Regularized Linear Modeling and Controlled Variable Selection
Big Longitudinal Data Analysis
Natural Language Processing/Text Mining
Prediction and Internal Statistical Cross Validation
Function Optimization
Deep Learning, Neural Networks

Program Outline

Welcome and introductions
Course logistics (please come prepared with access to Internet connected computers having local versions of R (statistical computing environment) and RStudio (graphical user interface and integrated development environment)
Data manipulation and visualization
Non-linear dimensionality reduction (UMAP & t-SNE)
Supervised and Unsupervised, model-based and model-free prediction, regression, classification, and clustering
Reticulation (Interoperability between R, Python, C/C++ and other languages)
Role of optimization in AI/ML
Examples of DNN modeling of 2D brain scans - automatic diagnostic prediction and tumor-mask derivation for cancer neuroimaging data
Activities and HTML5 demos.
Attendees' capstone project presentations (Day 3).
Diversity of participants' backgrounds, interests, and capabilities. We will make a significant effort to "level the playing field" and provide opportunities to all attendees to learn and actively participate in this training course.
- If you have advanced knowledge in a specific topic, continue to stay engaged in the class by exploring the extensive DSPA Appendices, reviewing other DSPA topics, and beginning to plan for the open-ended capstone project.
- If you need some extra time to understand a concept, try to keep pace with the course instructor and consider asking questions in the chat. To catch up with the material later on, you can review some of the referenced background materials, coordinate with other participants, or touch base with the instructor offline.

Program Details

Day 1 (Mon, March 7, 2022)	Day 2 (Tue, March 8, 2022)	Day 3 (Mon, March 21, 2022)
Welcome	Review of Day 1	Review of Days 1 and 2
DSPA Spring 2022 Course Overview (ISI, prereqs, vision, objectives, outcomes, Website)	Questions, comments, issues?	Capstone Project Presentations
Introductions (Instructor: Ivo Dinov; Attendees: please post in Chat/Discussion-Forum: Participant's Name, Affiliation, Title, interests, and one fun fact about you	Supervised AI
Course Coverage	Model-based
Expectations and optional Capstone Project Presentations	Baseball players physique modeling
SOCR Resources: Datasets & Case-studies, Webapps, DSPA, Spacekime/TCIU, GitHub, Prob & Stats EBook, SMHS EBook, Current SOCR Users	k-NN prediction of galaxy spin
Open Science It’s online, therefore it exists!	Model-free
Download DSPA Textbook (free)	Estimate the square root function using NN
Resource Search & Navigation, Language Translations
	NN Google Trends and the Stock Market
Motivation - and 7D of Big Data	Unsupervised AI
Digitalization of all human experiences	Classification and clustering (k-Means, spectral, hierarchical)
Responsible Data Science/Ethical Predictive Analytics	Hot-dogs example
R vs. Python vs. SAS vs. SPSS vs. other SW	Silhouette plots
Confirm local installations of R & RStudio	Pediatric trauma clustering study
RStudio GUI
Rmarkdown Notebook (IDE); End-to-end Pipeline Workflow from raw data … models … visualization … analytics … reporting/pubs
Example Demo (requires knitr package)
DSPA Chapter 4 (Linear Algebra/Matrix Computing): RMD Source, HTML output, SOCR_Header
Math Foundations
5-min Break	5-min Break
Data types: categorical & numeric, structured and unstructured, scalar, vector, matrix, data-frame, tensor, list, object	Reticulation (interoperability between R, Python, C/C++ and other languages)
Data manipulation import/export, EM imputation, webpage scraping, sample statistics (moments)	Text modeling & NLP (sentiment analysis example)
EDA (visualization)
Compare R EDA vs. HTML/JS: SOCRAT (NI data of AD/MCI/NC), Motion Charts (Housing Prices), BrainViewer (raw MRI, DTI tracks, Brain Atlas)
Probability Distributions: Distributome, TVN Webapp	Longitudinal data analysis (Google trends analytics)
Dimensionality reduction
Linear PCA: 2D --> 1D example, PPMI (Parkinson's disease) example
5-min Break	5-min Break
Non-linear: MNIST data OCR: UMAP OCR, t-SNE OCR	Role of optimization in AI/ML (Healthcare manufacturer product optimization example)
SOCR/Tensorboard/Projector UKBB Brain Study	Deep neural networks (image-classification example)
Capstone project: interactive-learning using monthly US macroeconomic data. Use the RMD source, the example HTML output, and the provided data to experiment with some of the DSPA techniques. Think of ways to augment these data (expand the time range and increase the feature richness)	DSPA Appendices: Bayesian Simulation, Modeling and Inference; Information-Theoretic Foundation of Statistical Learning; Surface, Shape, and Manifold Representation and Visualization; Power Analysis in Experimental Design; Database SQL/NoSQL Queries & Google BigQuery; Image Convolution, Filtering, & Fourier Transform; Causality, Transfer Entropy, & Mechanistic Effects; Agent-based Reinforcement Learning
	Demonstrations of interesting Capstone project results
Open discussion	Open discussion

Resources

Capstone Project

Following the first two days of instruction, each participant is expected to work for the next two weeks on designing, implementing, and presenting a capstone project. This hands-on project experience will provide interactive-learning to reinforce some of the DSPA concepts. Each participant can choose their own study and dataset, or utilize some of the provided case-studies data. If, and when necessary, 2 or 3 people can team up on a project.

Capstone Template: The supporting DSPA Canvas site includes 2 specific exemplary datasets (economic and biomedical), among many dozens of other case-studies. There are also two R markdown (Rmd) templates, and their corresponding HTML outputs, included on the canvas site. Each participant will customize the Rmd templates to fit their project design needs, analytical protocol, and study goals.
Expectations: The R markdown electronic notebook will capture the entire study, from conceptualizing the study, designing the experiments, data retrieval, wrangling, harmonization, preprocessing, and visualization, to execution of the workflow protocol, reporting and result presentation on Day 3 of the course. In the spirit of open science, transparent discoveries, scientific rigor, reproducibility, and team-cooperation, all Rmd notebooks must be self-contained. We will discuss in class how the entire study can be automatically compiled into PDF, HTML, DOCX or other formatted reports, by knitting (knitr) the Rmd electronic notebook.
Presentations: On day 3, each participant will have 5-minutes to present their capstone project, plus 5-minutes for community input, feedback, suggestions, and constructive recommendations. We'll probably proceed in the order of the listed participants. Everyone is expected to be present and actively engaged throughout the course and the capstone presentations.
Use of sensitive data: If you have sensitive data that you want to use in your capstone project (e.g., PHI, IP, or other protected information), please desensitize your protocol, so that your R markdown notebook and the report you present do not compromise individual privacy, organizational property, or violate various local or international regulations (e.g., GDPR, FERPA, HIPPA). You can use the DataSifter or other protocols to obfuscate and desensitize your dataset.

Video Recordings

to be posted later ...

Participants

ISI-sponsored fellows
- Ella Joyce Paragas, Central Luzon State University (CLSU), Department of Statistics
- Turgut Özaltındiş, Mimar Sinan Fine Arts University, Department of Statistics
- Aqsa Shah, Centre for Advanced Studies in Pure and Applied Mathematics, Bahauddin Zakariya University Multan
- Fekadu Tolessa Gedefa, Eötvös Lorand University (ELTE-TTK), Department of Operations Research
Glory Atilola, Imperial College London, Postdoctoral Research Assistant
Csenge Krisztina Szabó, Hungarian Central Statistical Office
Atilla Gergely, Kiss, Hungarian Central Statistical Office
Lilla Bogdan, Hungarian Central Statistical Office
João Neves, Statistics Portugal, Head of Sector
Dinesh Shetty, John Carroll University, Assistant Professor
Michelle Karker, University of Michigan, Graduate Student
Allen Lee, University of Michigan, Assistant Professor
Alfredo Bustos, INEGI, Researcher
Nohemi Delgado, INEGI, Coordination Linkage
Luisa Dumett, Pontificia Universidad Católica del Perú, Coordinador Inteligencia de datos
Abdulhakeem Eideh, Alquds University, Associate Professor of Statistics
Fernando Cantu Bazaldua, UNIDO, Statistician
Mariana Masteling, University of Michigan, PhD Student
Juan Jose Fernandez-Duran, ITAM, Professor
Juergen Pilz, AAU Klagenfurt, Emeritus Professor
Ismo Muhonen, Bank of Finland, IT System Economist
Bing Zeng, The University of Alabama in Huntsville, Graduate Student

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

SOCR News ISI DSPA Training 2022

Contents

SOCR News & Events: 2022 ISI Short Course - Data Science and Predictive Analytics (DSPA)

Instructor

Session Logistics

Overview

Prerequisites

Vision

Values

Strategic priorities

Outcomes

Topics

Program Outline

Program Details

Resources

Capstone Project

Video Recordings

Participants

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools