# Difference between revisions of "SOCR News ISI WSC DSPA Training 2021"

(→Program Details) |
(→Program Details) |
||

Line 114: | Line 114: | ||

| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/06_LazyLearning_kNN.html Supervised AI] | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/06_LazyLearning_kNN.html Supervised AI] | ||

|- | |- | ||

− | | Course Coverage | + | | [https://wiki.socr.umich.edu/index.php/SOCR_News_ISI_WSC_DSPA_Training_2021#Topics Course Coverage] |

| Model-based | | Model-based | ||

|- | |- | ||

Line 120: | Line 120: | ||

| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/09_RegressionForecasting.html#3_Case_Study_1:_Baseball_Players Baseball players physique modeling] | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/09_RegressionForecasting.html#3_Case_Study_1:_Baseball_Players Baseball players physique modeling] | ||

|- | |- | ||

− | | SOCR Resources: Datasets & Case-studies, Webapps, DSPA, Spacekime/TCIU, GitHub, Prob & Stats EBook, SMHS EBook, Current SOCR Users | + | | [https://www.socr.umich.edu/ SOCR Resources]: [https://wiki.socr.umich.edu/index.php/SOCR_Data Datasets] & [https://umich.instructure.com/courses/38100/files/folder/Case_Studies Case-studies], [https://socr.umich.edu/HTML5/ Webapps], [https://dspa.predictive.space/ DSPA], [https://spacekime.org/ Spacekime/TCIU], [https://github.com/SOCR GitHub[, [https://wiki.socr.umich.edu/index.php/EBook Prob & Stats EBook], [https://wiki.socr.umich.edu/index.php/SMHS SMHS EBook], [https://www.socr.umich.edu/html/SOCR_UserGoogleMap.html Current SOCR Users] |

| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/06_LazyLearning_kNN.html#4_Case_Study:_Predicting_Galaxy_Spins k-NN prediction of galaxy spin] | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/06_LazyLearning_kNN.html#4_Case_Study:_Predicting_Galaxy_Spins k-NN prediction of galaxy spin] | ||

|- | |- | ||

Line 127: | Line 127: | ||

|- | |- | ||

| [https://link.springer.com/book/10.1007%2F978-3-319-72347-1 Download DSPA Textbook] (free) | | [https://link.springer.com/book/10.1007%2F978-3-319-72347-1 Download DSPA Textbook] (free) | ||

− | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/10_ML_NN_SVM_Class.html#3_Simple_NN_demo_-_learning_to_compute_(sqrt_{___}) Estimate the square root function] | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/10_ML_NN_SVM_Class.html#3_Simple_NN_demo_-_learning_to_compute_(sqrt_{___}) Estimate the square root function using NN] |

|- | |- | ||

− | | Resource Search & Navigation, Language Translations | + | | Resource [https://www.socr.umich.edu/people/dinov/courses/DSPA_Topics.html Search] & [https://www.socr.umich.edu/html/Navigators.html Navigation], [https://translate.google.com Language Translations] |

− | | | + | | |

|- | |- | ||

| | | | ||

| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/10_ML_NN_SVM_Class.html#4_Case_Study_2:_Google_Trends_and_the_Stock_Market_-_Classification NN Google Trends and the Stock Market] | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/10_ML_NN_SVM_Class.html#4_Case_Study_2:_Google_Trends_and_the_Stock_Market_-_Classification NN Google Trends and the Stock Market] | ||

|- | |- | ||

− | | Motivation | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/00_Motivation.html Motivation] - and [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/00_Motivation.html#7_Common_Characteristics_of_Big_(Biomedical_and_Health)_Data 7D of Big Data] |

− | | Unsupervised AI | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/06_LazyLearning_kNN.html Unsupervised AI] |

|- | |- | ||

| Digitalization of all human experiences | | Digitalization of all human experiences | ||

| Classification and clustering (k-Means, spectral, hierarchical) | | Classification and clustering (k-Means, spectral, hierarchical) | ||

|- | |- | ||

− | | | + | | R[https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/00_Motivation.html#12_Responsible_Data_Science_and_Ethical_Predictive_Analytics esponsible Data Science/Ethical Predictive Analytics] |

− | | Hot-dogs example | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/12_kMeans_Clustering.html#1_Clustering_as_a_machine_learning_task Hot-dogs example] |

|- | |- | ||

− | | R vs. Python vs. SAS vs. SPSS vs. other SW | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/01_Foundation.html#1_Why_use_R R vs. Python vs. SAS vs. SPSS vs. other SW] |

− | | Silhouette plots | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/12_kMeans_Clustering.html#2_Silhouette_plots Silhouette plots] |

|- | |- | ||

| Confirm local installations of R & RStudio | | Confirm local installations of R & RStudio | ||

− | | Pediatric trauma clustering study | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/12_kMeans_Clustering.html#6_Case_study_2:_Pediatric_Trauma Pediatric trauma clustering study] |

|- | |- | ||

− | | RStudio GUI | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/01_Foundation.html#23_RStudio_GUI_Layout RStudio GUI] |

| | | | ||

|- | |- | ||

Line 159: | Line 159: | ||

| | | | ||

|- | |- | ||

− | | | + | | Chapter 4 [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/04_LinearAlgebraMatrixComputing.Rmd RMD Source], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/04_LinearAlgebraMatrixComputing.html HTML output], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/SOCR_header.html SOCR_Header] |

| | | | ||

|- | |- | ||

− | | Math Foundations | + | | [https://socr.umich.edu/BPAD/BPAD_notes/Biophysics430_Chap01_MathFoundations.html Math Foundations] |

| | | | ||

|- | |- | ||

Line 168: | Line 168: | ||

| 5-min Break | | 5-min Break | ||

|- | |- | ||

− | | Data types categorical & numeric, structured and unstructured, scalar, vector, matrix, data-frame, tensor, list, object | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/02_ManagingData.html#1_Saving_and_Loading_R_Data_Structures Data types]: categorical & numeric, structured and unstructured, scalar, vector, matrix, data-frame, tensor, list, object |

− | | Reticulation (interoperability between R, Python, C/C++ and other languages) | + | | Reticulation ([https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/15_SpecializedML_FormatsOptimization.html#7_R_Notebook_support_for_other_programming_languages interoperability between R, Python, C/C++ and other languages]) |

|- | |- | ||

− | | Data manipulation import/export, EM imputation, webpage scraping, sample statistics (moments) | + | | Data manipulation import/export, [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/02_ManagingData.html#143_Imputation_via_Expectation-Maximization EM imputation], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/02_ManagingData.html#15_Parsing_webpages_and_visualizing_tabular_HTML_data webpage scraping], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/02_ManagingData.html#6_Measuring_the_Central_Tendency_-_mean,_median,_mode sample statistics (moments)] |

− | | Text modeling & NLP (sentiment analysis example) | + | | Text modeling & NLP ([https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/19_NLP_TextMining.html#5_Sentiment_analysis sentiment analysis example]) |

|- | |- | ||

− | | EDA (visualization) | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/03_DataVisualization.html EDA (visualization)] |

| | | | ||

|- | |- | ||

− | | Compare R EDA vs. HTML/JS: SOCRAT (NI data of AD/MCI/NC), Motion Charts (Housing Prices), BrainViewer (raw MRI, DTI tracks, Brain Atlas) | + | | Compare R EDA vs. HTML/JS: [https://socr.umich.edu/HTML5/SOCRAT/ SOCRAT (NI data of AD/MCI/NC)], [https://socr.umich.edu/HTML5/MotionChart/ Motion Charts (Housing Prices)], [https://socr.umich.edu/HTML5/BrainViewer/ BrainViewer (raw MRI, DTI tracks, Brain Atlas)] |

| | | | ||

|- | |- | ||

− | | Probability Distributions Distributome, TVN Webapp | + | | Probability Distributions: [http://distributome.org/V3/ Distributome], [https://socr.umich.edu/HTML5/BivariateNormal/TVN/ TVN Webapp] |

| Longitudinal data analysis (Google trends analytics) | | Longitudinal data analysis (Google trends analytics) | ||

|- | |- | ||

Line 186: | Line 186: | ||

| | | | ||

|- | |- | ||

− | | Linear PCA: | + | | Linear PCA: [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/05_DimensionalityReduction.html#1_Example:_Reducing_2D_to_1D 2D --> 1D example], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/05_DimensionalityReduction.html#5_Principal_Component_Analysis_(PCA) PPMI (Parkinson's disease) example] |

| | | | ||

|- | |- | ||

Line 192: | Line 192: | ||

| 5-min Break | | 5-min Break | ||

|- | |- | ||

− | | Non-linear: MNIST data OCR: UMAP OCR, t-SNE OCR | + | | Non-linear: MNIST data OCR: [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/05_DimensionalityReduction.html#103_Hand-Written_Digits_Recognition UMAP OCR], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/05_DimensionalityReduction.html#92_t-SNE_Example:_Hand-written_Digit_Recognition t-SNE OCR] |

− | | Role of optimization in AI/ML (Healthcare manufacturer product optimization example) | + | | Role of optimization in AI/ML ([https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/21_FunctionOptimization.html#101_Application:_Healthcare_Manufacturer_Product_Optimization Healthcare manufacturer product optimization example]) |

|- | |- | ||

− | | SOCR/Tensorboard/Projector UKBB Brain Study | + | | [https://socr.umich.edu/HTML5/SOCR_TensorBoard_UKBB/ SOCR/Tensorboard/Projector UKBB Brain Study] |

− | | Deep neural networks (image-classification example) | + | | Deep neural networks ([https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/22_DeepLearning.html image-classification example]) |

|- | |- | ||

− | | Capstone project interactive-learning using monthly US macro-economic data. Use the RMD source, the example HTML output, and the provided data to experiment with some of the DSPA techniques. Think of ways to augment these data (expand the time range and increase the feature richness) | + | | [https://umich.instructure.com/courses/38100/files/folder/Case_Studies/34_US_MacroEconMarketData_CompleteMonthly_1979_2020 Capstone project]: interactive-learning using monthly US macro-economic data. Use the [https://umich.instructure.com/files/20798031/download?download_frd=1 RMD source], the [https://umich.instructure.com/files/20798030/download?download_frd=1 example HTML output], and the [https://umich.instructure.com/files/20026184/download?download_frd=1 provided data] to experiment with some of the DSPA techniques. Think of ways to augment these data (expand the time range and increase the feature richness) |

− | | DSPA Appendices: Bayesian Simulation, Modeling and Inference; Information-Theoretic Foundation of Statistical Learning; Surface, Shape, and Manifold Representation and Visualization; Power Analysis in Experimental Design; Database SQL/NoSQL Queries & Google BigQuery; Image Convolution, Filtering, & Fourier Transform; Causality, Transfer Entropy, & Mechanistic Effects; Agent-based Reinforcement Learning | + | | [https://www.socr.umich.edu/people/dinov/courses/DSPA_Topics.html#Appendix DSPA Appendices]: Bayesian Simulation, Modeling and Inference; Information-Theoretic Foundation of Statistical Learning; Surface, Shape, and Manifold Representation and Visualization; Power Analysis in Experimental Design; Database SQL/NoSQL Queries & Google BigQuery; Image Convolution, Filtering, & Fourier Transform; Causality, Transfer Entropy, & Mechanistic Effects; Agent-based Reinforcement Learning |

|- | |- | ||

| | | | ||

− | | Demonstrations of interesting Capstone project results | + | | Demonstrations of interesting [https://umich.instructure.com/courses/38100/files/folder/Case_Studies/34_US_MacroEconMarketData_CompleteMonthly_1979_2020 Capstone project] results |

|- | |- | ||

| Open discussion | | Open discussion |

## Revision as of 22:13, 15 June 2021

## Contents

## SOCR News & Events: 2021 ISI/WSC Training and Education Bootcamp on Data Science and Predictive Analytics (DSPA)

## Instructor

- Dr. Dinov is a professor of Health Behavior and Biological Sciences and Computational Medicine and Bioinformatics at the University of Michigan. He is a member of the Michigan Center for Applied and Interdisciplinary Mathematics (MCAIM) and a core member of the University of Michigan Comprehensive Cancer Center. Dr. Dinov serves as Director of the Statistics Online Computational Resource, Co-Director of the Center for Complexity and Self-management of Chronic Disease (CSCD Center), Co-Director of the multi-institutional Probability Distributome Project, Associate Director of the Michigan Institute for Data Science (MIDAS), and Associate Director of the Michigan Neuroscience Graduate Program (NGP). He is a member of the American Statistical Association (ASA), International Association for Statistical Education (IASE), American Mathematical Society (AMS), American Association for the Advancement of Science (AAAS), and an Elected Member of the Institutional Statistical Institute (ISI).

## Session Logistics

**Date/Time**: Wednesday & Thursday, June 16-17, 2021, 14.00-17.00, Central European Summer Time, CEST (UTC+2), 8:00-11:00 AM US-EDT.**Registration**: Registration Link, moderate registration fees apply.**GoToMeeting**: Webinar link.**URL**: Official ISI/WSC Course Website.**Conference**: 2021 ISI World Statistical Congress and WSC 2021 short courses.**Session Format**: Two daily sessions (3-hours each).- Session URL: https://myumi.ch/erXm2.

## Overview

This course will be based on a Data Science and Predictive Analytics (DSPA) course I teach at the University of Michigan. The training will provide intermediate to advanced learners with a solid data science foundation to address challenges related to collecting, managing, processing, interrogating, analyzing and interpreting complex health and biomedical datasets using R. Participants will gain skills and acquire a tool-chest of methods, software tools, and protocols that can be applied to a broad spectrum of Big Data problems.

Before diving into the mathematical algorithms, statistical computing methods, software tools, and health analytics, we will discuss a number of driving motivational problems. These will ground all the subsequent scientific discussions, data modeling, and computational approaches.

### Prerequisites

Assumed prior knowledge includes: Completed undergraduate study with quantitative STEM exposure, some quantitative training, programming experience, and high-level of energy and motivation to learn. Preinstalled R and RStudio on user local client computer.

### Vision

This course is based on active-learning and integrates driving motivational challenges with mathematical foundations, computational statistics, and modern scientific inference.

### Values

The training aims to provide effective, reliable, reproducible, and transformative data-driven discovery supporting open-science.

### Strategic priorities

Trainees will develop scientific intuition, computational skills, and data-wrangling abilities to tackle Big biomedical and health data problems. Instructors will provide well-documented R-scripts and software recipes implementing atomic data-filters as well as complex end-to-end predictive big data analytics solutions.

### Outcomes

Upon successful completion of this course, participants are expected to have moderate competency in at least two of each of the three competency areas: Algorithms and Applications, Data Management, and Analysis Methods. Specifically, participants will get end-to-end R-protocols, gain ML/AI algorithm knowledge, explore data validation, wrangling, and visualization, experiment with statistical inference and model-free Machine Learning tools.

Areas | Competency | Expectation | Notes |
---|---|---|---|

Algorithms and Applications | Tools | Working knowledge of basic software tools (command-line, GUI based, or web-services) | Familiarity with statistical programming languages, e.g., R or SciKit/Python, and database querying languages, e.g., SQL or NoSQL |

Algorithms | Knowledge of core principles of scientific computing, applications programming, API’s, algorithm complexity, and data structures | Best practices for scientific and application programming, efficient implementation of matrix linear algebra and graphics, elementary notions of computational complexity, user-friendly interfaces, string matching | |

Application Domain | Data analysis experience from at least one application area, either through coursework, internship, research project, etc. | Applied domain examples include: computational social sciences, health sciences, business and marketing, learning sciences, transportation sciences, engineering and physical sciences | |

Data Management | Data validation & visualization | Curation, Exploratory Data Analysis (EDA) and visualization | Data provenance, validation, visualization via histograms, Q-Q plots, scatterplots (ggplot, Dashboard, D3.js) |

Data wrangling | Skills for data normalization, data cleaning, data aggregation, and data harmonization/registration | Data imperfections include missing values, inconsistent string formatting (‘2016-01-01’ vs. ‘01/01/2016’, PC/Mac/Linux time vs. timestamps, structured vs. unstructured data | |

Data infrastructure | Handling databases, web-services, Hadoop, multi-source data | Data structures, SOAP protocols, ontologies, XML, JSON, streaming | |

Analysis Methods | Statistical inference | Basic understanding of bias and variance, principles of (non)parametric statistical inference, and (linear) modeling | Biological variability vs. technological noise, parametric (likelihood) vs non-parametric (rank order statistics) procedures, point vs. interval estimation, hypothesis testing, regression |

Study design and diagnostics | Design of experiments, power calculations and sample sizing, strength of evidence, p-values, False Discovery Rates | Multistage testing, variance normalizing transforms, histogram equalization, goodness-of-fit tests, model overfitting, model reduction | |

Machine Learning | Dimensionality reduction, k-nearest neighbors, random forests, AdaBoost, kernelization, SVM, ensemble methods, CNN | Empirical risk minimization. Supervised, semi-supervised, and unsupervised learning. Transfer learning, active learning, reinforcement learning, multiview learning, instance learning |

## Topics

The Data Science and Predictive Analytics textbook is divided into the following 23 chapters, each progressively building on the previous content.

- Motivation
- Foundations of R
- Managing Data in R
- Data Visualization
- Linear Algebra & Matrix Computing
- Dimensionality Reduction
- Lazy Learning: Classification Using Nearest Neighbors
- Probabilistic Learning: Classification Using Naive Bayes
- Decision Tree Divide and Conquer Classification
- Forecasting Numeric Data Using Regression Models
- Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
- Apriori Association Rules Learning
- k-Means Clustering
- Model Performance Assessment
- Improving Model Performance
- Specialized Machine Learning Topics
- Variable/Feature Selection
- Regularized Linear Modeling and Controlled Variable Selection
- Big Longitudinal Data Analysis
- Natural Language Processing/Text Mining
- Prediction and Internal Statistical Cross Validation
- Function Optimization
- Deep Learning, Neural Networks

## Program Outline

- Welcome and introductions
- Course logistics (please come prepared with access to Internet connected computers having local versions of R (statistical computing environment) and RStudio (graphical user interface and integrated development environment)
- Data manipulation and visualization
- Non-linear dimensionality reduction (UMAP & t-SNE)
- Supervised and Unsupervised, model-based and model-free prediction, regression, classification, and clustering
- Reticulation (Interoperability between R, Python, C/C++ and other languages)
- Role of optimization in AI/ML
- Activities and HTML5 demos.

## Program Details

Wednesday, June 16, 2021, 8:00-11:00 AM US-EDT | Thursday, June 17, 2021, , 8:00-11:00 AM US-EDT |
---|---|

Welcome | Review of Day 1 |

DSPA Summer Course Overview (ISI/WSC, prereqs, vision, objectives, outcomes, Website) | Questions, comments, issues? |

Introductions (Instructor: Ivo Dinov; Attendees: please post in Chat/Discussion-Forum: Participant?s Name, Affiliation, Title, interests, and ?one fun fact about you? | Supervised AI |

Course Coverage | Model-based |

Expectations and optional capstone project (below) | Baseball players physique modeling |

SOCR Resources: Datasets & Case-studies, Webapps, DSPA, Spacekime/TCIU, GitHub[, [https://wiki.socr.umich.edu/index.php/EBook Prob & Stats EBook, SMHS EBook, Current SOCR Users | k-NN prediction of galaxy spin |

Open Science It’s online, therefore it exists! | Model-free |

Download DSPA Textbook (free) | Estimate the square root function using NN |

Resource Search & Navigation, Language Translations | |

NN Google Trends and the Stock Market | |

Motivation - and 7D of Big Data | Unsupervised AI |

Digitalization of all human experiences | Classification and clustering (k-Means, spectral, hierarchical) |

Responsible Data Science/Ethical Predictive Analytics | Hot-dogs example |

R vs. Python vs. SAS vs. SPSS vs. other SW | Silhouette plots |

Confirm local installations of R & RStudio | Pediatric trauma clustering study |

RStudio GUI | |

Rmarkdown Notebook (IDE) End-to-end Pipeline Workflow from raw data … models … visualization … analytics … reporting/pubs | |

Example Demo (requires knitr package) | |

Chapter 4 RMD Source, HTML output, SOCR_Header | |

Math Foundations | |

5-min Break | 5-min Break |

Data types: categorical & numeric, structured and unstructured, scalar, vector, matrix, data-frame, tensor, list, object | Reticulation (interoperability between R, Python, C/C++ and other languages) |

Data manipulation import/export, EM imputation, webpage scraping, sample statistics (moments) | Text modeling & NLP (sentiment analysis example) |

EDA (visualization) | |

Compare R EDA vs. HTML/JS: SOCRAT (NI data of AD/MCI/NC), Motion Charts (Housing Prices), BrainViewer (raw MRI, DTI tracks, Brain Atlas) | |

Probability Distributions: Distributome, TVN Webapp | Longitudinal data analysis (Google trends analytics) |

Dimensionality reduction | |

Linear PCA: 2D --> 1D example, PPMI (Parkinson's disease) example | |

5-min Break | 5-min Break |

Non-linear: MNIST data OCR: UMAP OCR, t-SNE OCR | Role of optimization in AI/ML (Healthcare manufacturer product optimization example) |

SOCR/Tensorboard/Projector UKBB Brain Study | Deep neural networks (image-classification example) |

Capstone project: interactive-learning using monthly US macro-economic data. Use the RMD source, the example HTML output, and the provided data to experiment with some of the DSPA techniques. Think of ways to augment these data (expand the time range and increase the feature richness) | DSPA Appendices: Bayesian Simulation, Modeling and Inference; Information-Theoretic Foundation of Statistical Learning; Surface, Shape, and Manifold Representation and Visualization; Power Analysis in Experimental Design; Database SQL/NoSQL Queries & Google BigQuery; Image Convolution, Filtering, & Fourier Transform; Causality, Transfer Entropy, & Mechanistic Effects; Agent-based Reinforcement Learning |

Demonstrations of interesting Capstone project results | |

Open discussion | Open discussion |

## Resources

- Course Flyer.
- 1-page Course Coverage with dynamic links to content.
- DSPA Wikipedia.
- DSPA Springer Page & SpringerLink (PDF Download).
- dspa.predictive.space & DSPA MOOC Canvas Site.

Translate this page: