Difference between revisions of "SOCR News ISI DSPA Training 2022"

Jump to: navigation, search
(Session Logistics)
Line 12: Line 12:
* '''Date/Time''': March 7, 8, and 21, 2022; [https://www.timeanddate.com/worldclock/converter.html?iso=20220307T150000&p1=1314&p2=70&p3=165&p4=419&p5=1183&p6=234&p7=784 16:00-19:00 Central European Time] ([https://time.is/ET 10 AM - 1 PM US ET]).
* '''Date/Time''': March 7, 8, and 21, 2022; [https://www.timeanddate.com/worldclock/converter.html?iso=20220307T150000&p1=1314&p2=70&p3=165&p4=419&p5=1183&p6=234&p7=784 16:00-19:00 Central European Time] ([https://time.is/ET 10 AM - 1 PM US ET]).
* '''Registration''': [https://www.isi-web.org/courses/node-1262 Registration Link], [https://www.isi-web.org/events/courses/ moderate registration fees apply], [https://www.isi-web.org/courses/2022 2022 ISI Short Courses].
* '''Registration''': [https://www.isi-web.org/courses/node-1262 Registration Link], [https://www.isi-web.org/events/courses/ moderate registration fees apply], [https://www.isi-web.org/courses/2022 2022 ISI Short Courses].
** '''Fee-waiver Application''': A few need-based registration ''fee waivers'' may be awarded. If interested to be considered for a [https://forms.gle/UV4Qy8aezN1V2xEF9 registration fee-waiver, please complete this web-form application]
* '''GoToMeeting''': <individual GoToMeeting links are provided upon registration>.
* '''GoToMeeting''': <individual GoToMeeting links are provided upon registration>.
* '''URL''': [https://www.isi-web.org/courses/node-1262 Official ISI DSPA Course website].
* '''URL''': [https://www.isi-web.org/courses/node-1262 Official ISI DSPA Course website].

Revision as of 11:54, 24 January 2022

SOCR News & Events: 2022 ISI Short Course - Data Science and Predictive Analytics (DSPA)

2022 ISI Short Courses


Ivo Dinov, University of Michigan, SOCR, MIDAS.
Dr. Dinov is a professor of Health Behavior and Biological Sciences and Computational Medicine and Bioinformatics at the University of Michigan. He is a member of the Michigan Center for Applied and Interdisciplinary Mathematics (MCAIM) and a core member of the University of Michigan Comprehensive Cancer Center. Dr. Dinov serves as Director of the Statistics Online Computational Resource, Co-Director of the Center for Complexity and Self-management of Chronic Disease (CSCD Center), Co-Director of the multi-institutional Probability Distributome Project, Associate Director of the Michigan Institute for Data Science (MIDAS), and Associate Director of the Michigan Neuroscience Graduate Program (NGP). He is a member of the American Statistical Association (ASA), International Association for Statistical Education (IASE), American Mathematical Society (AMS), American Association for the Advancement of Science (AAAS), and an Elected Member of the International Statistical Institute (ISI).

Session Logistics


This course will be based on a Data Science and Predictive Analytics (DSPA) course the instructor teaches at the University of Michigan. The training will provide intermediate to advanced learners with a solid data science foundation to address challenges related to collecting, managing, processing, interrogating, analyzing and interpreting complex health and biomedical datasets using R. Participants will gain skills and acquire a tool-chest of methods, software tools, and protocols that can be applied to a broad spectrum of Big Data problems.

Before diving into the mathematical algorithms, statistical computing methods, software tools, and health analytics, we will discuss a number of driving motivational problems. These will ground all the subsequent scientific discussions, data modeling, and computational approaches.


Assumed prior knowledge includes: Completed undergraduate study with quantitative STEM exposure, some quantitative training, programming experience, and high-level of energy and motivation to learn. Preinstalled R and RStudio on user local client computer.


This course is based on active-learning and integrates driving motivational challenges with mathematical foundations, computational statistics, and modern scientific inference.


The training aims to provide effective, reliable, reproducible, and transformative data-driven discovery supporting open-science.

Strategic priorities

Trainees will develop scientific intuition, computational skills, and data-wrangling abilities to tackle Big biomedical and health data problems. Instructors will provide well-documented R-scripts and software recipes implementing atomic data-filters as well as complex end-to-end predictive big data analytics solutions.


Upon successful completion of this course, participants are expected to have moderate competency in at least two of each of the three competency areas: Algorithms and Applications, Data Management, and Analysis Methods. Specifically, participants will get end-to-end R-protocols, gain ML/AI algorithm knowledge, explore data validation, wrangling, and visualization, experiment with statistical inference and model-free Machine Learning tools.

Areas Competency Expectation Notes
Algorithms and Applications Tools Working knowledge of basic software tools (command-line, GUI based, or web-services) Familiarity with statistical programming languages, e.g., R or SciKit/Python, and database querying languages, e.g., SQL or NoSQL
Algorithms Knowledge of core principles of scientific computing, applications programming, API’s, algorithm complexity, and data structures Best practices for scientific and application programming, efficient implementation of matrix linear algebra and graphics, elementary notions of computational complexity, user-friendly interfaces, string matching
Application Domain Data analysis experience from at least one application area, either through coursework, internship, research project, etc. Applied domain examples include: computational social sciences, health sciences, business and marketing, learning sciences, transportation sciences, engineering and physical sciences
Data Management Data validation & visualization Curation, Exploratory Data Analysis (EDA) and visualization Data provenance, validation, visualization via histograms, Q-Q plots, scatterplots (ggplot, Dashboard, D3.js)
Data wrangling Skills for data normalization, data cleaning, data aggregation, and data harmonization/registration Data imperfections include missing values, inconsistent string formatting (‘2016-01-01’ vs. ‘01/01/2016’, PC/Mac/Linux time vs. timestamps, structured vs. unstructured data
Data infrastructure Handling databases, web-services, Hadoop, multi-source data Data structures, SOAP protocols, ontologies, XML, JSON, streaming
Analysis Methods Statistical inference Basic understanding of bias and variance, principles of (non)parametric statistical inference, and (linear) modeling Biological variability vs. technological noise, parametric (likelihood) vs non-parametric (rank order statistics) procedures, point vs. interval estimation, hypothesis testing, regression
Study design and diagnostics Design of experiments, power calculations and sample sizing, strength of evidence, p-values, False Discovery Rates Multistage testing, variance normalizing transforms, histogram equalization, goodness-of-fit tests, model overfitting, model reduction
Machine Learning Dimensionality reduction, k-nearest neighbors, random forests, AdaBoost, kernelization, SVM, ensemble methods, CNN Empirical risk minimization. Supervised, semi-supervised, and unsupervised learning. Transfer learning, active learning, reinforcement learning, multiview learning, instance learning


The Data Science and Predictive Analytics textbook is divided into the following 23 chapters, each progressively building on the previous content.

  1. Motivation
  2. Foundations of R
  3. Managing Data in R
  4. Data Visualization
  5. Linear Algebra & Matrix Computing
  6. Dimensionality Reduction
  7. Lazy Learning: Classification Using Nearest Neighbors
  8. Probabilistic Learning: Classification Using Naive Bayes
  9. Decision Tree Divide and Conquer Classification
  10. Forecasting Numeric Data Using Regression Models
  11. Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
  12. Apriori Association Rules Learning
  13. k-Means Clustering
  14. Model Performance Assessment
  15. Improving Model Performance
  16. Specialized Machine Learning Topics
  17. Variable/Feature Selection
  18. Regularized Linear Modeling and Controlled Variable Selection
  19. Big Longitudinal Data Analysis
  20. Natural Language Processing/Text Mining
  21. Prediction and Internal Statistical Cross Validation
  22. Function Optimization
  23. Deep Learning, Neural Networks

Program Outline

  • Welcome and introductions
  • Course logistics (please come prepared with access to Internet connected computers having local versions of R (statistical computing environment) and RStudio (graphical user interface and integrated development environment)
  • Data manipulation and visualization
  • Non-linear dimensionality reduction (UMAP & t-SNE)
  • Supervised and Unsupervised, model-based and model-free prediction, regression, classification, and clustering
  • Reticulation (Interoperability between R, Python, C/C++ and other languages)
  • Role of optimization in AI/ML
  • Examples of DNN modeling of 2D brain scans - automatic diagnostic prediction and tumor-mask derivation for cancer neuroimaging data
  • Activities and HTML5 demos.

Program Details

Day 1 (Mon, March 7, 2022) Day 2 (Tue, March 8, 2022) Day 3 (Mon, March 21, 2022)
Welcome Review of Day 1 Review of Days 1 and 2
DSPA Summer Course Overview (ISI, prereqs, vision, objectives, outcomes, Website) Questions, comments, issues? Capstone Project
Introductions (Instructor: Ivo Dinov; Attendees: please post in Chat/Discussion-Forum: Participant's Name, Affiliation, Title, interests, and one fun fact about you Supervised AI
Course Coverage Model-based
Expectations and optional capstone project (below) Baseball players physique modeling
SOCR Resources: Datasets & Case-studies, Webapps, DSPA, Spacekime/TCIU, GitHub, Prob & Stats EBook, SMHS EBook, Current SOCR Users k-NN prediction of galaxy spin
Open Science It’s online, therefore it exists! Model-free
Download DSPA Textbook (free) Estimate the square root function using NN
Resource Search & Navigation, Language Translations
NN Google Trends and the Stock Market
Motivation - and 7D of Big Data Unsupervised AI
Digitalization of all human experiences Classification and clustering (k-Means, spectral, hierarchical)
Rresponsible Data Science/Ethical Predictive Analytics Hot-dogs example
R vs. Python vs. SAS vs. SPSS vs. other SW Silhouette plots
Confirm local installations of R & RStudio Pediatric trauma clustering study
RStudio GUI
Rmarkdown Notebook (IDE) End-to-end Pipeline Workflow from raw data … models … visualization … analytics … reporting/pubs
Example Demo (requires knitr package)
Chapter 4 RMD Source, HTML output, SOCR_Header
Math Foundations
5-min Break 5-min Break
Data types: categorical & numeric, structured and unstructured, scalar, vector, matrix, data-frame, tensor, list, object Reticulation (interoperability between R, Python, C/C++ and other languages)
Data manipulation import/export, EM imputation, webpage scraping, sample statistics (moments) Text modeling & NLP (sentiment analysis example)
EDA (visualization)
Compare R EDA vs. HTML/JS: SOCRAT (NI data of AD/MCI/NC), Motion Charts (Housing Prices), BrainViewer (raw MRI, DTI tracks, Brain Atlas)
Probability Distributions: Distributome, TVN Webapp Longitudinal data analysis (Google trends analytics)
Dimensionality reduction
Linear PCA: 2D --> 1D example, PPMI (Parkinson's disease) example
5-min Break 5-min Break
Non-linear: MNIST data OCR: UMAP OCR, t-SNE OCR Role of optimization in AI/ML (Healthcare manufacturer product optimization example)
SOCR/Tensorboard/Projector UKBB Brain Study Deep neural networks (image-classification example)
Capstone project: interactive-learning using monthly US macro-economic data. Use the RMD source, the example HTML output, and the provided data to experiment with some of the DSPA techniques. Think of ways to augment these data (expand the time range and increase the feature richness) DSPA Appendices: Bayesian Simulation, Modeling and Inference; Information-Theoretic Foundation of Statistical Learning; Surface, Shape, and Manifold Representation and Visualization; Power Analysis in Experimental Design; Database SQL/NoSQL Queries & Google BigQuery; Image Convolution, Filtering, & Fourier Transform; Causality, Transfer Entropy, & Mechanistic Effects; Agent-based Reinforcement Learning
Demonstrations of interesting Capstone project results
Open discussion Open discussion


Video Recordings

  • to be posted later ...


  • Partial list of participants (to be posted later)

Translate this page:

Uk flag.gif

De flag.gif

Es flag.gif

Fr flag.gif

It flag.gif

Pt flag.gif

Jp flag.gif

Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Fi flag.gif

इस भाषा में
In flag.gif

No flag.png

Kr flag.gif

Cn flag.gif

Cn flag.gif

Ru flag.gif

Nl flag.gif

Gr flag.gif

Hr flag.gif

Česká republika
Cz flag.gif

Dk flag.gif

Pl flag.png

Ro flag.png

Se flag.gif