Difference between revisions of "SOCR News ISI WSC DSPA Training 2021"

From SOCR
Jump to: navigation, search
(Created page with "== SOCR News & Events: 2021 ISI/WSC Training and Education Bootcamp on Data Science and Predictive Analytics (DSPA) == Image:DSPA_2021.gif|250px|thumbnail|...")
 
m (Resources)
 
(40 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== [[SOCR_News | SOCR News & Events]]:  2021 ISI/WSC Training and Education Bootcamp on Data Science and Predictive Analytics (DSPA) ==
 
== [[SOCR_News | SOCR News & Events]]:  2021 ISI/WSC Training and Education Bootcamp on Data Science and Predictive Analytics (DSPA) ==
  
[[Image:DSPA_2021.gif|250px|thumbnail|right| [https://www.isi2021.org/ 2021 ISI World Statistical Congress] ]]
+
[[Image:DSPA_ISI_WSC_anime.gif|right| [https://www.isi2021.org/ 2021 ISI World Statistical Congress] ]]
  
==Overview==
+
== Instructor ==
....
+
: [https://umich.edu/~dinov Ivo Dinov], [https://www.umich.edu University of Michigan], [https://www.socr.umich.edu SOCR], [https://midas.umich.edu MIDAS].
  
 
+
:: Dr. Dinov is a professor of Health Behavior and Biological Sciences and Computational Medicine and Bioinformatics at the University of Michigan. He is a member of the Michigan Center for Applied and Interdisciplinary Mathematics (MCAIM) and a core member of the University of Michigan Comprehensive Cancer Center. Dr. Dinov serves as Director of the Statistics Online Computational Resource, Co-Director of the Center for Complexity and Self-management of Chronic Disease (CSCD Center), Co-Director of the multi-institutional Probability Distributome Project, Associate Director of the Michigan Institute for Data Science (MIDAS), and Associate Director of the Michigan Neuroscience Graduate Program (NGP). He is a member of the American Statistical Association (ASA), International Association for Statistical Education (IASE), American Mathematical Society (AMS), American Association for the Advancement of Science (AAAS), and an Elected Member of the International Statistical Institute (ISI).
== Organizer==
 
* [http://umich.edu/~dinov Ivo Dinov], [https://www.umich.edu University of Michigan], [https://www.socr.umich.edu SOCR], [https://midas.umich.edu MIDAS].
 
  
 
==Session Logistics==
 
==Session Logistics==
 
<!-- [[Image:JMM_2021_SS9_FoundationsOf_DS_Background.png|300px|thumbnail|right| [https://jointmathematicsmeetings.org/meetings/national/jmm2021/2247_program_ss9.html 2021 JMM/AMS Foundations of Data Science Session (SS9A)] ]] -->
 
<!-- [[Image:JMM_2021_SS9_FoundationsOf_DS_Background.png|300px|thumbnail|right| [https://jointmathematicsmeetings.org/meetings/national/jmm2021/2247_program_ss9.html 2021 JMM/AMS Foundations of Data Science Session (SS9A)] ]] -->
* '''Date/Time''': Wednesday & Thursday, June 16-17, 2021, 14.00-17.00, Central European Summer Time, [https://www.timeanddate.com/time/zones/cest CEST (UTC+2)]
+
* '''Date/Time''': Wednesday & Thursday, June 16-17, 2021, 14.00-17.00, Central European Summer Time, [https://www.timeanddate.com/time/zones/cest CEST (UTC+2)], 8:00-11:00 AM [https://www.timeanddate.com/time/zones/et US-EDT].
* '''Registration''': TBD.
+
* '''Registration''': [https://www.isi-web.org/events/courses/short-2021/data-science-and-predictive-analytics-dspa Registration Link], [https://www.isi-web.org/events/courses/short-2021 moderate registration fees apply].
* '''URL''': TBD.
+
* '''GoToMeeting''': [https://global.gotowebinar.com/pjoin/1464080410602344976/5342046927112771343 Webinar link].
* '''Conference''': [https://www.isi2021.org/ 2021 ISI World Statistical Congress].
+
* '''URL''': [https://www.isi-web.org/events/courses/short-2021/data-science-and-predictive-analytics-dspa Official ISI/WSC Course Website].
* '''Session Format''':  Daily 3-hour sessions.
+
* '''Conference''': [https://www.isi2021.org/ 2021 ISI World Statistical Congress] and [https://www.isi-web.org/events/courses/short-2021 WSC 2021 short courses].
* [https://myumi.ch/qgRl1 Session URL]: https://myumi.ch/qgRl1.
+
* '''Session Format''':  Two daily sessions (3-hours each).
 +
* [https://wiki.socr.umich.edu/index.php/SOCR_News_ISI_WSC_DSPA_Training_2021 Session URL]: https://myumi.ch/erXm2.
 +
 
 +
== Overview==
 +
This course will be based on a [https://www.socr.umich.edu/people/dinov/DSPA_Courses.html Data Science and Predictive Analytics (DSPA) course] I teach at the University of Michigan. The training will provide intermediate to advanced learners with a solid data science foundation to address challenges related to collecting, managing, processing, interrogating, analyzing and interpreting complex health and biomedical datasets using R. Participants will gain skills and acquire a tool-chest of methods, software tools, and protocols that can be applied to a broad spectrum of Big Data problems.
 +
 
 +
Before diving into the mathematical algorithms, statistical computing methods, software tools, and health analytics, we will discuss a number of driving motivational problems. These will ground all the subsequent scientific discussions, data modeling, and computational approaches.
 +
 
 +
===Prerequisites===
 +
Assumed [https://www.socr.umich.edu/people/dinov/courses/DSPA_Prereqs.html prior knowledge includes]: Completed undergraduate study with quantitative STEM exposure, some quantitative training, programming experience, and high-level of energy and motivation to learn. Preinstalled [http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/01_Foundation.html#21_Install_Basic_Shell-based_R R] and [http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/01_Foundation.html#22_GUI_based_R_Invocation_(RStudio) RStudio] on user local client computer.
 +
 
 +
===Vision===
 +
This course is based on active-learning and integrates driving motivational challenges with mathematical foundations, computational statistics, and modern scientific inference.
 +
 
 +
===Values===
 +
The training aims to provide effective, reliable, reproducible, and transformative data-driven discovery supporting open-science.
 +
 
 +
===Strategic priorities===
 +
Trainees will develop scientific intuition, computational skills, and data-wrangling abilities to tackle Big biomedical and health data problems. Instructors will provide well-documented R-scripts and software recipes implementing atomic data-filters as well as complex end-to-end predictive big data analytics solutions.
  
== Program==
+
===Outcomes===
 +
Upon successful completion of this course, participants are expected to have moderate competency in at least two of each of the three competency areas: Algorithms and Applications, Data Management, and Analysis Methods. Specifically, participants will get end-to-end R-protocols, gain ML/AI algorithm knowledge, explore data validation, wrangling, and visualization, experiment with statistical inference and model-free Machine Learning tools.
  
<center>
 
 
{| class="wikitable"
 
{| class="wikitable"
 +
! Areas !! Competency !! Expectation !! Notes
 
|-
 
|-
! Time [https://www.timeanddate.com/time/zones/mt US MT timezone (GMT-7)] || Presenter/Affiliation || Title || Abstract ID
+
| rowspan="3"|Algorithms and Applications || Tools || Working knowledge of basic software tools (command-line, GUI based, or web-services) || Familiarity with statistical programming languages, e.g., R or SciKit/Python, and database querying languages, e.g., SQL or NoSQL
 
|-
 
|-
| 8:00AM || [https://www.carolineuhler.com/ Caroline Uhler (MIT)] || ''Multi-Domain Data Integration: From Observations to Mechanistic Insights'' || [http://www.ams.org/amsmtgs/2247_abstracts/1163-62-32.pdf Abstract 1163-62-32]
+
| Algorithms || Knowledge of core principles of scientific computing, applications programming, API’s, algorithm complexity, and data structures || Best practices for scientific and application programming, efficient implementation of matrix linear algebra and graphics, elementary notions of computational complexity, user-friendly interfaces, string matching
 
|-
 
|-
| 8:30AM || [https://luddy.indiana.edu/contact/profile/?profile_id=187 Mehmet (Memo) Dalkilic (Indiana University)] || ''Teaching an Old Dog New Tricks: Making EM work with Big Data using Heaps'' || [http://www.ams.org/amsmtgs/2247_abstracts/1163-03-86.pdf Abstract 1163-03-86]
+
| Application Domain || Data analysis experience from at least one application area, either through coursework, internship, research project, etc. || Applied domain examples include: computational social sciences, health sciences, business and marketing, learning sciences, transportation sciences, engineering and physical sciences
 
|-
 
|-
| 9:00AM || [https://www.math.fsu.edu/People/faculty.php?id=1783 Tom Needham (Florida State University)] || ''Applications of Gromov-Wasserstein distance to network science'' || [http://www.ams.org/amsmtgs/2247_abstracts/1163-52-68.pdf Abstract 1163-52-68]
+
| rowspan="3"|Data Management || Data validation & visualization || Curation, Exploratory Data Analysis (EDA) and visualization || Data provenance, validation, visualization via histograms, Q-Q plots, scatterplots (ggplot, Dashboard, D3.js)
 
|-
 
|-
| 9:30AM || [http://www.cs.utah.edu/~jeffp/ Jeff M. Phillips (Utah)] || ''A Primer on the Geometry in Machine Learning'' || [http://www.ams.org/amsmtgs/2247_abstracts/1163-52-52.pdf Abstract 1163-52-52]
+
| Data wrangling || Skills for data normalization, data cleaning, data aggregation, and data harmonization/registration  || Data imperfections include missing values, inconsistent string formatting (‘2016-01-01’ vs. ‘01/01/2016’, PC/Mac/Linux time vs. timestamps, structured vs. unstructured data
 
|-
 
|-
| 10:00AM || [https://www.jonathannilesweed.com/ Jonathan Niles-Weed, NYU/Courant/Center for Data Science] || ''Statistical estimation under group actions'' || [http://www.ams.org/amsmtgs/2247_abstracts/1163-62-41.pdf Abstract 1163-62-41]
+
| Data infrastructure || Handling databases, web-services, Hadoop, multi-source data || Data structures, SOAP protocols, ontologies, XML, JSON, streaming
 
|-
 
|-
| 10:30 AM || [https://ani.stat.fsu.edu/~abarbu/ Adrian Barbu (Florida State University)] || ''A Novel Framework for Online Supervised Learning with Feature Selection'' || [http://www.ams.org/amsmtgs/2247_abstracts/1163-62-50.pdf Abstract 1163-62-50]
+
| rowspan="3"|Analysis Methods || Statistical inference || Basic understanding of bias and variance, principles of (non)parametric statistical inference, and (linear) modeling || Biological variability vs. technological noise, parametric (likelihood) vs non-parametric (rank order statistics) procedures, point vs. interval estimation, hypothesis testing, regression
 
|-
 
|-
| 11:00 AM || [https://arsuaga-vazquez-lab.faculty.ucdavis.edu/team-details/maxime-pouokam/ Maxime G Pouokam (UC Davis)] || ''Statistical Topology of Genome Analysis in Three Dimensions'' || [http://www.ams.org/amsmtgs/2247_abstracts/1163-62-338.pdf Abstract 1163-62-338]
+
| Study design and diagnostics || Design of experiments, power calculations and sample sizing, strength of evidence, p-values, False Discovery Rates || Multistage testing, variance normalizing transforms, histogram equalization, goodness-of-fit tests, model overfitting, model reduction
 
|-
 
|-
| 11:30 AM || [https://www.umich.edu/~dinov/ Ivo D. Dinov (University of Michigan)] || ''Data Science, Time Complexity, and Spacekime Analytics'' || [http://www.ams.org/amsmtgs/2247_abstracts/1163-62-33.pdf Abstract 1163-62-33]
+
| Machine Learning || Dimensionality reduction, k-nearest neighbors, random forests, AdaBoost, kernelization, SVM, ensemble methods, CNN || Empirical risk minimization. Supervised, semi-supervised, and unsupervised learning. Transfer learning, active learning, reinforcement learning, multiview learning, instance learning
 
|}
 
|}
</center>
 
  
==Speakers, Titles, and Abstracts==
+
== Topics ==
 +
The [https://socr.umich.edu/people/dinov/DSPA_Courses.html Data Science and Predictive Analytics] textbook is divided into the [https://en.wikipedia.org/wiki/Data_Science_and_Predictive_Analytics following 23 chapters], each progressively building on the previous content.
 +
<div style="column-count:3;-moz-column-count:3;-webkit-column-count:3">
 +
# Motivation
 +
# Foundations of R
 +
# Managing Data in R
 +
# Data Visualization
 +
# Linear Algebra & Matrix Computing
 +
# Dimensionality Reduction
 +
# Lazy Learning: Classification Using Nearest Neighbors
 +
# Probabilistic Learning: Classification Using Naive Bayes
 +
# Decision Tree Divide and Conquer Classification
 +
# Forecasting Numeric Data Using Regression Models
 +
# Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
 +
# Apriori Association Rules Learning
 +
# k-Means Clustering
 +
# Model Performance Assessment
 +
# Improving Model Performance
 +
# Specialized Machine Learning Topics
 +
# Variable/Feature Selection
 +
# Regularized Linear Modeling and Controlled Variable Selection
 +
# Big Longitudinal Data Analysis
 +
# Natural Language Processing/Text Mining
 +
# Prediction and Internal Statistical Cross Validation
 +
# Function Optimization
 +
# Deep Learning, Neural Networks
 +
</div>
  
* [https://www.carolineuhler.com/ Caroline Uhler (MIT)]: ''Multi-Domain Data Integration: From Observations to Mechanistic Insights'' ([http://www.ams.org/amsmtgs/2247_abstracts/1163-62-32.pdf Abstract 1163-62-32])
+
== Program Outline==
: Massive data collection holds the promise of a better understanding of complex phenomena and ultimately, of better decisions. An exciting opportunity in this regard stems from the growing availability of perturbation / intervention data (manufacturing, advertisement, education, genomics, etc.). In order to obtain mechanistic insights from such data, a major challenge is the integration of different data modalities (video, audio, interventional, observational, etc.). Using genomics and in particular the problem of identifying drugs for the repurposing against COVID-19 as an example, I will first discuss our recent work on coupling autoencoders in the latent space to integrate and translate between data of very different modalities such as sequencing and imaging. I will then present a framework for integrating observational and interventional data for causal structure discovery and characterize the causal relationships that are identifiable from such data. We end by a theoretical analysis of autoencoders linking overparameterization to memorization. In particular, I will characterize the implicit bias of overparameterized autoencoders and show that such networks trained using standard optimization methods implement associative memory. Collectively, our results have major implications for planning and learning from interventions in various application domains.
 
  
* [https://luddy.indiana.edu/contact/profile/?profile_id=187 Mehmet (Memo) Dalkilic (Indiana University)]: ''Teaching an Old Dog New Tricks: Making EM work with Big Data using Heaps'' ([http://www.ams.org/amsmtgs/2247_abstracts/1163-03-86.pdf Abstract 1163-03-86])
+
* Welcome and introductions
: Contemporary data mining algorithms are easily overwhelmed with truly big data.  While parallelism, improved initialization, and ad hoc data reduction are commonly used and necessary strategies, we note that (1) continually revisiting data and (2) visiting all data are two of the most prominent problems–especially for iterative learning techniques like expectation-maximization  algorithm  for  clustering  (EM-T).  To  the  best  of  our  knowledge, there  is  no  freely  available software that specifically focuses on improving the original EM-T algorithm in the context of big data.  We demonstrate the  utility  of  CRAN  package  ''DCEM''  that  implements  an  improved  version  of  EM-T, which  we  call  EM* (EM  star). DCEM provides an integrated and minimalistic interface to EM-T and EM* algorithms, and can be used as either (1) a stand-alone program or (2) a pluggable component in existing software.  We show that EM* can both effectively and efficiently cluster data as we vary size, dimensions, and separability.
+
* Course logistics (please come prepared with access to Internet connected computers having local versions of [http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/01_Foundation.html#21_Install_Basic_Shell-based_R R (statistical computing environment)] and [http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/01_Foundation.html#22_GUI_based_R_Invocation_(RStudio) RStudio (graphical user interface and integrated development environment)]
 +
* Data manipulation and visualization
 +
* Non-linear dimensionality reduction (UMAP & t-SNE)
 +
* Supervised and Unsupervised, model-based and model-free prediction, regression, classification, and clustering
 +
* Reticulation (Interoperability between R, Python, C/C++ and other languages)
 +
* Role of optimization in AI/ML
 +
* Activities and [https://socr.umich.edu/HTML5/ HTML5 demos].
  
* [https://www.math.fsu.edu/People/faculty.php?id=1783 Tom Needham (Florida State University)]: ''Applications of Gromov-Wasserstein distance to network science'' ([http://www.ams.org/amsmtgs/2247_abstracts/1163-52-68.pdf Abstract 1163-52-68])
+
== Program Details ==
: Recent years have seen a surge of  research activity in network analysis through the lens of optimal transport.   This perspective boils down to the following simple ideawhen comparing two networks, instead of considering a traditional registration between their nodes, one instead searches for an optimal ‘soft’ or probabilistic correspondenceThis perspective has led to state-of-the-art algorithms for robust large-scale network alignment and network partitioning tasksA rich mathematical theory underpins this workoptimal node correspondences realize the Gromov-Wasserstein (GW) distance between networks.  GW distance was originally introduced, independently by K. T. Sturm and Facundo M ́emoli, as a tool for studying abstract convergence properties of sequences of metric measure spaces. In particular, Sturm showed that GW distance can be understood as a geodesic distance with respect to a Riemannian structure on the space of isomorphism classes of metric measure spaces (the ‘Space of Spaces’). In this talk, I will describe joint work with Samir Chowdhury,in which we develop computationally efficient implementations of Sturm’s ideas for network science applicationsWe also derive theoretical results which link this framework to classical notions from spectral network analysis.
+
{| class="wikitable"
 +
|-
 +
! Wednesday, June 16, 2021, 8:00-11:00 AM US-EDT
 +
! Thursday, June 17, 2021, , 8:00-11:00 AM US-EDT
 +
|-
 +
|  Welcome
 +
|  Review of Day 1
 +
|-
 +
|  [https://wiki.socr.umich.edu/index.php/SOCR_News_ISI_WSC_DSPA_Training_2021 DSPA Summer Course] Overview ([https://www.isi-web.org/events/courses/short-2021/data-science-and-predictive-analytics-dspa ISI]/[https://www.isi2021.org/ WSC], [https://www.socr.umich.edu/people/dinov/courses/DSPA_Prereqs.html prereqs], vision, objectives, outcomes, Website)
 +
|  Questions, comments, issues?
 +
|-
 +
|  Introductions ([https://umich.edu/~dinov Instructor: Ivo Dinov]; Attendees: please post in Chat/Discussion-Forum: Participant's Name, Affiliation, Title, interests, and ''one fun fact about you''
 +
|  [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/06_LazyLearning_kNN.html Supervised AI]
 +
|-
 +
|  [https://wiki.socr.umich.edu/index.php/SOCR_News_ISI_WSC_DSPA_Training_2021#Topics Course Coverage]
 +
| Model-based
 +
|-
 +
|  [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/00_Motivation.html#13_DSPA_Expectations Expectations] and optional [https://umich.instructure.com/courses/38100/files/folder/Case_Studies/34_US_MacroEconMarketData_CompleteMonthly_1979_2020 capstone project] (below)
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/09_RegressionForecasting.html#3_Case_Study_1:_Baseball_Players Baseball players physique modeling]
 +
|-
 +
[https://www.socr.umich.edu/ SOCR Resources]: [https://wiki.socr.umich.edu/index.php/SOCR_Data Datasets] & [https://umich.instructure.com/courses/38100/files/folder/Case_Studies Case-studies], [https://socr.umich.edu/HTML5/ Webapps], [https://dspa.predictive.space/ DSPA], [https://spacekime.org/ Spacekime/TCIU], [https://github.com/SOCR GitHub], [https://wiki.socr.umich.edu/index.php/EBook Prob & Stats EBook], [https://wiki.socr.umich.edu/index.php/SMHS SMHS EBook], [https://www.socr.umich.edu/html/SOCR_UserGoogleMap.html Current SOCR Users]
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/06_LazyLearning_kNN.html#4_Case_Study:_Predicting_Galaxy_Spins k-NN prediction of galaxy spin]
 +
|-
 +
|  Open Science It’s online, therefore it exists!
 +
| Model-free
 +
|-
 +
|  [https://link.springer.com/book/10.1007%2F978-3-319-72347-1 Download DSPA Textbook] (free)
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/10_ML_NN_SVM_Class.html#3_Simple_NN_demo_-_learning_to_compute_(sqrt_{___}) Estimate the square root function using NN]
 +
|-
 +
| Resource [https://www.socr.umich.edu/people/dinov/courses/DSPA_Topics.html Search] & [https://www.socr.umich.edu/html/Navigators.html Navigation], [https://translate.google.com Language Translations]
 +
|
 +
|-
 +
|
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/10_ML_NN_SVM_Class.html#4_Case_Study_2:_Google_Trends_and_the_Stock_Market_-_Classification NN Google Trends and the Stock Market]
 +
|-
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/00_Motivation.html Motivation] - and [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/00_Motivation.html#7_Common_Characteristics_of_Big_(Biomedical_and_Health)_Data 7D of Big Data]
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/06_LazyLearning_kNN.html Unsupervised AI]
 +
|-
 +
| Digitalization of all human experiences
 +
| Classification and clustering (k-Means, spectral, hierarchical)
 +
|-
 +
| R[https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/00_Motivation.html#12_Responsible_Data_Science_and_Ethical_Predictive_Analytics responsible Data Science/Ethical Predictive Analytics]
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/12_kMeans_Clustering.html#1_Clustering_as_a_machine_learning_task Hot-dogs example]
 +
|-
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/01_Foundation.html#1_Why_use_R R vs. Python vs. SAS vs. SPSS vs. other SW]
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/12_kMeans_Clustering.html#2_Silhouette_plots Silhouette plots]
 +
|-
 +
| Confirm local installations of R & RStudio
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/12_kMeans_Clustering.html#6_Case_study_2:_Pediatric_Trauma Pediatric trauma clustering study]
 +
|-
 +
|  [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/01_Foundation.html#23_RStudio_GUI_Layout RStudio GUI]
 +
|
 +
|-
 +
| Rmarkdown Notebook (IDE) End-to-end Pipeline Workflow from raw data … models … visualization … analytics … reporting/pubs
 +
|
 +
|-
 +
| Example Demo (requires knitr package)
 +
|
 +
|-
 +
| Chapter 4 [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/04_LinearAlgebraMatrixComputing.Rmd RMD Source], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/04_LinearAlgebraMatrixComputing.html HTML output], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/SOCR_header.html SOCR_Header]
 +
|
 +
|-
 +
| [https://socr.umich.edu/BPAD/BPAD_notes/Biophysics430_Chap01_MathFoundations.html Math Foundations]
 +
|
 +
|-
 +
| 5-min Break
 +
| 5-min Break
 +
|-
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/02_ManagingData.html#1_Saving_and_Loading_R_Data_Structures      Data types]: categorical & numeric, structured and unstructured, scalar, vector, matrix, data-frame, tensor, list, object
 +
| Reticulation ([https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/15_SpecializedML_FormatsOptimization.html#7_R_Notebook_support_for_other_programming_languages interoperability between R, Python, C/C++ and other languages])
 +
|-
 +
| Data manipulation import/export, [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/02_ManagingData.html#143_Imputation_via_Expectation-Maximization EM imputation], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/02_ManagingData.html#15_Parsing_webpages_and_visualizing_tabular_HTML_data webpage scraping], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/02_ManagingData.html#6_Measuring_the_Central_Tendency_-_mean,_median,_mode sample statistics (moments)]
 +
| Text modeling & NLP ([https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/19_NLP_TextMining.html#5_Sentiment_analysis sentiment analysis example])
 +
|-
 +
| [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/03_DataVisualization.html EDA (visualization)]
 +
|
 +
|-
 +
| Compare R EDA vs. HTML/JS: [https://socr.umich.edu/HTML5/SOCRAT/ SOCRAT (NI data of AD/MCI/NC)], [https://socr.umich.edu/HTML5/MotionChart/ Motion Charts (Housing Prices)], [https://socr.umich.edu/HTML5/BrainViewer/ BrainViewer (raw MRI, DTI tracks, Brain Atlas)]
 +
|
 +
|-
 +
| Probability Distributions: [http://distributome.org/V3/ Distributome], [https://socr.umich.edu/HTML5/BivariateNormal/TVN/ TVN Webapp]
 +
|  Longitudinal data analysis (Google trends analytics)
 +
|-
 +
|  Dimensionality reduction
 +
|
 +
|-
 +
| Linear PCA: [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/05_DimensionalityReduction.html#1_Example:_Reducing_2D_to_1D 2D --> 1D example], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/05_DimensionalityReduction.html#5_Principal_Component_Analysis_(PCA) PPMI (Parkinson's disease) example]
 +
|
 +
|-
 +
| 5-min Break
 +
|  5-min Break
 +
|-
 +
|  Non-linear: MNIST data OCR: [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/05_DimensionalityReduction.html#103_Hand-Written_Digits_Recognition UMAP OCR], [https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/05_DimensionalityReduction.html#92_t-SNE_Example:_Hand-written_Digit_Recognition t-SNE OCR]
 +
| Role of optimization in AI/ML ([https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/21_FunctionOptimization.html#101_Application:_Healthcare_Manufacturer_Product_Optimization Healthcare manufacturer product optimization example])
 +
|-
 +
|  [https://socr.umich.edu/HTML5/SOCR_TensorBoard_UKBB/ SOCR/Tensorboard/Projector UKBB Brain Study]
 +
|  Deep neural networks ([https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/22_DeepLearning.html image-classification example])
 +
|-
 +
| [https://umich.instructure.com/courses/38100/files/folder/Case_Studies/34_US_MacroEconMarketData_CompleteMonthly_1979_2020 Capstone project]: interactive-learning using monthly US macro-economic data. Use the [https://umich.instructure.com/files/20798411/download?download_frd=1 RMD source], the [https://umich.instructure.com/files/20798410/download?download_frd=1 example HTML output], and the [https://umich.instructure.com/files/20026184/download?download_frd=1 provided data] to experiment with some of the DSPA techniques. Think of ways to augment these data (expand the time range and increase the feature richness)
 +
|  [https://www.socr.umich.edu/people/dinov/courses/DSPA_Topics.html#Appendix DSPA Appendices]: Bayesian Simulation, Modeling and Inference; Information-Theoretic Foundation of Statistical Learning; Surface, Shape, and Manifold Representation and Visualization; Power Analysis in Experimental Design; Database SQL/NoSQL Queries & Google BigQuery; Image Convolution, Filtering, & Fourier Transform; Causality, Transfer Entropy, & Mechanistic Effects; Agent-based Reinforcement Learning
 +
|-
 +
|
 +
|  Demonstrations of interesting [https://umich.instructure.com/courses/38100/files/folder/Case_Studies/34_US_MacroEconMarketData_CompleteMonthly_1979_2020 Capstone project] results
 +
|-
 +
|  Open discussion
 +
| Open discussion
 +
|}
  
* [http://www.cs.utah.edu/~jeffp/ Jeff M. Phillips (Utah)]: ''A Primer on the Geometry in Machine Learning'' ([http://www.ams.org/amsmtgs/2247_abstracts/1163-52-52.pdf Abstract 1163-52-52])
+
==Resources==
: Machine  Learning  is  a  discipline  filled  with  many  simple  geometric  algorithms,  the  central  task  of  which  is  usually classification. These varied approaches all take as input a set of n points in d dimensions, each with a label. In learning,the goal is to use this input data to build a function which predicts a label accurately on new data drawn from the same unknown distribution as the input data. The main difference in the many algorithms is largely a result of the chosen class of functions considered. This talk will take a quick tour through many approaches from simple to complex and modern,and show the geometry inherent at each step. Pit stops will include connections to geometric data structures, duality,random projections, range spaces, and core sets.  
+
* [https://socr.umich.edu/docs/uploads/2021/DSPA_ISI_WSC_Flyer_2021.pdf Course Flyer].
 +
* [https://wiki.socr.umich.edu/images/5/5c/ISI_WSC_2021_DSPA_Course_June_2021_Notes.pdf 1-page Course Coverage with dynamic links to content].
 +
* [https://en.wikipedia.org/wiki/Data_Science_and_Predictive_Analytics DSPA Wikipedia].
 +
* [https://www.springer.com/us/book/9783319723464 DSPA Springer Page] & [http://link.springer.com/978-3-319-72347-1 SpringerLink (PDF Download)].
 +
* [https://dspa.predictive.space/ dspa.predictive.space] & [https://umich.instructure.com/courses/143011/ DSPA MOOC Canvas Site].
 +
* [https://wiki.socr.umich.edu/index.php/SOCR_News_ISI_DSPA_Training_2022 Three-day follow up 2022 ISI DSPA Course].
  
* [https://www.jonathannilesweed.com/ Jonathan Niles-Weed, NYU/Courant/Center for Data Science]: ''Statistical estimation under group actions'' ([http://www.ams.org/amsmtgs/2247_abstracts/1163-62-41.pdf Abstract 1163-62-41])
+
==Video Recordings==
: A common challenge in the sciences is the presence of heterogeneity in data. Motivated by problems in signal processing and computational biology, we consider a particular form of heterogeneity where observations are corrupted by random transformations from a group (such as the group of permutations or rotations) before they can be collected and analyzed. We establish the fundamental limits of statistical estimation in such settings and show that the optimal rates of recovery are precisely governed by the invariant theory of the group. As a corollary, we establish rigorously the number of samples necessary to reconstruct the structure of molecules in cryo-electron microscopy. We also give a computationally efficient algorithm for a special case of this problem, and discuss conjectured statistical-computational gaps for the general case.
+
* [https://attendee.gotowebinar.com/recording/5279492413447473676 Day 1 Video podcast].
: Based on joint work with Afonso Bandeira, Ben Blum-Smith, Joe Kileel, Amelia Perry, Philippe Rigollet, Amit Singer, and Alex Wein.
+
* [https://attendee.gotowebinar.com/recording/4162355614300329995 Day 2 Video podcast].
  
* [https://ani.stat.fsu.edu/~abarbu/ Adrian Barbu (Florida State University)]: ''A Novel Framework for Online Supervised Learning with Feature Selection'' ([http://www.ams.org/amsmtgs/2247_abstracts/1163-62-50.pdf Abstract 1163-62-50])
+
==Participants==
: Current  online  learning  methods  suffer  from  lower  convergence  rates  and  limited  capability  to  recover  the  support  of the  true  features  compared  to  their  offline  counterparts.  In  this  work,  we  present  a  novel  online  learning  framework based on running averages and introduce online versions of some popular existing offline methods such as Elastic Net, Minimax Concave Penalty and Feature Selection with Annealing.  The framework can handle an arbitrarily large number of observations as long as the data dimension is not too large,  e.g.  p<50,000.  We prove the equivalence between our online methods and their offline counterparts and give theoretical true feature recovery and convergence guarantees for some  of  them.  In  contrast  to  the  existing  online  methods,  the  proposed  methods  can  extract  models  of  any  sparsity level at any time.  Numerical experiments indicate that our new methods enjoy high accuracy of true feature recovery and  a  fast  convergence  rate,  compared  with  standard  online  and  offline  algorithms.  We  also  show  how  the  running averages framework can be used for model adaptation in the presence of model drift.  Finally, we present applications to large datasets where again the proposed framework shows competitive results compared to popular online and offline algorithms.
+
''Partial list of participants:''
  
* [https://arsuaga-vazquez-lab.faculty.ucdavis.edu/team-details/maxime-pouokam/ Maxime G Pouokam (UC Davis)]: ''Statistical Topology of Genome Analysis in Three Dimension'' ([http://www.ams.org/amsmtgs/2247_abstracts/1163-62-338.pdf Abstract 1163-62-338])
+
* Jennifer Daniels: Adjunct Math and Statistics Instructor at Mid Michigan College, Davenport University, and Alma College. Graduate student at Central Michigan University.
: The three-dimensional (3D) configuration of chromosomes within the eukaryote nucleus is an important factor for several cellular  functions, including  gene  expression  regulation, and  has  also  been  linked  with  many  diseases  such  as  cancer-causing translocation events.  Recent adaptations of high-throughput sequencing to chromosome conformation capture (3C) techniques, allows for genome-wide structural characterization for the first time with a goal of getting a 3D structure of the genome. In this study, we present a novel approach to compute entanglement in open chains in general and apply it to chromosomes. Our metric is termed the linking proportion (Lp). We use the Lp in two different settings. We use the Lp to show that the Rabl configuration, an evolutionary conserved feature of the 3D nuclear organization, as an essential player in the simplification of the entanglement of chromatin fibers.  We show how the Lp incorporates statistical models of inference that can be used to determine the agreement between candidate 3D configuration reconstructions. In the last part of our work, we present Smooth3D, a novel 3D genome reconstruction method via cubic spline approximation.
+
* Jo Edwards: Australian Bureau of Statistics, Project Manager/Data Scientist.
 
+
* Jannik Schaller: Federal Statistical Office of Germany (DESTATIS), Interest: Data Fusion/ Statistical and Machine Learning.
* [https://www.umich.edu/~dinov/ Ivo D. Dinov (University of Michigan)]: ''Data Science, Time Complexity, and Spacekime Analytics'' ([http://www.ams.org/amsmtgs/2247_abstracts/1163-62-33.pdf Abstract 1163-62-33])
+
* Edviges Coelho: Statistics Portugal and Universidade Lusófona.
: Human behavior, communication, and social interactions are profoundly augmented by the rapid immersion of digitalization and virtualization of all life experiences. This process presents important challenges of managing, harmonizing, modeling, analyzing, interpreting, and visualizing complex information. There is a substantial need to develop, validate, productize, and support novel mathematical techniques, advanced statistical computing algorithms, transdisciplinary tools, and effective artificial intelligence applications. ''Spacekime analytics'' is a new technique for modeling high-dimensional longitudinal data. This approach relies on extending the notions of time, events, particles, and wavefunctions to complex-time (''kime''), complex-events (''kevents''), data, and inference-functions. We will illustrate how the kime-magnitude (longitudinal time order) and kime-direction (phase) affect the subsequent predictive analytics and the induced scientific inference. The mathematical foundation of spacekime calculus reveal various statistical implications including inferential uncertainty and a Bayesian formulation of spacekime analytics. Complexifying time allows the lifting of all commonly observed processes from the classical 4D Minkowski spacetime to a 5D spacekime manifold, where a number of interesting mathematical problems arise. Direct data science applications of spacekime analytics will be demonstrated using simulated data and clinical observations (e.g., sMRI, fMRI data).
+
* Kadri Rootalu: Data scientist in Statistics Estonia, but have an education in Sociology
 
+
* Jared Mendoza: University of the Philippines Los Banos, Assistant Professor of Statistics
==Resources==
+
* Lynda Aouar: UNCO, PhD student in Applied Statistics, I am interested about nonparametric statistics
Slides/papers
+
* Ananda Manage: Dept of Math & Stat, Sam Houston State University, Texas, USA
 +
* Michal Ciszewski, PhD student in Statistics at TU Delft, interests: activity recognition and anomaly detection
 +
* Joyce Chang; Data scientist at the U of Pittsburgh School of Medicine; interested in risk prediction modeling and identify heterogeneous treatment effects
 +
* Katherine Zavez: PhD student in the Department of Statistics at the University of Connecticut
 +
* Ewilly Liew: Lecturer in Econometrics and Business Statistics, Monash University Malaysia. Interest: behavioral research in higher education and healthcare.
 +
* Jennifer Daniels: Interested in Applied Statistics. Particularly, Data/Text Mining. I have lived and taught in Japan many years ago.
 +
* Elizabeth Gonzalez: Statistics Department, Colegio de Postgraduados, Mexico, interested in statistical inference in general.
 +
* Delia Ortega: PhD student in Statistics. Universidad Nacional, Colombia.
 +
* Li Zhou: PhD student in stat at Auburn University
 +
* Ilich Lama: Principal Research Scientist - Environmental Data Science (NCASI), Montreal, Canada - Interested among other things in statistical analysis of industrial emissions/releases.
 +
* Brocha Stern, postdoctoral fellow at Northwestern University, orthopedic health services and outcomes research
 +
* Annette Kifley, biostatistician in rehabilitation studies, University of Sydney
 +
* Jo Edwards: I am interested in Coding and Classification techniques as well as Entity extraction
 +
* Nur Aziha Mansor: Statistician in Department of Statistics Malaysia, Interest in data management
 +
* Martina Ozoglu: Statistical Office of the Slovak Republic, tourism analyst. I am interested in new forms of Tourism and its data interpretation.
 +
* Jason Ng, Monash University, Dept of Econometrics and Business Statistics
 +
* Quratulain Khaliq: PhD Statistics candidate from Pakistan
 +
* Malcolm Cai: Working in the public service of Singapore. Keen on data science, and sports.
  
: [https://wiki.socr.umich.edu/images/4/42/AdrianBarbu_Slides-2021-01-09-JMM.pdf (Adrian Barbu) ''A Novel Framework for Online Supervised Learning with Feature Selection''].
+
* Nurhazwani Abdul Halim, an Executive from Data Management and Statistics Department, from Central Bank of Malaysia. I am interested in Data Science and Machine Learning
: [https://wiki.socr.umich.edu/images/f/fb/JMM_MaximePouokam_UCD_2021.pdf (Maxime Pouokam) ''Statistical Topology of Genome Analysis in Three Dimension''].
+
* Zsófia Szente: Hungarian Central Statistical Office, statistician. I am interested in data visualization and data science.
: [https://socr.umich.edu/docs/uploads/2021/Dinov_Spacekime_JMM_AMS_2021.pdf (Ivo Dinov) ''Data Science, Time Complexity, and Spacekime Analytics'' (Presentation Slides)].
+
* Luigi Arzedi, PhD student in Statistics at University of Cagliari (Italy)
 +
* Miguel David Alvarez, PhD student in Economics and I work as a Data Scientist in the National Electoral Institute (Mexico).
 +
* Felibel Zabala: methodologist from Stats NZ. I am interested in data science & machine learning in official statistics
 +
* Quratulain Khaliq: PhD Candidate, Allama Iqbal open University, Statistical process Control, Robustness technique, non parametric statistics. I am interested to link SPC techniques to data science.
  
  
  
 
<hr>
 
<hr>
{{translate|pageName=http://wiki.stat.ucla.edu/socr/index.php?title=SOCR_News_ISI_WSC_DSPA_Training_2021}}
+
{{translate|pageName=https://wiki.socr.umich.edu/index.php/SOCR_News_ISI_WSC_DSPA_Training_2021}}

Latest revision as of 13:27, 3 March 2022

SOCR News & Events: 2021 ISI/WSC Training and Education Bootcamp on Data Science and Predictive Analytics (DSPA)

2021 ISI World Statistical Congress

Instructor

Ivo Dinov, University of Michigan, SOCR, MIDAS.
Dr. Dinov is a professor of Health Behavior and Biological Sciences and Computational Medicine and Bioinformatics at the University of Michigan. He is a member of the Michigan Center for Applied and Interdisciplinary Mathematics (MCAIM) and a core member of the University of Michigan Comprehensive Cancer Center. Dr. Dinov serves as Director of the Statistics Online Computational Resource, Co-Director of the Center for Complexity and Self-management of Chronic Disease (CSCD Center), Co-Director of the multi-institutional Probability Distributome Project, Associate Director of the Michigan Institute for Data Science (MIDAS), and Associate Director of the Michigan Neuroscience Graduate Program (NGP). He is a member of the American Statistical Association (ASA), International Association for Statistical Education (IASE), American Mathematical Society (AMS), American Association for the Advancement of Science (AAAS), and an Elected Member of the International Statistical Institute (ISI).

Session Logistics

Overview

This course will be based on a Data Science and Predictive Analytics (DSPA) course I teach at the University of Michigan. The training will provide intermediate to advanced learners with a solid data science foundation to address challenges related to collecting, managing, processing, interrogating, analyzing and interpreting complex health and biomedical datasets using R. Participants will gain skills and acquire a tool-chest of methods, software tools, and protocols that can be applied to a broad spectrum of Big Data problems.

Before diving into the mathematical algorithms, statistical computing methods, software tools, and health analytics, we will discuss a number of driving motivational problems. These will ground all the subsequent scientific discussions, data modeling, and computational approaches.

Prerequisites

Assumed prior knowledge includes: Completed undergraduate study with quantitative STEM exposure, some quantitative training, programming experience, and high-level of energy and motivation to learn. Preinstalled R and RStudio on user local client computer.

Vision

This course is based on active-learning and integrates driving motivational challenges with mathematical foundations, computational statistics, and modern scientific inference.

Values

The training aims to provide effective, reliable, reproducible, and transformative data-driven discovery supporting open-science.

Strategic priorities

Trainees will develop scientific intuition, computational skills, and data-wrangling abilities to tackle Big biomedical and health data problems. Instructors will provide well-documented R-scripts and software recipes implementing atomic data-filters as well as complex end-to-end predictive big data analytics solutions.

Outcomes

Upon successful completion of this course, participants are expected to have moderate competency in at least two of each of the three competency areas: Algorithms and Applications, Data Management, and Analysis Methods. Specifically, participants will get end-to-end R-protocols, gain ML/AI algorithm knowledge, explore data validation, wrangling, and visualization, experiment with statistical inference and model-free Machine Learning tools.

Areas Competency Expectation Notes
Algorithms and Applications Tools Working knowledge of basic software tools (command-line, GUI based, or web-services) Familiarity with statistical programming languages, e.g., R or SciKit/Python, and database querying languages, e.g., SQL or NoSQL
Algorithms Knowledge of core principles of scientific computing, applications programming, API’s, algorithm complexity, and data structures Best practices for scientific and application programming, efficient implementation of matrix linear algebra and graphics, elementary notions of computational complexity, user-friendly interfaces, string matching
Application Domain Data analysis experience from at least one application area, either through coursework, internship, research project, etc. Applied domain examples include: computational social sciences, health sciences, business and marketing, learning sciences, transportation sciences, engineering and physical sciences
Data Management Data validation & visualization Curation, Exploratory Data Analysis (EDA) and visualization Data provenance, validation, visualization via histograms, Q-Q plots, scatterplots (ggplot, Dashboard, D3.js)
Data wrangling Skills for data normalization, data cleaning, data aggregation, and data harmonization/registration Data imperfections include missing values, inconsistent string formatting (‘2016-01-01’ vs. ‘01/01/2016’, PC/Mac/Linux time vs. timestamps, structured vs. unstructured data
Data infrastructure Handling databases, web-services, Hadoop, multi-source data Data structures, SOAP protocols, ontologies, XML, JSON, streaming
Analysis Methods Statistical inference Basic understanding of bias and variance, principles of (non)parametric statistical inference, and (linear) modeling Biological variability vs. technological noise, parametric (likelihood) vs non-parametric (rank order statistics) procedures, point vs. interval estimation, hypothesis testing, regression
Study design and diagnostics Design of experiments, power calculations and sample sizing, strength of evidence, p-values, False Discovery Rates Multistage testing, variance normalizing transforms, histogram equalization, goodness-of-fit tests, model overfitting, model reduction
Machine Learning Dimensionality reduction, k-nearest neighbors, random forests, AdaBoost, kernelization, SVM, ensemble methods, CNN Empirical risk minimization. Supervised, semi-supervised, and unsupervised learning. Transfer learning, active learning, reinforcement learning, multiview learning, instance learning

Topics

The Data Science and Predictive Analytics textbook is divided into the following 23 chapters, each progressively building on the previous content.

  1. Motivation
  2. Foundations of R
  3. Managing Data in R
  4. Data Visualization
  5. Linear Algebra & Matrix Computing
  6. Dimensionality Reduction
  7. Lazy Learning: Classification Using Nearest Neighbors
  8. Probabilistic Learning: Classification Using Naive Bayes
  9. Decision Tree Divide and Conquer Classification
  10. Forecasting Numeric Data Using Regression Models
  11. Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
  12. Apriori Association Rules Learning
  13. k-Means Clustering
  14. Model Performance Assessment
  15. Improving Model Performance
  16. Specialized Machine Learning Topics
  17. Variable/Feature Selection
  18. Regularized Linear Modeling and Controlled Variable Selection
  19. Big Longitudinal Data Analysis
  20. Natural Language Processing/Text Mining
  21. Prediction and Internal Statistical Cross Validation
  22. Function Optimization
  23. Deep Learning, Neural Networks

Program Outline

Program Details

Wednesday, June 16, 2021, 8:00-11:00 AM US-EDT Thursday, June 17, 2021, , 8:00-11:00 AM US-EDT
Welcome Review of Day 1
DSPA Summer Course Overview (ISI/WSC, prereqs, vision, objectives, outcomes, Website) Questions, comments, issues?
Introductions (Instructor: Ivo Dinov; Attendees: please post in Chat/Discussion-Forum: Participant's Name, Affiliation, Title, interests, and one fun fact about you Supervised AI
Course Coverage Model-based
Expectations and optional capstone project (below) Baseball players physique modeling
SOCR Resources: Datasets & Case-studies, Webapps, DSPA, Spacekime/TCIU, GitHub, Prob & Stats EBook, SMHS EBook, Current SOCR Users k-NN prediction of galaxy spin
Open Science It’s online, therefore it exists! Model-free
Download DSPA Textbook (free) Estimate the square root function using NN
Resource Search & Navigation, Language Translations
NN Google Trends and the Stock Market
Motivation - and 7D of Big Data Unsupervised AI
Digitalization of all human experiences Classification and clustering (k-Means, spectral, hierarchical)
Rresponsible Data Science/Ethical Predictive Analytics Hot-dogs example
R vs. Python vs. SAS vs. SPSS vs. other SW Silhouette plots
Confirm local installations of R & RStudio Pediatric trauma clustering study
RStudio GUI
Rmarkdown Notebook (IDE) End-to-end Pipeline Workflow from raw data … models … visualization … analytics … reporting/pubs
Example Demo (requires knitr package)
Chapter 4 RMD Source, HTML output, SOCR_Header
Math Foundations
5-min Break 5-min Break
Data types: categorical & numeric, structured and unstructured, scalar, vector, matrix, data-frame, tensor, list, object Reticulation (interoperability between R, Python, C/C++ and other languages)
Data manipulation import/export, EM imputation, webpage scraping, sample statistics (moments) Text modeling & NLP (sentiment analysis example)
EDA (visualization)
Compare R EDA vs. HTML/JS: SOCRAT (NI data of AD/MCI/NC), Motion Charts (Housing Prices), BrainViewer (raw MRI, DTI tracks, Brain Atlas)
Probability Distributions: Distributome, TVN Webapp Longitudinal data analysis (Google trends analytics)
Dimensionality reduction
Linear PCA: 2D --> 1D example, PPMI (Parkinson's disease) example
5-min Break 5-min Break
Non-linear: MNIST data OCR: UMAP OCR, t-SNE OCR Role of optimization in AI/ML (Healthcare manufacturer product optimization example)
SOCR/Tensorboard/Projector UKBB Brain Study Deep neural networks (image-classification example)
Capstone project: interactive-learning using monthly US macro-economic data. Use the RMD source, the example HTML output, and the provided data to experiment with some of the DSPA techniques. Think of ways to augment these data (expand the time range and increase the feature richness) DSPA Appendices: Bayesian Simulation, Modeling and Inference; Information-Theoretic Foundation of Statistical Learning; Surface, Shape, and Manifold Representation and Visualization; Power Analysis in Experimental Design; Database SQL/NoSQL Queries & Google BigQuery; Image Convolution, Filtering, & Fourier Transform; Causality, Transfer Entropy, & Mechanistic Effects; Agent-based Reinforcement Learning
Demonstrations of interesting Capstone project results
Open discussion Open discussion

Resources

Video Recordings

Participants

Partial list of participants:

  • Jennifer Daniels: Adjunct Math and Statistics Instructor at Mid Michigan College, Davenport University, and Alma College. Graduate student at Central Michigan University.
  • Jo Edwards: Australian Bureau of Statistics, Project Manager/Data Scientist.
  • Jannik Schaller: Federal Statistical Office of Germany (DESTATIS), Interest: Data Fusion/ Statistical and Machine Learning.
  • Edviges Coelho: Statistics Portugal and Universidade Lusófona.
  • Kadri Rootalu: Data scientist in Statistics Estonia, but have an education in Sociology
  • Jared Mendoza: University of the Philippines Los Banos, Assistant Professor of Statistics
  • Lynda Aouar: UNCO, PhD student in Applied Statistics, I am interested about nonparametric statistics
  • Ananda Manage: Dept of Math & Stat, Sam Houston State University, Texas, USA
  • Michal Ciszewski, PhD student in Statistics at TU Delft, interests: activity recognition and anomaly detection
  • Joyce Chang; Data scientist at the U of Pittsburgh School of Medicine; interested in risk prediction modeling and identify heterogeneous treatment effects
  • Katherine Zavez: PhD student in the Department of Statistics at the University of Connecticut
  • Ewilly Liew: Lecturer in Econometrics and Business Statistics, Monash University Malaysia. Interest: behavioral research in higher education and healthcare.
  • Jennifer Daniels: Interested in Applied Statistics. Particularly, Data/Text Mining. I have lived and taught in Japan many years ago.
  • Elizabeth Gonzalez: Statistics Department, Colegio de Postgraduados, Mexico, interested in statistical inference in general.
  • Delia Ortega: PhD student in Statistics. Universidad Nacional, Colombia.
  • Li Zhou: PhD student in stat at Auburn University
  • Ilich Lama: Principal Research Scientist - Environmental Data Science (NCASI), Montreal, Canada - Interested among other things in statistical analysis of industrial emissions/releases.
  • Brocha Stern, postdoctoral fellow at Northwestern University, orthopedic health services and outcomes research
  • Annette Kifley, biostatistician in rehabilitation studies, University of Sydney
  • Jo Edwards: I am interested in Coding and Classification techniques as well as Entity extraction
  • Nur Aziha Mansor: Statistician in Department of Statistics Malaysia, Interest in data management
  • Martina Ozoglu: Statistical Office of the Slovak Republic, tourism analyst. I am interested in new forms of Tourism and its data interpretation.
  • Jason Ng, Monash University, Dept of Econometrics and Business Statistics
  • Quratulain Khaliq: PhD Statistics candidate from Pakistan
  • Malcolm Cai: Working in the public service of Singapore. Keen on data science, and sports.
  • Nurhazwani Abdul Halim, an Executive from Data Management and Statistics Department, from Central Bank of Malaysia. I am interested in Data Science and Machine Learning
  • Zsófia Szente: Hungarian Central Statistical Office, statistician. I am interested in data visualization and data science.
  • Luigi Arzedi, PhD student in Statistics at University of Cagliari (Italy)
  • Miguel David Alvarez, PhD student in Economics and I work as a Data Scientist in the National Electoral Institute (Mexico).
  • Felibel Zabala: methodologist from Stats NZ. I am interested in data science & machine learning in official statistics
  • Quratulain Khaliq: PhD Candidate, Allama Iqbal open University, Statistical process Control, Robustness technique, non parametric statistics. I am interested to link SPC techniques to data science.





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif