SOCR - User contributions [en]

SMHS DataManagement

2014-09-02T18:26:31Z

Zhenxunw: /* References */

==[[SMHS| Scientific Methods for Health Sciences]] - Data Management ==

===Overview===
Data management comprised all the discipline related to managing data as a valuable resource and it is of significant importance in various fields. It is officially defined as the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise. There are various ways to manage data. In this lecture, we are going to introduce the fundamental roles of data management in statistics and illustrate commonly used ways and steps in data managements through examples from different areas including tables, streams, cloud, warehouses, DBs, arrays, binary ASC II, handling and mechanics.

===Motivation===
The next step after getting data would be to make proper management of the data in hand. Data management is of course a vital step in data analysis and is crucial to the success and reproducibility of a statistical analysis. So how would we do a good data management and what are the commonly used ways of data management? In order to further make good use of the data, we are going to learn more about the area of data management and various ways to implement it. Selection of appropriate tools and efficient use of these tools can save the researcher numerous hours, and allow other researchers to leverage the products of their work.

===Theory===

'''Tables:''' one of the most commonly used ways to manage data. It is a means of arranging data in rows and columns. It is of pervasive use throughout all research and data analysis.
There are two basic types of tables:
*Simple Table: consider the following example of a table summarizing the data from two groups. The table presents a general comparison of the two groups and shows us a clear picture of the measurements and the comparative characteristics between the two groups. What we learn from the table below: (1) these two groups with the same sample size have the same mean; (2) group 1 has a bigger range of the data values; (3) group 2 has a smaller standard deviation indicating a less variant dataset compared to group 1.

<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
| ||Minimum||Maximum||Mean||Standard Deviation||Size
|-
|Group 1||12||45||22||2.6||40
|-
|Group 2||15||30||22||1.5||40
|}
</center>

*Multi-dimensional table: consider the F distribution (http://socr.umich.edu/Applets/F_Table.html) where the first row and the column is the degree of freedoms and the data in the middle is the 99% quantiles of F(k,l). This is a two dimensional example and the coordinates or combinations of the basic headers give a unique value attached.
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
|l \ k||1||2||3||4||5
|-
|1||4052.||4999.5||5403.||5625.||5764.
|-
|2||98.50||99.00||99.17||99.25||99.30
|-
|3||34.12||30.82||29.46||28.71||28.24
|-
|4||21.20||18.00||16.69||15.98||15.52
|-
|5||13.27||12.06||11.39||10.97||10.67
|}
</center>

'''Streams:''' is also an easy way of data management and it is more visualized with pictures. It is a sequence of data elements made available over time and can be thought as a conveyor belt that allows items to be processed one at a time rather than in large batches. It is a sequence of digitally encoded coherent signals used to transmit or receive information that is in the process of being transmitted. In electronics and computer architecture, it determines for which time which data item is scheduled to enter or leave which port.

From the chart below, we have three apparent observations: (1) pre-dialysis kidney function is associated with patient survival; (2) pre-dialysis kidney function differs by type of dialysis treatment (it is associated with type of dialysis); (3) pre-dialysis kidney function is not in the causal pathway of type of dialysis and survival. And from these three conditions, we can conclude that pre-dialysis kidney function is a potential confounder.

[[File:DataManagementChart1.png |500px]]

An illustrative example of streaming is wind map (http://hint.fm/wind/), which generates a vivid demonstration of the wind speed over the USA with data streamed live from different resources. We can have a clear picture of the wind range countrywide with streaming.

[[File:DataManagementFig1.png|500px]]

'''Cloud Data Storage:''' Data cloud storage demands high availability, durability, and scalability from a few bytes to petabytes. Examples of data cloud storage services include Amazon’s S3, which promises a 99.9% monthly availability and 99.999999999% durability per year. This translates into less than an hour outage per month. For an example of durability, assume that a user stores 10,000 objects in a cloud storage, then, on average, the user would expect to lose one object every 10,000,000 years. S3, and other cloud service providers, achieve this reliability by storing data in multiple facilities with error checking and self-healing processes to detect and repair errors and device failures. This process is completely transparent to the user and requires no actions or knowledge of the underlying complexities. Global data-intense service providers like Google and Facebook have the expertise and scale to provide enormous storage and significant reliability and uptime in an efficient/economical way. Many Big Data research projects benefit from using cloud storage services (e.g., umich email, file-sharing, computational services etc.)

'''Warehouse:''' a system used for reporting and data analysis. Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). DW stores current and historical data and are used for creating trending reports for senior management reporting such as annual comparison.
<center>
[[File:DataManagementFig2.jpg]]
</center>

'''DBs:'''

A database is an organized collection of data where data are typically organized to model relevant aspects of reality in a way that supports processes requiring this information.
<center>
[[File:DataManagementFig3.png]]
</center>

A well-known example would be the pipeline, which utilizes data from the LONI Image Data Archive (IDA). The pipeline takes advantage of the cluster nodes to download files in parallel from the IDA database.
*[http://pipeline.loni.usc.edu/learn/user-guide/building-a-workflow/#IDA Building a Workflow User-Guide]
*[https://ida.loni.usc.edu/login.jsp This article] provides a specific introduction to The LONI Image Data Archive (IDA), which offers an integrated environment for safely archiving, querying, visualizing and sharing neuroimaging data. It facilitates the de-identification and pooling of data from multiple institutions, protecting data from unauthorized access while providing the ability to share data among collaborative investigators. The archive provides flexibility in establishing project metadata, accommodating one or more research groups, sits and others. It’s simple, secure and requires nothing more than a computer with Internet connection and a web browser.

'''Arrays:''' a data structure consisting of a collection of elements, each identified by at least one array index or key.
Types of arrays:
*One-dimensional array: a simple example: an array of 8 integer variables with indices 0 through 7 may be sored as 8 words at memory address: {200, 202, 204, 206, 208, 210, 212, 214} which can be memorized as 200+2i.
*Multidimensional arrays: $Data=\begin{bmatrix}
2 & 3 & 0\\
6 & 4 & 5\\
5 & 3 & 1\\
\end{bmatrix}$

Example: R arrays and data-frames

DF <- data.frame(a = 1:3, b = letters[10:12],
c = seq(as.Date("2004-01-01"), by = "week", len = 3),
stringsAsFactors = TRUE)
> DF
...a.b..... c
1 1 j 2004-01-01
2 2 k 2004-01-08
3 3 l 2004-01-15

data.matrix(DF[1:2])
data.matrix(DF)

> data.matrix(DF)
a.b c
[1,] 1 1 12418
[2,] 2 2 12425
[3,] 3 3 12432

> sleep # sleep dataset
......extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
7 3.7 1 7
8 0.8 1 8
9 0.0 1 9
10 2.0 1 10
11 1.9 2 1
12 0.8 2 2
13 1.1 2 3
14 0.1 2 4
15 -0.1 2 5
16 4.4 2 6
17 5.5 2 7
18 1.6 2 8
19 4.6 2 9
20 3.4 2 10

'''Binary ASCII:''' [http://en.wikipedia.org/wiki/Array_data_structure ASCII], which is short for American Standard Code for Information Interchange is a character-encoding scheme originally based on English alphabet that encodes 128 specified characters – the numbers 0 – 9, the letters a – z and A – Z, some basic punctuation symbols, some control codes that originated with teletype machines and a bland space – into the 7-bit binary integers. It represents text in computers, communications equipment, and other devices that use text. To convert the ASCII code to binary character (part):
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
|Letter||ASCII Code||Binary||Letter||ASC II Code||Binary
|-
|a||097||01100001||A||065||01000001
|-
|b||098||01100010||B||066||01000010
|-
|c||099||01100011||C||067||01000011
|-
|d||100||01100100||D||068||01000100
|-
|e||101||01100101||E||069||01000101
|}
</center>

<center>
{|class="wikitable" style="text-align:center; width:33%" border="1"
|-
|Dec||Hex||Binary
|-
|0||00||00000000
|-
|1||01||00000001
|-
|2||02||00000010
|-
|3||03||00000011
|-
|4||04||00000100
|-
|5||05||00000101
|-
|6||06||00000110
|-
|7||07||00000111
|-
|8||08||00001000
|-
|9||09||00001001
|-
|10||0A||00001010
|-
|11||0B||00001011
|-
|12||0C||00001100
|-
|13||0D||00001101
|-
|14||0E||00001110
|-
|15||0F||00001111
|-
|16||10||00010000
|}
</center>

'''[http://ori.hhs.gov/education/products/n_illinois_u/datamanagement/dhtopic.html Handling:]''' the process of ensuring that the research data is stored, archived or disposed off in a safe and secure manner during and after the conclusion of a research project. This includes the development of policies and procedures to manage data handled electronically as well as through non-electronic means.

*Data handling is important in ensuring the integrity of research data since it addresses concerns related to confidentially, security, and preservation/retention of research data. Proper planning for data handling can also result in efficient and economical storage, retrieval, and disposal of data. In the case of data handled electronically, data integrity is a primary concern to ensure that recorded data is not altered, erased, lost or accessed by unauthorized users.

*Data handling issues encompass both electronic as well as non-electronic systems, such as paper files, journals, and laboratory notebooks. Electronic systems include computer workstations and laptops, personal digital assistants (PDA), storage media such as videotape, diskette, CD, DVD, memory cards, and other electronic instrumentation. These systems may be used for storage, archival, sharing, and disposing off data, and therefore, require adequate planning at the start of a research project so that issues related to data integrity can be analyzed and addressed early on.

Issues needed to be considered in ensuring integrity of data handled:
*Type of data handled and its impact on the environment (especially if it is on a toxic media);
*Type of media containing data and its storage capacity, handling and storage requirements, reliability, longevity (in the case of degradable medium), retrieval effectiveness, and ease of upgrade to newer media;
*Data handling responsibilities/privileges, that is, who can handle which portion of data, at what point during the project, for what purpose, etc;
*Data handling procedures that describe how long the data should be kept, and when, how, and who should handle data for storage, sharing, archival, retrieval and disposal purposes.

===Applications===
*[http://en.wikibooks.org/wiki/OpenClinica_User_Manual This article] presents a comprehensive introduction to EDC (electronic data capture) in clinical research. It contains a series of guides to help users learn how to use OpenClinica for clinical data management. This user manual serves a perfect introduction to OpenClinica and equips researches with background information, specific data management procedures, study construction and system maintenance for data management in clinical studies.
*[http://pipeline.loni.usc.edu/learn/user-guide/building-a-workflow/#XNAT This article] talked about issues regarding building the workflow with LONI Pipeline. It introduced dragging and connecting in the modules and how data is managed with pipeline. It also illustrates usage of pipeline with IDA, NDAR, XNAT and the cloud storage it supported.
*[http://onlinepubs.trb.org/onlinepubs/nchrp/cd-22/manual/v2chapter2.pdf This article] presents a general introduction to data management, analysis tools and analysis mechanics. It illustrated the purpose, steps, considerations and useful tools in data management and introduced the specific steps in data handling. It go through the major steps with examples and illustrated the database handling with consideration of the dataset size. This article offers a comprehensive analysis of data management and would be a good start to implement data management.
*[http://www.sciencedirect.com/science/article/pii/S0167819102000947 This article] presents two services that they believe are fundamental to any data grid: reliable, high-speed transport and replica management. Their high-speed transport service, GridFTP, extends the popular FTP protocol with new features required for Data Grid applications, such as striping and partial file access and their replica management service integrates a replica catalog with GridFTP transfers to provide for the creation, registration, location, and management of dataset replicas. The article also presents the design of both services and also preliminary performance results and implementations exploit security and other services provided by the Globus Toolkit.

===Software===
*[http://cran.r-project.org/doc/manuals/r-devel/R-data.html Data Import/Export in R]
*[http://cran.r-project.org/web/packages/rmongodb/ Package ‘rmongodb’ in R (provides an interface to the NoSQL MongoDB database)]

===Problems===
Example 1: Data Streaming:

> library("stream")
> dsd <- DSD_Gaussians(k=3, d=3, noise=0.05)
> dsd
Static Mixture of Gaussians Data Stream
With 3 clusters in 3 dimensions

> p <- get_points(dsd, n=5)

> p
V1 V2 V3
1 0.6930698 0.4082633 0.3857444
2 0.9086645 0.5545718 0.6942871
3 0.6031121 0.5288573 0.5795389
4 0.7590460 0.5485160 0.6484480
5 0.7249682 0.7327056 0.4891291

> p <- get_points(dsd, n=100, assignment=TRUE)

> attr(p, "assignment")
[1] 3 3 3 NA 2 2 2 3 1 2 1 2 2 3 1 1 1 3 3 2 3 1 1 2
[25] 3 1 3 3 1 1 3 2 3 3 2 NA 1 1 1 1 1 3 2 2 2 3 1 2
[49] 2 2 2 3 1 NA 1 3 2 2 3 3 3 2 1 2 2 2 1 3 1 1 1 2
[73] 1 3 2 2 3 1 NA 1 2 1 1 3 3 1 1 2 1 3 3 3 2 3 3 2
[97] NA 1 2 1

> plot(dsd, n=500)

<center> [[File:DataManagementFig4.png]] </center>

Example 2: Import of big data: zips (the data would be available at http://media.mongodb.org/zips.json) import using R/MongoDB:
*Steps. Install the package of ‘rmongodb’ and import data zips, this data is included in the rmongodb package and can be loaded using the command data(zips):
**install.package('rmonogodb')
**library(rmongodb)
**data(zips)
**head(zips)

''Output:''
> install.packages('rmongodb')
trying URL 'http://cran.mtu.edu/bin/macosx/leopard/contrib/2.15/rmongodb_1.6.5.tgz'
Content type 'application/x-gzip' length 1291831 bytes (1.2 Mb)
opened URL
downloaded 1.2 Mb

The downloaded binary packages are in
/var/folders/k6/3r5dstw5385_b4fzt679pmsr0000gn/T//RtmpWD2wiV/downloaded_packages

> head(zips)
city loc pop state _id

[1,] "ACMAR" Numeric,2 6055 "AL" "35004"
[2,] "ADAMSVILLE" Numeric,2 10616 "AL" "35005"
[3,] "ADGER" Numeric,2 3205 "AL" "35006"
[4,] "KEYSTONE" Numeric,2 14218 "AL" "35007"
[5,] "NEW SITE" Numeric,2 19942 "AL" "35010"
[6,] "ALPINE" Numeric,2 3062 "AL" "35014"

===References===
*[http://en.wikipedia.org/wiki/Table_(information) Table (Information) Wikipedia]
*[http://en.wikipedia.org/wiki/Stream_(computing) Stream (Computing) Wikipedia]
*[http://en.wikipedia.org/wiki/Cloud_storage Cloud Storage Wikipedia]
*[http://en.wikipedia.org/wiki/Data_warehouse Data Warehouse Wikipedia]
*[http://en.wikipedia.org/wiki/Database Database Wikipedia]
*[http://en.wikipedia.org/wiki/Array_data_structure Array Data Structure Wikipedia]
*[http://en.wikipedia.org/wiki/Array_data_structure ASCII Wikipedia]
*[http://ori.hhs.gov/education/products/n_illinois_u/datamanagement/dhtopic.html Data Handling, Responsible Conduct in Data Management]

SMHS DataManagement

2014-09-02T18:25:52Z

Zhenxunw: /* Theory */

==[[SMHS| Scientific Methods for Health Sciences]] - Data Management ==

===Overview===
Data management comprised all the discipline related to managing data as a valuable resource and it is of significant importance in various fields. It is officially defined as the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise. There are various ways to manage data. In this lecture, we are going to introduce the fundamental roles of data management in statistics and illustrate commonly used ways and steps in data managements through examples from different areas including tables, streams, cloud, warehouses, DBs, arrays, binary ASC II, handling and mechanics.

===Motivation===
The next step after getting data would be to make proper management of the data in hand. Data management is of course a vital step in data analysis and is crucial to the success and reproducibility of a statistical analysis. So how would we do a good data management and what are the commonly used ways of data management? In order to further make good use of the data, we are going to learn more about the area of data management and various ways to implement it. Selection of appropriate tools and efficient use of these tools can save the researcher numerous hours, and allow other researchers to leverage the products of their work.

===Theory===

'''Tables:''' one of the most commonly used ways to manage data. It is a means of arranging data in rows and columns. It is of pervasive use throughout all research and data analysis.
There are two basic types of tables:
*Simple Table: consider the following example of a table summarizing the data from two groups. The table presents a general comparison of the two groups and shows us a clear picture of the measurements and the comparative characteristics between the two groups. What we learn from the table below: (1) these two groups with the same sample size have the same mean; (2) group 1 has a bigger range of the data values; (3) group 2 has a smaller standard deviation indicating a less variant dataset compared to group 1.

<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
| ||Minimum||Maximum||Mean||Standard Deviation||Size
|-
|Group 1||12||45||22||2.6||40
|-
|Group 2||15||30||22||1.5||40
|}
</center>

*Multi-dimensional table: consider the F distribution (http://socr.umich.edu/Applets/F_Table.html) where the first row and the column is the degree of freedoms and the data in the middle is the 99% quantiles of F(k,l). This is a two dimensional example and the coordinates or combinations of the basic headers give a unique value attached.
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
|l \ k||1||2||3||4||5
|-
|1||4052.||4999.5||5403.||5625.||5764.
|-
|2||98.50||99.00||99.17||99.25||99.30
|-
|3||34.12||30.82||29.46||28.71||28.24
|-
|4||21.20||18.00||16.69||15.98||15.52
|-
|5||13.27||12.06||11.39||10.97||10.67
|}
</center>

'''Streams:''' is also an easy way of data management and it is more visualized with pictures. It is a sequence of data elements made available over time and can be thought as a conveyor belt that allows items to be processed one at a time rather than in large batches. It is a sequence of digitally encoded coherent signals used to transmit or receive information that is in the process of being transmitted. In electronics and computer architecture, it determines for which time which data item is scheduled to enter or leave which port.

From the chart below, we have three apparent observations: (1) pre-dialysis kidney function is associated with patient survival; (2) pre-dialysis kidney function differs by type of dialysis treatment (it is associated with type of dialysis); (3) pre-dialysis kidney function is not in the causal pathway of type of dialysis and survival. And from these three conditions, we can conclude that pre-dialysis kidney function is a potential confounder.

[[File:DataManagementChart1.png |500px]]

An illustrative example of streaming is wind map (http://hint.fm/wind/), which generates a vivid demonstration of the wind speed over the USA with data streamed live from different resources. We can have a clear picture of the wind range countrywide with streaming.

[[File:DataManagementFig1.png|500px]]

'''Cloud Data Storage:''' Data cloud storage demands high availability, durability, and scalability from a few bytes to petabytes. Examples of data cloud storage services include Amazon’s S3, which promises a 99.9% monthly availability and 99.999999999% durability per year. This translates into less than an hour outage per month. For an example of durability, assume that a user stores 10,000 objects in a cloud storage, then, on average, the user would expect to lose one object every 10,000,000 years. S3, and other cloud service providers, achieve this reliability by storing data in multiple facilities with error checking and self-healing processes to detect and repair errors and device failures. This process is completely transparent to the user and requires no actions or knowledge of the underlying complexities. Global data-intense service providers like Google and Facebook have the expertise and scale to provide enormous storage and significant reliability and uptime in an efficient/economical way. Many Big Data research projects benefit from using cloud storage services (e.g., umich email, file-sharing, computational services etc.)

'''Warehouse:''' a system used for reporting and data analysis. Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). DW stores current and historical data and are used for creating trending reports for senior management reporting such as annual comparison.
<center>
[[File:DataManagementFig2.jpg]]
</center>

'''DBs:'''

A database is an organized collection of data where data are typically organized to model relevant aspects of reality in a way that supports processes requiring this information.
<center>
[[File:DataManagementFig3.png]]
</center>

A well-known example would be the pipeline, which utilizes data from the LONI Image Data Archive (IDA). The pipeline takes advantage of the cluster nodes to download files in parallel from the IDA database.
*[http://pipeline.loni.usc.edu/learn/user-guide/building-a-workflow/#IDA Building a Workflow User-Guide]
*[https://ida.loni.usc.edu/login.jsp This article] provides a specific introduction to The LONI Image Data Archive (IDA), which offers an integrated environment for safely archiving, querying, visualizing and sharing neuroimaging data. It facilitates the de-identification and pooling of data from multiple institutions, protecting data from unauthorized access while providing the ability to share data among collaborative investigators. The archive provides flexibility in establishing project metadata, accommodating one or more research groups, sits and others. It’s simple, secure and requires nothing more than a computer with Internet connection and a web browser.

'''Arrays:''' a data structure consisting of a collection of elements, each identified by at least one array index or key.
Types of arrays:
*One-dimensional array: a simple example: an array of 8 integer variables with indices 0 through 7 may be sored as 8 words at memory address: {200, 202, 204, 206, 208, 210, 212, 214} which can be memorized as 200+2i.
*Multidimensional arrays: $Data=\begin{bmatrix}
2 & 3 & 0\\
6 & 4 & 5\\
5 & 3 & 1\\
\end{bmatrix}$

Example: R arrays and data-frames

DF <- data.frame(a = 1:3, b = letters[10:12],
c = seq(as.Date("2004-01-01"), by = "week", len = 3),
stringsAsFactors = TRUE)
> DF
...a.b..... c
1 1 j 2004-01-01
2 2 k 2004-01-08
3 3 l 2004-01-15

data.matrix(DF[1:2])
data.matrix(DF)

> data.matrix(DF)
a.b c
[1,] 1 1 12418
[2,] 2 2 12425
[3,] 3 3 12432

> sleep # sleep dataset
......extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
7 3.7 1 7
8 0.8 1 8
9 0.0 1 9
10 2.0 1 10
11 1.9 2 1
12 0.8 2 2
13 1.1 2 3
14 0.1 2 4
15 -0.1 2 5
16 4.4 2 6
17 5.5 2 7
18 1.6 2 8
19 4.6 2 9
20 3.4 2 10

'''Binary ASCII:''' [http://en.wikipedia.org/wiki/Array_data_structure ASCII], which is short for American Standard Code for Information Interchange is a character-encoding scheme originally based on English alphabet that encodes 128 specified characters – the numbers 0 – 9, the letters a – z and A – Z, some basic punctuation symbols, some control codes that originated with teletype machines and a bland space – into the 7-bit binary integers. It represents text in computers, communications equipment, and other devices that use text. To convert the ASCII code to binary character (part):
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
|Letter||ASCII Code||Binary||Letter||ASC II Code||Binary
|-
|a||097||01100001||A||065||01000001
|-
|b||098||01100010||B||066||01000010
|-
|c||099||01100011||C||067||01000011
|-
|d||100||01100100||D||068||01000100
|-
|e||101||01100101||E||069||01000101
|}
</center>

<center>
{|class="wikitable" style="text-align:center; width:33%" border="1"
|-
|Dec||Hex||Binary
|-
|0||00||00000000
|-
|1||01||00000001
|-
|2||02||00000010
|-
|3||03||00000011
|-
|4||04||00000100
|-
|5||05||00000101
|-
|6||06||00000110
|-
|7||07||00000111
|-
|8||08||00001000
|-
|9||09||00001001
|-
|10||0A||00001010
|-
|11||0B||00001011
|-
|12||0C||00001100
|-
|13||0D||00001101
|-
|14||0E||00001110
|-
|15||0F||00001111
|-
|16||10||00010000
|}
</center>

'''[http://ori.hhs.gov/education/products/n_illinois_u/datamanagement/dhtopic.html Handling:]''' the process of ensuring that the research data is stored, archived or disposed off in a safe and secure manner during and after the conclusion of a research project. This includes the development of policies and procedures to manage data handled electronically as well as through non-electronic means.

*Data handling is important in ensuring the integrity of research data since it addresses concerns related to confidentially, security, and preservation/retention of research data. Proper planning for data handling can also result in efficient and economical storage, retrieval, and disposal of data. In the case of data handled electronically, data integrity is a primary concern to ensure that recorded data is not altered, erased, lost or accessed by unauthorized users.

*Data handling issues encompass both electronic as well as non-electronic systems, such as paper files, journals, and laboratory notebooks. Electronic systems include computer workstations and laptops, personal digital assistants (PDA), storage media such as videotape, diskette, CD, DVD, memory cards, and other electronic instrumentation. These systems may be used for storage, archival, sharing, and disposing off data, and therefore, require adequate planning at the start of a research project so that issues related to data integrity can be analyzed and addressed early on.

Issues needed to be considered in ensuring integrity of data handled:
*Type of data handled and its impact on the environment (especially if it is on a toxic media);
*Type of media containing data and its storage capacity, handling and storage requirements, reliability, longevity (in the case of degradable medium), retrieval effectiveness, and ease of upgrade to newer media;
*Data handling responsibilities/privileges, that is, who can handle which portion of data, at what point during the project, for what purpose, etc;
*Data handling procedures that describe how long the data should be kept, and when, how, and who should handle data for storage, sharing, archival, retrieval and disposal purposes.

===Applications===
*[http://en.wikibooks.org/wiki/OpenClinica_User_Manual This article] presents a comprehensive introduction to EDC (electronic data capture) in clinical research. It contains a series of guides to help users learn how to use OpenClinica for clinical data management. This user manual serves a perfect introduction to OpenClinica and equips researches with background information, specific data management procedures, study construction and system maintenance for data management in clinical studies.
*[http://pipeline.loni.usc.edu/learn/user-guide/building-a-workflow/#XNAT This article] talked about issues regarding building the workflow with LONI Pipeline. It introduced dragging and connecting in the modules and how data is managed with pipeline. It also illustrates usage of pipeline with IDA, NDAR, XNAT and the cloud storage it supported.
*[http://onlinepubs.trb.org/onlinepubs/nchrp/cd-22/manual/v2chapter2.pdf This article] presents a general introduction to data management, analysis tools and analysis mechanics. It illustrated the purpose, steps, considerations and useful tools in data management and introduced the specific steps in data handling. It go through the major steps with examples and illustrated the database handling with consideration of the dataset size. This article offers a comprehensive analysis of data management and would be a good start to implement data management.
*[http://www.sciencedirect.com/science/article/pii/S0167819102000947 This article] presents two services that they believe are fundamental to any data grid: reliable, high-speed transport and replica management. Their high-speed transport service, GridFTP, extends the popular FTP protocol with new features required for Data Grid applications, such as striping and partial file access and their replica management service integrates a replica catalog with GridFTP transfers to provide for the creation, registration, location, and management of dataset replicas. The article also presents the design of both services and also preliminary performance results and implementations exploit security and other services provided by the Globus Toolkit.

===Software===
*[http://cran.r-project.org/doc/manuals/r-devel/R-data.html Data Import/Export in R]
*[http://cran.r-project.org/web/packages/rmongodb/ Package ‘rmongodb’ in R (provides an interface to the NoSQL MongoDB database)]

===Problems===
Example 1: Data Streaming:

> library("stream")
> dsd <- DSD_Gaussians(k=3, d=3, noise=0.05)
> dsd
Static Mixture of Gaussians Data Stream
With 3 clusters in 3 dimensions

> p <- get_points(dsd, n=5)

> p
V1 V2 V3
1 0.6930698 0.4082633 0.3857444
2 0.9086645 0.5545718 0.6942871
3 0.6031121 0.5288573 0.5795389
4 0.7590460 0.5485160 0.6484480
5 0.7249682 0.7327056 0.4891291

> p <- get_points(dsd, n=100, assignment=TRUE)

> attr(p, "assignment")
[1] 3 3 3 NA 2 2 2 3 1 2 1 2 2 3 1 1 1 3 3 2 3 1 1 2
[25] 3 1 3 3 1 1 3 2 3 3 2 NA 1 1 1 1 1 3 2 2 2 3 1 2
[49] 2 2 2 3 1 NA 1 3 2 2 3 3 3 2 1 2 2 2 1 3 1 1 1 2
[73] 1 3 2 2 3 1 NA 1 2 1 1 3 3 1 1 2 1 3 3 3 2 3 3 2
[97] NA 1 2 1

> plot(dsd, n=500)

<center> [[File:DataManagementFig4.png]] </center>

Example 2: Import of big data: zips (the data would be available at http://media.mongodb.org/zips.json) import using R/MongoDB:
*Steps. Install the package of ‘rmongodb’ and import data zips, this data is included in the rmongodb package and can be loaded using the command data(zips):
**install.package('rmonogodb')
**library(rmongodb)
**data(zips)
**head(zips)

''Output:''
> install.packages('rmongodb')
trying URL 'http://cran.mtu.edu/bin/macosx/leopard/contrib/2.15/rmongodb_1.6.5.tgz'
Content type 'application/x-gzip' length 1291831 bytes (1.2 Mb)
opened URL
downloaded 1.2 Mb

The downloaded binary packages are in
/var/folders/k6/3r5dstw5385_b4fzt679pmsr0000gn/T//RtmpWD2wiV/downloaded_packages

> head(zips)
city loc pop state _id

[1,] "ACMAR" Numeric,2 6055 "AL" "35004"
[2,] "ADAMSVILLE" Numeric,2 10616 "AL" "35005"
[3,] "ADGER" Numeric,2 3205 "AL" "35006"
[4,] "KEYSTONE" Numeric,2 14218 "AL" "35007"
[5,] "NEW SITE" Numeric,2 19942 "AL" "35010"
[6,] "ALPINE" Numeric,2 3062 "AL" "35014"

===References===
*[http://en.wikipedia.org/wiki/Table_(information) Table (Information) Wikipedia]
*[http://en.wikipedia.org/wiki/Stream_(computing) Stream (Computing) Wikipedia]
*[http://en.wikipedia.org/wiki/Cloud_storage Cloud Storage Wikipedia]
*[http://en.wikipedia.org/wiki/Data_warehouse Data Warehouse Wikipedia]
*[http://en.wikipedia.org/wiki/Database Database Wikipedia]
*[http://en.wikipedia.org/wiki/Array_data_structure Array Data Structure Wikipedia]
*[http://ori.hhs.gov/education/products/n_illinois_u/datamanagement/dhtopic.html Data Handling, Responsible Conduct in Data Management]

SMHS PowerSensitivitySpecificity

2014-09-02T18:20:40Z

Zhenxunw: /* References */

==[[SMHS| Scientific Methods for Health Sciences]] - Statistical Power, Sample-Size, Sensitivity and Specificity ==

===Overview:===

In the statistics, we have many ways to value and choose a test or model. In this lecture, we are going to introduce some commonly used methods, which describes the characteristics of a test: power, sample size, effect size, sensitivity and specificity. Those measures and characteristics of a test would help us in our statistical test or experiments. This lecture will present introduction to the background knowledge of those concepts and illustrate their power and application through examples.

===Motivation:===

Experiments, models and tests are significant fundamentals to the filed of statistics and we all experienced the question of how to set up the right test and how to choose a better model. We are interested in studying on some of the most commonly used methods including power, effect size, sensitivity and specificity, which will greatly help us in understanding and choosing the model. So, what would be a reasonable sample size to reach a balance in the trade off between cost and efficiency? What would be the probability that the test will reject a false null hypothesis? What is the test’s ability to correctly accept a true null hypothesis or reject a false alternative hypothesis?

===Theory===

====Type I Error, Type II Error and Power====
*Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
*Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
*Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
*Test specificity (ability of a test to correctly accept the null hypothesis $ =\frac{d}{b+d}$.
*Test sensitivity (ability of a test to correctly reject the alternative hypothesis $=\frac{a}{a+c}$.

*The table below gives an example of calculating specificity, sensitivity, False positive rate $\alpha$, False Negative Rate $\beta$ and power given the information of ''TN'' and ''FN''.

<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent (H_0 is true)''' || '''Present (H_1 is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject H_0)''' || Condition absent + Negative result = True (accurate) Negative ('''TN''', 0.98505) || ''Condition present + Negative result = False (invalid) Negative ('''FN''', 0.00025)'''Type II error''' (β)
|-
| '''Positive (reject H_0)''' || Condition absent + Positive result = False Positive ('''FP''', 0.00995)'''Type I error''' (α) || Condition Present + Positive result = True Positive ('''TP''', 0.00475)
|-
|'''Test Interpretation''' || $Power = 1-FN= 1-0.00025 = 0.99975 $ ||'''Specificity''': TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99 ||'''Sensitivity''': TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95
|-
|}
</center>

Specificity $=\frac{TN}{TN + FP}$, Sensitivity $=\dfrac{TP}{TP+FN}$, $\alpha=\dfrac {FP}{FP+TN}$, $\beta=\frac{FN}{FN+TP}$, power$=1-\beta.$

====Sample size====
The number of observations or replicates included in a statistical sample. It is an important feature of any empirical study, which aims to make inference about a population. In complicated studies, there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling, there may be different sizes of samples for each population.

*Factors influence sample size: expense of data collection; need to have sufficient statistical power.

*Ways to choose sample sizes: (1) expedience. Consider a simple experiment where the sample data is readily available or convenient to collect, yet the size of sample is crucial in avoiding wide confidence intervals or risks of errors in statistical hypothesis testing. (2) using a target variance for an estimate to be derived from the sample eventually obtained; (3) using a target for the power of a statistical test to be applied once the sample is collected.

*Intuitively, larger sample size generally lead to increased precision in estimating unknown parameters. However, in some situations, the increase in accuracy for larger sample size is minimal, or even doesn’t exist. This can result from the presence of systematic error or strong dependence in the data, or if the data follow a heavy-tailed distribution. Sample size is judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test.

*Choose the sample size based on our expectation of other measures.
*Suppose the simple experiment of flipping a coin, where estimator of a proportion is $\hat{p}=\frac{X}{n}$, where $X$ is the number of heads out of n experiments. The estimator follows a binomial distribution and when n is sufficiently large, the distribution will be closely approximated by a normal distribution. With approximation, it can be shown that around $95\%$of this distribution’s probability lies within 2 standard deviations of the mean. Use Wald method for the binomial distribution, an interval of the form $(\hat{p} -2\sqrt{\frac{0.25}{n}}, \hat{p} + 2\sqrt{\frac{.25}{n}}) $ will form a 95% CI for the true proportion. If this interval needs to be no more than $W$ units wide, then we have $4\sqrt{\frac{0.25}{n}}=W$, solved for $n$, we have $ n=\frac{4}{W^2}=\frac{1}{B^2}$ where $B$ is the error bound on the estimate, i.e., the estimate is usually given as within $\pm B$. Hence, if $B=10$, then $n=100$; and if $B=0.05$ (5%), then $n=400$.

*A proportion is a special case of mean. When estimating the population mean using an independent and identically distributed sample of size n, where each data has variance $ \sigma ^{2}$, the standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$. With [[SMHS_CLT_LLN|CLT]], the 95% CI is $(\bar x - \frac {2\sigma}{\sqrt n},\bar x +\frac{2\sigma}{\sqrt n})$. If we wish to have a confidence interval with W units in width, then solve for n, we have $n=\frac{16\sigma^2}{W^2}$.

*Sample size for hypothesis tests: Let $X_i,i=1,2,…,n$ be independent observations taken from a normal distribution with unknown mean μ and known variance $\sigma^2$. The null hypothesis vs. alternative hypothesis: $H_0:\mu=0$ vs.$H_a:\mu=\mu^*$. If we wish to (1) reject $H_0$ with a probability of at least $1-\beta$ when $H_a$ is true, (2) reject $H_0$ with probability $\alpha$ when $H_0$ is true, we need: $P(\bar x >\frac{z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})=\alpha $, and so reject $H_0$ if our sample average is more than $\frac{z_\alpha\sigma} {\sqrt n}$ is a decision rule which satisfies (2). $z_\alpha$ is the upper percentage point of the standard normal distribution. If we wish this to happen with a probability $1-\beta$ when $H_a$ is true. In this case, our sample average will come from a normal distribution with mean $μ^*$.

: Therefore, require $P (\bar x >\frac {z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})\le 1-\beta $. Solve for n, we have $n \ge ( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}})^{2}$, where $\Phi$ is the [[SMHS_ProbabilityDistributions#Normal_distribution|normal cumulative distribution function]].

====Effect size====
[http://books.google.com/books?id=J8AlAgAAQBAJ&pg=PT176&lpg=PT176&dq=Effect+size+is+a+descriptive+statistic+that+conveys+the+estimated+magnitude+of+a+relationship+without+making+any+statement+about+whether+the+apparent+relationship+in+the+data+reflects+a+true+relationship+in+the+population&source=bl&ots=YcgNM4azVu&sig=ut-4IHx-SrRoHqMZjAmQtXxxYp4&hl=en&sa=X&ei=wQkGVPzhIsrHggT68YDQBA&ved=0CDMQ6AEwAg#v=onepage&q=Effect%20size%20is%20a%20descriptive%20statistic%20that%20conveys%20the%20estimated%20magnitude%20of%20a%20relationship%20without%20making%20any%20statement%20about%20whether%20the%20apparent%20relationship%20in%20the%20data%20reflects%20a%20true%20relationship%20in%20the%20population&f=false Effect size] is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. It complements inferential statistics such as p-value and plays an important role in statistical studies. The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical statistical population. These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model.

====Other common measures====
*Pearson $r$ (correlation): an effect size when paired quantitative data are available, for instance if one were studying the relationship between birth weight and longevity. It varies from -1 to 1, 1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation and 0 indicating no linear relation between two variables.

*Correlation coefficient, $ r^2 $: a coefficient determination calculated as the square of Pearson correlation r. It varies from 0 to 1 and is always nonnegative. For example, if $r=0.2$ then $r^2=0.04$ meaning that $4\%$ of the variance of either variable is shared with the other variable.

*Eta-squared, $ \eta^2 $, describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the $ r^2 $. It is a biased estimator of the variance explained by the model in the population. $ \eta^2=\frac{SS_{treatment}} {SS_{total}} $ .

*Omega-squared, $\omega^2$: a less biased estimator of the variance explained in the population. $\omega^2 =/frac{SS_{treatment}-df_{treatment}*MS_{error}}{SS_{total}+MS_error}$. Given it is less biased, $\omega^2$ is preferable to $\eta^2$, however, it can be more inconvenient to calculate for complex analyses.

* Cohen’s $ f^2 $: one of several effect size measures to use in the context of an F test for ANOVA or multiple regression. Its amount of bias depends on the bias of its underlying measurement of variance explained. $f^2=\frac{R^2}{1-R^2}$,$R^2 $ is the squared multiple correlation.

===Applications===
*[http://www.sciencedirect.com/science/article/pii/0197245681900015 This article]titled Introduction To Sample Size Determination And Power Analysis For Clinical Trials reviewed the importance of sample size in clinical trials and presented a general method from which specific equations are derived for sample size determination and analysis of power for a wide variety of statistical procedures. This paper discussed the method in details with illustration in relation to the t test, test for proportions, test for survival time and tests for correlations that commonly occurred in clinical trials.

*[http://http://onlinelibrary.wiley.com/doi/10.1111/j.1469-185X.2007.00027.x/pdf This article] presents measures of the magnitude of effects (i.e., effect size statistics) and their confidence intervals in all biological journals. It illustrated the combined use of an effect size and its confidence interval, which enables one to assess the relationships within data more effectively than the use of p values, regardless of statistical significance. It focused on standardized effect size statistics and extensively discussed two dimensionless classes of effect size statistics: d statistics (standardized mean difference) and r statistics (correlation coefficient), because these can be calculated from almost all study designs and also because their calculations are essential for meta-analysis. The paper provided potential solutions for four main technical problems researchers may encounter when calculating effect size and CIs: (1) when covariates exist, (2) when bias in estimating effect size is possible, (3) when data have non-normal error structure and/or variances, and (4) when data are non-independent.

*[http://www.sciencedirect.com/science/article/pii/019724569090005M This article]reviewed methods of sample size and power calculation for most commonly study designs. It presents two generic formulae for sample size and power calculation, from which the commonly used methods are derived. It also illustrates the calculation with a computer program, which can be used for studies with dichotomous, continuous, or survival response measures.

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html Student Calculator]
*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Normal T Chi-Squared F Tables]

===Problems===
Other things being equal, which of the following actions will reduce the power of a hypothesis test?

I. Increasing sample size.  II. Increasing significance level.  III. Increasing beta, the probability of a Type II error.

:(A) I only
:(B) II only  
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?

I. The power of the hypothesis test.  II. The effect size of the hypothesis test.  III. The probability of making a Type II error.

:(A) I only
:(B) II only
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose we have the following measurements taken. Calculate the corresponding power, specificity and sensitivity.
<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent ($H_0$ is true)''' || '''Present ($H_1$ is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject $H_0$)''' || 0.983 || 0.0025
|-
| '''Positive (reject $H_0$)''' || 0.0085 ||0.0055
|}
</center>

Suppose we are running a test on a simple experiment where the population standard deviation is $ 0.06$. $H_0: \mu=0$ vs. $H_a: \mu=0.5$. With type I error of 5%, what would be a reasonable sample size if we want to achieve at least 98% power.

===References===

*[http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_Hypothesis_Basics SOCR]

*[http://en.wikipedia.org/wiki/Sample_size_determination Sample Size Determination Wikipedia]

*[http://en.wikipedia.org/wiki/Effect_size Effect Size Wikipedia]

*[http://books.google.com/books?id=whF18jCxyv0C&pg=PT4&lpg=PT4&dq=e-Study+Guide+for+Statistics+for+the+Behavioral+Sciences,+textbook+by+Susan&source=bl&ots=9vlDcJMtv1&sig=lUFE0l5GeZdyX8iasXUNgSpb6UI&hl=en&sa=X&ei=CQoGVMjCNs_HgwTi1YCICw&ved=0CD8Q6AEwAw#v=onepage&q=e-Study%20Guide%20for%20Statistics%20for%20the%20Behavioral%20Sciences%2C%20textbook%20by%20Susan&f=false e-Study Guide for Statistics for the Behavioral Sciences, textbook by Susan]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PowerSensitivitySpecificity}}

SMHS PowerSensitivitySpecificity

2014-09-02T18:19:41Z

Zhenxunw: /* Effect size */

==[[SMHS| Scientific Methods for Health Sciences]] - Statistical Power, Sample-Size, Sensitivity and Specificity ==

===Overview:===

In the statistics, we have many ways to value and choose a test or model. In this lecture, we are going to introduce some commonly used methods, which describes the characteristics of a test: power, sample size, effect size, sensitivity and specificity. Those measures and characteristics of a test would help us in our statistical test or experiments. This lecture will present introduction to the background knowledge of those concepts and illustrate their power and application through examples.

===Motivation:===

Experiments, models and tests are significant fundamentals to the filed of statistics and we all experienced the question of how to set up the right test and how to choose a better model. We are interested in studying on some of the most commonly used methods including power, effect size, sensitivity and specificity, which will greatly help us in understanding and choosing the model. So, what would be a reasonable sample size to reach a balance in the trade off between cost and efficiency? What would be the probability that the test will reject a false null hypothesis? What is the test’s ability to correctly accept a true null hypothesis or reject a false alternative hypothesis?

===Theory===

====Type I Error, Type II Error and Power====
*Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
*Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
*Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
*Test specificity (ability of a test to correctly accept the null hypothesis $ =\frac{d}{b+d}$.
*Test sensitivity (ability of a test to correctly reject the alternative hypothesis $=\frac{a}{a+c}$.

*The table below gives an example of calculating specificity, sensitivity, False positive rate $\alpha$, False Negative Rate $\beta$ and power given the information of ''TN'' and ''FN''.

<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent (H_0 is true)''' || '''Present (H_1 is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject H_0)''' || Condition absent + Negative result = True (accurate) Negative ('''TN''', 0.98505) || ''Condition present + Negative result = False (invalid) Negative ('''FN''', 0.00025)'''Type II error''' (β)
|-
| '''Positive (reject H_0)''' || Condition absent + Positive result = False Positive ('''FP''', 0.00995)'''Type I error''' (α) || Condition Present + Positive result = True Positive ('''TP''', 0.00475)
|-
|'''Test Interpretation''' || $Power = 1-FN= 1-0.00025 = 0.99975 $ ||'''Specificity''': TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99 ||'''Sensitivity''': TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95
|-
|}
</center>

Specificity $=\frac{TN}{TN + FP}$, Sensitivity $=\dfrac{TP}{TP+FN}$, $\alpha=\dfrac {FP}{FP+TN}$, $\beta=\frac{FN}{FN+TP}$, power$=1-\beta.$

====Sample size====
The number of observations or replicates included in a statistical sample. It is an important feature of any empirical study, which aims to make inference about a population. In complicated studies, there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling, there may be different sizes of samples for each population.

*Factors influence sample size: expense of data collection; need to have sufficient statistical power.

*Ways to choose sample sizes: (1) expedience. Consider a simple experiment where the sample data is readily available or convenient to collect, yet the size of sample is crucial in avoiding wide confidence intervals or risks of errors in statistical hypothesis testing. (2) using a target variance for an estimate to be derived from the sample eventually obtained; (3) using a target for the power of a statistical test to be applied once the sample is collected.

*Intuitively, larger sample size generally lead to increased precision in estimating unknown parameters. However, in some situations, the increase in accuracy for larger sample size is minimal, or even doesn’t exist. This can result from the presence of systematic error or strong dependence in the data, or if the data follow a heavy-tailed distribution. Sample size is judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test.

*Choose the sample size based on our expectation of other measures.
*Suppose the simple experiment of flipping a coin, where estimator of a proportion is $\hat{p}=\frac{X}{n}$, where $X$ is the number of heads out of n experiments. The estimator follows a binomial distribution and when n is sufficiently large, the distribution will be closely approximated by a normal distribution. With approximation, it can be shown that around $95\%$of this distribution’s probability lies within 2 standard deviations of the mean. Use Wald method for the binomial distribution, an interval of the form $(\hat{p} -2\sqrt{\frac{0.25}{n}}, \hat{p} + 2\sqrt{\frac{.25}{n}}) $ will form a 95% CI for the true proportion. If this interval needs to be no more than $W$ units wide, then we have $4\sqrt{\frac{0.25}{n}}=W$, solved for $n$, we have $ n=\frac{4}{W^2}=\frac{1}{B^2}$ where $B$ is the error bound on the estimate, i.e., the estimate is usually given as within $\pm B$. Hence, if $B=10$, then $n=100$; and if $B=0.05$ (5%), then $n=400$.

*A proportion is a special case of mean. When estimating the population mean using an independent and identically distributed sample of size n, where each data has variance $ \sigma ^{2}$, the standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$. With [[SMHS_CLT_LLN|CLT]], the 95% CI is $(\bar x - \frac {2\sigma}{\sqrt n},\bar x +\frac{2\sigma}{\sqrt n})$. If we wish to have a confidence interval with W units in width, then solve for n, we have $n=\frac{16\sigma^2}{W^2}$.

*Sample size for hypothesis tests: Let $X_i,i=1,2,…,n$ be independent observations taken from a normal distribution with unknown mean μ and known variance $\sigma^2$. The null hypothesis vs. alternative hypothesis: $H_0:\mu=0$ vs.$H_a:\mu=\mu^*$. If we wish to (1) reject $H_0$ with a probability of at least $1-\beta$ when $H_a$ is true, (2) reject $H_0$ with probability $\alpha$ when $H_0$ is true, we need: $P(\bar x >\frac{z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})=\alpha $, and so reject $H_0$ if our sample average is more than $\frac{z_\alpha\sigma} {\sqrt n}$ is a decision rule which satisfies (2). $z_\alpha$ is the upper percentage point of the standard normal distribution. If we wish this to happen with a probability $1-\beta$ when $H_a$ is true. In this case, our sample average will come from a normal distribution with mean $μ^*$.

: Therefore, require $P (\bar x >\frac {z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})\le 1-\beta $. Solve for n, we have $n \ge ( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}})^{2}$, where $\Phi$ is the [[SMHS_ProbabilityDistributions#Normal_distribution|normal cumulative distribution function]].

====Effect size====
[http://books.google.com/books?id=J8AlAgAAQBAJ&pg=PT176&lpg=PT176&dq=Effect+size+is+a+descriptive+statistic+that+conveys+the+estimated+magnitude+of+a+relationship+without+making+any+statement+about+whether+the+apparent+relationship+in+the+data+reflects+a+true+relationship+in+the+population&source=bl&ots=YcgNM4azVu&sig=ut-4IHx-SrRoHqMZjAmQtXxxYp4&hl=en&sa=X&ei=wQkGVPzhIsrHggT68YDQBA&ved=0CDMQ6AEwAg#v=onepage&q=Effect%20size%20is%20a%20descriptive%20statistic%20that%20conveys%20the%20estimated%20magnitude%20of%20a%20relationship%20without%20making%20any%20statement%20about%20whether%20the%20apparent%20relationship%20in%20the%20data%20reflects%20a%20true%20relationship%20in%20the%20population&f=false Effect size] is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. It complements inferential statistics such as p-value and plays an important role in statistical studies. The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical statistical population. These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model.

====Other common measures====
*Pearson $r$ (correlation): an effect size when paired quantitative data are available, for instance if one were studying the relationship between birth weight and longevity. It varies from -1 to 1, 1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation and 0 indicating no linear relation between two variables.

*Correlation coefficient, $ r^2 $: a coefficient determination calculated as the square of Pearson correlation r. It varies from 0 to 1 and is always nonnegative. For example, if $r=0.2$ then $r^2=0.04$ meaning that $4\%$ of the variance of either variable is shared with the other variable.

*Eta-squared, $ \eta^2 $, describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the $ r^2 $. It is a biased estimator of the variance explained by the model in the population. $ \eta^2=\frac{SS_{treatment}} {SS_{total}} $ .

*Omega-squared, $\omega^2$: a less biased estimator of the variance explained in the population. $\omega^2 =/frac{SS_{treatment}-df_{treatment}*MS_{error}}{SS_{total}+MS_error}$. Given it is less biased, $\omega^2$ is preferable to $\eta^2$, however, it can be more inconvenient to calculate for complex analyses.

* Cohen’s $ f^2 $: one of several effect size measures to use in the context of an F test for ANOVA or multiple regression. Its amount of bias depends on the bias of its underlying measurement of variance explained. $f^2=\frac{R^2}{1-R^2}$,$R^2 $ is the squared multiple correlation.

===Applications===
*[http://www.sciencedirect.com/science/article/pii/0197245681900015 This article]titled Introduction To Sample Size Determination And Power Analysis For Clinical Trials reviewed the importance of sample size in clinical trials and presented a general method from which specific equations are derived for sample size determination and analysis of power for a wide variety of statistical procedures. This paper discussed the method in details with illustration in relation to the t test, test for proportions, test for survival time and tests for correlations that commonly occurred in clinical trials.

*[http://http://onlinelibrary.wiley.com/doi/10.1111/j.1469-185X.2007.00027.x/pdf This article] presents measures of the magnitude of effects (i.e., effect size statistics) and their confidence intervals in all biological journals. It illustrated the combined use of an effect size and its confidence interval, which enables one to assess the relationships within data more effectively than the use of p values, regardless of statistical significance. It focused on standardized effect size statistics and extensively discussed two dimensionless classes of effect size statistics: d statistics (standardized mean difference) and r statistics (correlation coefficient), because these can be calculated from almost all study designs and also because their calculations are essential for meta-analysis. The paper provided potential solutions for four main technical problems researchers may encounter when calculating effect size and CIs: (1) when covariates exist, (2) when bias in estimating effect size is possible, (3) when data have non-normal error structure and/or variances, and (4) when data are non-independent.

*[http://www.sciencedirect.com/science/article/pii/019724569090005M This article]reviewed methods of sample size and power calculation for most commonly study designs. It presents two generic formulae for sample size and power calculation, from which the commonly used methods are derived. It also illustrates the calculation with a computer program, which can be used for studies with dichotomous, continuous, or survival response measures.

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html Student Calculator]
*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Normal T Chi-Squared F Tables]

===Problems===
Other things being equal, which of the following actions will reduce the power of a hypothesis test?

I. Increasing sample size.  II. Increasing significance level.  III. Increasing beta, the probability of a Type II error.

:(A) I only
:(B) II only  
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?

I. The power of the hypothesis test.  II. The effect size of the hypothesis test.  III. The probability of making a Type II error.

:(A) I only
:(B) II only
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose we have the following measurements taken. Calculate the corresponding power, specificity and sensitivity.
<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent ($H_0$ is true)''' || '''Present ($H_1$ is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject $H_0$)''' || 0.983 || 0.0025
|-
| '''Positive (reject $H_0$)''' || 0.0085 ||0.0055
|}
</center>

Suppose we are running a test on a simple experiment where the population standard deviation is $ 0.06$. $H_0: \mu=0$ vs. $H_a: \mu=0.5$. With type I error of 5%, what would be a reasonable sample size if we want to achieve at least 98% power.

===References===

*[http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_Hypothesis_Basics SOCR]

*[http://en.wikipedia.org/wiki/Sample_size_determination Sample Size Determination Wikipedia]

*[http://en.wikipedia.org/wiki/Effect_size Effect Size Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PowerSensitivitySpecificity}}

SMHS PowerSensitivitySpecificity

2014-09-02T18:19:13Z

Zhenxunw: /* Effect size */

==[[SMHS| Scientific Methods for Health Sciences]] - Statistical Power, Sample-Size, Sensitivity and Specificity ==

===Overview:===

In the statistics, we have many ways to value and choose a test or model. In this lecture, we are going to introduce some commonly used methods, which describes the characteristics of a test: power, sample size, effect size, sensitivity and specificity. Those measures and characteristics of a test would help us in our statistical test or experiments. This lecture will present introduction to the background knowledge of those concepts and illustrate their power and application through examples.

===Motivation:===

Experiments, models and tests are significant fundamentals to the filed of statistics and we all experienced the question of how to set up the right test and how to choose a better model. We are interested in studying on some of the most commonly used methods including power, effect size, sensitivity and specificity, which will greatly help us in understanding and choosing the model. So, what would be a reasonable sample size to reach a balance in the trade off between cost and efficiency? What would be the probability that the test will reject a false null hypothesis? What is the test’s ability to correctly accept a true null hypothesis or reject a false alternative hypothesis?

===Theory===

====Type I Error, Type II Error and Power====
*Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
*Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
*Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
*Test specificity (ability of a test to correctly accept the null hypothesis $ =\frac{d}{b+d}$.
*Test sensitivity (ability of a test to correctly reject the alternative hypothesis $=\frac{a}{a+c}$.

*The table below gives an example of calculating specificity, sensitivity, False positive rate $\alpha$, False Negative Rate $\beta$ and power given the information of ''TN'' and ''FN''.

<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent (H_0 is true)''' || '''Present (H_1 is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject H_0)''' || Condition absent + Negative result = True (accurate) Negative ('''TN''', 0.98505) || ''Condition present + Negative result = False (invalid) Negative ('''FN''', 0.00025)'''Type II error''' (β)
|-
| '''Positive (reject H_0)''' || Condition absent + Positive result = False Positive ('''FP''', 0.00995)'''Type I error''' (α) || Condition Present + Positive result = True Positive ('''TP''', 0.00475)
|-
|'''Test Interpretation''' || $Power = 1-FN= 1-0.00025 = 0.99975 $ ||'''Specificity''': TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99 ||'''Sensitivity''': TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95
|-
|}
</center>

Specificity $=\frac{TN}{TN + FP}$, Sensitivity $=\dfrac{TP}{TP+FN}$, $\alpha=\dfrac {FP}{FP+TN}$, $\beta=\frac{FN}{FN+TP}$, power$=1-\beta.$

====Sample size====
The number of observations or replicates included in a statistical sample. It is an important feature of any empirical study, which aims to make inference about a population. In complicated studies, there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling, there may be different sizes of samples for each population.

*Factors influence sample size: expense of data collection; need to have sufficient statistical power.

*Ways to choose sample sizes: (1) expedience. Consider a simple experiment where the sample data is readily available or convenient to collect, yet the size of sample is crucial in avoiding wide confidence intervals or risks of errors in statistical hypothesis testing. (2) using a target variance for an estimate to be derived from the sample eventually obtained; (3) using a target for the power of a statistical test to be applied once the sample is collected.

*Intuitively, larger sample size generally lead to increased precision in estimating unknown parameters. However, in some situations, the increase in accuracy for larger sample size is minimal, or even doesn’t exist. This can result from the presence of systematic error or strong dependence in the data, or if the data follow a heavy-tailed distribution. Sample size is judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test.

*Choose the sample size based on our expectation of other measures.
*Suppose the simple experiment of flipping a coin, where estimator of a proportion is $\hat{p}=\frac{X}{n}$, where $X$ is the number of heads out of n experiments. The estimator follows a binomial distribution and when n is sufficiently large, the distribution will be closely approximated by a normal distribution. With approximation, it can be shown that around $95\%$of this distribution’s probability lies within 2 standard deviations of the mean. Use Wald method for the binomial distribution, an interval of the form $(\hat{p} -2\sqrt{\frac{0.25}{n}}, \hat{p} + 2\sqrt{\frac{.25}{n}}) $ will form a 95% CI for the true proportion. If this interval needs to be no more than $W$ units wide, then we have $4\sqrt{\frac{0.25}{n}}=W$, solved for $n$, we have $ n=\frac{4}{W^2}=\frac{1}{B^2}$ where $B$ is the error bound on the estimate, i.e., the estimate is usually given as within $\pm B$. Hence, if $B=10$, then $n=100$; and if $B=0.05$ (5%), then $n=400$.

*A proportion is a special case of mean. When estimating the population mean using an independent and identically distributed sample of size n, where each data has variance $ \sigma ^{2}$, the standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$. With [[SMHS_CLT_LLN|CLT]], the 95% CI is $(\bar x - \frac {2\sigma}{\sqrt n},\bar x +\frac{2\sigma}{\sqrt n})$. If we wish to have a confidence interval with W units in width, then solve for n, we have $n=\frac{16\sigma^2}{W^2}$.

*Sample size for hypothesis tests: Let $X_i,i=1,2,…,n$ be independent observations taken from a normal distribution with unknown mean μ and known variance $\sigma^2$. The null hypothesis vs. alternative hypothesis: $H_0:\mu=0$ vs.$H_a:\mu=\mu^*$. If we wish to (1) reject $H_0$ with a probability of at least $1-\beta$ when $H_a$ is true, (2) reject $H_0$ with probability $\alpha$ when $H_0$ is true, we need: $P(\bar x >\frac{z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})=\alpha $, and so reject $H_0$ if our sample average is more than $\frac{z_\alpha\sigma} {\sqrt n}$ is a decision rule which satisfies (2). $z_\alpha$ is the upper percentage point of the standard normal distribution. If we wish this to happen with a probability $1-\beta$ when $H_a$ is true. In this case, our sample average will come from a normal distribution with mean $μ^*$.

: Therefore, require $P (\bar x >\frac {z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})\le 1-\beta $. Solve for n, we have $n \ge ( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}})^{2}$, where $\Phi$ is the [[SMHS_ProbabilityDistributions#Normal_distribution|normal cumulative distribution function]].

====Effect size====
[http://books.google.com/books?id=whF18jCxyv0C&pg=PT4&lpg=PT4&dq=e-Study+Guide+for+Statistics+for+the+Behavioral+Sciences,+textbook+by+Susan&source=bl&ots=9vlDcJMtv1&sig=lUFE0l5GeZdyX8iasXUNgSpb6UI&hl=en&sa=X&ei=CQoGVMjCNs_HgwTi1YCICw&ved=0CD8Q6AEwAw#v=onepage&q=e-Study%20Guide%20for%20Statistics%20for%20the%20Behavioral%20Sciences%2C%20textbook%20by%20Susan&f=false Effect size] is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. It complements inferential statistics such as p-value and plays an important role in statistical studies. The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical statistical population. These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model.

====Other common measures====
*Pearson $r$ (correlation): an effect size when paired quantitative data are available, for instance if one were studying the relationship between birth weight and longevity. It varies from -1 to 1, 1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation and 0 indicating no linear relation between two variables.

*Correlation coefficient, $ r^2 $: a coefficient determination calculated as the square of Pearson correlation r. It varies from 0 to 1 and is always nonnegative. For example, if $r=0.2$ then $r^2=0.04$ meaning that $4\%$ of the variance of either variable is shared with the other variable.

*Eta-squared, $ \eta^2 $, describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the $ r^2 $. It is a biased estimator of the variance explained by the model in the population. $ \eta^2=\frac{SS_{treatment}} {SS_{total}} $ .

*Omega-squared, $\omega^2$: a less biased estimator of the variance explained in the population. $\omega^2 =/frac{SS_{treatment}-df_{treatment}*MS_{error}}{SS_{total}+MS_error}$. Given it is less biased, $\omega^2$ is preferable to $\eta^2$, however, it can be more inconvenient to calculate for complex analyses.

* Cohen’s $ f^2 $: one of several effect size measures to use in the context of an F test for ANOVA or multiple regression. Its amount of bias depends on the bias of its underlying measurement of variance explained. $f^2=\frac{R^2}{1-R^2}$,$R^2 $ is the squared multiple correlation.

===Applications===
*[http://www.sciencedirect.com/science/article/pii/0197245681900015 This article]titled Introduction To Sample Size Determination And Power Analysis For Clinical Trials reviewed the importance of sample size in clinical trials and presented a general method from which specific equations are derived for sample size determination and analysis of power for a wide variety of statistical procedures. This paper discussed the method in details with illustration in relation to the t test, test for proportions, test for survival time and tests for correlations that commonly occurred in clinical trials.

*[http://http://onlinelibrary.wiley.com/doi/10.1111/j.1469-185X.2007.00027.x/pdf This article] presents measures of the magnitude of effects (i.e., effect size statistics) and their confidence intervals in all biological journals. It illustrated the combined use of an effect size and its confidence interval, which enables one to assess the relationships within data more effectively than the use of p values, regardless of statistical significance. It focused on standardized effect size statistics and extensively discussed two dimensionless classes of effect size statistics: d statistics (standardized mean difference) and r statistics (correlation coefficient), because these can be calculated from almost all study designs and also because their calculations are essential for meta-analysis. The paper provided potential solutions for four main technical problems researchers may encounter when calculating effect size and CIs: (1) when covariates exist, (2) when bias in estimating effect size is possible, (3) when data have non-normal error structure and/or variances, and (4) when data are non-independent.

*[http://www.sciencedirect.com/science/article/pii/019724569090005M This article]reviewed methods of sample size and power calculation for most commonly study designs. It presents two generic formulae for sample size and power calculation, from which the commonly used methods are derived. It also illustrates the calculation with a computer program, which can be used for studies with dichotomous, continuous, or survival response measures.

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html Student Calculator]
*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Normal T Chi-Squared F Tables]

===Problems===
Other things being equal, which of the following actions will reduce the power of a hypothesis test?

I. Increasing sample size.  II. Increasing significance level.  III. Increasing beta, the probability of a Type II error.

:(A) I only
:(B) II only  
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?

I. The power of the hypothesis test.  II. The effect size of the hypothesis test.  III. The probability of making a Type II error.

:(A) I only
:(B) II only
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose we have the following measurements taken. Calculate the corresponding power, specificity and sensitivity.
<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent ($H_0$ is true)''' || '''Present ($H_1$ is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject $H_0$)''' || 0.983 || 0.0025
|-
| '''Positive (reject $H_0$)''' || 0.0085 ||0.0055
|}
</center>

Suppose we are running a test on a simple experiment where the population standard deviation is $ 0.06$. $H_0: \mu=0$ vs. $H_a: \mu=0.5$. With type I error of 5%, what would be a reasonable sample size if we want to achieve at least 98% power.

===References===

*[http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_Hypothesis_Basics SOCR]

*[http://en.wikipedia.org/wiki/Sample_size_determination Sample Size Determination Wikipedia]

*[http://en.wikipedia.org/wiki/Effect_size Effect Size Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PowerSensitivitySpecificity}}

SMHS Estimation

2014-09-02T18:13:12Z

Zhenxunw: /* References */

==[[SMHS| Scientific Methods for Health Sciences]] - Parameter Estimation ==

===Overview===
Estimation is an important concept in the field of statistics and application of estimation is widely applied in various areas. It deals with estimating values of parameters of the population based on the sample data. And the parameters describe an underlying physical setting and their value would affect the distribution of the measured data. Two major approaches are commonly used in estimation: (1) the probabilistic approach assumes that the measured data is random with probability distribution dependent on the parameters; (2) the set-membership approach assumes that the measured data vector belongs to a set which depends on the parameter vector. The purpose of estimation is to find an estimator that is interpretable, accurate and exhibits some form of optimality. Indicators like minimum variance unbiased estimator is usually applied to measure estimator optimality, although it is possible that an optimal estimator don’t always exist. Here we present the fundamentals of estimation theory and illustrate how to apply estimation in real studies.

===Motivation===
To obtain a desired estimator, or estimation, we need to first determine a probability distribution with parameters of interest based on the data. After deciding the probabilistic model, we need to find the theoretically achievable precision available to any estimator based on the model and then develop an estimator based on this model. There are variety of methods and criteria to develop and choose between estimators based on their performance: maximum likelihood estimators, Bayes estimators, method of moments estimators, minimum mean square error estimators, minimum variance unbiased estimator, best linear unbiased estimator, etc. Experiment or simulations can also be run to test estimators’ performance.

===Theory===
An estimate of a population parameter may be expressed in two ways:
*Point estimate: a single value of estimate. For example, sample mean is a point estimate of the population mean.
*Interval estimate: an interval estimate is defined by two numbers, between which a population parameter is said to lie.

====Confidence Intervals (CIs)====
CIs describe the uncertainty of a sampling method and contains a confidence level, a statistic and a margin of error. The statistic and the margin of error define an interval estimate, which represent the precision of the method. Confidence Interval is expressed as sample statistic ± margin of error.
Interpretation of a confidence interval at 95% confidence level is that we have 95% confidence that the parameter will fall within the margin of the interval.

* Confidence level: the probability part of a confidence interval. It describes the likelihood that a particular sampling method will produce a confidence interval that includes the true population parameter.

* Margin of error: range of the values above and below the sample statistic in confidence interval. ''margin of error=critical value*standard deviation of the statistic''.

* Critical value: The central limit theorem states that the sampling distribution of a statistic will be normal or nearly normal and the critical value can be expressed as a t score or as a z score, if ANY of the following conditions apply:
**The population distribution is normal;
**The sampling distribution is symmetric, unimodal, without outliers, and the sample size is 15 or less;
**The sampling distribution is moderately skewed, unimodal, without outliers, and the sample size is between 16 and 40;
**The sample size is greater than 40, without outliers.

To find the critical value, follow these steps.
*Compute alpha $(\alpha): \alpha = 1 - (confidence\ level / 100)$
*Find the critical probability $(p^*): p^* = 1 -\frac {\alpha} {2}$
*To express the critical value as a $z$ score, find the $z$ score having a cumulative probability equal to the critical probability $(p^*)$.
*To express the critical value as a t score, follow these steps. Find the degree of freedom (DF): when estimating a mean score or a proportion from a single sample, DF is equal to the sample size minus one. For other applications, the degrees of freedom may be calculated differently. We will describe those computations as they come up.

The critical t score $(t^*)$ is the t score having degrees of freedom equal to DF and a cumulative probability equal to the critical probability $(p^*)$.

Should you express the critical value as a t score or as a z score? As a practical matter, when the sample size is large (greater than 40), it doesn't make much difference. Both approaches yield similar results. Strictly speaking, when the population standard deviation is unknown or when the sample size is small, the t score is preferred. Nevertheless, many introductory statistics texts use the z score exclusively.

* Standard error: an estimate of the standard deviation of a statistic. When the values of population parameters are unknown, it is valuable to compute the standard error as an unbiased estimate of the standard deviation of a statistic. It is computed form known sample statistic. The table below shows how to compute the standard error for simple random samples assuming that the population size is at least 10 times larger than the sample size.
<center>
{| class="wikitable" style="text-align:center;width:25%"border="1"
|-
|Statistic || Standard error
|-
|Sample mean, $\bar{x}$ || $SE_{\bar{x}}=s/\sqrt(n)$
|-
|Sample proportion, $p$ || $SE_{p}=\sqrt{\frac{p(1-p)}{n}}$
|-
|Difference between means,$\bar{x}_{1} -\bar{x}_{2}$ || $ SE_{\bar{x}_1 -\bar{x}_2} = \sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}$
|-
|Difference between proportions, $\bar{p}_{1} - \bar{p}_{2}$ || $SE_{\bar{p}_{1} - \bar{p}_{2}} = \sqrt{ \frac{p_1 (1-p_1)}{n_1} +\frac{(p_{2}(1-p_{2})}{n_{2}}}$
|}
</center>

* Degrees of freedom: the number of independent pieces of information on which the estimate is based

In general, the degrees of freedom for an estimate is equal to the number of values minus the number of parameters estimated to the estimate in question. Suppose we have sampled 20 data points then our estimate of the variance has 20 – 1 = 19 degree of freedom.

====Characteristics of Estimators====
* Bias: refers to whether an estimator tends to either overestimate or underestimate the parameter. We say an estimator is biased if the mean of the sampling distribution of the statistic is not equal to the parameter. For example, $σ^{2}=\frac{(x-μ)^{2}} {N}$ is a biased estimator of the population variance and sample variance $s^{2}=\frac{(x-\overline x ̅ )^{2}} {N-1 }$ is unbiased estimate of the population variance.

*Sampling variability: refers to how much the estimate varies from sample to sample. It is usually measured by its standard error: the smaller the standard error, the less the sampling variability. For example, the standard error of the mean is $σ_M=σ/√N$. So the larger the sample size $(N)$, the smaller the standard error of the mean, hence the smaller the sample variability.

*Unbiased estimate: $\eta (X_{1},X_{2},…,X_{n})=E[\delta(X_{1},X_{2},…,X_{n})|T]$ then $\delta(X_{1},X_{2},…,X_{n} )$ is unbiased estimate for $g(\theta)$ and $T$ is a complete sufficient statistic for the family of densities.

*(Uniformly) Minimum-variance unbiased estimator ([http://en.wikipedia.org/wiki/Minimum-variance_unbiased_estimator UMVUE], or MVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter. It may not exist.Consider estimation of $g(\theta)$ based on data $X_{1},X_{2},…,X_{n}$ independent and identically distributed from some member of a family with density $p_\theta, \theta \in \Omega $, an unbiased estimator $\delta(X_{1},X_{2},…,X_{n})$ of $g(\theta)$ is UMVUE if $∀ \theta \in \Omega$, $var(\delta(X_{1},X_{2},…,X_{n})) \leq var(\tilde{\delta} (X_{1},X_{2},…,X_{n}))$ for any other unbiased estimator $\tilde{\delta}$.

: $MSE(\delta)=var(\delta)+(bias(\delta))^{2}$. The MVUE minimizes MSE among unbiased estimators. In some cases biased estimators have lower MSE because they have a smaller variance than does any unbiased estimator.

===Applications===
* [[AP_Statistics_Curriculum_2007_Estim_MOM_MLE|This article]] presents the MOM and MLE methods of estimation. It illustrates the MOM method in detailed examples and attached several exercise for students to practice. MOM, which is short for Method Of Moments, is one of the most commonly used methods to estimate population parameters using observed data from the specific process. The idea is to use the sample data to calculate sample moments and then set these equal to their corresponding population counterparts. Steps: (1) determine the $k$ parameters of interest and specific distribution for this process; (2) compute the first $k$ (or more) sample-moments; (3) set the sample-moments equal to the population moments and solve for a system of $k$ equations with $k$ unknowns. Let’s look at a simple example as application of the MOM method.

: Consider we want to estimate the true probability of a head by flipping the coins (assume a unfair coin). Suppose we flip the coin 10 times and observe the following outcome: {H,T,H,H,T,T,T,H,T,T}. With MOM: (1) the parameter of interest is $p=P(H)$ and it follows a Bernoulli distribution, (2) $np=E[Y]=4,p=2/5$, where $Y$ is the number of heads for one experiment and it follows a Binomial distribution. (3) estimate of true probability of flipping a head in one experiment equals $2/5$. This is an easy example of MOM proportion example.

* [http://onlinestatbook.com/2/estimation/estimation.html This article] presents a fundamental introduction to estimation theory and illustrated on basic concepts and application of estimation. It offers specific examples and exercises on each concept and application and works as a good start of introduction to estimation theory.

* [http://digital-library.theiet.org/content/journals/10.1049/ip-f-2.1993.0015 This article] proposed an algorithm, the bootstrap filter, for implementing recursive Bayesian filters. The required density of the state vector is represented as a set of random samples, which are updated and propagated by the algorithm. The method presented is not restricted by assumptions of linearity or Gaussian noise and it may be applied to any state transition or measurement model. It presents a simulation example of the bearings only tracking problems and includes schemes for improving the efficiency of the basic algorithm.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distribution]
*[http://socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Simulations & Experiments]
*[http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts]

===Problems===
* Which of the following statements is true.
: a. When the margin of error is small, the confidence level is high.
: b. When the margin of error is small, the confidence level is low.
: c. A confidence interval is a type of point estimate.
: d. A population mean is an example of a point estimate.
: e. None of the above.

* Which of the following statements is true.
: a. The standard error is computed solely from sample attributes.
: b. The standard deviation is computed solely from sample attributes.
: c. The standard error is a measure of central tendency.
: d. All of the above.
: e. None of the above.

* 900 students were randomly selected for a national survey. Among survey participants, the mean grade-point average (GPA) was 2.7, and the standard deviation was 0.4. What is the margin of error, assuming a 95% confidence level?
: a. 0.013
: b. 0.025
: c. 0.500
: d. 1.960

* Suppose we want to estimate the average weight of an adult male in Dekalb County, Georgia. We draw a random sample of 1,000 men from a population of 1,000,000 men and weigh them. We find that the average man in our sample weighs 180 pounds, and the standard deviation of the sample is 30 pounds. What is the 95% confidence interval?
: a. $180 \pm 1.86$
: b. $180 \pm 3.0$
: c. $180 \pm 5.88$
: d. $180 \pm 30$

* Suppose that simple random samples of seniors are selected from two colleges: 15 students from school A and 20 students from school B. On a standardized test, the sample from school A has an average score of 1000 with a standard deviation of 100. The sample from school B has an average score of 950 with a standard deviation of 90. What is the 90% confidence interval for the difference in test scores at the two schools, assuming that test scores came from normal distributions in both schools? (Hint: Since the sample sizes are small, use a t score as the critical value.)
: a. 50 + 1.70
: b. 50 + 28.49
: c. 50 + 32.74
: d. 50 + 55.66

* You know the population mean for a certain test score. You select 10 people from the population to estimate the standard deviation. How many degrees of freedom does your estimation of the standard deviation have?
: a. 8
: b. 9
: c. 10
: d. 11

* In the population, a parameter has a value of 10. Based on the means and standard errors of their sampling distributions, which of these statistics estimates this parameter with the least sampling variability?
: a. Mean = 10, SE = 5
: b. Mean = 9, SE = 4
: c. Mean = 11, SE = 2
: d. Mean = 13, SE = 3

===References===
*[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Method_of_Moments_and_Maximum_Likelihood_Estimation SOCR]
* [http://en.wikipedia.org/wiki/Estimation Estimation Wikipedia]
* [http://onlinestatbook.com/2/estimation/characteristics.html OnlineStatBook: Estimation]
* [http://en.wikipedia.org/wiki/Confidence_interval Confidence Interval Wikipedia]
* [http://en.wikipedia.org/wiki/Minimum-variance_unbiased_estimator UMVUE Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_Estimation}}

SMHS Estimation

2014-09-02T18:12:38Z

Zhenxunw: /* Characteristics of Estimators */

==[[SMHS| Scientific Methods for Health Sciences]] - Parameter Estimation ==

===Overview===
Estimation is an important concept in the field of statistics and application of estimation is widely applied in various areas. It deals with estimating values of parameters of the population based on the sample data. And the parameters describe an underlying physical setting and their value would affect the distribution of the measured data. Two major approaches are commonly used in estimation: (1) the probabilistic approach assumes that the measured data is random with probability distribution dependent on the parameters; (2) the set-membership approach assumes that the measured data vector belongs to a set which depends on the parameter vector. The purpose of estimation is to find an estimator that is interpretable, accurate and exhibits some form of optimality. Indicators like minimum variance unbiased estimator is usually applied to measure estimator optimality, although it is possible that an optimal estimator don’t always exist. Here we present the fundamentals of estimation theory and illustrate how to apply estimation in real studies.

===Motivation===
To obtain a desired estimator, or estimation, we need to first determine a probability distribution with parameters of interest based on the data. After deciding the probabilistic model, we need to find the theoretically achievable precision available to any estimator based on the model and then develop an estimator based on this model. There are variety of methods and criteria to develop and choose between estimators based on their performance: maximum likelihood estimators, Bayes estimators, method of moments estimators, minimum mean square error estimators, minimum variance unbiased estimator, best linear unbiased estimator, etc. Experiment or simulations can also be run to test estimators’ performance.

===Theory===
An estimate of a population parameter may be expressed in two ways:
*Point estimate: a single value of estimate. For example, sample mean is a point estimate of the population mean.
*Interval estimate: an interval estimate is defined by two numbers, between which a population parameter is said to lie.

====Confidence Intervals (CIs)====
CIs describe the uncertainty of a sampling method and contains a confidence level, a statistic and a margin of error. The statistic and the margin of error define an interval estimate, which represent the precision of the method. Confidence Interval is expressed as sample statistic ± margin of error.
Interpretation of a confidence interval at 95% confidence level is that we have 95% confidence that the parameter will fall within the margin of the interval.

* Confidence level: the probability part of a confidence interval. It describes the likelihood that a particular sampling method will produce a confidence interval that includes the true population parameter.

* Margin of error: range of the values above and below the sample statistic in confidence interval. ''margin of error=critical value*standard deviation of the statistic''.

* Critical value: The central limit theorem states that the sampling distribution of a statistic will be normal or nearly normal and the critical value can be expressed as a t score or as a z score, if ANY of the following conditions apply:
**The population distribution is normal;
**The sampling distribution is symmetric, unimodal, without outliers, and the sample size is 15 or less;
**The sampling distribution is moderately skewed, unimodal, without outliers, and the sample size is between 16 and 40;
**The sample size is greater than 40, without outliers.

To find the critical value, follow these steps.
*Compute alpha $(\alpha): \alpha = 1 - (confidence\ level / 100)$
*Find the critical probability $(p^*): p^* = 1 -\frac {\alpha} {2}$
*To express the critical value as a $z$ score, find the $z$ score having a cumulative probability equal to the critical probability $(p^*)$.
*To express the critical value as a t score, follow these steps. Find the degree of freedom (DF): when estimating a mean score or a proportion from a single sample, DF is equal to the sample size minus one. For other applications, the degrees of freedom may be calculated differently. We will describe those computations as they come up.

The critical t score $(t^*)$ is the t score having degrees of freedom equal to DF and a cumulative probability equal to the critical probability $(p^*)$.

Should you express the critical value as a t score or as a z score? As a practical matter, when the sample size is large (greater than 40), it doesn't make much difference. Both approaches yield similar results. Strictly speaking, when the population standard deviation is unknown or when the sample size is small, the t score is preferred. Nevertheless, many introductory statistics texts use the z score exclusively.

* Standard error: an estimate of the standard deviation of a statistic. When the values of population parameters are unknown, it is valuable to compute the standard error as an unbiased estimate of the standard deviation of a statistic. It is computed form known sample statistic. The table below shows how to compute the standard error for simple random samples assuming that the population size is at least 10 times larger than the sample size.
<center>
{| class="wikitable" style="text-align:center;width:25%"border="1"
|-
|Statistic || Standard error
|-
|Sample mean, $\bar{x}$ || $SE_{\bar{x}}=s/\sqrt(n)$
|-
|Sample proportion, $p$ || $SE_{p}=\sqrt{\frac{p(1-p)}{n}}$
|-
|Difference between means,$\bar{x}_{1} -\bar{x}_{2}$ || $ SE_{\bar{x}_1 -\bar{x}_2} = \sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}$
|-
|Difference between proportions, $\bar{p}_{1} - \bar{p}_{2}$ || $SE_{\bar{p}_{1} - \bar{p}_{2}} = \sqrt{ \frac{p_1 (1-p_1)}{n_1} +\frac{(p_{2}(1-p_{2})}{n_{2}}}$
|}
</center>

* Degrees of freedom: the number of independent pieces of information on which the estimate is based

In general, the degrees of freedom for an estimate is equal to the number of values minus the number of parameters estimated to the estimate in question. Suppose we have sampled 20 data points then our estimate of the variance has 20 – 1 = 19 degree of freedom.

====Characteristics of Estimators====
* Bias: refers to whether an estimator tends to either overestimate or underestimate the parameter. We say an estimator is biased if the mean of the sampling distribution of the statistic is not equal to the parameter. For example, $σ^{2}=\frac{(x-μ)^{2}} {N}$ is a biased estimator of the population variance and sample variance $s^{2}=\frac{(x-\overline x ̅ )^{2}} {N-1 }$ is unbiased estimate of the population variance.

*Sampling variability: refers to how much the estimate varies from sample to sample. It is usually measured by its standard error: the smaller the standard error, the less the sampling variability. For example, the standard error of the mean is $σ_M=σ/√N$. So the larger the sample size $(N)$, the smaller the standard error of the mean, hence the smaller the sample variability.

*Unbiased estimate: $\eta (X_{1},X_{2},…,X_{n})=E[\delta(X_{1},X_{2},…,X_{n})|T]$ then $\delta(X_{1},X_{2},…,X_{n} )$ is unbiased estimate for $g(\theta)$ and $T$ is a complete sufficient statistic for the family of densities.

*(Uniformly) Minimum-variance unbiased estimator ([http://en.wikipedia.org/wiki/Minimum-variance_unbiased_estimator UMVUE], or MVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter. It may not exist.Consider estimation of $g(\theta)$ based on data $X_{1},X_{2},…,X_{n}$ independent and identically distributed from some member of a family with density $p_\theta, \theta \in \Omega $, an unbiased estimator $\delta(X_{1},X_{2},…,X_{n})$ of $g(\theta)$ is UMVUE if $∀ \theta \in \Omega$, $var(\delta(X_{1},X_{2},…,X_{n})) \leq var(\tilde{\delta} (X_{1},X_{2},…,X_{n}))$ for any other unbiased estimator $\tilde{\delta}$.

: $MSE(\delta)=var(\delta)+(bias(\delta))^{2}$. The MVUE minimizes MSE among unbiased estimators. In some cases biased estimators have lower MSE because they have a smaller variance than does any unbiased estimator.

===Applications===
* [[AP_Statistics_Curriculum_2007_Estim_MOM_MLE|This article]] presents the MOM and MLE methods of estimation. It illustrates the MOM method in detailed examples and attached several exercise for students to practice. MOM, which is short for Method Of Moments, is one of the most commonly used methods to estimate population parameters using observed data from the specific process. The idea is to use the sample data to calculate sample moments and then set these equal to their corresponding population counterparts. Steps: (1) determine the $k$ parameters of interest and specific distribution for this process; (2) compute the first $k$ (or more) sample-moments; (3) set the sample-moments equal to the population moments and solve for a system of $k$ equations with $k$ unknowns. Let’s look at a simple example as application of the MOM method.

: Consider we want to estimate the true probability of a head by flipping the coins (assume a unfair coin). Suppose we flip the coin 10 times and observe the following outcome: {H,T,H,H,T,T,T,H,T,T}. With MOM: (1) the parameter of interest is $p=P(H)$ and it follows a Bernoulli distribution, (2) $np=E[Y]=4,p=2/5$, where $Y$ is the number of heads for one experiment and it follows a Binomial distribution. (3) estimate of true probability of flipping a head in one experiment equals $2/5$. This is an easy example of MOM proportion example.

* [http://onlinestatbook.com/2/estimation/estimation.html This article] presents a fundamental introduction to estimation theory and illustrated on basic concepts and application of estimation. It offers specific examples and exercises on each concept and application and works as a good start of introduction to estimation theory.

* [http://digital-library.theiet.org/content/journals/10.1049/ip-f-2.1993.0015 This article] proposed an algorithm, the bootstrap filter, for implementing recursive Bayesian filters. The required density of the state vector is represented as a set of random samples, which are updated and propagated by the algorithm. The method presented is not restricted by assumptions of linearity or Gaussian noise and it may be applied to any state transition or measurement model. It presents a simulation example of the bearings only tracking problems and includes schemes for improving the efficiency of the basic algorithm.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distribution]
*[http://socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Simulations & Experiments]
*[http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts]

===Problems===
* Which of the following statements is true.
: a. When the margin of error is small, the confidence level is high.
: b. When the margin of error is small, the confidence level is low.
: c. A confidence interval is a type of point estimate.
: d. A population mean is an example of a point estimate.
: e. None of the above.

* Which of the following statements is true.
: a. The standard error is computed solely from sample attributes.
: b. The standard deviation is computed solely from sample attributes.
: c. The standard error is a measure of central tendency.
: d. All of the above.
: e. None of the above.

* 900 students were randomly selected for a national survey. Among survey participants, the mean grade-point average (GPA) was 2.7, and the standard deviation was 0.4. What is the margin of error, assuming a 95% confidence level?
: a. 0.013
: b. 0.025
: c. 0.500
: d. 1.960

* Suppose we want to estimate the average weight of an adult male in Dekalb County, Georgia. We draw a random sample of 1,000 men from a population of 1,000,000 men and weigh them. We find that the average man in our sample weighs 180 pounds, and the standard deviation of the sample is 30 pounds. What is the 95% confidence interval?
: a. $180 \pm 1.86$
: b. $180 \pm 3.0$
: c. $180 \pm 5.88$
: d. $180 \pm 30$

* Suppose that simple random samples of seniors are selected from two colleges: 15 students from school A and 20 students from school B. On a standardized test, the sample from school A has an average score of 1000 with a standard deviation of 100. The sample from school B has an average score of 950 with a standard deviation of 90. What is the 90% confidence interval for the difference in test scores at the two schools, assuming that test scores came from normal distributions in both schools? (Hint: Since the sample sizes are small, use a t score as the critical value.)
: a. 50 + 1.70
: b. 50 + 28.49
: c. 50 + 32.74
: d. 50 + 55.66

* You know the population mean for a certain test score. You select 10 people from the population to estimate the standard deviation. How many degrees of freedom does your estimation of the standard deviation have?
: a. 8
: b. 9
: c. 10
: d. 11

* In the population, a parameter has a value of 10. Based on the means and standard errors of their sampling distributions, which of these statistics estimates this parameter with the least sampling variability?
: a. Mean = 10, SE = 5
: b. Mean = 9, SE = 4
: c. Mean = 11, SE = 2
: d. Mean = 13, SE = 3

===References===
*[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Method_of_Moments_and_Maximum_Likelihood_Estimation SOCR]
* [http://en.wikipedia.org/wiki/Estimation Estimation Wikipedia]
* [http://onlinestatbook.com/2/estimation/characteristics.html OnlineStatBook: Estimation]
* [http://en.wikipedia.org/wiki/Confidence_interval Confidence Interval Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_Estimation}}

SMHS IntroEpi

2014-09-02T18:02:14Z

Zhenxunw: /* Theory */

==[[SMHS| Scientific Methods for Health Sciences]] - Introduction to Epidemiology ==

===Overview===
Epidemiology is the study of the distribution and determinants of disease frequency in human populations. It serves as an important area in the scientific field: it is the only scientific discipline that is concerned with the occurrence of disease in human populations and how it changes over time. The introduction to Epidemiology aims to introduce the filed of Epidemiology and study the basic concepts and methodologies we are going to apply later. It also aims to help students solve and analyze Epidemiological problems and introduce students to various Epidemiological studies.

===Motivation===
To get an introduction to Epidemiology, we want to:
*study on the basis of the language of epidemiology and identify key sources of data for epidemiologic purposes
*be able to calculate and interpret measures of disease frequency
*recognize and evaluate epidemiological study designs and their limitations
*be an informed consumer of epidemiological sources of information (journals, websites, government agencies).

===Theory===
*Five main goals of epidemiology:
**1. To identify the cause of disease and its risk factors
**2. To determine the extent of disease found in the community
**3. To study the natural history and prognosis of disease
**4. To evaluate new preventative and therapeutic measures
**5. To provide a foundation for developing public policy.

*Distinguishing between Endemic, Epidemic, and Pandemic
**Endemic: The habitual presence (or usual occurrence) of a disease within a given geographic area;
**Epidemic: The occurrence of a disease clearly in excess of normal expectancy in a given geographic area;
**Pandemic: A worldwide epidemic affecting an exceptionally high proportion of the global population.

*Modes of Disease Transmission
**Direct contact: transmission occurs when the pathogen is transferred by contact from an infected person to contaminated intermediate object such as sneeze, touch or sexual intercourse.
**Indirect contact: transmission involves the transfer of pathogen by contact with a contaminated intermediate inanimate object or vector.
:(1) Inanimate object vehicle), examples may be toy, food or water;
:(2) Vector-borne (animal or insect), examples include mosquito, tick and mice.

*Attack Rates and Ratios (ARR)
**Attack rates and ratios use statistics to develop and evaluate hypotheses in an outbreak involves: starting with the big picture and big risk factors for disease such as “How many people at the event got ill?”; refining the big picture into smaller questions of “Did they eat the salad? Chicken? Or ice cream?”; formulating a hypothesis such as “Among those who eat at the buffet, are the people who ate the Caesar salad at greater risk than those who did not?”
**Attack Rates (AR): $AR=\frac{Number\,of\,people\,at\,risk\,who\,develop\,a\,certain\, illness} {Total\,number\,of\,people\,at\,risk}.$
**Attack Rate Ratio (ARR): $ARR=\frac{Attack\,rate\,in\,those\,exposed} {Attack\,rate\,in\,those\,unexposed}.$
**$H_{0}:ARR=1$,and 95% confidence intervals can be used to see whether estimated ARR interval includes the null value of 1. If ARR is much greater than 1, then people exposed are more likely to develop the illness compared to those unexposed.

*Measuring Disease

To name and calculate two measures of incidence and describe differences in interpreting these measures as well as to understand the difference of the difference between proportion and a true rate.
**Incidence: number of new cases of a disease occurring in the population during a special period of time divided by the number of persons at risk of developing the disease during that period of time. For example: if there are 2000 persons at risk during the year and 20 develop disease over that period. The incidence rate would be 20⁄2000=1%.
**Cumulative incidence: $ \frac{Number\,of\,new\,cases}{Total\,population\,at\,risk}. $
**Incidence rate: $\frac{Number\,of\,new\,cases}{Total\,person-time\,contributed\,by\,the\,persons\,followed}.$
Person time is a way to measure the amount of time all individuals in a study spend at risk. For example, if subject A is followed for 3 days, subject B is followed for 5 days and C for 8 days then person-days = 3 + 5 + 8 = 16.
**Prevalence $\frac{Number\,of\,cases\,of\,a\,disease\,in\,the\,population\,at\,a\,specified\,time}{Number\,of\,persons\,in\,the\,population\,at\,that\,time}.$
**The specified time can be a period or a point, so we can measure the prevalence during a short period in January of 2013 or on January 3$^{rd}$, 2013.

*Measuring Mortality Rates
**To calculate and interpret all-cause mortality rates, group-specific mortality rates and cause-specific mortality rates.
**All cause mortality rates=$\frac{Number\,of\,deaths\,in\,a\,specified\,time\,period}{Number\,in\,population\,in\,the\,middle\,of\,the\,year}$.
**Cause-specific mortality rate=$\frac{Total\,number\,of\,deaths\,in\,1\,year\,from\,lung\,cancer\,in\,US}{Population\,of\,the\,US\,in\,the\,middle\,of\,the\,year}$.
**Group-specific mortality rate=$\frac{Total\,number\,of\,deaths\,in\,1\,year\,among\,women\,in\,US} {Female\,population\,of\,the\,US\,in\,the\,middle\,of\,the\,year}$.
*Additional Measures of Mortality
**Infant mortality: $\frac{Number\,of\,deaths\,in\,children\,under\,1\,year\,of\,age\,in\,2011} {(Number\,of\,live\,births\,in\,2011}$.

**Proportionate mortality: measures proportion of all deaths occurring in a given place over a given time that is due to a given cause.
**Case fatality: Of all people diagnosed with a given disease, the proportion of persons die of a case over a certain period.
**Underlying cause of death.

*Direct and Indirect Adjustment of Rates

Direct and indirect adjustment of rates are used to compare two populations or one population at different time periods with different age distributions by adjust for age to compare the mortality rates in two populations if they both have the same age distribution.
**Direct age-adjustment: expected rate (or standardized rate) can be compared to the crude rate or to any other similarly standardized rate.

For each population:
:1. Calculate age-specific rates
:2. Multiply age-specific rates by the # of people in corresponding age range in standard population
:3. Sum expected # of deaths across age groups
:4. Divide total # of expected deaths by total standard population

Age-adjusted mortality rate for each population of interest.

**Indirect age-adjustment: expected number of deaths can be compared to the number of actual deaths with the standardized mortality rate (SMR). It is especially useful when I don’t trust the group-specific rates (i.e. if the population is too small).
:1. Acquire age-specific mortality rates for standard population
:2. Multiply standard population’s age-specific rates by # of people in age range in study population
:3. Sum expected # of deaths across age groups in study population
:4. Divide observed # of deaths by expected # of deaths in study population

Result: SMR (>1 more than expected, =1 as expected, <1 less than expected)

*Screening

Screening is the use of testing to sort out apparently well persons (asymptomatic) who probably have disease from those who probably do not and allows to detect the disease early. Examples of screening include: fasting blood sugar for diabetes, bone densitometry for osteoporosis and Otoacoustic emissions testing for hearing loss new borns. It is done during the preclinical phase and is a secondary prevention strategy. Screening increases lead time, thereby allows us to detect disease early, initiate treatment sooner and provide better outcomes. However, it is critical that screening programs must be warranted and there must be a critical point that can be preceded by screening.

*A. Clinical utility predictive value & reliability: clinical utility of positive tests.

If a patient is tested positive, the likelihood they actually have the disease is called '''Positive Predictive Value (PPV'''), if a patient tests negative, the likelihood they actually do not have the disease is called '''Negative Predictive Value (NPV).''' PPV and NPV are affected by prevalence of disease, specificity and sensitivity of the test.
<center>
{|class="wikitable" style="text align:center;width:25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| Disease Status
|-
| Disease|| No Disease
|-
|rowspan=2 |Screening Test ||Positive|| a (True positives)|| b (False positives)
|-
| Negative || c (False negatives)|| d (True negatives)
|}
$PPV=\frac{a}{a+b},NPV=\frac{d}{c+d}$
</center>
'''PPV interpretation:''' Given a positive result on the disease, the likelihood that an individual is positive in the screening test is PPV.
'''NPV interpretation:''' Given a negative result on the disease, the likelihood that an individual is negative in the screening test is NPV.

*B. Factors influence predictive values:

Disease prevalence: increasing disease prevalence increases PPV (or decreases NPV). Screening program most productive and efficient in high-risk populations; screening for infrequent disease may waste resources; need to present PPV in context of disease prevalence.
**Test specificity (ability of a test to correctly identify those who have the disease $=\frac{d}{b+d}$): higher test specificity increases PPV.
**Test sensitivity (ability of a test to correctly identify those who do not have the disease =$\frac{a}{a+c}).$

'''Note:''' the cutoff of a disease will influence test sensitivity and specificity: lowering the cutpoint will increase true positive hence increases sensitivity; decreases true negative hence decreases specificity. Similarly, raising the cutpoint will decrease true positives hence decreases sensitivity; increase true negatives hence increases specificity.

*C. Validity:

Validity is the ability of a test to distinguish between who has disease and who does not; reliability is the ability to replicate results on same sample if test if repeated. The following charts shows the three possible outcomes: (from left to right) valid not reliable, reliable not valid and valid and reliable.
<center>
[[Image:SMHS_InNtroEpi_Fig_1_2_3_C.png]]
</center>
*D. Reliability(repeatability) of tests:

Can the results be replicated if the test is redone? The results may be influenced by three factors:
**Intrasubject variation: variation within individual subjects
**Intraobserver variation: variation in reading of results by the same reader
**Interobserver variation: variation between those reading results

*E. How do multiple testing improve screening programs?

Using multiple tests:
:(1) sequential tests(2-stage) is less expensive, less invasive, less uncomfortable test first; if positive on first test, then follow-up with additional testing.
:(2) simultaneous tests (parallel) conduct multiple screening tests at the same time; to be considered positive, the person can test positive on either test, to be considered negative, the person must test negative on all tests.

Each test has own sensitivity and specificity. Utilization of multiple testing can improve net sensitivity (simultaneous testing) or net specificity (sequential testing), that is sequential testing decreases net sensitivity and increases net specificity while simultaneous testing increases net sensitivity and decreases net specificity.

*Randomized Controlled Trials (RCT):

The investigator assigns exposure at random to study participants, investigator then observes if there are differences in health outcomes between people who were (treatment group) and were not (comparison group) exposed to the facto. Special care is taken in ensuring that the follow-up is done in an identical way in both groups. The essence of good comparison between “treatment” is that the compared groups are the same except for the “treatment”.
**Steps of a RCT: hypothesis formed; study participant recruited based on specific criteria and their informed consent is sought; eligible and willing participants randomly allocated to receive assignment to a particular study group; study groups are monitored for outcome under study; rates of outcome in the various groups are compared.
<center>
[[Image:MSHS_IntroEpi_Fig_3_actually2.png |400px]]
</center>
**External and internal validity:
***External validity: Generalization of study to larger source population. Influenced by factors like: demographic differences between eligible and ineligible subgroups; intervention mirror what will happen in the community or source population.
***Internal validity: Ability to reach correct conclusion in study. Influenced by factors like: ability of subjects to provide valid and reliable data; expected compliance with a regimen; low probability of dropping out.

*Measures of Association and Effect in RCT:

Ratio of two measures of disease incidence (relative measures) - Risk Ratio (Relative Risk), Rate Ratio.
Difference between two measures of disease incidence: Risk difference, efficacy.
<center>
{|class="wikitable" style="text align:center;width:25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| Disease Status
|-
| Disease|| No Disease
|-
|rowspan=2 |Treatment||Drug A|| a || b
|-
| Placebo || c || d
|-
|}
</center>
$Relative\,Risk=\frac{Cumulative\,Incidence\,in\,exposed} {Cumulative\,Incidence\,in\,unexposed}=ratio\,of\,risks=Risk\,Ratio=\frac{a/(a+b)} {c/(c+d)}=\frac{CI_{drugA}}{CI_placebo}$

<center>
$Rate\, Ratio=\frac{Incidence\,rate\,in\,exposed} {Incidence\,rate\,in\,unexposed}$
</center>

Interpretation: RR>1, The risk of X is RR times more likely to occur in group A than in group B; RR=1, Null value (no difference between groups); RR<1, Either calculate the reduction in risk ratios (100%-xx%) or invert (1/RR) to be interpreted as “less likely” risk.
<center> $Efficacy=\frac{C.I.\,rate\,in\, placebo-C.I.\,rate\, in\, the\, treatment}{C.I.\,rate\, in\, placebo\, group}$
</center>
*Situations that favor the use of RCT:
:(1) Exposure of interest is a modifiable factor over which individuals are willing to relinquish control;
:(2) Legitimate uncertainty exists regarding the effect of interventions on outcome, but reasons exist to believe that the benefits of the intervention in question overweight the risks;
:(3) Effect of intervention on outcome is of sufficient importance to justify a large study.

*Cohort Study:

Population of exposed and unexposed individuals at risk of developing outcomes are followed over time to compare the development of disease in each group.
**Steps: Establish the study population. Identify a study population that is reflective of base population of interest and has a distribution of exposure; identify group of exposed and unexposed individuals. Study on the outcomes of exposed and not exposed groups.
[[Image:MSHS_IntroEpi_Fig2_C.png |500px|]]
**Types:
Prospective (concurrent) and Retrospective Cohort Studies (non-concurrent) based on when is the data collected.
Retrospective has benefits: more cost effective; good for disease of long latency.
Prospective has benefits: data quality presumably higher.
Both designs need to be cautious of ascertainment biases if outcomes or exposure is known.

**Measures of Association in Cohort Study:

Ratio of two measures of disease incidence (relative measures): Risk Ratio (Relative Risk), Rate Ratio.
Difference between two measures of disease incidence: Risk Difference, Rate Difference.
**Strengths and weakness of Cohort Design:
Strengths:
:(1) Maintain temporal sequence – can estimate incidence of disease; exposure precedes development of disease; also explore time-varying information.
:(2) Excellent for studying known adverse exposures or those that cannot practically be randomized.
:(3) Like RCT, excellent for studying rare exposures.
:(4) Multiple outcomes and sometimes multiple exposures can be studied.
Disadvantages:
:(1) Long-term follow-up required and expensive;
:(2) Not effective at capturing rare outcomes and can be challenging to study disease that take a long time to develop;
:(3) Loss to follow-up can be a problem;
:(4) Changes over time in criteria and methods can lead to problems with inferences;
:(5) People self-select exposures so exposed and unexposed may differ with respect to important characteristics.
**Situations favor a Cohort Study:
:(1) When there is evidence of an association between the exposure and the disease from other studies;
:(2) When the exposure is rare but incidence of disease among the exposure is high;|
:(3) When time between exposure and development of the disease is relatively short or historical data is available;
:(4) When good follow-up can be ensured.

*Case Control Study:
A case control study compares cases and controls to see which group has greater exposure to the disease.
**Measures of Association: Odds Ratio.
<center>
{|class="wikitable" style="text align:center;width:25%"border="1"
|-
| colspan=2| || Case || Control
|-
|rowspan=2 |Exposed || Yes || a || b
|-
| No || c ||d
|-
|}
</center>
$Odds\, Ratio=\frac{odds\, of\, a\, case\, being\, exposed}{odds\, of\, a\, control\, being\, exposed}=\frac{(a/c)} {(b/d)}=\frac {ad}{bc}.$

''Interpretation:'' Odds of being exposed is OR times higher (if OR > 1) in the cases than the controls (1/OR times lower (if OR < 1) in the cases than the controls; No association – odds are the same in cases and controls (if OR = 1)).

*Strengths and weakness of Case Control Study:
**Strengths: Case Control Study Design is efficient and can evaluate many risk factors for the same disease, so is good for diseases about which little is known; it is observational – we don’t ask people to change their behavior, we just collect information on events that happen “naturally”.
**Weakness: Inefficient for rare exposures; can study only one outcome at a time; cannot calculate incidence of disease but can only estimate the odds of being exposed in cases vs. controls; the number of cases and controls in study is artificial and does not represent the natural distribution of disease in the population.

*Avoiding Recall / Reporting Bias:
**Ways to avoid recall and report bias include:
:(1) adjusting timing so that the time between the event/illness and the study is as short as possible; use standardized questionnaires that obtain complete information;
:(2) using existing information if/when possible (e.g. medical record);
:(3) masking participants to study hypothesis
**Conditions when an OR from a Case-Control Study can approximate a RR OR≈RR:
:(1) when the cases are representative, with respect to their exposure status, of all people with the disease in the population from which the cases were drawn;
:(2) when the controls are representative, with respect to their exposure status, of all people without the disease in the population from which the cases are drawn;
:(3) when the disease being studied does not occur frequently.

*Cross-Sectional Studies:

A cross sectional study is an observational study in which a subject’s exposure and disease data are measured at the same time; prevalent cases of the disease are identified; exposure prevalence in relation to disease prevalence (no incidence cases; unable to determine temporality).
**Strengths and Limitations of Cross-Sectional Studies:
'''Strengths:'''
:(1) good for generating hypotheses;
:(2) easily sets up other analytic designs;
:(3) temporality is not a problem for time invariant exposures (genetic markers);
:(4) relatively low cost.
'''Weakness:'''
:(1) temporality – exposure or disease which happened first;
:(2) prevalent cases may not be the same as incident cases;
:(3) not useful for rare disease;
:(4) subject to selection bias.

**Measures of Association in Cross Sectional Studies
<center>
{|class="wikitable" style="text align:center;width:25%"border="1"
|-
| colspan=2| || Case || Control
|-
|rowspan=2 |Exposed || Yes || a || b
|-
| No || c ||d
|-
|}
$Prevalence Ratio=\frac{Prevalence\,of\,disease\,in\,exposed}{Prevalence\,of\, disease\,in\,unexposed}=\frac{a/(a+b)}{c/(c+d)}$
</center>

*Ecologic Studies:

An ecological study is an observational study in which group-level data is used for the exposure and/or the outcome. Subjects can be grouped by place (multiple-group study); by time (time-trend study); by place & time (mixed study). An error that could occur when an association identified based on group level (ecologic) characteristics are ascribed to individuals when such association do not exist at the individual level.
'''Strengths and Disadvantages of Ecologic Studies:'''
'''Strengths:'''
:(1) data is relatively easy and/or cheap to obtain;
:(2) good place to start; (3) many relevant social, occupational and environmental exposures cannot be ascribed to an individual.
'''Weakness:''' reliance on group-level data may not correctly represent individual-level associations.

*Ecologic fallacy is when an association between variables based on group characteristics is used to make inferences about individuals when that association does not exist.

*Ecologic studies are useful for generation of new hypotheses because they are relatively easy and low-cost to conduct.

*Other Risk Estimates:
**Attributable Risk Estimates of Effect – if exposure causes increased risk of disease, then we can estimate how many cases of disease could be eliminated if we completely eliminate the exposure.
**Attributable Risk (AR):$AR=CI_{Exposed} - CI _{Not\,exposed}$ This is just the risk difference. Group of interest: exposed and aims to quantify the risk of disease in the “exposed” group attributable to the exposure.
**Attributable Risk Percent $(AR\%)$: $ AR\%$ = $\frac{(CI_{Exposed} - CI_{Not exposed})}{CI_{exposed}}$
**Population Attributable Risk (PAR): $PAR= CI_{Total} - CI_{Not exposed}$
**Population Attributable Risk Percent $(PAR\%)$: $PAR\%$ = $\frac{(CI_{Total}-CI_{Not exposed})} {CI_{total}}$

*Bias: A barrier to internal validity
**Causes of bias: Any systematic error in the design, conduct or analysis of a study that results in a distorted estimate of the relationship between an exposure and o*utcome; observed results different than true results.
*Impact of bias: makes it appear as if there is an association when there really is none (bias away form the null); mask an association when there really is one (bias toward the null).
*Reasons we get wrong answer:
*(1)Selection bias: who is selected or retained in a study distorts your estimates of the truth. Example may be selection bias due to different retention in the study.
**Mechanisms to reduce bias:
**Ensure proper selection of study subjects (chose groups from the same source population; try lists of people that are more inclusive; use methods that result in high recruitment rates).
**Minimize loss-to-follow up: keep participants happy and in touch with study team; review non-respondents to understand characteristics.

*:(2)Information bias: the quality of your information distorts your estimate of the true association. Examples include surveillance bias, non-differential misclassification of hypertension, reporting bias and differential misclassification. Sources of measurement error/misclassification: normal variability or imprecision in measure, error due to subconscious or conscious decisions by the participant or investigator.

*:(3)Confounding bias: differences between cases and controls or exposed and unexposed distorts your estimates of the truth. A variable is a confounder if it is a known risk factor for the outcome, it is associated with the exposure but not a result of the exposure. These three conditions are necessary for a variable to be considered as a confounder.

*:(4)Chance: the luck of draw gets you a study sample that is not representative of the larger population.
**Strategies to handle confounding: (1) in study design – individual matching, group matching, randomization (experimental) studies; (2) in data analysis – stratification, adjustment.\
Matching in a case-control study:
<center>
{|class="wikitable" style="text align:center;width:25%"border="1"
|-
| || Control Exposed || Control Unexposed
|-
| Case Exposed || a || b
|-
|Case Unexposed || c ||d
|-
|}
</center>

Concordant pairs: both case and control exposed; neither case nor control exposed.
Discordant pairs: case exposed but control not exposed; control exposed but case not exposed.
*Matched analysis: Odds ratio (only based on discordant pairs) $Odds\, Ratio =\frac {b} {c}.$

''Interpretation'': If there is an association between exposure and outcome, it is not due to any factors that were matched on; you cannot conduct analyses for matched variables and outcome.
*Randomization: Random allocation of exposure/”treatment” by investigator, ensure that the two groups (exposed & unexposed) are the same except for exposure of interest, able to control for both known and unknown confounders because distribution of these “3rd variables” should be equally distributed between the groups.
*Stratification: Examine the relationship between exposure and outcome within each stratum of a potential confounding variable; holding the confounding variable constant.
*Adjustment: A statistical technique that can be used to examine what the association between exposure and outcome would be IF the confounder was not associated with the exposure.

Example following is age-adjustment.

[[Image:MSHS_IntroEpi_Fig4.png]]

===Applications===
[[http://www.sciencedirect.com/science/article/pii/S1631069107001072 This article]] reviews, from some important examples, the classical methodological approach for discussing causality in epidemiology. Coronary hear disease (CHD) prevention has largely benefited in the past from the development of epidemiological research, however, the opposition association-causation is currently raised from observational data. The easy identification of DNA polymorphisms has prompted new CHD etiological research in the past 10 years. Causality of the associations presents some special characteristics when genes are involved: necessity of replication, Mendelian randomization, which might prove to be important in future research.

[[http://www.sciencedirect.com/science/article/pii/S0020748912004166 This article]],studies retrospectively the relationship between surveillance, staffing, and serious adverse events in children on general care postoperative units. The paper investigates these hypotheses: (1) the relationship between patient factors and surveillance would be moderated by staffing (i.e., registered nurse hours per patient per shift), and (2) the relationship between staffing and serious adverse events would be mediated by surveillance.

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html Student Calculator]
*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Normal T Chi-Squared F Tables]

===Problems===

How do we learn about existence of outbreaks?
:a. cases call health departments directly
:b. clinicians
:c. laboratories
:d. all of the above

In the case of obesity, neighborhood access to healthy food stores represents which aspect of the epidemiologic triad?
:a. host
:b. agent
:c. vector
:d. environment
:e. all of the above

The Detroit population had 1 million people without lung cancer in 2000, and 700,000 people without lung cancer in 2010. During that time period, 17,000 people were newly diagnosed with lung cancer. What was the incidence rate for lung cancer in Detroit from 2000 to 2010 (expressed per 100,000 person-years)?
:a. 0.002 lung cancer cases per 100,000 person years
:b. 200 lung cancer cases per 100,000 person years
:c. 270 lung cancer cases per 100,000 person years
:d. 243 lung cancer cases per 100,000 person years

In a fixed population, what happens to the prevalence of a disease when the incidence increases slightly, considering the different duration scenarios below?
:a. The prevalence increases if the duration of disease is increasing or stays the same
:b. The prevalence increases if the duration of disease is decreasing rapidly
:c. The prevalence decreases if the duration of disease is increasing
:d. The prevalence decreases if the duration of disease stays the same

Ann Arbor’s Mortality Rates from Diabetes Mellitus among whites, 2002- 2012.
<center>
{| class="wikitable" style="text-align:center:width:25% border="1"
|-
|Age groups (years) ||Age-specific rates (per 100,000)|| Michigan standard population || Expected number of deaths
|-
|<20|| 20 ||2,000,000||
|-
|20-39|| 10 || 3,000,000 ||
|-
|40-59 ||5 ||1,000,000||
|-
|>60|| 30|| 4,000,000||
|-
|Total || || 10,000,000 ||
|}
</center>

What is the age-adjusted mortality rate from diabetes among whites according to the table above?
:a. 40.2 deaths per 100,000
:b. 19.5 deaths per 100,000
:c. 1.9 death per 100,000
:d. 20.4 deaths per 100,000

Given the information above, what is the Standardized Mortality Ratio (SMR) if the observed deaths in the white population are 3000?
:a. 1.54
:b. 5.02
:c. 1.69
:d. 0.65

When a serious disease can be treated if it is caught early, it is more important to have a test with high specificity than high sensitivity.
:True
:False

Sequential testing tends to have higher net specificity than specificity of a single test.
:True
:False

A new screening test has been developed for diabetes. The table below represents the results of the new test compared to the current gold standard. Use this table to answer the following questions:
<center>
{| class="wikitable" style="text-align:center:width:25% border="1"
|-
|colspan=2 rowspan=2| || colspan=2|Gold standard
|-
|Condition Positive||Condition negative
|-
|rowspan=2| Result of New Test|| Test Positive ||80||70
|-
|Test Negative ||10 ||240
|-
|}
</center>

What is the sensitivity of the new test?
:a. 77%
:b. 89%
:c. 80%
:d. 53%

What is the specificity of the test?
:a. 77%
:b. 89%
:c. 80%
:d. 53%

What is the positive value of the test?
:a. 77%
:b. 89%
:c. 80%
:d. 53%

Understanding health behaviors that may protect against infection with the flu in population-dense areas is of great interest to epidemiologists. To determine if proper hand washing may prevent flu transmission, investigators recruited 834 students from a university dormitory to participate in a research study. At baseline, 74 individuals were experiencing flu-like symptoms and tested positive for active antibodies against the flu virus (meaning they in fact, had the flu) and thus, were not enrolled in the research study. The students that were not ill with the flu at baseline were followed for 12 months with no loss to follow-up. Researchers asked students to contact the study team when they exhibited flu-like symptoms so that they could be tested for the flu virus. During the course of follow-up, 379 students were diagnosed with the flu. Of the students enrolled in this study, 60% reported improper hand-washing behaviors. Of the students that were diagnosed with the flu during follow-up, 280 of them reported improper hand-washing.

:a. What type of study is this?
:b. Why is this type of study adequate for this particular situation?
:c. Imagine that you are the investigator picking the appropriate study design to answer this question, what might you have worried about in picking this design?
:d. What is the best measure of association to test the relationship between hand washing and incident flu? Why?
:e. Calculate and interpret the above measure of association using a 2X2 table.
:f. If proper hand-washing behavior were to be used by the students who exhibited improper hand-washing techniques, how many cases per 1000 would be prevented? Interpret your findings.

Chikungunya is a relatively rare viral disease transmitted by mosquitoes. This unpleasant disease is characterized by high fevers, nausea, vomiting, and crippling muscle and joint pain that may last for weeks to years as well as retinal damage. Chikungunya was recently detected in the Caribbean, prompting local epidemiologists to conduct a study on the Caribbean Island of Martinique to better understand local risk factors for Chikungunya. Researchers selected 100 individuals who tested positive for Chikungunya infection, as well as 200 individuals that did not have Chikungunya. Though they looked at multiple risk factors, the epidemiologists focused primarily on individuals’ use or non-use of mosquito repellent. Participants were asked about their repellent use (yes/no) in the 12 months preceding enrollment in the study. In their eventual publication, researchers reported that in total, 142 of the participants reported not using repellent. It was also noted that 31% of the participants who did not have Chikungunya reported no repellent use.
:a. What type of study design was used in this example?
:b. Why is this type of study appropriate for this particular situation?
:c. Given that the participants were asked about their use of repellent in the past, what is a potential limitation of this study?
:d. Set up a 2X2 table to assess the relationship between Chikungunya infection and improper mosquito repellent use.
:e. What is the appropriate measure of association for this study? Explain why.
:f. Calculate and interpret your measure of association.

A group of epidemiologists at a prestigious university decided to conduct a survey of public health students to investigate the relationship between cramping of the hands and creating 2x2 tables by hand. This survey was administered just once and there was no follow-up of the participants.
:a. What type of study is this?
:b. What type of measure of association is appropriate for this study? Why?
:c. Our epidemiologists found that 75% of study participants who had hand cramping reported excessive 2x2 table making. Are the epidemiologists justified in claiming that this study provides causal evidence that 2x2 table making leads to hand cramping? Why?

Parents of children who were born with birth defects may be more likely to remember any drug or exposure that occurred during pregnancy than parents of children born without birth defects. This is an example of what type of bias?
:a. interviewer bias
:b. recall bias
:c. loss to follow-up
:d. non-differential misclassification

Using data from the Nurses Health Study, the association between self-reported frequency of sunburns and melanoma was examined. When questioned after the diagnosis of melanoma, some women with melanoma may have exaggerated their frequency of sunburns especially if they were concerned that sun exposure was a reason they got melanoma. This is an example of:
:a. interviewer bias
:b. loss to follow-up
:c. differential misclassfication
:d. non-differential misclassification

===References===
*[http://en.wikipedia.org/wiki/Epidemiology Epidemiology Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_IntroEpi}}

SMHS DesignOfExperiments

2014-08-31T22:12:41Z

Zhenxunw: /* Theory */

==[[SMHS| Scientific Methods for Health Sciences]] - Design of Experiments ==

===Overview===
Design of experiments is a systematic, rigorous approach to problem solving that applies principles and techniques at the data collection stage so as to ensure the generation of valid, supportable and defensible conclusions. Design of experiments can be used at the point of greatest leverage to reduce costs by speeding up the design process, reducing late engineering design changes and reducing product material and labor complexity. It is also powerful tools to achieve manufacturing costs savings by minimizing process variation and reducing rework, and the need for inspection. The lecture presents a general overview of DOE and an introduction to some fundamental concepts, objectives, steps and design guidelines to assist in conducting designed experiments.

=== Motivation===
Experiment would be the natural way to implement a study and achieve the desired objectives. So the next question is how can these experiments and studies be realized, that is we need a blueprint for planning the study or experiment including ways to collect data and to control study parameters for accuracy and consistency. What are the key factors in a process? At what settings would the process deliver acceptable performance? What are the main and interaction effects in the process and what settings would bring out less variation in the output?
Design of Experiment would be the answer to those questions. Experiments can be designed in many different ways to collect the information of which process inputs have a significant impact on the process output and what the target level of those inputs should be to achieve a desired output. There are [http://www.itl.nist.gov/div898/handbook/pmd/section3/pmd31.htm four general problem areas] in which design of experiment may be applied:
*Comparative: the designer is interested in assessing whether a change in a single factor has in fact resulted in a change/improvement to the process as a whole.
*Screening and characterizing: the designer is interested in understanding the process as a whole in the sense that they can have a ranked list of the importance of factors that can affect the process.
*Modeling: the designer is interested in functionally modeling the process with the output being a good fit mathematical function, and to have good estimates of the coefficients in that function..
*Optimizing: the designer is interested in determining the optimal settings of the process factors, that is to determine for each factor the level of the factor that optimizes the process response.

===Theory===
[http://en.wikipedia.org/wiki/Design_of_experiments '''The most common components in design of experiments(DOE): ''']
*Comparison: in some fields of study it’s not possible to have independent measurements to a traceable standard and comparisons between treatments are much more valuable and preferable. To make inference about effects, associations or predictions, one typically has to compare different groups subjected to distinct conditions.
*Randomization: the process of assigning individuals at random to groups in an experiment. It requires that we make allocation of (controlled variables) treatments to units using some random mechanism. Random does not mean haphazard and great care needs to be taken to make sure appropriate random methods are used.
*Experimental vs. observational studies: there are many situations where randomized experiments are impractical. Therefore, we cannot reduce causality or effects of various treatments on the response measurement. Observational studies are retrospective or prospective studies where the investigator doesn’t have control over randomization of treatments to subject or units. In these cases, the subjects or units fall naturally within a treatment group.
*Replication: all measurements, observations or data collection are usually subject to variation and uncertainty. They are repeated and full experiments are replicated to help identify the sources of variation to better estimate the true effects of treatments, to further strengthen the experiment’s reliability and to add to the existing knowledge of the topic.
*Blocking: the arrangement of experimental units into groups consisting of units that are similar to one another. It reduced known but irrelevant sources of variation between units and allows greater precision in the estimation of the source of variation under study.
*Orthogonality：it concerns the forms of contrasts that can be legitimately and efficiently carried out. With independence between contrasts, each orthogonal treatment provides different information to the others. The goal is to completely decompose the variance or the relations of the observed measurements into independent components.
*Factorial experiments: are more efficient at evaluating the effects and possible interactions of several factors. DOE is built on the foundation of the analysis of variance, which partitions the observed variance into components according to what factors the experiment must estimate or test.
*Placebo: is a sham or simulated medical intervention that has no direct health impact but may result in actual improvement of a medical condition or disorder. Of such sham effect is observed, it is called a placebo effect. Common placebos are inert tablets, sham surgery and other procedures based on false information. An example could be giving a patient a pill identical to the actual treatment pill but without treatment ingredients. Typically all patients are informed that some will be treated using the drug and some will receive the insert pill, however the patients are blinded as to whether they actually received the drug or the placebo. Such an intervention may cause the patient to believe the treatment will change their condition, which may produce a subjective perception of a therapeutic effect.

'''Components of DOE:'''
*Factors (inputs): including controllable and uncontrollable variables. The former refers to the factors that we can control like how big is the dose or how often is the treatment taken by the patients. The later refers to factors we have no power with like the factors from the environment: air condition, temperature or humidity. People are generally considered as noise factor, which is an uncontrollable factor that causes variability under normal operating conditions but we can control it during the experiment using blocking and randomization.
*Levels (settings of each factor): examples include particular level of dosage for evaluation.
*Response (output): consider the test on a new drug. The output could be the frequency patients having the struck or their need for drugs. Experiments often desire to avoid optimizing the process for on response at the expense of another and important outcomes are measured and analyzed to determine the factors and their setting that will provide the best overall outcome.

'''Objectives of DOE:'''
*Comparing alternatives: DOE allows us to make an informed decision that evaluates both the quality and the cost.
*Identify significant factors that affect the output: separating the vital few from the trivial many.
*Achieving an optimal process output.
*Reducing variability.
*Minimizing, maximizing or targeting an output.
*Improving process or product robustness to ensure the experiments fits with varying conditions.
*Balance tradeoffs between multiple quality characteristics that require optimization.

'''DOE guidelines:'''
*DOE guidelines address the questions outlined above by stipulating factors to be tested, levels of the factors and structure and layout of experimental conditions. To sum up, DOE aims to come up with an experiment that can obtain the required information in a cost effective and reproducible manner.
*Unexplained variation, in addition to measurement error, can obscure the results. Errors can be unexplained variation that is either within or between experiment runs.
*Noise factors: uncontrollable factors that induce variation under normal operating conditions. For example multiple shifts, humidity or raw materials can be built into the experiment so that variation doesn’t lumped into unexplained error.
*Correlation. Consider two factors that vary together may be highly correlated without one causing the other or they may both the cause of a third factor.
*The combined effects or interactions. Consider growing rose, sufficient water will be benefit for its growth though too much water may be harmful for the rose. Factors may generate non-linear effects that are not additive, but these can only be studied with more complex experiments involving more than 2 level settings, such as quadratic or cubic.

'''DOE process:'''
<center>[[Image:SMHS_Design_of_Experiment_Fig_1_DOE_Process_Gallaway_07232014.jpg|500px]]
</center>

'''Test of means – one factor experiment:'''
One of the most common types of experiments is the comparison of two process methods, or two methods of treatment. One of the most straightforward methods to evaluate a new process method is to plot the results on an SPC chart that also includes historical data from the baseline process, with established control limits. Then apply the standard rules to evaluate out-of-control conditions to see if the process has been shifted. You may need to collect several subgroups worth of data in order to make a determination, although a single subgroup could fall outside of the existing control limits.
An alternative way to control chart approach is to use F-test to compare the means of alternate treatments and this is done automatically with ANOVA (analysis of variance). Consider the following example where three treatments are analyzed with the following data:

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| rowspan=2 | || colspan=3| '''Treatment''' || ||
|-
|'''A Usual Route''' || '''B (alternate)''' || '''C(alternate)''' || '''Variance''' || '''Mean'''
|-
| rowspan=10| '''Time in Minutes'''|| 27.0 || 26.0 || 29.5 || ||
|-
| 31.03 || 33.0 || 25.0 || ||
|-
| 28.5|| 26.5 || 28.5 || ||
|-
| 26.0|| 27.5 || 25.5 || ||
|-
| 27.5|| 29.0 || 24.0 || ||
|-
| 29.0|| 27.5 || 24.0 || ||
|-
| 33.0|| 26.5 || 28.0 || ||
|-
| 35.0|| 27.0 || 26.0 || ||
|-
| 28.0|| 28.0 || 25.5 || ||
|-
| 29.0|| 32.0 || 26.5 || ||
|-
|Mean $(\bar Y)$|| 29.4||28.3||26.6||1.99 ||
|-
|Variance $s^2$||7.9 ||5.7 ||3.0 || || 5.51
|}
</center>

The F-test analysis is the basis for model evaluation of both single factor and multi-factor experiments. This analysis is commonly output as an ANOVA table by statistical analysis software as illustrated in the table below;
<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=7| '''AVONA - Analysis of Variance Table'''
|-
|Source|| Sum of Squares|| DF || Mean Square || F-Ratio ||Probability || Significant
|-
| Between Groups ||39.80 || 2 || 19.90 || 3.61 || 0.0408 || Yes
|-
| Within Groups ||148.90|| 27 || 5.51 || || ||
|-
| Total ||188.70|| 29 || || || ||
|}
</center>

0.0408: there is only 4.08% probability that a Model F-ratio this large could occur due to noise (random chance). In other words, the three routes differ significantly in terms of the time taken to reach home from work.

ANOVA: $H_0: μ_1=⋯=μ_a vs.H_a$at least one mean is different

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
|Source|| Sum of Squares|| DF || Mean Square || F-Ratio
|-
| Between Groups ||$ SS_{treatment}= ∑_{i=1}^{a} n_i (\bar y_{l.}-\bar y..)^2 $ || $a-1$ ||$ MS_{treatment}=\frac {SS_{treatment}}{a-1}$|| $ F =\frac {MS_{treatment}}{MS_{error}}$
|-
| Within Groups ||$SS_{error} = SS_{total} - SS_{treatment}$|| $N-a$ || $MS_{error}=\frac{SS_{error}} {N-a}$ ||
|-
| Total || $SS_{treatment} = ∑_{i=1}^{a} ∑_{j=1}^{n_{i}} (\bar y_{ij}-\bar y..)^2 $ || $N-1$ || ||
|}
</center>

We reject the null hypothesis of equal treatment means if F_0>F_(α,a-1,a(n-1))
Note: a is the number of treatments, n_i is the size of sample in the i^th group, α is the level of significance, y_ij is the measurement from group i, observation index j, (y_(..) ) ̅ is the grand mean of all the observations, (y_(i.) ) ̅ is the grand mean f the i^th treatment group. ANOVA will be further studied in the section of ANOVA later.

===Applications===
*[http://www.cancer.org/healthy/stayawayfromtobacco/index This article] presents an observational study of smoking effects on cancer. It presents various side effects of tobacco on human health and provided guide to quit smoking. This article illustrated the who study in ten sections where each section is fully developed in a clearly stated form with questions and answers format. The whole article is well organized and prepares people with enough knowledge of the reason behind quitting smoking as well as suggestions and programs help people quit smoking. This is a typical observational study.

*[http://arc.aiaa.org/doi/abs/10.2514/2.3153?journalCode=ja This article] brought together an empirical drag prediction model plus design of experiment, response surface and data-fusion methods with computational fluid dynamics (CFD) to provide a wing optimization system. The system presented allows high-quality designs to be found using a full three-dimensional CFD code without the expense of direct searches. The meta-models built are shown to be more accurate than the initial empirical model or than simple response surfaces based on the CFD data alone. Data fusion is achieved by building a response surface kriging of the differences between the two drag prediction tools, which are working at varying levels of fidelity. It then uses kriging with empirical tool to predict the drags coming from the CFD code, which is much quicker to use than direct searches of the CFD.

*[http://www.tandfonline.com/doi/abs/10.1080/01621459.1972.10481253#.U6HGyBZRXKw This article] illustrated certain numerical approximations for finding one and two stage bioassay designs, which produce small posterior variance using a one-parameter logistic distribution. It discussed the use of two prior distribution: one for design and the other for inference with graphs for designing experiments when the prior distribution are normal. These graphs illustrate the importance of using additional dose levels when the variance of the prior distribution is large.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]

*[http://socr.ucla.edu/htmls/SOCR_Analyses.html SOCR Analyses]

*[http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts]

*[http://www.socr.ucla.edu/htmls/SOCR_ChoiceOfStatisticalTest.html SOCR Choice Of Statistical Test]

===Problems===
Suppose two researchers wanted to determine if aspirin reduced the chance of a heart attack. Researcher 1 studied the medical records of 500 patients. For each patient, he recorded whether the person took aspirin every day and if the person had ever had a heart attack. Then he reported the percentage of heart attacks for the patients who took aspirin every day and for those who did not take aspirin every day.
Researcher 2 also studied 500 people. He randomly assigned half of the patients to take aspirin every day and the other half to take a placebo everyday. After a certain length of time, he reported the percentage of heart attacks for the patients who took aspirin every day and for those who did not take aspirin every day. Suppose that both researchers found that there is a statistically significant difference in the heart attack rates for the aspirin users and the non-aspirin users and that aspirin users had a lower rate of heart attacks. Can both researchers conclude that aspirin caused the reduction?
:(a) No, only researcher 2 can conclude this.
:(b) No, only researcher 1 can conclude this.
:(c) Yes, because aspirin is known to reduce heart attacks.
:(d) Yes, because aspirin users had a larger heart attack rate in both studies.

Suppose that you were hired as a statistical consultant to design a study to examine the impact of a new medicine vs. a current medicine on lowering blood pressure. 50 patients volunteer to participate in the study. What design will you recommend?
:(a) Completely randomized design with two factors.
:(b) Completely randomized design with two factors and single blind.
:(c) Completely randomized design.
:(d) Completely randomized design with two factors and double blind.

The next four questions are based on the following:
Hospital floors are usually covered by bare tiles. Carpets would cut down on noise but might be more likely to harbor germs. To study this possibility, investigators randomly assigned 8 of 16 available hospital rooms to have carpet installed. The others were left bare. Later, air from each room was pumped over a dish of agar. The dish was incubated for a fixed period, and the number of bacteria colonies was counted.

*1.Select the appropriate statistical term for the 8 rooms left bare.
*:(a) Treatments
*:(b) Experimental Units
*:(c) Control Group
*:(d) Response

*2.Select the appropriate statistical term for the 16 hospital rooms.
*:(a) Response
*:(b) Treatments
*:(c) Experimental Units
*:(d) Control Group

*3.Select the appropriate statistical term for number of colonies in a dish.
*:(a) Treatments
*:(b) Control Group
*:(c) Response
*:(d) Experimental Units

*4.Select the appropriate statistical term for number of colonies in a dish.
*:(a) Treatments
*:(b) Response
*:(c) Experimental Units
*:(d) Control Group

A psychologist is examining the effect of showing pictures on learning of words by seven-year-olds. The seven-year-olds are randomly assigned to two groups. The experimental group is shown the word along with the picture. The control group is shown only the word. At the end of the experiment, the subjects are given a test on the number of words they get right. This is an example of:
:(a) A blind study
:(b) An experiment with a design flaw
:(c) A double blind study
:(d) A well-designed experiment

Suppose that students A and B are working for the university. The registrar asks student A to calculate the mean and SD of the GPA's for the Fall 2005 freshmen class. He asks student B to design a sampling strategy to evaluate the attitude of the undergraduates at the university toward undergraduate teaching.
:(a) Student A is doing descriptive statistics and student B is doing inferential statistics.
:(b) Student A is doing inferential statistics and student B is doing descriptive statistics.
:(c) Both students are doing descriptive statistics.
:(d) Both students are doing inferential statistics.

At the Department of Statistics, we intend to examine the effect of using computers in Statistics 10 on the attitudes of students toward statistics. We offer ten lectures of Statistics 10 in an academic year. Five of these sections are randomly assigned to the experimental group and the other five are assigned to the control group. The experimental group will go to lecture, section, and computer lab. The control group will only go to lecture and section, but will not do the computer lab. The attitude of the students toward statistics is measured before and after the course. This study is:
:(a) A double blind study
:(b) A well-designed experiment
:(c) A blind study
:(d) Not a randomized experiment

An office manager wonders whether there is any relationship between drinking coffee before 10 am and alertness. He selects at random 3 days of the week, and in those days, he compared the alertness level of 25 employees who usually drink coffee before 10 am and 25 employees who do not usually drink coffee before 10 am. Is this an observational or experimental study?
:(a) We need more information to decide
:(b) This is an experimental study
:(c) This is an observational study
:(d) This is a combination of experimental and observational study

A major car manufacturing company intends to find out if cars get better millage with premium instead of regular unleaded gasoline. They also would like to know if the size of the car has any effect on fuel economy. 96 volunteers who are similar in age, experience and style of driving participate in the study. The drivers are randomly assigned to the premium and regular groups. The drivers assigned to the premium and regular groups are then randomly assigned to drive a small, medium, or large car. All of the drivers are asked to keep a driving log. What is the design used for this study?
:(a) randomized block design
:(b) Completely randomized two factor experiment
:(c) Completely randomized experiment with one factor
:(d) Completely randomized experiment with matching

For this research situation, decide what statistical procedure would most likely be used to answer the research question posed. Assume all assumptions have been met for using the procedure.
Is ethnicity related to political party affiliation (Republican, Democrat, Other)?
:(a) Test the difference in means between two paired or dependent samples.
:(b) Use a chi-squared test of association.
:(c) Test one mean against a hypothesized constant.
:(d) Test the difference between two means (independent samples).
:(e) Test for a difference in more than two means (one way ANOVA).

In a large mid-western university with 30 different departments, the university is considering eliminating standardized scores from their admission requirements. The university wants to find out whether the students agree with this plan. They decide to randomly select 100 students from each department, send them a survey, and follow up with a phone call if they do not return the survey within a week. What kind of sampling plan did they use?
:(a) Stratified random sampling
:(b) Simple random sampling
:(c) Cluster sampling
:(d) Multi-stage sampling

===References===

*[http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_IntroDesign SOCR]

*[http://en.wikipedia.org/wiki/Design_of_experiments Design of Experiments Wikipedia]

*[https://www.moresteam.com/toolbox/design-of-experiments.cfm Design of Experiment Tutorial]

*[http://www.itl.nist.gov/div898/handbook/pmd/section3/pmd31.htm What is DOE, Engineering Statistics Handbook]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_DesignOfExperiments}}

SMHS DesignOfExperiments

2014-08-31T22:07:32Z

Zhenxunw: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Design of Experiments ==

===Overview===
Design of experiments is a systematic, rigorous approach to problem solving that applies principles and techniques at the data collection stage so as to ensure the generation of valid, supportable and defensible conclusions. Design of experiments can be used at the point of greatest leverage to reduce costs by speeding up the design process, reducing late engineering design changes and reducing product material and labor complexity. It is also powerful tools to achieve manufacturing costs savings by minimizing process variation and reducing rework, and the need for inspection. The lecture presents a general overview of DOE and an introduction to some fundamental concepts, objectives, steps and design guidelines to assist in conducting designed experiments.

=== Motivation===
Experiment would be the natural way to implement a study and achieve the desired objectives. So the next question is how can these experiments and studies be realized, that is we need a blueprint for planning the study or experiment including ways to collect data and to control study parameters for accuracy and consistency. What are the key factors in a process? At what settings would the process deliver acceptable performance? What are the main and interaction effects in the process and what settings would bring out less variation in the output?
Design of Experiment would be the answer to those questions. Experiments can be designed in many different ways to collect the information of which process inputs have a significant impact on the process output and what the target level of those inputs should be to achieve a desired output. There are [http://www.itl.nist.gov/div898/handbook/pmd/section3/pmd31.htm four general problem areas] in which design of experiment may be applied:
*Comparative: the designer is interested in assessing whether a change in a single factor has in fact resulted in a change/improvement to the process as a whole.
*Screening and characterizing: the designer is interested in understanding the process as a whole in the sense that they can have a ranked list of the importance of factors that can affect the process.
*Modeling: the designer is interested in functionally modeling the process with the output being a good fit mathematical function, and to have good estimates of the coefficients in that function..
*Optimizing: the designer is interested in determining the optimal settings of the process factors, that is to determine for each factor the level of the factor that optimizes the process response.

===Theory===
The most common components in design of experiments(DOE):
*Comparison: in some fields of study it’s not possible to have independent measurements to a traceable standard and comparisons between treatments are much more valuable and preferable. To make inference about effects, associations or predictions, one typically has to compare different groups subjected to distinct conditions.
*Randomization: the process of assigning individuals at random to groups in an experiment. It requires that we make allocation of (controlled variables) treatments to units using some random mechanism. Random does not mean haphazard and great care needs to be taken to make sure appropriate random methods are used.
*Experimental vs. observational studies: there are many situations where randomized experiments are impractical. Therefore, we cannot reduce causality or effects of various treatments on the response measurement. Observational studies are retrospective or prospective studies where the investigator doesn’t have control over randomization of treatments to subject or units. In these cases, the subjects or units fall naturally within a treatment group.
*Replication: all measurements, observations or data collection are usually subject to variation and uncertainty. They are repeated and full experiments are replicated to help identify the sources of variation to better estimate the true effects of treatments, to further strengthen the experiment’s reliability and to add to the existing knowledge of the topic.
*Blocking: the arrangement of experimental units into groups consisting of units that are similar to one another. It reduced known but irrelevant sources of variation between units and allows greater precision in the estimation of the source of variation under study.
*Orthogonality：it concerns the forms of contrasts that can be legitimately and efficiently carried out. With independence between contrasts, each orthogonal treatment provides different information to the others. The goal is to completely decompose the variance or the relations of the observed measurements into independent components.
*Factorial experiments: are more efficient at evaluating the effects and possible interactions of several factors. DOE is built on the foundation of the analysis of variance, which partitions the observed variance into components according to what factors the experiment must estimate or test.
*Placebo: is a sham or simulated medical intervention that has no direct health impact but may result in actual improvement of a medical condition or disorder. Of such sham effect is observed, it is called a placebo effect. Common placebos are inert tablets, sham surgery and other procedures based on false information. An example could be giving a patient a pill identical to the actual treatment pill but without treatment ingredients. Typically all patients are informed that some will be treated using the drug and some will receive the insert pill, however the patients are blinded as to whether they actually received the drug or the placebo. Such an intervention may cause the patient to believe the treatment will change their condition, which may produce a subjective perception of a therapeutic effect.

'''Components of DOE:'''
*Factors (inputs): including controllable and uncontrollable variables. The former refers to the factors that we can control like how big is the dose or how often is the treatment taken by the patients. The later refers to factors we have no power with like the factors from the environment: air condition, temperature or humidity. People are generally considered as noise factor, which is an uncontrollable factor that causes variability under normal operating conditions but we can control it during the experiment using blocking and randomization.
*Levels (settings of each factor): examples include particular level of dosage for evaluation.
*Response (output): consider the test on a new drug. The output could be the frequency patients having the struck or their need for drugs. Experiments often desire to avoid optimizing the process for on response at the expense of another and important outcomes are measured and analyzed to determine the factors and their setting that will provide the best overall outcome.

'''Objectives of DOE:'''
*Comparing alternatives: DOE allows us to make an informed decision that evaluates both the quality and the cost.
*Identify significant factors that affect the output: separating the vital few from the trivial many.
*Achieving an optimal process output.
*Reducing variability.
*Minimizing, maximizing or targeting an output.
*Improving process or product robustness to ensure the experiments fits with varying conditions.
*Balance tradeoffs between multiple quality characteristics that require optimization.

'''DOE guidelines:'''
*DOE guidelines address the questions outlined above by stipulating factors to be tested, levels of the factors and structure and layout of experimental conditions. To sum up, DOE aims to come up with an experiment that can obtain the required information in a cost effective and reproducible manner.
*Unexplained variation, in addition to measurement error, can obscure the results. Errors can be unexplained variation that is either within or between experiment runs.
*Noise factors: uncontrollable factors that induce variation under normal operating conditions. For example multiple shifts, humidity or raw materials can be built into the experiment so that variation doesn’t lumped into unexplained error.
*Correlation. Consider two factors that vary together may be highly correlated without one causing the other or they may both the cause of a third factor.
*The combined effects or interactions. Consider growing rose, sufficient water will be benefit for its growth though too much water may be harmful for the rose. Factors may generate non-linear effects that are not additive, but these can only be studied with more complex experiments involving more than 2 level settings, such as quadratic or cubic.

'''DOE process:'''
<center>[[Image:SMHS_Design_of_Experiment_Fig_1_DOE_Process_Gallaway_07232014.jpg|500px]]
</center>

'''Test of means – one factor experiment:'''
One of the most common types of experiments is the comparison of two process methods, or two methods of treatment. One of the most straightforward methods to evaluate a new process method is to plot the results on an SPC chart that also includes historical data from the baseline process, with established control limits. Then apply the standard rules to evaluate out-of-control conditions to see if the process has been shifted. You may need to collect several subgroups worth of data in order to make a determination, although a single subgroup could fall outside of the existing control limits.
An alternative way to control chart approach is to use F-test to compare the means of alternate treatments and this is done automatically with ANOVA (analysis of variance). Consider the following example where three treatments are analyzed with the following data:

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| rowspan=2 | || colspan=3| '''Treatment''' || ||
|-
|'''A Usual Route''' || '''B (alternate)''' || '''C(alternate)''' || '''Variance''' || '''Mean'''
|-
| rowspan=10| '''Time in Minutes'''|| 27.0 || 26.0 || 29.5 || ||
|-
| 31.03 || 33.0 || 25.0 || ||
|-
| 28.5|| 26.5 || 28.5 || ||
|-
| 26.0|| 27.5 || 25.5 || ||
|-
| 27.5|| 29.0 || 24.0 || ||
|-
| 29.0|| 27.5 || 24.0 || ||
|-
| 33.0|| 26.5 || 28.0 || ||
|-
| 35.0|| 27.0 || 26.0 || ||
|-
| 28.0|| 28.0 || 25.5 || ||
|-
| 29.0|| 32.0 || 26.5 || ||
|-
|Mean $(\bar Y)$|| 29.4||28.3||26.6||1.99 ||
|-
|Variance $s^2$||7.9 ||5.7 ||3.0 || || 5.51
|}
</center>

The F-test analysis is the basis for model evaluation of both single factor and multi-factor experiments. This analysis is commonly output as an ANOVA table by statistical analysis software as illustrated in the table below;
<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=7| '''AVONA - Analysis of Variance Table'''
|-
|Source|| Sum of Squares|| DF || Mean Square || F-Ratio ||Probability || Significant
|-
| Between Groups ||39.80 || 2 || 19.90 || 3.61 || 0.0408 || Yes
|-
| Within Groups ||148.90|| 27 || 5.51 || || ||
|-
| Total ||188.70|| 29 || || || ||
|}
</center>

0.0408: there is only 4.08% probability that a Model F-ratio this large could occur due to noise (random chance). In other words, the three routes differ significantly in terms of the time taken to reach home from work.

ANOVA: $H_0: μ_1=⋯=μ_a vs.H_a$at least one mean is different

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
|Source|| Sum of Squares|| DF || Mean Square || F-Ratio
|-
| Between Groups ||$ SS_{treatment}= ∑_{i=1}^{a} n_i (\bar y_{l.}-\bar y..)^2 $ || $a-1$ ||$ MS_{treatment}=\frac {SS_{treatment}}{a-1}$|| $ F =\frac {MS_{treatment}}{MS_{error}}$
|-
| Within Groups ||$SS_{error} = SS_{total} - SS_{treatment}$|| $N-a$ || $MS_{error}=\frac{SS_{error}} {N-a}$ ||
|-
| Total || $SS_{treatment} = ∑_{i=1}^{a} ∑_{j=1}^{n_{i}} (\bar y_{ij}-\bar y..)^2 $ || $N-1$ || ||
|}
</center>

We reject the null hypothesis of equal treatment means if F_0>F_(α,a-1,a(n-1))
Note: a is the number of treatments, n_i is the size of sample in the i^th group, α is the level of significance, y_ij is the measurement from group i, observation index j, (y_(..) ) ̅ is the grand mean of all the observations, (y_(i.) ) ̅ is the grand mean f the i^th treatment group. ANOVA will be further studied in the section of ANOVA later.

===Applications===
*[http://www.cancer.org/healthy/stayawayfromtobacco/index This article] presents an observational study of smoking effects on cancer. It presents various side effects of tobacco on human health and provided guide to quit smoking. This article illustrated the who study in ten sections where each section is fully developed in a clearly stated form with questions and answers format. The whole article is well organized and prepares people with enough knowledge of the reason behind quitting smoking as well as suggestions and programs help people quit smoking. This is a typical observational study.

*[http://arc.aiaa.org/doi/abs/10.2514/2.3153?journalCode=ja This article] brought together an empirical drag prediction model plus design of experiment, response surface and data-fusion methods with computational fluid dynamics (CFD) to provide a wing optimization system. The system presented allows high-quality designs to be found using a full three-dimensional CFD code without the expense of direct searches. The meta-models built are shown to be more accurate than the initial empirical model or than simple response surfaces based on the CFD data alone. Data fusion is achieved by building a response surface kriging of the differences between the two drag prediction tools, which are working at varying levels of fidelity. It then uses kriging with empirical tool to predict the drags coming from the CFD code, which is much quicker to use than direct searches of the CFD.

*[http://www.tandfonline.com/doi/abs/10.1080/01621459.1972.10481253#.U6HGyBZRXKw This article] illustrated certain numerical approximations for finding one and two stage bioassay designs, which produce small posterior variance using a one-parameter logistic distribution. It discussed the use of two prior distribution: one for design and the other for inference with graphs for designing experiments when the prior distribution are normal. These graphs illustrate the importance of using additional dose levels when the variance of the prior distribution is large.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]

*[http://socr.ucla.edu/htmls/SOCR_Analyses.html SOCR Analyses]

*[http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts]

*[http://www.socr.ucla.edu/htmls/SOCR_ChoiceOfStatisticalTest.html SOCR Choice Of Statistical Test]

===Problems===
Suppose two researchers wanted to determine if aspirin reduced the chance of a heart attack. Researcher 1 studied the medical records of 500 patients. For each patient, he recorded whether the person took aspirin every day and if the person had ever had a heart attack. Then he reported the percentage of heart attacks for the patients who took aspirin every day and for those who did not take aspirin every day.
Researcher 2 also studied 500 people. He randomly assigned half of the patients to take aspirin every day and the other half to take a placebo everyday. After a certain length of time, he reported the percentage of heart attacks for the patients who took aspirin every day and for those who did not take aspirin every day. Suppose that both researchers found that there is a statistically significant difference in the heart attack rates for the aspirin users and the non-aspirin users and that aspirin users had a lower rate of heart attacks. Can both researchers conclude that aspirin caused the reduction?
:(a) No, only researcher 2 can conclude this.
:(b) No, only researcher 1 can conclude this.
:(c) Yes, because aspirin is known to reduce heart attacks.
:(d) Yes, because aspirin users had a larger heart attack rate in both studies.

Suppose that you were hired as a statistical consultant to design a study to examine the impact of a new medicine vs. a current medicine on lowering blood pressure. 50 patients volunteer to participate in the study. What design will you recommend?
:(a) Completely randomized design with two factors.
:(b) Completely randomized design with two factors and single blind.
:(c) Completely randomized design.
:(d) Completely randomized design with two factors and double blind.

The next four questions are based on the following:
Hospital floors are usually covered by bare tiles. Carpets would cut down on noise but might be more likely to harbor germs. To study this possibility, investigators randomly assigned 8 of 16 available hospital rooms to have carpet installed. The others were left bare. Later, air from each room was pumped over a dish of agar. The dish was incubated for a fixed period, and the number of bacteria colonies was counted.

*1.Select the appropriate statistical term for the 8 rooms left bare.
*:(a) Treatments
*:(b) Experimental Units
*:(c) Control Group
*:(d) Response

*2.Select the appropriate statistical term for the 16 hospital rooms.
*:(a) Response
*:(b) Treatments
*:(c) Experimental Units
*:(d) Control Group

*3.Select the appropriate statistical term for number of colonies in a dish.
*:(a) Treatments
*:(b) Control Group
*:(c) Response
*:(d) Experimental Units

*4.Select the appropriate statistical term for number of colonies in a dish.
*:(a) Treatments
*:(b) Response
*:(c) Experimental Units
*:(d) Control Group

A psychologist is examining the effect of showing pictures on learning of words by seven-year-olds. The seven-year-olds are randomly assigned to two groups. The experimental group is shown the word along with the picture. The control group is shown only the word. At the end of the experiment, the subjects are given a test on the number of words they get right. This is an example of:
:(a) A blind study
:(b) An experiment with a design flaw
:(c) A double blind study
:(d) A well-designed experiment

Suppose that students A and B are working for the university. The registrar asks student A to calculate the mean and SD of the GPA's for the Fall 2005 freshmen class. He asks student B to design a sampling strategy to evaluate the attitude of the undergraduates at the university toward undergraduate teaching.
:(a) Student A is doing descriptive statistics and student B is doing inferential statistics.
:(b) Student A is doing inferential statistics and student B is doing descriptive statistics.
:(c) Both students are doing descriptive statistics.
:(d) Both students are doing inferential statistics.

At the Department of Statistics, we intend to examine the effect of using computers in Statistics 10 on the attitudes of students toward statistics. We offer ten lectures of Statistics 10 in an academic year. Five of these sections are randomly assigned to the experimental group and the other five are assigned to the control group. The experimental group will go to lecture, section, and computer lab. The control group will only go to lecture and section, but will not do the computer lab. The attitude of the students toward statistics is measured before and after the course. This study is:
:(a) A double blind study
:(b) A well-designed experiment
:(c) A blind study
:(d) Not a randomized experiment

An office manager wonders whether there is any relationship between drinking coffee before 10 am and alertness. He selects at random 3 days of the week, and in those days, he compared the alertness level of 25 employees who usually drink coffee before 10 am and 25 employees who do not usually drink coffee before 10 am. Is this an observational or experimental study?
:(a) We need more information to decide
:(b) This is an experimental study
:(c) This is an observational study
:(d) This is a combination of experimental and observational study

A major car manufacturing company intends to find out if cars get better millage with premium instead of regular unleaded gasoline. They also would like to know if the size of the car has any effect on fuel economy. 96 volunteers who are similar in age, experience and style of driving participate in the study. The drivers are randomly assigned to the premium and regular groups. The drivers assigned to the premium and regular groups are then randomly assigned to drive a small, medium, or large car. All of the drivers are asked to keep a driving log. What is the design used for this study?
:(a) randomized block design
:(b) Completely randomized two factor experiment
:(c) Completely randomized experiment with one factor
:(d) Completely randomized experiment with matching

For this research situation, decide what statistical procedure would most likely be used to answer the research question posed. Assume all assumptions have been met for using the procedure.
Is ethnicity related to political party affiliation (Republican, Democrat, Other)?
:(a) Test the difference in means between two paired or dependent samples.
:(b) Use a chi-squared test of association.
:(c) Test one mean against a hypothesized constant.
:(d) Test the difference between two means (independent samples).
:(e) Test for a difference in more than two means (one way ANOVA).

In a large mid-western university with 30 different departments, the university is considering eliminating standardized scores from their admission requirements. The university wants to find out whether the students agree with this plan. They decide to randomly select 100 students from each department, send them a survey, and follow up with a phone call if they do not return the survey within a week. What kind of sampling plan did they use?
:(a) Stratified random sampling
:(b) Simple random sampling
:(c) Cluster sampling
:(d) Multi-stage sampling

===References===

*[http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_IntroDesign SOCR]

*[http://en.wikipedia.org/wiki/Design_of_experiments Design of Experiments Wikipedia]

*[https://www.moresteam.com/toolbox/design-of-experiments.cfm Design of Experiment Tutorial]

*[http://www.itl.nist.gov/div898/handbook/pmd/section3/pmd31.htm What is DOE, Engineering Statistics Handbook]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_DesignOfExperiments}}

SMHS ResamplingSimulation

2014-08-31T18:43:49Z

Zhenxunw: /* Overview */

==[[SMHS| Scientific Methods for Health Sciences]] - Resampling and Simulation ==

===Overview===
In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. [http://en.wikipedia.org/wiki/Resampling_(statistics) ''Resampling''] is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

===Motivation===
Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

===Theory===
==== Resampling methods====
Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined data generating process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.

*Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.

====Bootstrapping====
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like the median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods, and it falls into the broader class of resampling methods.

*Situations where bootstrapping applies: (1) When the theoretical distribution of a statistic of interest is complicated or unknown; (2) When the sample size is insufficient for straightforward statistical inference; (3) When power calculations have to be performed, and a small pilot sample is available.

*Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.

*The basic idea of bootstrapping is that inference about a population based on sample data (sample → population) can be modeled by resampling the data and performing inference on this new sample (resample → sample). More formally, bootstrapping works by treating inference regarding the true probability distribution given the data as being analogous to inference regarding the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.

*Common process: (1) Begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (e.g., 1000), and (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.

*Key features of the bootstrap: The draws must be independent; that is, each observation in the observed sample must have an equal chance of being selected. The simulated sample must be of size N to take full advantage of the information in the sample. Resampling must be done with replacement, if not, then every simulated sample of size N would be identical and the same as the original sample. Resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.

*Types of bootstrap schemes: (1) ''Case resampling'' using the Monte Carlo algorithm; (2) estimating the distribution of the sample means; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.

*Advantages: Boostrapping is simple, and it is straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; it is appropriate to control and check the stability of the results.

*Limitations: It does not provide general finite-sample guarantees. Additionally, the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

====Jackknife====
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is to systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

*Jackknife estimates of variance asymptotically tend to the true value almost surely. The jackknife is consistent for sample means, sample variances, etc.

*The jackknife is not consistent for the sample median. In the case of a unimodal variable, the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
*It is dependent on the independence of the data. Extensions of the jackknife to allow for dependences in the data have been proposed.

*Advantages: This method is good at detecting outliers and influential cases. Those sub-sample estimates that differ most indicate cases that have the strongest influence on those estimates in the original full sample analysis.

*Limitations: The jackknife is less general than the bootstrap and thus, is used less frequently. It does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples and it does not perform well with small samples because you cannot generate many resamples.

==== Cross-validation====
Cross-validation (CV) is a statistical method for validating a predictive model by assessing a statistical model using a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as ''validating sets''; a model is fit to the remaining data (i.e., ''training set'') and used to predict the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

*Steps: (1) Randomly partition the data into a training set and a validating set, (2) fit the model to the training set, (3) take the parameter estimates from that model and use them to calculate a measure of fit on the testing set, (4) repeat several times and average to reduce variability.

*'''Types of CV''':
**''Leave-one-out CV'': This is an iterative method in which the number of iterations = sample size, each observation becomes the validating set one time. Steps: 1) Delete observation #1 from the data, 2) fit the model to observations #2-n, 3) apply the coefficients form step #2 to observation #1 and calculate the chosen fit measure, 4) delete observation #2 from the data, 5) fit the model to observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2 and calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
**''K-fold cross-validation'' splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. This influence is similar to they way in which in regression analysis methods, such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.

*Limitations of CV: The training and validating data must be random samples from the same population. It will be most different from in-sample measures when n is small. It is more computationally demanding than calculating in-sample measures. It is subject to the researcher’s selection of an appropriate fit statistic.

====Permutation test====
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

*Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.

*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

====Simulation====
A common assumption is that the coefficients we are trying to estimate come from a probability distribution. With large enough sample sizes, according to the central limit theorem (CLT), this distribution is multivariate normal.
The goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients.
*Steps: (1) Choose a quality index (QI), e.g., expected value, predicted probability, odds ratio, first difference, etc.; (2) Set a key variable in the model to a theoretically interesting value and the rest to their means or modes; (3) calculate the QI with each set of simulated coefficients; (4) set the variable to a new value; (5) calculate that QI with each set of the simulated coefficients, (6) repeat as appropriate; (7) efficiently summarize the distribution of the computed QI at each value of the variable of interest.

*Advantages: Simulation provides more information than a table of regression outputs. It accounts for uncertainty in the QI and is flexible to many different types of models, QIs and variable specifications. After performing it once, it is easy to use and can be much easier than working with analytic solutions.

*Limitations: It relies on the CLT to justify asymptotic normality. In contrast, a fully Bayesian model using MCMC could produce an exact finite-sample distribution and bootstrapping would require no distributional assumptions. It can be computationally intense and large models can produce great uncertainty regarding quantities of interest.

===Applications===
* [http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents an application of the Central Limit Theorem using the SOCR applet for a demonstration activity. This article described an innovative effort at using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT in probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of the CLT as well as a hands-on simulation and a number of examples illustrating the theory and application of the CLT.

* [http://link.springer.com/article/10.3758/BRM.40.3.879 This article], entitled "Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models," provides an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model via examples. The paper presents an illustrative example, assessing and contrasting potential mediators of the relationship between the helpfulness of socialization agents and job satisfaction as well as discussing software applications of these methods using SAS, SPSS macros, etc.

* [[SOCR_ResamplingSimulation_Activity|This article]] presents a resampling, randomization and simulation activity and illustrates the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling- and randomization-based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.

* [[SOCR_EduMaterials_Activities_SamplingDistributionCLTExperiment | This article]] presents an experiment on sampling distributions using the Central Limit Theorem. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT via an experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and the CLT and empirically demonstratse that the sample average is unique. The article helps users develop a better understanding of the two topics and apply the topics to various types of activities by explaining concepts such as the native distribution, sample distribution and numerical parameter estimate.

===Software ===
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html R-Tutorial]
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
*[http://socr.ucla.edu/htmls/HTML5/SOCR_Resampling_Webapp/ SOCR Resampling Webapp]

* Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose"
[8] "Jack" "Tim" "Alfred"

> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack"
[8] "Rose" "Tom" "Ann"

===Problems===
# Go over the examples in article 4.2 and 4.3.
# Simulate stock closing prices, $S_t$, on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$, $S_0=36, \sigma=2 and v=0.01$.
# Now suppose you bought a call on this stock with strike price 40. Based on your simulated data, what percentage of days would you profit from exercising the call option? (This is the percentage of days your simulated $S_t$ is greater than 40).

<hr>
* SOCR Home page: http://www.socr.umich.edu

=== References===
* [http://en.wikipedia.org/wiki/Resampling_(statistics) Resampling Wikipedia]
* [http://en.wikipedia.org/wiki/Cross-validation_(statistics) Cross Validation Wikipedia]
* [http://en.wikipedia.org/wiki/Bootstrapping_(statistics) Bootstrapping Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ResamplingSimulation}}

SMHS Probability

2014-08-31T18:29:14Z

Zhenxunw: /* References */

==[[SMHS| Scientific Methods for Health Sciences]] - Probability Theory ==

===Overview===
Probability theory plays an important role in statistics and its application to many other disciplines because it provides the theoretical groundwork for statistical inference. Probability theory is concerned with probability, which is the analysis of random phenomena. The central objects are random variables, stochastic processes, and events. Consider an individual coin toss, which can be considered to be a random event; if it is repeated many times then the sequence of random events will exhibit certain patterns. Probability theory helps us study and predict those patterns. Often, probability theory is further divided into two categories: discrete probability distributions and continuous probability distributions. We will study these later in the Distribution section. Here, we aim to study fundamental concepts of probability theory and define the rules of probability theory that we will apply in following studies.

===Motivation===
Imagine that you are performing an experiment in which a number of outcomes are produced. This set of outcomes is called the sample space and the power set of the sample space includes all the different collections of possible results from the experiment. Suppose we are rolling a fair die, which has 6 possible outcomes. The sample space is {1, 2, 3, 4, 5, 6}. An event is any collection of the possible results. For example, the event of rolling an even number involves the subset {2, 4, 6}, which is an element of the power set of the sample space in this experiment. What if we want to estimate the chance of rolling three 2’s in a row or the chance of rolling an odd number in an experiment? Probability is a way of assigning every event a value between 0 and 1; this value represents the chance that the event will occur.

===Theory===
'''Random Sampling''': A simple random sample of n items is a sample in which every member of the population has an equal chance of being selected and the members of the sample are chosen independently. For example, consider a survey for which 100 students are selected to take the questionnaire from a total of 5000 students, and the chance of being selected is the same for each student. This is a simple example of random sampling. An a common application is in random number generators.

'''Types of probabilities''': Probability models have two components: sample space and probabilities.
*The '''sample space (S)''' for a random experiment is the set of all possible outcomes of the experiment.
**Event: An event is a collection of outcomes.
**An event is said to occur if any outcome making up that event occurs.
*'''Probabilities''' for each event in the sample space.
**Probabilities may come from models and be mathematical and/or physical descriptions of a sample space and the chance of each event. An example of this would be a fair dice tossing game.
**Probabilities may be derived from data. Data observations can determine probability distribution. An example would be tossing a coin 50 times and observing the heads count.
*Subjective probabilities: combining data and psychological factors to design a reasonable probability table. An example may be the stock market.

'''Axioms of probability'''
*First axiom: The probability of an event is a non-negative real number.
*Second axiom: The probability that some elementary event in the entire sample space will occur is 1. More specifically, there are no elementary events outside the sample space $P(S)=1$.
*Third axiom: A countable sequence of pair-wise disjoint events $E_1,E_2, E_3, … $ satisfies $P(E_1 \cup E_2 \cup E_3 \cup … ) = \sum_i {P(E_i)} $.

'''Event manipulations'''
*Complement: The complement of event $A$ is denoted as $A^c$ or $A'$. It occurs if and only if $A$ does not occur. The union of $A$ and $A^C$ make up the entire sample space ($S$).
* Union: $A\cup B$ contains all outcomes in $A$ or $B$ (or both). $ P(A\cup B)=P(A)+P(B)-P(A\cap B). $
* Intersection: $A\cap B$ contains all outcomes which are in both $A$ and $B$.
* Mutually exclusive events are events that cannot occur at the same time, $A\cap B =\emptyset$.
* Conditional Probability: The conditional probability of event $A$ occurring given that event $B$ occurs is $ P(A│B)=(P(A\cap B))/(P(B)) $. If $A$ and $B$ are independent then knowing $B$, or $B^c$, gives no information on the probability of $A$, i.e., $ P(A│B)=P(A) $.
* Multiplication rule: For any two events, $A$ and $B$, $ P(A\cap B)=P(A│B)P(B) $. In general, for $n$ events $A_1, ..., A_n$: $ P(A_1 \cap A_2 \cap A_3 \cap … \cap A_n ) = P(A_1 )P(A_1│A_2 )P(A_3│A_1\cap A_2 ) … P(A_n│A_1\cap A_1\cap A_2\cap A_3\cap … \cap A_(n-1) ) $.
* Law of total probability: $P(B)=P(B│A_1 )P(A_1 )+P(B│A_2 )P(A_2 )+⋯ +P(B│A_n )P(A_n) $, where the events $ {A_1,…,A_n} $ partition the sample space $S$.
* Inverting the order of conditioning: $ P(A \cap B) = P(A | B) \times P(B) = P(B | A) \times P(A) $.
* Bayesian Rule: If $ {A_1,…,A_n} $ partition the sample space $S$, and $A$ and $B$ are any events (i.e., subsets of $S$) then we have:
$$ P(A | B) = {P(B | A) P(A) \over P(B)} = {P(B | A) P(A) \over P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + \cdots + P(B|A_n)P(A_n)}. $$

====Counting====
Counting principles are very useful in probability theory. Consider picking 3 students from a total of 26 students named A to Z.
*'''Permutation''': Permutation is the rearrangement of objects in distinguishable sequences. Each unique ordering is called a permutation. For example $\{A, B, D\}$ is different from $\{D, A, B\}$. There are $3!=6$ permutations of students A, B and D.
**Permutation with repetitions (replacement): If the ordering of objects matters and an object can be chosen more than once then the number of permutations is $ n^r $, where n is the number of objects from which you can choose and r is the number of objects you choose. In our example above, we have $ 26^3 $ permutations with repetitions.
**Permutation without repetitions (replacement): If the order matters and each object can be chosen only once, then the number of permutation is $ n(n-1)…(n-r+1)=n!/(n-r)! $, where $n$ is the number of objects you can choose from and $r$ is the number of objects you choose. In our example above, we have $26*25*24$ permutations without repetitions.
*Combinations: A combination is an un-ordered collection of unique objects. In our example above, {A, B, D} is the same as {D, B, A}.
**Combinations with repetitions (replacement): This is the case in which order does not matter, and an object can be chosen more than once. The number of combinations is $ {n+r-1 \choose r}= \frac{(n+r-1)!}{r!(n-1)!} $. In The example above, we have $ ((26+3-1)!)/3!(26-1)!=6552 $ combinations with repetitions.
**Combinations without repetitions (replacement): This is the case in which the order does not matter, and an object can be chosen only once. The number of combinations is $ {n \choose r}=n!/r!(n-r)! $, where $n$ is the number of objects you can choose from and $r$ is the number of objects you choose. In our example, we have $ {26 \choose 3} $ combinations without repetitions.

'''Independence vs. disjointness/mutual exclusiveness'''
The events $A$ and $B$ are independent if $ P(A│B)=P(A)$, that is $ P(A\cap B)=P(A)P(B) $.
The events $C$ and $D$ are disjoint or mutually exclusive if $ P(C\cap D)=0 $, that is $ P(C\cup D)=P(C)+P(D) $.

These two concepts are different and should not be conflated. If two events are mutually exclusive, they cannot happen together (i.e., $P(A│B)=0)$). The occurrence of one provides information about the probability of the other; therefore, events that are mutually exclusive cannot be independent.

Consider the [[SOCR_EduMaterials_Activities_PokerExperiment|SOCR poker game]]. If we know the card we picked randomly is a queen, then the event that it is a red queen given that it is a queen and the event that it is a black queen given that it is a queen are independent. The event that it is a black card is not mutually exclusive from the event that it is a spade.

===Applications===
* [http://wiki.socr.umich.edu/index.php/AP_Statistics_Curriculum_2007_Prob_Simul This website] introduces an application of probability theory through simulation. Many practical examples require probability computations of complex events. Such calculations may be carried out exactly, using the rules of probability, or approximately using estimation and/or simulations. [http://wiki.socr.umich.edu/index.php/AP_Statistics_Curriculum_2007_Prob_Simul SOCR simulations] may be used to compute approximate probabilities for various processes and to compare these empirical probabilities to their exact counterparts. This article included examples of a ''Ball and Urn Experiment, Binomial Coin Toss Experiment, Card Experiment, Roulette Experiment, and Chuck A Luck Experiment''. It is a valuable source for practicing simulations using probability theory.

* [http://www.probabilitytheory.info This website] offers a list of interesting articles on the topic of probability theory. It includes a general introduction to the history of probability theory and addresses a wide list of articles of the application of probability in different areas including business, medicine, economics, and biology. These short articles are good starting place to learn about applications of probability theory in various fields.

===Software ===
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Simulations & Experiments]
*[http://www.calculatorsoup.com/calculators/discretemathematics/combinations.php Combinations Calculator]
*[http://www.calculatorsoup.com/calculators/discretemathematics/permutations.php Permutations Calculator]
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/proport.html R Tutorials Counts & Proportions]

===Problems===
* A box contains 6 balls; 2 are red, 2 are white, and 2 are blue. Four balls are picked at random, one at a time. Each time a ball is picked, the color is recorded, and the ball is put back in the box. If the first 3 balls are red, what color is the fourth ball most likely to be?
: (a) Red
: (b) White
: (c) Blue
: (d) Blue and white are equally likely and more likely than red.
: (e) Red, blue, and white are all equally likely.

* A coin is tossed 400 times and 170 heads are observed. This coin is __ ?
: (a) fair, because the probability of seeing that amount of heads or less is approximately 0.0013
: (b) neither fair nor unfair. There is not enough information to determine that.
: (c) fair, because the probability of seeing that amount of heads or less is approximately 0.5
: (d) not fair, because the probability of seeing that amount of heads or less is close to 0.

* If two events are independent, then they are automatically mutually exclusive.
: (a) True
: (b) False

* If two events are mutually exclusive, then the sums of their probabilities is 1.
: (a) True
: (b) False

* A professor who teaches 500 students in an introductory psychology course reports that 250 of the students have taken at least one introductory statistics course, and the other 250 have not taken any statistics courses. 200 of the students were freshmen, and the other 300 students were not freshmen. Exactly 50 of the students were freshmen who had taken at least one introductory statistics course. If you select one of these psychology students at random, what is the probability that the student is not a freshman and has never taken a statistics course?
: (a) 30%
: (b) 40%
: (c) 50%
: (d) 60%
: (e) 20%

* A professor who teaches 300 students in an introductory psychology course reports that 135 of the students have taken exactly one introductory statistics course, 60 have taken two or more introductory statistics courses, and the other 105 have not taken any statistics courses. If you select one of these psychology students at random, what is the probability that the student has taken at least one statistics class?
: (a) 0.20
: (b) 0.45
: (c) 0.65
: (d) 0.35

* In a carnival game, a person can win a prize by guessing which one of 5 identical boxes contains the prize. After each guess, if the prize has been won, a new prize is randomly placed in one of the 5 boxes. If a person makes 4 guesses, what is the probability that the person wins a prize exactly twice?
: (a) $(0.2)^2/(0.8)^2$
: (b) $2(0.2)^2*(0.8)^2$
: (c) $6(0.2)^2*(0.8)^2$
: (d) $(0.2)^2*(0.8)^2$
: (e) $2!/5!$

* In a university with 20,000 students, 20% are engineering students, 40% are in the sciences, 30% are in the social sciences, and the rest are in other majors. The counselors in the registrar's office want to survey the opinions of students on the issue of posting grades on-line, and they seek opinions from students in various majors. They conduct a survey by randomly selecting students. Among the first three students selected, what is the probability that two of the three major in social sciences and one has a major other than social science?
: (a) 0.600
: (b) 0.189
: (c) 0.090
: (d) 0.063

* Every five years, the Conference Board of Mathematical Sciences surveys college math departments. In a recent report, 51% of all undergraduates taking calculus were in classes using graphing calculators, and 31% were in classes using computer assignments. Suppose that 16% of these students use both calculators and computers. What proportion of undergraduates taking calculus uses no technology?
: (a) 0.44
: (b) 0.82
: (c) 0.66
: (d) 0.34
: (e) 0.16

* Two cards are dealt to you (without replacement) from an ordinary well-shuffled deck. Let X = the probability that you have a pair. Let Y = the probability that both of your cards are diamonds. Compare X and Y.
: (a) X < Y
: (b) X = Y
: (c) X > Y

* [[SOCR_EduMaterials_Activities_PokerExperiment|Poker game]]: How many hands would contain a full house with an AAABB-type pattern, where A and B have distinct values? How many hands are there with two pairs (i.e., an AABBC-type pattern), where A, B and C have distinct values? What is total number of 5-card hands?

===References===
* [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_Prob_Basics SOCR]
* [http://en.wikipedia.org/wiki/Probability Probability Wikipedia]
* [[Probability_and_statistics_EBook#Chapter_III:_Probability|SOCR EBook: Probability Chapter]]
* [[AP_Statistics_Curriculum_2007_Prob_Count|SOCR EBook: Counting Examples]]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_Probability}}

SMHS ParamInference

2014-08-31T17:58:28Z

Zhenxunw: /* References */

==[[SMHS| Scientific Methods for Health Sciences]] - Parametric Inference ==

===Overview===
In statistical inference, we aim to draw inferences about an underlying population based on a sample drawn from it. For example, we sometimes achieve this by estimating the parameters of a probability density function based on observations. In an idealized case, we would have a perfect model with unknown parameters; based on this, we would make inferences about the population by estimating the parameters with the data we have. In this section, we are going to introduce to the concept of variables, parametric models and inference based on these models.

===Motivation===
Consider the well-known example of flipping a coin 10 times. Experience tells us that the expected outcome of the number of heads in one experiment with 10 flips would be equal to the probability of observing a head on each flip. For an evenly weighted coin, we would expect approximately 5 heads. If we repeated the experiment many times, we would expect the results to follow a binomial distribution with the parameter $ p=P(head)$ for each flip. In other words, we believe the underlying model to be Binomial$ (n,p) $, where $ n=10 $.

The next step would be to determine the value of $ p $. An obvious way of doing this would be to flip the coin many times, let's say 100, and record the number of heads. The estimate of $ p $ would just be the number of heads in the 100 flips divided by 100. For example, if we got 63 heads, we would estimate the probability, $p$ of getting a head on any given flip to be $ 63/100 $. Based on this information, we believe the number of heads in our experiment follows a binomial distribution with parameters $ (n=10,p=0.63) $. That is, we can infer that we will flip an average of 6.3 heads in 10 flips if we repeat the experiment enough times.

Next we will explore the following questions. What is a random variable? How do we build a parametric model based on data? What kind of inference can we make based on a parametric model?

===Theory===
* [http://en.wikipedia.org/wiki/Random_variable Random variable]: A random variable is a variable whose value is subject to variations due to chance (i.e., randomness). It can take on a set of values, each with an associated probability for discrete variables or a probability density for continuous variables. The value of a random variable represents the possible outcomes of a yet-to-be-performed experiment or the possible outcomes of a past experiment whose pre-existing value is uncertain. The possible values of a random variable and their associated probabilities (known as a probability distribution) can be further described with mathematical functions.
: There are two types of random variables:
:: ''Discrete random variables'' take on a specified finite or countable list of values, and are endowed with a probability mass function, which is characteristic of a particular probability distribution;
:: ''Continuous random variables'' take on any numerical value in an interval or collection of intervals via a probability density function that is characteristic of a probability distribution.

* [http://en.wikipedia.org/wiki/Parameter Parameters]: A parameter is a characteristic or measurable factor that can help in defining a particular system. It is an important element to consider when evaluating or trying to understand an event. μ is often used to represent the mean and σ the standard deviation in statistics. The following table provides a list of commonly used parameters with descriptions:

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|Parameter || Description || Parameter || Description
|-
| $\bar{x}$ || Sample mean || α,β,γ || Various Greek letters
|-
| μ || Population mean || θ || Lower case theta
|-
|σ || Population standard deviation || φ || Lower case phi
|-
| $σ^2$ || Population variance || ω || Lower case omega
|-
| s || Sample standard deviation || ∆ || Increment
|-
| $s^2$ || Sample variance || ν || Nu
|-
| λ || Poisson mean, Lambda || τ || Tau
|-
| χ || χ distribution, Chi || η || Eta
|-
| ρ || The density, Rho || τ || Sometimes used in tau function
|-
| ϕ || Normal density function, Phi || Θ || Parameter space
|-
| Γ || Gamma || Ω || Sample space, omega
|-
| ∂ || Per/ divided || δ || Lower case delta
|-
| S || Sample space|| Κ,k || kappa
|}
</center>

* [http://en.wikipedia.org/wiki/Parametric_model Parametric model]: A parametric model is a collection of probability distributions that can be described using a finite number of parameters. These parameters are usually written together to form a single k-dimensional parameter vector $\theta=(\theta_1,\theta_2,…,\theta_k)$. The main characteristic of a parametric model is that all the parameters are from finite-dimensional parameter spaces.

: Each member of the collection of parameters, $ p_θ $, is described by a finite-dimensional parameter $ θ $. The set of all allowable values for the parameter is denoted $ Θ⊆R^k $, and the model itself is written as $ P={p_θ |θ∈Θ} $. If the model consists of absolutely continuous distributions, it is often specified in terms of the corresponding probability density function $ P={f_θ |θ∈Θ}$. The model is considered identifiable if the mapping $ θ→p_θ $ is invertible, that is, there are no two different parameter values $ θ_1 $ and $ θ_2 $ such that $ p_{θ_1} =p_{θ_2} $.

: Consider one of the most popular distributions, the normal distribution, in which the parameter vector is $ θ=(μ,σ) $. Here, $ μ∈R $ is a location parameter, and σ>0 is a scale parameter. This parametrized family can be expressed as:
$$ p=\{f_θ (x)=\frac{1}{\sqrt{2πσ}} e^{-\frac{1}{2σ^2}{({x-μ}^2)}} |μ∈R,σ>0\}.$$

* Parametric inference: Often, we are interested in estimating $ \theta $, or more generally, a function of $ \theta $, say $ g(\theta) $. Let’s consider a few examples that will enable us to understand this.
** Let $ x_1,x_2,…,x_n $ be the outcomes of n independent flips of the same coin. Here, we code $ X_i=1 $ if the $i^{th}$ toss produces a head and code $ X_i=0 $ if the $i^{th}$ toss produces a tail. Therefore, $ \theta $, which is the probability of flipping a head in a single toss, could be any number between 0 and 1. We know that the $ x_i$’s are independent and identically distributed (i.i.d.). The distribution $ p_{\theta} $ commonly used to describe this type of experiment is a Bernoulli distribution with parameter $ (\theta) $. It has the probability mass function $ f(x,\theta)=\theta^x (1-\theta)^(1-x), x \in {0,1} $. If we repeat the experiment with the same coin enough times, we would expect to $n \theta$ heads on average.
**Let $ x_1,x_2,…,x_n $ be the number of customers that arrive at $n$ different identical counters in a unit of time. The $ X_i$'s can be thought of as an i.i.d. random variable with a Poisson distribution with mean $ \theta $. This distribution varies in the set $ (0,\infty) $, representing the parameter space $ \Theta $. The probability mass function is $ f(x,\theta)=e^{-\theta} \frac{\theta^{x}}{x!}$, for each $x=0, 1, 2, ...$.

: After determining the parameters of the model, we will be able to apply the characteristics of the distribution and the model to the data. The characteristics of various distributions will be discussed further in the [[SMHS_ProbabilityDistributions|Distribution section]]. We will also discuss hypothesis testing and estimation later.

====Random number generation====
* R examples: A random variable follows a normal distribution, $ N(\mu=0,\sigma=1) $.
* We use a random number generator to obtain 10 samples from a normal distribution with mean 0 and variance 1:
> runif(10,0,1)
[1] 0.64900447 0.82074379 0.56889471 0.95659206 0.69771341 0.19772881 0.07656862
[8] 0.29823980 0.31825198 0.45029058

* We generate 5 random variables following a Poisson distribution with $\lambda = 2$
> rpois(5,2)
[1] 3 2 1 4 1

* We generate 5 random variables following a binomial distribution with $ p = 0.3, n = 10 $
> rbinom(5,10,0.3)
[1] 2 3 3 2 3

* [[SOCR_EduMaterials_Activities_RNG|SOCR Random Number Generation Activity]]

===Applications===
* The article entitled [http://link.springer.com/article/10.1007/BF00341287 Parametric Inference For Imperfectly Observed Gibbsian Fields] presents a maximum likelihood estimation method for imperfectly observed Gibbsian fields on a finite lattice. This method is an adaptation of the algorithm given in Younes. A presentation of the new algorithm is followed by a theorem about the limit of the second derivative of the likelihood when the lattice increases, which is related to the convergence of the method. The paper offers some practical remarks about the implementation of the procedure.

* [http://www.pnas.org/content/101/46/16138.short This article] uses graphical models that have been applied to problems including hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems that are associated with different statistical models. This article introduces the polytope propagation algorithm for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
*[http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate Normal Experiment]
*[http://socr.ucla.edu/htmls/dist/Multinomial_Distribution.html Multinomial Distribution]
*[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_Binomial_Distributions Activities with Binomial Distributions]

===Problems===
* Suppose we are flipping a fair dice, what would be the average probability that we are going to roll three six in a row? What kind of model we are inferring on?

* Consider the unfair coin flipping game, where the probability of flipping a head is unknown. Construct an experiment to test the probability of flipping a head in a single experiment. What is the probability that we are going to roll 5 heads out of 8 flips?

* Random number generator is a commonly used in scientific studies. Explain how it works.

* The average number of homes sold by realty Tom is 3 houses per day, what is the probability that exactly 4 houses will be sold tomorrow?

* Suppose that the average number of patients with cancer seen per day is 5, what is the probability that less than 4 patients with cancer will be seen on the next day?

=== References===
* [http://www.itl.nist.gov/div898/handbook/eda/eda.htm NIST EDA]
* [http://en.wikipedia.org/wiki/Random_variable Random variable Wikipedia]
* [http://en.wikipedia.org/wiki/Parameter Parameter Wikipedia]
* [http://en.wikipedia.org/wiki/Parametric_model Parametric model Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ParamInference}}

SMHS ParamInference

2014-08-31T17:58:18Z

Zhenxunw: /* References */

==[[SMHS| Scientific Methods for Health Sciences]] - Parametric Inference ==

===Overview===
In statistical inference, we aim to draw inferences about an underlying population based on a sample drawn from it. For example, we sometimes achieve this by estimating the parameters of a probability density function based on observations. In an idealized case, we would have a perfect model with unknown parameters; based on this, we would make inferences about the population by estimating the parameters with the data we have. In this section, we are going to introduce to the concept of variables, parametric models and inference based on these models.

===Motivation===
Consider the well-known example of flipping a coin 10 times. Experience tells us that the expected outcome of the number of heads in one experiment with 10 flips would be equal to the probability of observing a head on each flip. For an evenly weighted coin, we would expect approximately 5 heads. If we repeated the experiment many times, we would expect the results to follow a binomial distribution with the parameter $ p=P(head)$ for each flip. In other words, we believe the underlying model to be Binomial$ (n,p) $, where $ n=10 $.

The next step would be to determine the value of $ p $. An obvious way of doing this would be to flip the coin many times, let's say 100, and record the number of heads. The estimate of $ p $ would just be the number of heads in the 100 flips divided by 100. For example, if we got 63 heads, we would estimate the probability, $p$ of getting a head on any given flip to be $ 63/100 $. Based on this information, we believe the number of heads in our experiment follows a binomial distribution with parameters $ (n=10,p=0.63) $. That is, we can infer that we will flip an average of 6.3 heads in 10 flips if we repeat the experiment enough times.

Next we will explore the following questions. What is a random variable? How do we build a parametric model based on data? What kind of inference can we make based on a parametric model?

===Theory===
* [http://en.wikipedia.org/wiki/Random_variable Random variable]: A random variable is a variable whose value is subject to variations due to chance (i.e., randomness). It can take on a set of values, each with an associated probability for discrete variables or a probability density for continuous variables. The value of a random variable represents the possible outcomes of a yet-to-be-performed experiment or the possible outcomes of a past experiment whose pre-existing value is uncertain. The possible values of a random variable and their associated probabilities (known as a probability distribution) can be further described with mathematical functions.
: There are two types of random variables:
:: ''Discrete random variables'' take on a specified finite or countable list of values, and are endowed with a probability mass function, which is characteristic of a particular probability distribution;
:: ''Continuous random variables'' take on any numerical value in an interval or collection of intervals via a probability density function that is characteristic of a probability distribution.

* [http://en.wikipedia.org/wiki/Parameter Parameters]: A parameter is a characteristic or measurable factor that can help in defining a particular system. It is an important element to consider when evaluating or trying to understand an event. μ is often used to represent the mean and σ the standard deviation in statistics. The following table provides a list of commonly used parameters with descriptions:

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|Parameter || Description || Parameter || Description
|-
| $\bar{x}$ || Sample mean || α,β,γ || Various Greek letters
|-
| μ || Population mean || θ || Lower case theta
|-
|σ || Population standard deviation || φ || Lower case phi
|-
| $σ^2$ || Population variance || ω || Lower case omega
|-
| s || Sample standard deviation || ∆ || Increment
|-
| $s^2$ || Sample variance || ν || Nu
|-
| λ || Poisson mean, Lambda || τ || Tau
|-
| χ || χ distribution, Chi || η || Eta
|-
| ρ || The density, Rho || τ || Sometimes used in tau function
|-
| ϕ || Normal density function, Phi || Θ || Parameter space
|-
| Γ || Gamma || Ω || Sample space, omega
|-
| ∂ || Per/ divided || δ || Lower case delta
|-
| S || Sample space|| Κ,k || kappa
|}
</center>

* [http://en.wikipedia.org/wiki/Parametric_model Parametric model]: A parametric model is a collection of probability distributions that can be described using a finite number of parameters. These parameters are usually written together to form a single k-dimensional parameter vector $\theta=(\theta_1,\theta_2,…,\theta_k)$. The main characteristic of a parametric model is that all the parameters are from finite-dimensional parameter spaces.

: Each member of the collection of parameters, $ p_θ $, is described by a finite-dimensional parameter $ θ $. The set of all allowable values for the parameter is denoted $ Θ⊆R^k $, and the model itself is written as $ P={p_θ |θ∈Θ} $. If the model consists of absolutely continuous distributions, it is often specified in terms of the corresponding probability density function $ P={f_θ |θ∈Θ}$. The model is considered identifiable if the mapping $ θ→p_θ $ is invertible, that is, there are no two different parameter values $ θ_1 $ and $ θ_2 $ such that $ p_{θ_1} =p_{θ_2} $.

: Consider one of the most popular distributions, the normal distribution, in which the parameter vector is $ θ=(μ,σ) $. Here, $ μ∈R $ is a location parameter, and σ>0 is a scale parameter. This parametrized family can be expressed as:
$$ p=\{f_θ (x)=\frac{1}{\sqrt{2πσ}} e^{-\frac{1}{2σ^2}{({x-μ}^2)}} |μ∈R,σ>0\}.$$

* Parametric inference: Often, we are interested in estimating $ \theta $, or more generally, a function of $ \theta $, say $ g(\theta) $. Let’s consider a few examples that will enable us to understand this.
** Let $ x_1,x_2,…,x_n $ be the outcomes of n independent flips of the same coin. Here, we code $ X_i=1 $ if the $i^{th}$ toss produces a head and code $ X_i=0 $ if the $i^{th}$ toss produces a tail. Therefore, $ \theta $, which is the probability of flipping a head in a single toss, could be any number between 0 and 1. We know that the $ x_i$’s are independent and identically distributed (i.i.d.). The distribution $ p_{\theta} $ commonly used to describe this type of experiment is a Bernoulli distribution with parameter $ (\theta) $. It has the probability mass function $ f(x,\theta)=\theta^x (1-\theta)^(1-x), x \in {0,1} $. If we repeat the experiment with the same coin enough times, we would expect to $n \theta$ heads on average.
**Let $ x_1,x_2,…,x_n $ be the number of customers that arrive at $n$ different identical counters in a unit of time. The $ X_i$'s can be thought of as an i.i.d. random variable with a Poisson distribution with mean $ \theta $. This distribution varies in the set $ (0,\infty) $, representing the parameter space $ \Theta $. The probability mass function is $ f(x,\theta)=e^{-\theta} \frac{\theta^{x}}{x!}$, for each $x=0, 1, 2, ...$.

: After determining the parameters of the model, we will be able to apply the characteristics of the distribution and the model to the data. The characteristics of various distributions will be discussed further in the [[SMHS_ProbabilityDistributions|Distribution section]]. We will also discuss hypothesis testing and estimation later.

====Random number generation====
* R examples: A random variable follows a normal distribution, $ N(\mu=0,\sigma=1) $.
* We use a random number generator to obtain 10 samples from a normal distribution with mean 0 and variance 1:
> runif(10,0,1)
[1] 0.64900447 0.82074379 0.56889471 0.95659206 0.69771341 0.19772881 0.07656862
[8] 0.29823980 0.31825198 0.45029058

* We generate 5 random variables following a Poisson distribution with $\lambda = 2$
> rpois(5,2)
[1] 3 2 1 4 1

* We generate 5 random variables following a binomial distribution with $ p = 0.3, n = 10 $
> rbinom(5,10,0.3)
[1] 2 3 3 2 3

* [[SOCR_EduMaterials_Activities_RNG|SOCR Random Number Generation Activity]]

===Applications===
* The article entitled [http://link.springer.com/article/10.1007/BF00341287 Parametric Inference For Imperfectly Observed Gibbsian Fields] presents a maximum likelihood estimation method for imperfectly observed Gibbsian fields on a finite lattice. This method is an adaptation of the algorithm given in Younes. A presentation of the new algorithm is followed by a theorem about the limit of the second derivative of the likelihood when the lattice increases, which is related to the convergence of the method. The paper offers some practical remarks about the implementation of the procedure.

* [http://www.pnas.org/content/101/46/16138.short This article] uses graphical models that have been applied to problems including hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems that are associated with different statistical models. This article introduces the polytope propagation algorithm for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
*[http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate Normal Experiment]
*[http://socr.ucla.edu/htmls/dist/Multinomial_Distribution.html Multinomial Distribution]
*[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_Binomial_Distributions Activities with Binomial Distributions]

===Problems===
* Suppose we are flipping a fair dice, what would be the average probability that we are going to roll three six in a row? What kind of model we are inferring on?

* Consider the unfair coin flipping game, where the probability of flipping a head is unknown. Construct an experiment to test the probability of flipping a head in a single experiment. What is the probability that we are going to roll 5 heads out of 8 flips?

* Random number generator is a commonly used in scientific studies. Explain how it works.

* The average number of homes sold by realty Tom is 3 houses per day, what is the probability that exactly 4 houses will be sold tomorrow?

* Suppose that the average number of patients with cancer seen per day is 5, what is the probability that less than 4 patients with cancer will be seen on the next day?

=== References===
* [http://www.itl.nist.gov/div898/handbook/eda/eda.htm NIST EDA]
* [http://en.wikipedia.org/wiki/Random_variable Random variable Wikipedia]
* [http://en.wikipedia.org/wiki/Parameter Parameter Wikipedia]
* [[http://en.wikipedia.org/wiki/Parametric_model Parametric model Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ParamInference}}

SMHS ParamInference

2014-08-31T17:57:47Z

Zhenxunw: /* Theory */

==[[SMHS| Scientific Methods for Health Sciences]] - Parametric Inference ==

===Overview===
In statistical inference, we aim to draw inferences about an underlying population based on a sample drawn from it. For example, we sometimes achieve this by estimating the parameters of a probability density function based on observations. In an idealized case, we would have a perfect model with unknown parameters; based on this, we would make inferences about the population by estimating the parameters with the data we have. In this section, we are going to introduce to the concept of variables, parametric models and inference based on these models.

===Motivation===
Consider the well-known example of flipping a coin 10 times. Experience tells us that the expected outcome of the number of heads in one experiment with 10 flips would be equal to the probability of observing a head on each flip. For an evenly weighted coin, we would expect approximately 5 heads. If we repeated the experiment many times, we would expect the results to follow a binomial distribution with the parameter $ p=P(head)$ for each flip. In other words, we believe the underlying model to be Binomial$ (n,p) $, where $ n=10 $.

The next step would be to determine the value of $ p $. An obvious way of doing this would be to flip the coin many times, let's say 100, and record the number of heads. The estimate of $ p $ would just be the number of heads in the 100 flips divided by 100. For example, if we got 63 heads, we would estimate the probability, $p$ of getting a head on any given flip to be $ 63/100 $. Based on this information, we believe the number of heads in our experiment follows a binomial distribution with parameters $ (n=10,p=0.63) $. That is, we can infer that we will flip an average of 6.3 heads in 10 flips if we repeat the experiment enough times.

Next we will explore the following questions. What is a random variable? How do we build a parametric model based on data? What kind of inference can we make based on a parametric model?

===Theory===
* [http://en.wikipedia.org/wiki/Random_variable Random variable]: A random variable is a variable whose value is subject to variations due to chance (i.e., randomness). It can take on a set of values, each with an associated probability for discrete variables or a probability density for continuous variables. The value of a random variable represents the possible outcomes of a yet-to-be-performed experiment or the possible outcomes of a past experiment whose pre-existing value is uncertain. The possible values of a random variable and their associated probabilities (known as a probability distribution) can be further described with mathematical functions.
: There are two types of random variables:
:: ''Discrete random variables'' take on a specified finite or countable list of values, and are endowed with a probability mass function, which is characteristic of a particular probability distribution;
:: ''Continuous random variables'' take on any numerical value in an interval or collection of intervals via a probability density function that is characteristic of a probability distribution.

* [http://en.wikipedia.org/wiki/Parameter Parameters]: A parameter is a characteristic or measurable factor that can help in defining a particular system. It is an important element to consider when evaluating or trying to understand an event. μ is often used to represent the mean and σ the standard deviation in statistics. The following table provides a list of commonly used parameters with descriptions:

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|Parameter || Description || Parameter || Description
|-
| $\bar{x}$ || Sample mean || α,β,γ || Various Greek letters
|-
| μ || Population mean || θ || Lower case theta
|-
|σ || Population standard deviation || φ || Lower case phi
|-
| $σ^2$ || Population variance || ω || Lower case omega
|-
| s || Sample standard deviation || ∆ || Increment
|-
| $s^2$ || Sample variance || ν || Nu
|-
| λ || Poisson mean, Lambda || τ || Tau
|-
| χ || χ distribution, Chi || η || Eta
|-
| ρ || The density, Rho || τ || Sometimes used in tau function
|-
| ϕ || Normal density function, Phi || Θ || Parameter space
|-
| Γ || Gamma || Ω || Sample space, omega
|-
| ∂ || Per/ divided || δ || Lower case delta
|-
| S || Sample space|| Κ,k || kappa
|}
</center>

* [http://en.wikipedia.org/wiki/Parametric_model Parametric model]: A parametric model is a collection of probability distributions that can be described using a finite number of parameters. These parameters are usually written together to form a single k-dimensional parameter vector $\theta=(\theta_1,\theta_2,…,\theta_k)$. The main characteristic of a parametric model is that all the parameters are from finite-dimensional parameter spaces.

: Each member of the collection of parameters, $ p_θ $, is described by a finite-dimensional parameter $ θ $. The set of all allowable values for the parameter is denoted $ Θ⊆R^k $, and the model itself is written as $ P={p_θ |θ∈Θ} $. If the model consists of absolutely continuous distributions, it is often specified in terms of the corresponding probability density function $ P={f_θ |θ∈Θ}$. The model is considered identifiable if the mapping $ θ→p_θ $ is invertible, that is, there are no two different parameter values $ θ_1 $ and $ θ_2 $ such that $ p_{θ_1} =p_{θ_2} $.

: Consider one of the most popular distributions, the normal distribution, in which the parameter vector is $ θ=(μ,σ) $. Here, $ μ∈R $ is a location parameter, and σ>0 is a scale parameter. This parametrized family can be expressed as:
$$ p=\{f_θ (x)=\frac{1}{\sqrt{2πσ}} e^{-\frac{1}{2σ^2}{({x-μ}^2)}} |μ∈R,σ>0\}.$$

* Parametric inference: Often, we are interested in estimating $ \theta $, or more generally, a function of $ \theta $, say $ g(\theta) $. Let’s consider a few examples that will enable us to understand this.
** Let $ x_1,x_2,…,x_n $ be the outcomes of n independent flips of the same coin. Here, we code $ X_i=1 $ if the $i^{th}$ toss produces a head and code $ X_i=0 $ if the $i^{th}$ toss produces a tail. Therefore, $ \theta $, which is the probability of flipping a head in a single toss, could be any number between 0 and 1. We know that the $ x_i$’s are independent and identically distributed (i.i.d.). The distribution $ p_{\theta} $ commonly used to describe this type of experiment is a Bernoulli distribution with parameter $ (\theta) $. It has the probability mass function $ f(x,\theta)=\theta^x (1-\theta)^(1-x), x \in {0,1} $. If we repeat the experiment with the same coin enough times, we would expect to $n \theta$ heads on average.
**Let $ x_1,x_2,…,x_n $ be the number of customers that arrive at $n$ different identical counters in a unit of time. The $ X_i$'s can be thought of as an i.i.d. random variable with a Poisson distribution with mean $ \theta $. This distribution varies in the set $ (0,\infty) $, representing the parameter space $ \Theta $. The probability mass function is $ f(x,\theta)=e^{-\theta} \frac{\theta^{x}}{x!}$, for each $x=0, 1, 2, ...$.

: After determining the parameters of the model, we will be able to apply the characteristics of the distribution and the model to the data. The characteristics of various distributions will be discussed further in the [[SMHS_ProbabilityDistributions|Distribution section]]. We will also discuss hypothesis testing and estimation later.

====Random number generation====
* R examples: A random variable follows a normal distribution, $ N(\mu=0,\sigma=1) $.
* We use a random number generator to obtain 10 samples from a normal distribution with mean 0 and variance 1:
> runif(10,0,1)
[1] 0.64900447 0.82074379 0.56889471 0.95659206 0.69771341 0.19772881 0.07656862
[8] 0.29823980 0.31825198 0.45029058

* We generate 5 random variables following a Poisson distribution with $\lambda = 2$
> rpois(5,2)
[1] 3 2 1 4 1

* We generate 5 random variables following a binomial distribution with $ p = 0.3, n = 10 $
> rbinom(5,10,0.3)
[1] 2 3 3 2 3

* [[SOCR_EduMaterials_Activities_RNG|SOCR Random Number Generation Activity]]

===Applications===
* The article entitled [http://link.springer.com/article/10.1007/BF00341287 Parametric Inference For Imperfectly Observed Gibbsian Fields] presents a maximum likelihood estimation method for imperfectly observed Gibbsian fields on a finite lattice. This method is an adaptation of the algorithm given in Younes. A presentation of the new algorithm is followed by a theorem about the limit of the second derivative of the likelihood when the lattice increases, which is related to the convergence of the method. The paper offers some practical remarks about the implementation of the procedure.

* [http://www.pnas.org/content/101/46/16138.short This article] uses graphical models that have been applied to problems including hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems that are associated with different statistical models. This article introduces the polytope propagation algorithm for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
*[http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate Normal Experiment]
*[http://socr.ucla.edu/htmls/dist/Multinomial_Distribution.html Multinomial Distribution]
*[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_Binomial_Distributions Activities with Binomial Distributions]

===Problems===
* Suppose we are flipping a fair dice, what would be the average probability that we are going to roll three six in a row? What kind of model we are inferring on?

* Consider the unfair coin flipping game, where the probability of flipping a head is unknown. Construct an experiment to test the probability of flipping a head in a single experiment. What is the probability that we are going to roll 5 heads out of 8 flips?

* Random number generator is a commonly used in scientific studies. Explain how it works.

* The average number of homes sold by realty Tom is 3 houses per day, what is the probability that exactly 4 houses will be sold tomorrow?

* Suppose that the average number of patients with cancer seen per day is 5, what is the probability that less than 4 patients with cancer will be seen on the next day?

=== References===
* [http://www.itl.nist.gov/div898/handbook/eda/eda.htm NIST EDA]
* [http://en.wikipedia.org/wiki/Random_variable Random variable Wikipedia]
* [http://en.wikipedia.org/wiki/Parameter Parameter Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ParamInference}}

SMHS ParamInference

2014-08-31T17:53:45Z

Zhenxunw: /* Theory */

==[[SMHS| Scientific Methods for Health Sciences]] - Parametric Inference ==

===Overview===
In statistical inference, we aim to draw inferences about an underlying population based on a sample drawn from it. For example, we sometimes achieve this by estimating the parameters of a probability density function based on observations. In an idealized case, we would have a perfect model with unknown parameters; based on this, we would make inferences about the population by estimating the parameters with the data we have. In this section, we are going to introduce to the concept of variables, parametric models and inference based on these models.

===Motivation===
Consider the well-known example of flipping a coin 10 times. Experience tells us that the expected outcome of the number of heads in one experiment with 10 flips would be equal to the probability of observing a head on each flip. For an evenly weighted coin, we would expect approximately 5 heads. If we repeated the experiment many times, we would expect the results to follow a binomial distribution with the parameter $ p=P(head)$ for each flip. In other words, we believe the underlying model to be Binomial$ (n,p) $, where $ n=10 $.

The next step would be to determine the value of $ p $. An obvious way of doing this would be to flip the coin many times, let's say 100, and record the number of heads. The estimate of $ p $ would just be the number of heads in the 100 flips divided by 100. For example, if we got 63 heads, we would estimate the probability, $p$ of getting a head on any given flip to be $ 63/100 $. Based on this information, we believe the number of heads in our experiment follows a binomial distribution with parameters $ (n=10,p=0.63) $. That is, we can infer that we will flip an average of 6.3 heads in 10 flips if we repeat the experiment enough times.

Next we will explore the following questions. What is a random variable? How do we build a parametric model based on data? What kind of inference can we make based on a parametric model?

===Theory===
* [http://en.wikipedia.org/wiki/Random_variable Random variable]: A random variable is a variable whose value is subject to variations due to chance (i.e., randomness). It can take on a set of values, each with an associated probability for discrete variables or a probability density for continuous variables. The value of a random variable represents the possible outcomes of a yet-to-be-performed experiment or the possible outcomes of a past experiment whose pre-existing value is uncertain. The possible values of a random variable and their associated probabilities (known as a probability distribution) can be further described with mathematical functions.
: There are two types of random variables:
:: ''Discrete random variables'' take on a specified finite or countable list of values, and are endowed with a probability mass function, which is characteristic of a particular probability distribution;
:: ''Continuous random variables'' take on any numerical value in an interval or collection of intervals via a probability density function that is characteristic of a probability distribution.

* Parameters: A parameter is a characteristic or measurable factor that can help in defining a particular system. It is an important element to consider when evaluating or trying to understand an event. μ is often used to represent the mean and σ the standard deviation in statistics. The following table provides a list of commonly used parameters with descriptions:

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|Parameter || Description || Parameter || Description
|-
| $\bar{x}$ || Sample mean || α,β,γ || Various Greek letters
|-
| μ || Population mean || θ || Lower case theta
|-
|σ || Population standard deviation || φ || Lower case phi
|-
| $σ^2$ || Population variance || ω || Lower case omega
|-
| s || Sample standard deviation || ∆ || Increment
|-
| $s^2$ || Sample variance || ν || Nu
|-
| λ || Poisson mean, Lambda || τ || Tau
|-
| χ || χ distribution, Chi || η || Eta
|-
| ρ || The density, Rho || τ || Sometimes used in tau function
|-
| ϕ || Normal density function, Phi || Θ || Parameter space
|-
| Γ || Gamma || Ω || Sample space, omega
|-
| ∂ || Per/ divided || δ || Lower case delta
|-
| S || Sample space|| Κ,k || kappa
|}
</center>

* Parametric model: A parametric model is a collection of probability distributions that can be described using a finite number of parameters. These parameters are usually written together to form a single k-dimensional parameter vector $\theta=(\theta_1,\theta_2,…,\theta_k)$. The main characteristic of a parametric model is that all the parameters are from finite-dimensional parameter spaces.

: Each member of the collection of parameters, $ p_θ $, is described by a finite-dimensional parameter $ θ $. The set of all allowable values for the parameter is denoted $ Θ⊆R^k $, and the model itself is written as $ P={p_θ |θ∈Θ} $. If the model consists of absolutely continuous distributions, it is often specified in terms of the corresponding probability density function $ P={f_θ |θ∈Θ}$. The model is considered identifiable if the mapping $ θ→p_θ $ is invertible, that is, there are no two different parameter values $ θ_1 $ and $ θ_2 $ such that $ p_{θ_1} =p_{θ_2} $.

: Consider one of the most popular distributions, the normal distribution, in which the parameter vector is $ θ=(μ,σ) $. Here, $ μ∈R $ is a location parameter, and σ>0 is a scale parameter. This parametrized family can be expressed as:
$$ p=\{f_θ (x)=\frac{1}{\sqrt{2πσ}} e^{-\frac{1}{2σ^2}{({x-μ}^2)}} |μ∈R,σ>0\}.$$

* Parametric inference: Often, we are interested in estimating $ \theta $, or more generally, a function of $ \theta $, say $ g(\theta) $. Let’s consider a few examples that will enable us to understand this.
** Let $ x_1,x_2,…,x_n $ be the outcomes of n independent flips of the same coin. Here, we code $ X_i=1 $ if the $i^{th}$ toss produces a head and code $ X_i=0 $ if the $i^{th}$ toss produces a tail. Therefore, $ \theta $, which is the probability of flipping a head in a single toss, could be any number between 0 and 1. We know that the $ x_i$’s are independent and identically distributed (i.i.d.). The distribution $ p_{\theta} $ commonly used to describe this type of experiment is a Bernoulli distribution with parameter $ (\theta) $. It has the probability mass function $ f(x,\theta)=\theta^x (1-\theta)^(1-x), x \in {0,1} $. If we repeat the experiment with the same coin enough times, we would expect to $n \theta$ heads on average.
**Let $ x_1,x_2,…,x_n $ be the number of customers that arrive at $n$ different identical counters in a unit of time. The $ X_i$'s can be thought of as an i.i.d. random variable with a Poisson distribution with mean $ \theta $. This distribution varies in the set $ (0,\infty) $, representing the parameter space $ \Theta $. The probability mass function is $ f(x,\theta)=e^{-\theta} \frac{\theta^{x}}{x!}$, for each $x=0, 1, 2, ...$.

: After determining the parameters of the model, we will be able to apply the characteristics of the distribution and the model to the data. The characteristics of various distributions will be discussed further in the [[SMHS_ProbabilityDistributions|Distribution section]]. We will also discuss hypothesis testing and estimation later.

====Random number generation====
* R examples: A random variable follows a normal distribution, $ N(\mu=0,\sigma=1) $.
* We use a random number generator to obtain 10 samples from a normal distribution with mean 0 and variance 1:
> runif(10,0,1)
[1] 0.64900447 0.82074379 0.56889471 0.95659206 0.69771341 0.19772881 0.07656862
[8] 0.29823980 0.31825198 0.45029058

* We generate 5 random variables following a Poisson distribution with $\lambda = 2$
> rpois(5,2)
[1] 3 2 1 4 1

* We generate 5 random variables following a binomial distribution with $ p = 0.3, n = 10 $
> rbinom(5,10,0.3)
[1] 2 3 3 2 3

* [[SOCR_EduMaterials_Activities_RNG|SOCR Random Number Generation Activity]]

===Applications===
* The article entitled [http://link.springer.com/article/10.1007/BF00341287 Parametric Inference For Imperfectly Observed Gibbsian Fields] presents a maximum likelihood estimation method for imperfectly observed Gibbsian fields on a finite lattice. This method is an adaptation of the algorithm given in Younes. A presentation of the new algorithm is followed by a theorem about the limit of the second derivative of the likelihood when the lattice increases, which is related to the convergence of the method. The paper offers some practical remarks about the implementation of the procedure.

* [http://www.pnas.org/content/101/46/16138.short This article] uses graphical models that have been applied to problems including hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems that are associated with different statistical models. This article introduces the polytope propagation algorithm for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
*[http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate Normal Experiment]
*[http://socr.ucla.edu/htmls/dist/Multinomial_Distribution.html Multinomial Distribution]
*[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_Binomial_Distributions Activities with Binomial Distributions]

===Problems===
* Suppose we are flipping a fair dice, what would be the average probability that we are going to roll three six in a row? What kind of model we are inferring on?

* Consider the unfair coin flipping game, where the probability of flipping a head is unknown. Construct an experiment to test the probability of flipping a head in a single experiment. What is the probability that we are going to roll 5 heads out of 8 flips?

* Random number generator is a commonly used in scientific studies. Explain how it works.

* The average number of homes sold by realty Tom is 3 houses per day, what is the probability that exactly 4 houses will be sold tomorrow?

* Suppose that the average number of patients with cancer seen per day is 5, what is the probability that less than 4 patients with cancer will be seen on the next day?

=== References===
* [http://www.itl.nist.gov/div898/handbook/eda/eda.htm NIST EDA]
* [http://en.wikipedia.org/wiki/Random_variable Random variable Wikipedia]
* [http://en.wikipedia.org/wiki/Parameter Parameter Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ParamInference}}

SMHS EDA

2014-08-31T17:43:45Z

Zhenxunw: /* References */

==[[SMHS| Scientific Methods for Health Sciences]] - Exploratory Data Analysis (EDA), Charts and Plots ==

===Overview===
* ''What is data?'' Data is a collection of facts, observations or information, such as values or measurements. Data can be numbers, measurements, or even just description of things (meta-data). Data types can be divided into two big categories of quantitative (numerical information) and qualitative data (descriptive information).

*''Quantitative data'' is anything that can be expressed as a number, or quantified. For example, the scores on a math test or weight of girls in the fourth grade are both quantitative data. Quantitative data (discrete or continuous) is often referred to as the measurable data and this type of data allows scientists to perform various arithmetic operations, such as addition, multiplication, functional-evaluation, or to find parameters of a population. There are two major types of quantitative data: discrete and continuous.
**Discrete data results from either a finite, or infinite but countable, possible options for the values present in a given discrete data set and the values of this data type can constitute a sequence of isolated or separated points on the real number line.
**Continuous quantitative data results from infinite and dense possible values that the observations can take on.

*''Qualitative'' data cannot be expressed as numbers. Examples of qualitative data elements include gender, religious preference. Categorical data (qualitative or nominal) results from placing individuals into groups or categories. Ordinal and qualitative categorical data types both fall into this category.

In statistics, exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics. Modern statistics regards the graphical visualization and interrogation of data as a critical component of any reliable method for statistical modeling, analysis and interpretation of data. Formally, there are two types of data analysis that should be employed in concert on the same set of data to make a valid and robust inference: graphical techniques and quantitative techniques. We will discuss many of these later, but below is a snapshot of EDA approaches:
* [[SOCR_EduMaterials_Activities_BoxPlot|Box plot]], [[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]; Multi-vari chart; Run chart; Pareto chart; [[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]; Stem-and-leaf plot;
* Parallel coordinates; Odds ratio; Multidimensional scaling; Targeted projection pursuit; Principal component analysis; Multi-linear PCA; Projection methods such as grand tour, guided tour and manual tour.
* [http://en.wikipedia.org/wiki/Median_polish Median polish], [http://en.wikipedia.org/wiki/Trimean Trimean].

===Motivation===
The feel of data comes clearly from the application of various graphical techniques, which serves as a perfect window to human perspective and sense. The primary goal of EDA is to maximize the analyst’s insight into a data set and into the underlying structure of the data set. To get a feel for the data, it is not enough for the analyst to know what is in the data, he or she must also know what is not in the data, and the only way to do that is to draw on our own pattern recognition and comparative abilities in the context of a series of judicious graphical techniques applied to the data. The [http://en.wikipedia.org/wiki/Exploratory_data_analysis#EDA_development main objectives of EDA] are to:
* Suggest hypotheses about the causes of observed phenomena;
* Assess (parametric) assumptions on which statistical inference will be based;
* Support the selection of appropriate statistical tools and techniques;
* Provide a basis for further data collection through surveys and experiments.

===Theory===
Many EDA techniques have been proposed, validated and adopted for various statistical methodologies. Here is an introduction to some of the frequently used EDA charts and the quantitative techniques.

====[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]]====
[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]] is an efficient way for presenting data, especially for comparing multiple groups of data. In the box plot, we can mark-off the five-number summary of a data set (minimum, 25th percentile, median, 75th percentile, maximum). The box contains the 50% of the data. The upper edge of the box represents the 75th percentile, while the lower edge is the 25th percentile. The median is represented by a line drawn in the middle of the box. If the median is not in the middle of the box then the data are skewed. The ends of the lines (whiskers) represent the minimum and maximum values of the data set, unless there are outliers. Outliers are observations below $ Q_1-1.5(IQR) $ or above $ Q_3+1.5(IQR) $, where $ Q_1 $ is the 25th percentile, $ Q_3 $ is the 75th percentile, and $ IQR=Q_3-Q_1 $ (the interquartile range). The advantage of a box plot is that it provides graphically the location and the spread of the data set, it provides an idea about the skewness of the data set, and can provide a comparison between variables by constructing a side-by-side box plots.
<center>[[Image:SMHS EDA Gallaway 07012014 Fig1a.png|500px]]</center>

====[[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]====
Histograms represent a graphical visualization of tabulated frequencies or counts of data within equal spaced partition of the range of the data. It shows what proportion of measurements that fall into each of the categories defined by the partition of the data range space.
[[Image:UMHS Gallawaay 07012014 Fig2.PNG|500px]]

* Comment: Compare the two series from the histogram above, we can easily tell that the pattern of series 2 if more obvious compared to series 1. Our intuition may come from: series 1 has more extreme values across five days, for example, the values for Jan 1st and Jan 3rd are extremely high (almost 55 for Jan 1st) while that for Jan 4th is almost -12. However values for series 2 are all above 0 and fluctuated between 5 and 20.

[[Image:UMHS Gallaway 07012014 Fig3.PNG|500px]]

* Comment: The Dot chart above gives a clear picture of the values of all the data points and makes the fundamental measurements easily readable. We can tell that most of the values of the data fluctuate between 1 and 7 with mean 3.9 and median 4. There are two obvious outliers valued -2 and 10.

====[[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]====
[http://en.wikipedia.org/wiki/Scatter_plot Scatter plots] use Cartesian coordinates to display values for two variables for a set of data, which is displayed as a collection of points. The value of variable is determined by the position on the horizontal and vertical axis.

[[Image:UMHS Gallaway 07012014 Fig4.PNG|500px]]

* Comment: The x and y axes display values for two variables and all the data points drawn in the chart are coordinates indicating a pair of values for both variables.

For the first series, all the data points lie on and above the diagonal line so with increasing x variable, the paired y variable increases faster or equal to x variable. We can infer a positive linear association between X and Y.

For the second series, most data located along the line except for two outliers (4,8) and (1,5). So for most data points, with increasing x variable, the paired y variable decreases slower or equal to x. We may infer a negative linear association between X and Y.

For the third series, we can’t draw a line association between X and Y, instead, a quadratic pattern would work better here.

====[[SOCR_EduMaterials_Activities_QQChart|QQ Plot]]====
In Quantile-Quantile plots, the observed values are plotted against theoretical quantiles in QQ charts. A line of good fit is drawn to show the behavior of the data values against the theoretical distribution. If F() is a cumulative distribution function, then a quantile (q), also known as a percentile, is defined as a solution to the equation $F(q)=p$,that is $q=F^{-1}(p)$.

[[Image:UMHS Gallaway 07012014 Fig5.PNG|500px]]

*Comment: From the chart above, we can see that the data follows a normal distribution in general given all the data points (noted in red) located along side the line. However, the data doesn’t follow a normal distribution tightly because there are data points located pretty far from the line. We can also infer that the sampled data may not be representative enough of the population because of the limited size of the sample.

====Median polish====
[http://en.wikipedia.org/wiki/Median_polish Median polish] is an EDA procedure proposed by John Tukey. It finds an additively fit model for data in a two-way layout table of the form row effect + column effect + overall mean. It is an iterative algorithm for removing any trends by computing medians for various coordinates on the spatial domain D.

==== Trimean====
[http://en.wikipedia.org/wiki/Trimean Trimean] is a measure of a probability distribution’s location defined as a weighted average of the distribution’s median and its two quartiles. It combines the median’s emphasis on center values with the midhinge’s attention to the extremes. And it is a remarkably efficient estimator of population mean especially for large data set (say more than 100 points) from a symmetric population.
<center> $ \frac{Q_1+2Q_2+Q_3}{4} $ </center>

===Applications===
* [http://www.itl.nist.gov/div898/handbook/eda/eda.htm This article] provides a thorough introduction to EDA. It discusses the basic concepts, objectives, and techniques associated with EDA. It also includes case studies in which EDA is applied. The case studies include eight type of charts for univariate analyses and introduce the concepts of reliability and multi-factor studies. The article gives specific examples with background, output and interpretations of results is a useful resource for learning EDA.
* [http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf This article] begins with a general introduction to data analysis and explains EDA via examples that employ various graphical analyses. This article serves as a basic and general introduction to the concepts associated with EDA and is a good starting place for studying these concepts.
* The [http://wiki.stat.ucla.edu/socr/index.php/SOCR_HTML5_Expansion_MotionCharts SOCR Motion Charts Project enables complex data visualization, see the [http://socr.umich.edu/HTML5/MotionChart/ SOCR MotionChart webapp]. The SOCR Motion Charts provide an interactive infrastructure for discovery-based exploratory analysis of multivariate data.

Now, we want to explore the relationship between two variables in the [[SOCR_Data_Dinov_010309_HousingPriceIndex| dataset: UR (Unemployment Rate) and HPI (Housing Price Index) in the state of Alabama over 2000 to 2006]]. First, how does the UR in Alabama change from 2000 to 2006?
[[Image:UMHS Gallaway 07012014 Fig6.PNG|800px]]

From this chart, we can see that the UR in Alabama increases from 2000 to 2003 then decreases sharply from 2004 to 2006. So you may wonder what is UR for states from other part of the country over the same period?
[[Image:UMHS Gallaway 07012014 Fig7.PNG|800px]]

All the states appear to follow similar patterns. Now, let’s study relationships between UR and HPI in a single state, say Alabama, across this time span.
[[Image:UMHS Gallaway 07012014 Fig8.PNG|800px]]

The chart above suggests that HPI increases through time in Alabama, while UR increases at first and then exhibits a sharp drop between 2004 and 2006. If there is any association between UR and HPI, it appears to be quadratic rather than linear. Similarly, if we extend the graph to the three states from different regions, we generate the following chart:
[[Image:UMHS Gallaway 07012014 Fig9.PNG|500px]]

We can now address the question: is there any association between UR and HPI among all the states based on the chart?
The motion chart, however, makes the study much more interesting by exhibiting a moving chart with UR vs. HPI of 51 states from different areas during the period from 2000 to 2006. This allows us to get an idea of the changing values over the years among all states. You’re welcome to play with the data to see how the chart changes using the link listed above.

===Software ===
* [http://www.socr.umich.edu/html/cha SOCR Charts]
* [http://www.r-bloggers.com/exploratory-data-analysis-useful-r-functions-for-exploring-a-data-frame/ R EDA functions]

===Problems===
* Work on problems in [http://www.itl.nist.gov/div898/handbook/eda/section4/eda42.htm Uniform Random Numbers and Random Walk from this Case Study].

* Two random samples were taken to determine backpack load difference between seniors and freshmen, in pounds. The following are the summaries:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Year|| Mean || SD || Median || Min ||Max || Range|| Count
|-
| Freshmen || 20.43 || 4.21 || 17.2 || 5.78 || 31.68 || 25.9 || 115
|-
| Senior || 18.67 || 4.21 || 18.67 || 5.31 || 27.66 || 22.35 ||157
|-
|}
</center>

* Which of the following plots would be the most useful in comparing the two sets of backpack weights? Choose One Answer:
: (a) Histograms
: (b) Dot Plots
: (c) Scatter Plots
: (d) Box Plots

* School administrators are interested in examining the relationship between height and GPA. What type of plot should they use to display this relationship? Choose one answer.
: (a) box plot
: (b) scatter plot
: (c) line plot
: (d) dot plot

* What would be the most appropriate plot for comparing the heights of the 8th graders from different ethnic backgrounds? Choose one answer.
: (a) bar charts
: (b) side by side boxplot
: (c) histograms
: (d) pie charts

* There is a company in which a very small minority of males (3%) receives three times the median salary of all males, and a very small minority of females (3%) receives one-third of the median salary of all females. What do you expect the side-by-side boxplot of male and female salaries to look like? Choose one answer.
: (a) Both boxplots will be skewed and the median line will not be in the middle of any of the boxes.
: (b) Both boxplots will be skewed, in the case of the females the median line will be close to the top of the box and in the case of the males the median line will be closer to the bottom of the box.
: (c) Need to have the actual data to compare the shape of the boxplots.
: (d) Both boxplots will be skewed, in the case of males the median line will be close to the top of the box and for the females the median line will be closer to the bottom of the box.

* A researcher has collected the following information on a random sample of 200 adults in the 40-50 age range: Weight in pounds Heart beats per minute Smoker or non-smoker Single or married
He wants to examine the relationship between: 1) heart beat per minute and weight, and 2) smoking and marital status. Choose one answer.
: (a) He should draw a scatter plot of heart beat and weight, and a segmented bar chart of smoking and marital status.
: (b) He should draw a side by side boxplot of heart beat and weight and a scatterplot of smoking and marital status.
: (c) He should draw a side by side boxplot of smoking and marital status and a segmented bar chart of hear beat and weight.
: (d) He should draw a back to back stem and leaf plot of weight and heart beat and examine the cell frequencies in the contingency table for smoking by marital status.

* As part of an experiment in perception, 160 University of Michigan psych students completed a task on identifying similar objects. On average, the students spent 8.25 minutes with sa tandard deviation of 2.4 minutes. However, the minimum time was 2.3 minutes and one students worked for almost 60 minutes. What is the best description of the histogram of times that students spent on this task? Choose one answer.
: (a) The histogram of times could be symmetrical and not normal with major outliers.
: (b) The histogram of times could be left skewed, and in case there are any outliers, it is likely that they will be smaller than the mean.
: (c) The histogram of times could be right skewed, and in the case of any outliers, it is likely that they will be larger than the mean.
: (d) The histogram of times could be normal with no major outliers.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_EDA_Plots SOCR]
* [http://en.wikipedia.org/wiki/Exploratory_data_analysis Exploratory data analysis Wikipedia]
* [http://en.wikipedia.org/wiki/Scatter_plot Scatter plots Wikipedia]
* [http://en.wikipedia.org/wiki/Median_polish Median polish Wikipedia]
* [http://en.wikipedia.org/wiki/Trimean Trimean Wikipedia]
<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_EDA}}

SMHS EDA

2014-08-31T17:37:59Z

Zhenxunw: /* Trimean */

==[[SMHS| Scientific Methods for Health Sciences]] - Exploratory Data Analysis (EDA), Charts and Plots ==

===Overview===
* ''What is data?'' Data is a collection of facts, observations or information, such as values or measurements. Data can be numbers, measurements, or even just description of things (meta-data). Data types can be divided into two big categories of quantitative (numerical information) and qualitative data (descriptive information).

*''Quantitative data'' is anything that can be expressed as a number, or quantified. For example, the scores on a math test or weight of girls in the fourth grade are both quantitative data. Quantitative data (discrete or continuous) is often referred to as the measurable data and this type of data allows scientists to perform various arithmetic operations, such as addition, multiplication, functional-evaluation, or to find parameters of a population. There are two major types of quantitative data: discrete and continuous.
**Discrete data results from either a finite, or infinite but countable, possible options for the values present in a given discrete data set and the values of this data type can constitute a sequence of isolated or separated points on the real number line.
**Continuous quantitative data results from infinite and dense possible values that the observations can take on.

*''Qualitative'' data cannot be expressed as numbers. Examples of qualitative data elements include gender, religious preference. Categorical data (qualitative or nominal) results from placing individuals into groups or categories. Ordinal and qualitative categorical data types both fall into this category.

In statistics, exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics. Modern statistics regards the graphical visualization and interrogation of data as a critical component of any reliable method for statistical modeling, analysis and interpretation of data. Formally, there are two types of data analysis that should be employed in concert on the same set of data to make a valid and robust inference: graphical techniques and quantitative techniques. We will discuss many of these later, but below is a snapshot of EDA approaches:
* [[SOCR_EduMaterials_Activities_BoxPlot|Box plot]], [[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]; Multi-vari chart; Run chart; Pareto chart; [[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]; Stem-and-leaf plot;
* Parallel coordinates; Odds ratio; Multidimensional scaling; Targeted projection pursuit; Principal component analysis; Multi-linear PCA; Projection methods such as grand tour, guided tour and manual tour.
* [http://en.wikipedia.org/wiki/Median_polish Median polish], [http://en.wikipedia.org/wiki/Trimean Trimean].

===Motivation===
The feel of data comes clearly from the application of various graphical techniques, which serves as a perfect window to human perspective and sense. The primary goal of EDA is to maximize the analyst’s insight into a data set and into the underlying structure of the data set. To get a feel for the data, it is not enough for the analyst to know what is in the data, he or she must also know what is not in the data, and the only way to do that is to draw on our own pattern recognition and comparative abilities in the context of a series of judicious graphical techniques applied to the data. The [http://en.wikipedia.org/wiki/Exploratory_data_analysis#EDA_development main objectives of EDA] are to:
* Suggest hypotheses about the causes of observed phenomena;
* Assess (parametric) assumptions on which statistical inference will be based;
* Support the selection of appropriate statistical tools and techniques;
* Provide a basis for further data collection through surveys and experiments.

===Theory===
Many EDA techniques have been proposed, validated and adopted for various statistical methodologies. Here is an introduction to some of the frequently used EDA charts and the quantitative techniques.

====[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]]====
[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]] is an efficient way for presenting data, especially for comparing multiple groups of data. In the box plot, we can mark-off the five-number summary of a data set (minimum, 25th percentile, median, 75th percentile, maximum). The box contains the 50% of the data. The upper edge of the box represents the 75th percentile, while the lower edge is the 25th percentile. The median is represented by a line drawn in the middle of the box. If the median is not in the middle of the box then the data are skewed. The ends of the lines (whiskers) represent the minimum and maximum values of the data set, unless there are outliers. Outliers are observations below $ Q_1-1.5(IQR) $ or above $ Q_3+1.5(IQR) $, where $ Q_1 $ is the 25th percentile, $ Q_3 $ is the 75th percentile, and $ IQR=Q_3-Q_1 $ (the interquartile range). The advantage of a box plot is that it provides graphically the location and the spread of the data set, it provides an idea about the skewness of the data set, and can provide a comparison between variables by constructing a side-by-side box plots.
<center>[[Image:SMHS EDA Gallaway 07012014 Fig1a.png|500px]]</center>

====[[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]====
Histograms represent a graphical visualization of tabulated frequencies or counts of data within equal spaced partition of the range of the data. It shows what proportion of measurements that fall into each of the categories defined by the partition of the data range space.
[[Image:UMHS Gallawaay 07012014 Fig2.PNG|500px]]

* Comment: Compare the two series from the histogram above, we can easily tell that the pattern of series 2 if more obvious compared to series 1. Our intuition may come from: series 1 has more extreme values across five days, for example, the values for Jan 1st and Jan 3rd are extremely high (almost 55 for Jan 1st) while that for Jan 4th is almost -12. However values for series 2 are all above 0 and fluctuated between 5 and 20.

[[Image:UMHS Gallaway 07012014 Fig3.PNG|500px]]

* Comment: The Dot chart above gives a clear picture of the values of all the data points and makes the fundamental measurements easily readable. We can tell that most of the values of the data fluctuate between 1 and 7 with mean 3.9 and median 4. There are two obvious outliers valued -2 and 10.

====[[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]====
[http://en.wikipedia.org/wiki/Scatter_plot Scatter plots] use Cartesian coordinates to display values for two variables for a set of data, which is displayed as a collection of points. The value of variable is determined by the position on the horizontal and vertical axis.

[[Image:UMHS Gallaway 07012014 Fig4.PNG|500px]]

* Comment: The x and y axes display values for two variables and all the data points drawn in the chart are coordinates indicating a pair of values for both variables.

For the first series, all the data points lie on and above the diagonal line so with increasing x variable, the paired y variable increases faster or equal to x variable. We can infer a positive linear association between X and Y.

For the second series, most data located along the line except for two outliers (4,8) and (1,5). So for most data points, with increasing x variable, the paired y variable decreases slower or equal to x. We may infer a negative linear association between X and Y.

For the third series, we can’t draw a line association between X and Y, instead, a quadratic pattern would work better here.

====[[SOCR_EduMaterials_Activities_QQChart|QQ Plot]]====
In Quantile-Quantile plots, the observed values are plotted against theoretical quantiles in QQ charts. A line of good fit is drawn to show the behavior of the data values against the theoretical distribution. If F() is a cumulative distribution function, then a quantile (q), also known as a percentile, is defined as a solution to the equation $F(q)=p$,that is $q=F^{-1}(p)$.

[[Image:UMHS Gallaway 07012014 Fig5.PNG|500px]]

*Comment: From the chart above, we can see that the data follows a normal distribution in general given all the data points (noted in red) located along side the line. However, the data doesn’t follow a normal distribution tightly because there are data points located pretty far from the line. We can also infer that the sampled data may not be representative enough of the population because of the limited size of the sample.

====Median polish====
[http://en.wikipedia.org/wiki/Median_polish Median polish] is an EDA procedure proposed by John Tukey. It finds an additively fit model for data in a two-way layout table of the form row effect + column effect + overall mean. It is an iterative algorithm for removing any trends by computing medians for various coordinates on the spatial domain D.

==== Trimean====
[http://en.wikipedia.org/wiki/Trimean Trimean] is a measure of a probability distribution’s location defined as a weighted average of the distribution’s median and its two quartiles. It combines the median’s emphasis on center values with the midhinge’s attention to the extremes. And it is a remarkably efficient estimator of population mean especially for large data set (say more than 100 points) from a symmetric population.
<center> $ \frac{Q_1+2Q_2+Q_3}{4} $ </center>

===Applications===
* [http://www.itl.nist.gov/div898/handbook/eda/eda.htm This article] provides a thorough introduction to EDA. It discusses the basic concepts, objectives, and techniques associated with EDA. It also includes case studies in which EDA is applied. The case studies include eight type of charts for univariate analyses and introduce the concepts of reliability and multi-factor studies. The article gives specific examples with background, output and interpretations of results is a useful resource for learning EDA.
* [http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf This article] begins with a general introduction to data analysis and explains EDA via examples that employ various graphical analyses. This article serves as a basic and general introduction to the concepts associated with EDA and is a good starting place for studying these concepts.
* The [http://wiki.stat.ucla.edu/socr/index.php/SOCR_HTML5_Expansion_MotionCharts SOCR Motion Charts Project enables complex data visualization, see the [http://socr.umich.edu/HTML5/MotionChart/ SOCR MotionChart webapp]. The SOCR Motion Charts provide an interactive infrastructure for discovery-based exploratory analysis of multivariate data.

Now, we want to explore the relationship between two variables in the [[SOCR_Data_Dinov_010309_HousingPriceIndex| dataset: UR (Unemployment Rate) and HPI (Housing Price Index) in the state of Alabama over 2000 to 2006]]. First, how does the UR in Alabama change from 2000 to 2006?
[[Image:UMHS Gallaway 07012014 Fig6.PNG|800px]]

From this chart, we can see that the UR in Alabama increases from 2000 to 2003 then decreases sharply from 2004 to 2006. So you may wonder what is UR for states from other part of the country over the same period?
[[Image:UMHS Gallaway 07012014 Fig7.PNG|800px]]

All the states appear to follow similar patterns. Now, let’s study relationships between UR and HPI in a single state, say Alabama, across this time span.
[[Image:UMHS Gallaway 07012014 Fig8.PNG|800px]]

The chart above suggests that HPI increases through time in Alabama, while UR increases at first and then exhibits a sharp drop between 2004 and 2006. If there is any association between UR and HPI, it appears to be quadratic rather than linear. Similarly, if we extend the graph to the three states from different regions, we generate the following chart:
[[Image:UMHS Gallaway 07012014 Fig9.PNG|500px]]

We can now address the question: is there any association between UR and HPI among all the states based on the chart?
The motion chart, however, makes the study much more interesting by exhibiting a moving chart with UR vs. HPI of 51 states from different areas during the period from 2000 to 2006. This allows us to get an idea of the changing values over the years among all states. You’re welcome to play with the data to see how the chart changes using the link listed above.

===Software ===
* [http://www.socr.umich.edu/html/cha SOCR Charts]
* [http://www.r-bloggers.com/exploratory-data-analysis-useful-r-functions-for-exploring-a-data-frame/ R EDA functions]

===Problems===
* Work on problems in [http://www.itl.nist.gov/div898/handbook/eda/section4/eda42.htm Uniform Random Numbers and Random Walk from this Case Study].

* Two random samples were taken to determine backpack load difference between seniors and freshmen, in pounds. The following are the summaries:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Year|| Mean || SD || Median || Min ||Max || Range|| Count
|-
| Freshmen || 20.43 || 4.21 || 17.2 || 5.78 || 31.68 || 25.9 || 115
|-
| Senior || 18.67 || 4.21 || 18.67 || 5.31 || 27.66 || 22.35 ||157
|-
|}
</center>

* Which of the following plots would be the most useful in comparing the two sets of backpack weights? Choose One Answer:
: (a) Histograms
: (b) Dot Plots
: (c) Scatter Plots
: (d) Box Plots

* School administrators are interested in examining the relationship between height and GPA. What type of plot should they use to display this relationship? Choose one answer.
: (a) box plot
: (b) scatter plot
: (c) line plot
: (d) dot plot

* What would be the most appropriate plot for comparing the heights of the 8th graders from different ethnic backgrounds? Choose one answer.
: (a) bar charts
: (b) side by side boxplot
: (c) histograms
: (d) pie charts

* There is a company in which a very small minority of males (3%) receives three times the median salary of all males, and a very small minority of females (3%) receives one-third of the median salary of all females. What do you expect the side-by-side boxplot of male and female salaries to look like? Choose one answer.
: (a) Both boxplots will be skewed and the median line will not be in the middle of any of the boxes.
: (b) Both boxplots will be skewed, in the case of the females the median line will be close to the top of the box and in the case of the males the median line will be closer to the bottom of the box.
: (c) Need to have the actual data to compare the shape of the boxplots.
: (d) Both boxplots will be skewed, in the case of males the median line will be close to the top of the box and for the females the median line will be closer to the bottom of the box.

* A researcher has collected the following information on a random sample of 200 adults in the 40-50 age range: Weight in pounds Heart beats per minute Smoker or non-smoker Single or married
He wants to examine the relationship between: 1) heart beat per minute and weight, and 2) smoking and marital status. Choose one answer.
: (a) He should draw a scatter plot of heart beat and weight, and a segmented bar chart of smoking and marital status.
: (b) He should draw a side by side boxplot of heart beat and weight and a scatterplot of smoking and marital status.
: (c) He should draw a side by side boxplot of smoking and marital status and a segmented bar chart of hear beat and weight.
: (d) He should draw a back to back stem and leaf plot of weight and heart beat and examine the cell frequencies in the contingency table for smoking by marital status.

* As part of an experiment in perception, 160 University of Michigan psych students completed a task on identifying similar objects. On average, the students spent 8.25 minutes with sa tandard deviation of 2.4 minutes. However, the minimum time was 2.3 minutes and one students worked for almost 60 minutes. What is the best description of the histogram of times that students spent on this task? Choose one answer.
: (a) The histogram of times could be symmetrical and not normal with major outliers.
: (b) The histogram of times could be left skewed, and in case there are any outliers, it is likely that they will be smaller than the mean.
: (c) The histogram of times could be right skewed, and in the case of any outliers, it is likely that they will be larger than the mean.
: (d) The histogram of times could be normal with no major outliers.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_EDA_Plots SOCR]
* [http://en.wikipedia.org/wiki/Exploratory_data_analysis Exploratory data analysis Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_EDA}}

SMHS EDA

2014-08-31T17:36:21Z

Zhenxunw: /* Median polish */

==[[SMHS| Scientific Methods for Health Sciences]] - Exploratory Data Analysis (EDA), Charts and Plots ==

===Overview===
* ''What is data?'' Data is a collection of facts, observations or information, such as values or measurements. Data can be numbers, measurements, or even just description of things (meta-data). Data types can be divided into two big categories of quantitative (numerical information) and qualitative data (descriptive information).

*''Quantitative data'' is anything that can be expressed as a number, or quantified. For example, the scores on a math test or weight of girls in the fourth grade are both quantitative data. Quantitative data (discrete or continuous) is often referred to as the measurable data and this type of data allows scientists to perform various arithmetic operations, such as addition, multiplication, functional-evaluation, or to find parameters of a population. There are two major types of quantitative data: discrete and continuous.
**Discrete data results from either a finite, or infinite but countable, possible options for the values present in a given discrete data set and the values of this data type can constitute a sequence of isolated or separated points on the real number line.
**Continuous quantitative data results from infinite and dense possible values that the observations can take on.

*''Qualitative'' data cannot be expressed as numbers. Examples of qualitative data elements include gender, religious preference. Categorical data (qualitative or nominal) results from placing individuals into groups or categories. Ordinal and qualitative categorical data types both fall into this category.

In statistics, exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics. Modern statistics regards the graphical visualization and interrogation of data as a critical component of any reliable method for statistical modeling, analysis and interpretation of data. Formally, there are two types of data analysis that should be employed in concert on the same set of data to make a valid and robust inference: graphical techniques and quantitative techniques. We will discuss many of these later, but below is a snapshot of EDA approaches:
* [[SOCR_EduMaterials_Activities_BoxPlot|Box plot]], [[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]; Multi-vari chart; Run chart; Pareto chart; [[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]; Stem-and-leaf plot;
* Parallel coordinates; Odds ratio; Multidimensional scaling; Targeted projection pursuit; Principal component analysis; Multi-linear PCA; Projection methods such as grand tour, guided tour and manual tour.
* [http://en.wikipedia.org/wiki/Median_polish Median polish], [http://en.wikipedia.org/wiki/Trimean Trimean].

===Motivation===
The feel of data comes clearly from the application of various graphical techniques, which serves as a perfect window to human perspective and sense. The primary goal of EDA is to maximize the analyst’s insight into a data set and into the underlying structure of the data set. To get a feel for the data, it is not enough for the analyst to know what is in the data, he or she must also know what is not in the data, and the only way to do that is to draw on our own pattern recognition and comparative abilities in the context of a series of judicious graphical techniques applied to the data. The [http://en.wikipedia.org/wiki/Exploratory_data_analysis#EDA_development main objectives of EDA] are to:
* Suggest hypotheses about the causes of observed phenomena;
* Assess (parametric) assumptions on which statistical inference will be based;
* Support the selection of appropriate statistical tools and techniques;
* Provide a basis for further data collection through surveys and experiments.

===Theory===
Many EDA techniques have been proposed, validated and adopted for various statistical methodologies. Here is an introduction to some of the frequently used EDA charts and the quantitative techniques.

====[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]]====
[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]] is an efficient way for presenting data, especially for comparing multiple groups of data. In the box plot, we can mark-off the five-number summary of a data set (minimum, 25th percentile, median, 75th percentile, maximum). The box contains the 50% of the data. The upper edge of the box represents the 75th percentile, while the lower edge is the 25th percentile. The median is represented by a line drawn in the middle of the box. If the median is not in the middle of the box then the data are skewed. The ends of the lines (whiskers) represent the minimum and maximum values of the data set, unless there are outliers. Outliers are observations below $ Q_1-1.5(IQR) $ or above $ Q_3+1.5(IQR) $, where $ Q_1 $ is the 25th percentile, $ Q_3 $ is the 75th percentile, and $ IQR=Q_3-Q_1 $ (the interquartile range). The advantage of a box plot is that it provides graphically the location and the spread of the data set, it provides an idea about the skewness of the data set, and can provide a comparison between variables by constructing a side-by-side box plots.
<center>[[Image:SMHS EDA Gallaway 07012014 Fig1a.png|500px]]</center>

====[[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]====
Histograms represent a graphical visualization of tabulated frequencies or counts of data within equal spaced partition of the range of the data. It shows what proportion of measurements that fall into each of the categories defined by the partition of the data range space.
[[Image:UMHS Gallawaay 07012014 Fig2.PNG|500px]]

* Comment: Compare the two series from the histogram above, we can easily tell that the pattern of series 2 if more obvious compared to series 1. Our intuition may come from: series 1 has more extreme values across five days, for example, the values for Jan 1st and Jan 3rd are extremely high (almost 55 for Jan 1st) while that for Jan 4th is almost -12. However values for series 2 are all above 0 and fluctuated between 5 and 20.

[[Image:UMHS Gallaway 07012014 Fig3.PNG|500px]]

* Comment: The Dot chart above gives a clear picture of the values of all the data points and makes the fundamental measurements easily readable. We can tell that most of the values of the data fluctuate between 1 and 7 with mean 3.9 and median 4. There are two obvious outliers valued -2 and 10.

====[[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]====
[http://en.wikipedia.org/wiki/Scatter_plot Scatter plots] use Cartesian coordinates to display values for two variables for a set of data, which is displayed as a collection of points. The value of variable is determined by the position on the horizontal and vertical axis.

[[Image:UMHS Gallaway 07012014 Fig4.PNG|500px]]

* Comment: The x and y axes display values for two variables and all the data points drawn in the chart are coordinates indicating a pair of values for both variables.

For the first series, all the data points lie on and above the diagonal line so with increasing x variable, the paired y variable increases faster or equal to x variable. We can infer a positive linear association between X and Y.

For the second series, most data located along the line except for two outliers (4,8) and (1,5). So for most data points, with increasing x variable, the paired y variable decreases slower or equal to x. We may infer a negative linear association between X and Y.

For the third series, we can’t draw a line association between X and Y, instead, a quadratic pattern would work better here.

====[[SOCR_EduMaterials_Activities_QQChart|QQ Plot]]====
In Quantile-Quantile plots, the observed values are plotted against theoretical quantiles in QQ charts. A line of good fit is drawn to show the behavior of the data values against the theoretical distribution. If F() is a cumulative distribution function, then a quantile (q), also known as a percentile, is defined as a solution to the equation $F(q)=p$,that is $q=F^{-1}(p)$.

[[Image:UMHS Gallaway 07012014 Fig5.PNG|500px]]

*Comment: From the chart above, we can see that the data follows a normal distribution in general given all the data points (noted in red) located along side the line. However, the data doesn’t follow a normal distribution tightly because there are data points located pretty far from the line. We can also infer that the sampled data may not be representative enough of the population because of the limited size of the sample.

====Median polish====
[http://en.wikipedia.org/wiki/Median_polish Median polish] is an EDA procedure proposed by John Tukey. It finds an additively fit model for data in a two-way layout table of the form row effect + column effect + overall mean. It is an iterative algorithm for removing any trends by computing medians for various coordinates on the spatial domain D.

==== Trimean====
Trimean is a measure of a probability distribution’s location defined as a weighted average of the distribution’s median and its two quartiles. It combines the median’s emphasis on center values with the midhinge’s attention to the extremes. And it is a remarkably efficient estimator of population mean especially for large data set (say more than 100 points) from a symmetric population.
<center> $ \frac{Q_1+2Q_2+Q_3}{4} $ </center>

===Applications===
* [http://www.itl.nist.gov/div898/handbook/eda/eda.htm This article] provides a thorough introduction to EDA. It discusses the basic concepts, objectives, and techniques associated with EDA. It also includes case studies in which EDA is applied. The case studies include eight type of charts for univariate analyses and introduce the concepts of reliability and multi-factor studies. The article gives specific examples with background, output and interpretations of results is a useful resource for learning EDA.
* [http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf This article] begins with a general introduction to data analysis and explains EDA via examples that employ various graphical analyses. This article serves as a basic and general introduction to the concepts associated with EDA and is a good starting place for studying these concepts.
* The [http://wiki.stat.ucla.edu/socr/index.php/SOCR_HTML5_Expansion_MotionCharts SOCR Motion Charts Project enables complex data visualization, see the [http://socr.umich.edu/HTML5/MotionChart/ SOCR MotionChart webapp]. The SOCR Motion Charts provide an interactive infrastructure for discovery-based exploratory analysis of multivariate data.

Now, we want to explore the relationship between two variables in the [[SOCR_Data_Dinov_010309_HousingPriceIndex| dataset: UR (Unemployment Rate) and HPI (Housing Price Index) in the state of Alabama over 2000 to 2006]]. First, how does the UR in Alabama change from 2000 to 2006?
[[Image:UMHS Gallaway 07012014 Fig6.PNG|800px]]

From this chart, we can see that the UR in Alabama increases from 2000 to 2003 then decreases sharply from 2004 to 2006. So you may wonder what is UR for states from other part of the country over the same period?
[[Image:UMHS Gallaway 07012014 Fig7.PNG|800px]]

All the states appear to follow similar patterns. Now, let’s study relationships between UR and HPI in a single state, say Alabama, across this time span.
[[Image:UMHS Gallaway 07012014 Fig8.PNG|800px]]

The chart above suggests that HPI increases through time in Alabama, while UR increases at first and then exhibits a sharp drop between 2004 and 2006. If there is any association between UR and HPI, it appears to be quadratic rather than linear. Similarly, if we extend the graph to the three states from different regions, we generate the following chart:
[[Image:UMHS Gallaway 07012014 Fig9.PNG|500px]]

We can now address the question: is there any association between UR and HPI among all the states based on the chart?
The motion chart, however, makes the study much more interesting by exhibiting a moving chart with UR vs. HPI of 51 states from different areas during the period from 2000 to 2006. This allows us to get an idea of the changing values over the years among all states. You’re welcome to play with the data to see how the chart changes using the link listed above.

===Software ===
* [http://www.socr.umich.edu/html/cha SOCR Charts]
* [http://www.r-bloggers.com/exploratory-data-analysis-useful-r-functions-for-exploring-a-data-frame/ R EDA functions]

===Problems===
* Work on problems in [http://www.itl.nist.gov/div898/handbook/eda/section4/eda42.htm Uniform Random Numbers and Random Walk from this Case Study].

* Two random samples were taken to determine backpack load difference between seniors and freshmen, in pounds. The following are the summaries:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Year|| Mean || SD || Median || Min ||Max || Range|| Count
|-
| Freshmen || 20.43 || 4.21 || 17.2 || 5.78 || 31.68 || 25.9 || 115
|-
| Senior || 18.67 || 4.21 || 18.67 || 5.31 || 27.66 || 22.35 ||157
|-
|}
</center>

* Which of the following plots would be the most useful in comparing the two sets of backpack weights? Choose One Answer:
: (a) Histograms
: (b) Dot Plots
: (c) Scatter Plots
: (d) Box Plots

* School administrators are interested in examining the relationship between height and GPA. What type of plot should they use to display this relationship? Choose one answer.
: (a) box plot
: (b) scatter plot
: (c) line plot
: (d) dot plot

* What would be the most appropriate plot for comparing the heights of the 8th graders from different ethnic backgrounds? Choose one answer.
: (a) bar charts
: (b) side by side boxplot
: (c) histograms
: (d) pie charts

* There is a company in which a very small minority of males (3%) receives three times the median salary of all males, and a very small minority of females (3%) receives one-third of the median salary of all females. What do you expect the side-by-side boxplot of male and female salaries to look like? Choose one answer.
: (a) Both boxplots will be skewed and the median line will not be in the middle of any of the boxes.
: (b) Both boxplots will be skewed, in the case of the females the median line will be close to the top of the box and in the case of the males the median line will be closer to the bottom of the box.
: (c) Need to have the actual data to compare the shape of the boxplots.
: (d) Both boxplots will be skewed, in the case of males the median line will be close to the top of the box and for the females the median line will be closer to the bottom of the box.

* A researcher has collected the following information on a random sample of 200 adults in the 40-50 age range: Weight in pounds Heart beats per minute Smoker or non-smoker Single or married
He wants to examine the relationship between: 1) heart beat per minute and weight, and 2) smoking and marital status. Choose one answer.
: (a) He should draw a scatter plot of heart beat and weight, and a segmented bar chart of smoking and marital status.
: (b) He should draw a side by side boxplot of heart beat and weight and a scatterplot of smoking and marital status.
: (c) He should draw a side by side boxplot of smoking and marital status and a segmented bar chart of hear beat and weight.
: (d) He should draw a back to back stem and leaf plot of weight and heart beat and examine the cell frequencies in the contingency table for smoking by marital status.

* As part of an experiment in perception, 160 University of Michigan psych students completed a task on identifying similar objects. On average, the students spent 8.25 minutes with sa tandard deviation of 2.4 minutes. However, the minimum time was 2.3 minutes and one students worked for almost 60 minutes. What is the best description of the histogram of times that students spent on this task? Choose one answer.
: (a) The histogram of times could be symmetrical and not normal with major outliers.
: (b) The histogram of times could be left skewed, and in case there are any outliers, it is likely that they will be smaller than the mean.
: (c) The histogram of times could be right skewed, and in the case of any outliers, it is likely that they will be larger than the mean.
: (d) The histogram of times could be normal with no major outliers.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_EDA_Plots SOCR]
* [http://en.wikipedia.org/wiki/Exploratory_data_analysis Exploratory data analysis Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_EDA}}

SMHS EDA

2014-08-31T17:28:39Z

Zhenxunw: /* Scatter plot */

==[[SMHS| Scientific Methods for Health Sciences]] - Exploratory Data Analysis (EDA), Charts and Plots ==

===Overview===
* ''What is data?'' Data is a collection of facts, observations or information, such as values or measurements. Data can be numbers, measurements, or even just description of things (meta-data). Data types can be divided into two big categories of quantitative (numerical information) and qualitative data (descriptive information).

*''Quantitative data'' is anything that can be expressed as a number, or quantified. For example, the scores on a math test or weight of girls in the fourth grade are both quantitative data. Quantitative data (discrete or continuous) is often referred to as the measurable data and this type of data allows scientists to perform various arithmetic operations, such as addition, multiplication, functional-evaluation, or to find parameters of a population. There are two major types of quantitative data: discrete and continuous.
**Discrete data results from either a finite, or infinite but countable, possible options for the values present in a given discrete data set and the values of this data type can constitute a sequence of isolated or separated points on the real number line.
**Continuous quantitative data results from infinite and dense possible values that the observations can take on.

*''Qualitative'' data cannot be expressed as numbers. Examples of qualitative data elements include gender, religious preference. Categorical data (qualitative or nominal) results from placing individuals into groups or categories. Ordinal and qualitative categorical data types both fall into this category.

In statistics, exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics. Modern statistics regards the graphical visualization and interrogation of data as a critical component of any reliable method for statistical modeling, analysis and interpretation of data. Formally, there are two types of data analysis that should be employed in concert on the same set of data to make a valid and robust inference: graphical techniques and quantitative techniques. We will discuss many of these later, but below is a snapshot of EDA approaches:
* [[SOCR_EduMaterials_Activities_BoxPlot|Box plot]], [[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]; Multi-vari chart; Run chart; Pareto chart; [[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]; Stem-and-leaf plot;
* Parallel coordinates; Odds ratio; Multidimensional scaling; Targeted projection pursuit; Principal component analysis; Multi-linear PCA; Projection methods such as grand tour, guided tour and manual tour.
* [http://en.wikipedia.org/wiki/Median_polish Median polish], [http://en.wikipedia.org/wiki/Trimean Trimean].

===Motivation===
The feel of data comes clearly from the application of various graphical techniques, which serves as a perfect window to human perspective and sense. The primary goal of EDA is to maximize the analyst’s insight into a data set and into the underlying structure of the data set. To get a feel for the data, it is not enough for the analyst to know what is in the data, he or she must also know what is not in the data, and the only way to do that is to draw on our own pattern recognition and comparative abilities in the context of a series of judicious graphical techniques applied to the data. The [http://en.wikipedia.org/wiki/Exploratory_data_analysis#EDA_development main objectives of EDA] are to:
* Suggest hypotheses about the causes of observed phenomena;
* Assess (parametric) assumptions on which statistical inference will be based;
* Support the selection of appropriate statistical tools and techniques;
* Provide a basis for further data collection through surveys and experiments.

===Theory===
Many EDA techniques have been proposed, validated and adopted for various statistical methodologies. Here is an introduction to some of the frequently used EDA charts and the quantitative techniques.

====[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]]====
[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]] is an efficient way for presenting data, especially for comparing multiple groups of data. In the box plot, we can mark-off the five-number summary of a data set (minimum, 25th percentile, median, 75th percentile, maximum). The box contains the 50% of the data. The upper edge of the box represents the 75th percentile, while the lower edge is the 25th percentile. The median is represented by a line drawn in the middle of the box. If the median is not in the middle of the box then the data are skewed. The ends of the lines (whiskers) represent the minimum and maximum values of the data set, unless there are outliers. Outliers are observations below $ Q_1-1.5(IQR) $ or above $ Q_3+1.5(IQR) $, where $ Q_1 $ is the 25th percentile, $ Q_3 $ is the 75th percentile, and $ IQR=Q_3-Q_1 $ (the interquartile range). The advantage of a box plot is that it provides graphically the location and the spread of the data set, it provides an idea about the skewness of the data set, and can provide a comparison between variables by constructing a side-by-side box plots.
<center>[[Image:SMHS EDA Gallaway 07012014 Fig1a.png|500px]]</center>

====[[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]====
Histograms represent a graphical visualization of tabulated frequencies or counts of data within equal spaced partition of the range of the data. It shows what proportion of measurements that fall into each of the categories defined by the partition of the data range space.
[[Image:UMHS Gallawaay 07012014 Fig2.PNG|500px]]

* Comment: Compare the two series from the histogram above, we can easily tell that the pattern of series 2 if more obvious compared to series 1. Our intuition may come from: series 1 has more extreme values across five days, for example, the values for Jan 1st and Jan 3rd are extremely high (almost 55 for Jan 1st) while that for Jan 4th is almost -12. However values for series 2 are all above 0 and fluctuated between 5 and 20.

[[Image:UMHS Gallaway 07012014 Fig3.PNG|500px]]

* Comment: The Dot chart above gives a clear picture of the values of all the data points and makes the fundamental measurements easily readable. We can tell that most of the values of the data fluctuate between 1 and 7 with mean 3.9 and median 4. There are two obvious outliers valued -2 and 10.

====[[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]====
[http://en.wikipedia.org/wiki/Scatter_plot Scatter plots] use Cartesian coordinates to display values for two variables for a set of data, which is displayed as a collection of points. The value of variable is determined by the position on the horizontal and vertical axis.

[[Image:UMHS Gallaway 07012014 Fig4.PNG|500px]]

* Comment: The x and y axes display values for two variables and all the data points drawn in the chart are coordinates indicating a pair of values for both variables.

For the first series, all the data points lie on and above the diagonal line so with increasing x variable, the paired y variable increases faster or equal to x variable. We can infer a positive linear association between X and Y.

For the second series, most data located along the line except for two outliers (4,8) and (1,5). So for most data points, with increasing x variable, the paired y variable decreases slower or equal to x. We may infer a negative linear association between X and Y.

For the third series, we can’t draw a line association between X and Y, instead, a quadratic pattern would work better here.

====[[SOCR_EduMaterials_Activities_QQChart|QQ Plot]]====
In Quantile-Quantile plots, the observed values are plotted against theoretical quantiles in QQ charts. A line of good fit is drawn to show the behavior of the data values against the theoretical distribution. If F() is a cumulative distribution function, then a quantile (q), also known as a percentile, is defined as a solution to the equation $F(q)=p$,that is $q=F^{-1}(p)$.

[[Image:UMHS Gallaway 07012014 Fig5.PNG|500px]]

*Comment: From the chart above, we can see that the data follows a normal distribution in general given all the data points (noted in red) located along side the line. However, the data doesn’t follow a normal distribution tightly because there are data points located pretty far from the line. We can also infer that the sampled data may not be representative enough of the population because of the limited size of the sample.

====Median polish====
Median polish is an EDA procedure proposed by [http://en.wikipedia.org/wiki/John_Tukey John Tukey]. It finds an additively fit model for data in a two-way layout table of the form row effect + column effect + overall mean. It is an iterative algorithm for removing any trends by computing medians for various coordinates on the spatial domain D.

==== Trimean====
Trimean is a measure of a probability distribution’s location defined as a weighted average of the distribution’s median and its two quartiles. It combines the median’s emphasis on center values with the midhinge’s attention to the extremes. And it is a remarkably efficient estimator of population mean especially for large data set (say more than 100 points) from a symmetric population.
<center> $ \frac{Q_1+2Q_2+Q_3}{4} $ </center>

===Applications===
* [http://www.itl.nist.gov/div898/handbook/eda/eda.htm This article] provides a thorough introduction to EDA. It discusses the basic concepts, objectives, and techniques associated with EDA. It also includes case studies in which EDA is applied. The case studies include eight type of charts for univariate analyses and introduce the concepts of reliability and multi-factor studies. The article gives specific examples with background, output and interpretations of results is a useful resource for learning EDA.
* [http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf This article] begins with a general introduction to data analysis and explains EDA via examples that employ various graphical analyses. This article serves as a basic and general introduction to the concepts associated with EDA and is a good starting place for studying these concepts.
* The [http://wiki.stat.ucla.edu/socr/index.php/SOCR_HTML5_Expansion_MotionCharts SOCR Motion Charts Project enables complex data visualization, see the [http://socr.umich.edu/HTML5/MotionChart/ SOCR MotionChart webapp]. The SOCR Motion Charts provide an interactive infrastructure for discovery-based exploratory analysis of multivariate data.

Now, we want to explore the relationship between two variables in the [[SOCR_Data_Dinov_010309_HousingPriceIndex| dataset: UR (Unemployment Rate) and HPI (Housing Price Index) in the state of Alabama over 2000 to 2006]]. First, how does the UR in Alabama change from 2000 to 2006?
[[Image:UMHS Gallaway 07012014 Fig6.PNG|800px]]

From this chart, we can see that the UR in Alabama increases from 2000 to 2003 then decreases sharply from 2004 to 2006. So you may wonder what is UR for states from other part of the country over the same period?
[[Image:UMHS Gallaway 07012014 Fig7.PNG|800px]]

All the states appear to follow similar patterns. Now, let’s study relationships between UR and HPI in a single state, say Alabama, across this time span.
[[Image:UMHS Gallaway 07012014 Fig8.PNG|800px]]

The chart above suggests that HPI increases through time in Alabama, while UR increases at first and then exhibits a sharp drop between 2004 and 2006. If there is any association between UR and HPI, it appears to be quadratic rather than linear. Similarly, if we extend the graph to the three states from different regions, we generate the following chart:
[[Image:UMHS Gallaway 07012014 Fig9.PNG|500px]]

We can now address the question: is there any association between UR and HPI among all the states based on the chart?
The motion chart, however, makes the study much more interesting by exhibiting a moving chart with UR vs. HPI of 51 states from different areas during the period from 2000 to 2006. This allows us to get an idea of the changing values over the years among all states. You’re welcome to play with the data to see how the chart changes using the link listed above.

===Software ===
* [http://www.socr.umich.edu/html/cha SOCR Charts]
* [http://www.r-bloggers.com/exploratory-data-analysis-useful-r-functions-for-exploring-a-data-frame/ R EDA functions]

===Problems===
* Work on problems in [http://www.itl.nist.gov/div898/handbook/eda/section4/eda42.htm Uniform Random Numbers and Random Walk from this Case Study].

* Two random samples were taken to determine backpack load difference between seniors and freshmen, in pounds. The following are the summaries:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Year|| Mean || SD || Median || Min ||Max || Range|| Count
|-
| Freshmen || 20.43 || 4.21 || 17.2 || 5.78 || 31.68 || 25.9 || 115
|-
| Senior || 18.67 || 4.21 || 18.67 || 5.31 || 27.66 || 22.35 ||157
|-
|}
</center>

* Which of the following plots would be the most useful in comparing the two sets of backpack weights? Choose One Answer:
: (a) Histograms
: (b) Dot Plots
: (c) Scatter Plots
: (d) Box Plots

* School administrators are interested in examining the relationship between height and GPA. What type of plot should they use to display this relationship? Choose one answer.
: (a) box plot
: (b) scatter plot
: (c) line plot
: (d) dot plot

* What would be the most appropriate plot for comparing the heights of the 8th graders from different ethnic backgrounds? Choose one answer.
: (a) bar charts
: (b) side by side boxplot
: (c) histograms
: (d) pie charts

* There is a company in which a very small minority of males (3%) receives three times the median salary of all males, and a very small minority of females (3%) receives one-third of the median salary of all females. What do you expect the side-by-side boxplot of male and female salaries to look like? Choose one answer.
: (a) Both boxplots will be skewed and the median line will not be in the middle of any of the boxes.
: (b) Both boxplots will be skewed, in the case of the females the median line will be close to the top of the box and in the case of the males the median line will be closer to the bottom of the box.
: (c) Need to have the actual data to compare the shape of the boxplots.
: (d) Both boxplots will be skewed, in the case of males the median line will be close to the top of the box and for the females the median line will be closer to the bottom of the box.

* A researcher has collected the following information on a random sample of 200 adults in the 40-50 age range: Weight in pounds Heart beats per minute Smoker or non-smoker Single or married
He wants to examine the relationship between: 1) heart beat per minute and weight, and 2) smoking and marital status. Choose one answer.
: (a) He should draw a scatter plot of heart beat and weight, and a segmented bar chart of smoking and marital status.
: (b) He should draw a side by side boxplot of heart beat and weight and a scatterplot of smoking and marital status.
: (c) He should draw a side by side boxplot of smoking and marital status and a segmented bar chart of hear beat and weight.
: (d) He should draw a back to back stem and leaf plot of weight and heart beat and examine the cell frequencies in the contingency table for smoking by marital status.

* As part of an experiment in perception, 160 University of Michigan psych students completed a task on identifying similar objects. On average, the students spent 8.25 minutes with sa tandard deviation of 2.4 minutes. However, the minimum time was 2.3 minutes and one students worked for almost 60 minutes. What is the best description of the histogram of times that students spent on this task? Choose one answer.
: (a) The histogram of times could be symmetrical and not normal with major outliers.
: (b) The histogram of times could be left skewed, and in case there are any outliers, it is likely that they will be smaller than the mean.
: (c) The histogram of times could be right skewed, and in the case of any outliers, it is likely that they will be larger than the mean.
: (d) The histogram of times could be normal with no major outliers.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_EDA_Plots SOCR]
* [http://en.wikipedia.org/wiki/Exploratory_data_analysis Exploratory data analysis Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_EDA}}

SMHS EDA

2014-08-31T17:21:40Z

Zhenxunw: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Exploratory Data Analysis (EDA), Charts and Plots ==

===Overview===
* ''What is data?'' Data is a collection of facts, observations or information, such as values or measurements. Data can be numbers, measurements, or even just description of things (meta-data). Data types can be divided into two big categories of quantitative (numerical information) and qualitative data (descriptive information).

*''Quantitative data'' is anything that can be expressed as a number, or quantified. For example, the scores on a math test or weight of girls in the fourth grade are both quantitative data. Quantitative data (discrete or continuous) is often referred to as the measurable data and this type of data allows scientists to perform various arithmetic operations, such as addition, multiplication, functional-evaluation, or to find parameters of a population. There are two major types of quantitative data: discrete and continuous.
**Discrete data results from either a finite, or infinite but countable, possible options for the values present in a given discrete data set and the values of this data type can constitute a sequence of isolated or separated points on the real number line.
**Continuous quantitative data results from infinite and dense possible values that the observations can take on.

*''Qualitative'' data cannot be expressed as numbers. Examples of qualitative data elements include gender, religious preference. Categorical data (qualitative or nominal) results from placing individuals into groups or categories. Ordinal and qualitative categorical data types both fall into this category.

In statistics, exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics. Modern statistics regards the graphical visualization and interrogation of data as a critical component of any reliable method for statistical modeling, analysis and interpretation of data. Formally, there are two types of data analysis that should be employed in concert on the same set of data to make a valid and robust inference: graphical techniques and quantitative techniques. We will discuss many of these later, but below is a snapshot of EDA approaches:
* [[SOCR_EduMaterials_Activities_BoxPlot|Box plot]], [[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]; Multi-vari chart; Run chart; Pareto chart; [[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]; Stem-and-leaf plot;
* Parallel coordinates; Odds ratio; Multidimensional scaling; Targeted projection pursuit; Principal component analysis; Multi-linear PCA; Projection methods such as grand tour, guided tour and manual tour.
* [http://en.wikipedia.org/wiki/Median_polish Median polish], [http://en.wikipedia.org/wiki/Trimean Trimean].

===Motivation===
The feel of data comes clearly from the application of various graphical techniques, which serves as a perfect window to human perspective and sense. The primary goal of EDA is to maximize the analyst’s insight into a data set and into the underlying structure of the data set. To get a feel for the data, it is not enough for the analyst to know what is in the data, he or she must also know what is not in the data, and the only way to do that is to draw on our own pattern recognition and comparative abilities in the context of a series of judicious graphical techniques applied to the data. The [http://en.wikipedia.org/wiki/Exploratory_data_analysis#EDA_development main objectives of EDA] are to:
* Suggest hypotheses about the causes of observed phenomena;
* Assess (parametric) assumptions on which statistical inference will be based;
* Support the selection of appropriate statistical tools and techniques;
* Provide a basis for further data collection through surveys and experiments.

===Theory===
Many EDA techniques have been proposed, validated and adopted for various statistical methodologies. Here is an introduction to some of the frequently used EDA charts and the quantitative techniques.

====[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]]====
[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]] is an efficient way for presenting data, especially for comparing multiple groups of data. In the box plot, we can mark-off the five-number summary of a data set (minimum, 25th percentile, median, 75th percentile, maximum). The box contains the 50% of the data. The upper edge of the box represents the 75th percentile, while the lower edge is the 25th percentile. The median is represented by a line drawn in the middle of the box. If the median is not in the middle of the box then the data are skewed. The ends of the lines (whiskers) represent the minimum and maximum values of the data set, unless there are outliers. Outliers are observations below $ Q_1-1.5(IQR) $ or above $ Q_3+1.5(IQR) $, where $ Q_1 $ is the 25th percentile, $ Q_3 $ is the 75th percentile, and $ IQR=Q_3-Q_1 $ (the interquartile range). The advantage of a box plot is that it provides graphically the location and the spread of the data set, it provides an idea about the skewness of the data set, and can provide a comparison between variables by constructing a side-by-side box plots.
<center>[[Image:SMHS EDA Gallaway 07012014 Fig1a.png|500px]]</center>

====[[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]====
Histograms represent a graphical visualization of tabulated frequencies or counts of data within equal spaced partition of the range of the data. It shows what proportion of measurements that fall into each of the categories defined by the partition of the data range space.
[[Image:UMHS Gallawaay 07012014 Fig2.PNG|500px]]

* Comment: Compare the two series from the histogram above, we can easily tell that the pattern of series 2 if more obvious compared to series 1. Our intuition may come from: series 1 has more extreme values across five days, for example, the values for Jan 1st and Jan 3rd are extremely high (almost 55 for Jan 1st) while that for Jan 4th is almost -12. However values for series 2 are all above 0 and fluctuated between 5 and 20.

[[Image:UMHS Gallaway 07012014 Fig3.PNG|500px]]

* Comment: The Dot chart above gives a clear picture of the values of all the data points and makes the fundamental measurements easily readable. We can tell that most of the values of the data fluctuate between 1 and 7 with mean 3.9 and median 4. There are two obvious outliers valued -2 and 10.

====[[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]====
Scatter plots use Cartesian coordinates to display values for two variables for a set of data, which is displayed as a collection of points. The value of variable is determined by the position on the horizontal and vertical axis.

[[Image:UMHS Gallaway 07012014 Fig4.PNG|500px]]

* Comment: The x and y axes display values for two variables and all the data points drawn in the chart are coordinates indicating a pair of values for both variables.

For the first series, all the data points lie on and above the diagonal line so with increasing x variable, the paired y variable increases faster or equal to x variable. We can infer a positive linear association between X and Y.

For the second series, most data located along the line except for two outliers (4,8) and (1,5). So for most data points, with increasing x variable, the paired y variable decreases slower or equal to x. We may infer a negative linear association between X and Y.

For the third series, we can’t draw a line association between X and Y, instead, a quadratic pattern would work better here.

====[[SOCR_EduMaterials_Activities_QQChart|QQ Plot]]====
In Quantile-Quantile plots, the observed values are plotted against theoretical quantiles in QQ charts. A line of good fit is drawn to show the behavior of the data values against the theoretical distribution. If F() is a cumulative distribution function, then a quantile (q), also known as a percentile, is defined as a solution to the equation $F(q)=p$,that is $q=F^{-1}(p)$.

[[Image:UMHS Gallaway 07012014 Fig5.PNG|500px]]

*Comment: From the chart above, we can see that the data follows a normal distribution in general given all the data points (noted in red) located along side the line. However, the data doesn’t follow a normal distribution tightly because there are data points located pretty far from the line. We can also infer that the sampled data may not be representative enough of the population because of the limited size of the sample.

====Median polish====
Median polish is an EDA procedure proposed by [http://en.wikipedia.org/wiki/John_Tukey John Tukey]. It finds an additively fit model for data in a two-way layout table of the form row effect + column effect + overall mean. It is an iterative algorithm for removing any trends by computing medians for various coordinates on the spatial domain D.

==== Trimean====
Trimean is a measure of a probability distribution’s location defined as a weighted average of the distribution’s median and its two quartiles. It combines the median’s emphasis on center values with the midhinge’s attention to the extremes. And it is a remarkably efficient estimator of population mean especially for large data set (say more than 100 points) from a symmetric population.
<center> $ \frac{Q_1+2Q_2+Q_3}{4} $ </center>

===Applications===
* [http://www.itl.nist.gov/div898/handbook/eda/eda.htm This article] provides a thorough introduction to EDA. It discusses the basic concepts, objectives, and techniques associated with EDA. It also includes case studies in which EDA is applied. The case studies include eight type of charts for univariate analyses and introduce the concepts of reliability and multi-factor studies. The article gives specific examples with background, output and interpretations of results is a useful resource for learning EDA.
* [http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf This article] begins with a general introduction to data analysis and explains EDA via examples that employ various graphical analyses. This article serves as a basic and general introduction to the concepts associated with EDA and is a good starting place for studying these concepts.
* The [http://wiki.stat.ucla.edu/socr/index.php/SOCR_HTML5_Expansion_MotionCharts SOCR Motion Charts Project enables complex data visualization, see the [http://socr.umich.edu/HTML5/MotionChart/ SOCR MotionChart webapp]. The SOCR Motion Charts provide an interactive infrastructure for discovery-based exploratory analysis of multivariate data.

Now, we want to explore the relationship between two variables in the [[SOCR_Data_Dinov_010309_HousingPriceIndex| dataset: UR (Unemployment Rate) and HPI (Housing Price Index) in the state of Alabama over 2000 to 2006]]. First, how does the UR in Alabama change from 2000 to 2006?
[[Image:UMHS Gallaway 07012014 Fig6.PNG|800px]]

From this chart, we can see that the UR in Alabama increases from 2000 to 2003 then decreases sharply from 2004 to 2006. So you may wonder what is UR for states from other part of the country over the same period?
[[Image:UMHS Gallaway 07012014 Fig7.PNG|800px]]

All the states appear to follow similar patterns. Now, let’s study relationships between UR and HPI in a single state, say Alabama, across this time span.
[[Image:UMHS Gallaway 07012014 Fig8.PNG|800px]]

The chart above suggests that HPI increases through time in Alabama, while UR increases at first and then exhibits a sharp drop between 2004 and 2006. If there is any association between UR and HPI, it appears to be quadratic rather than linear. Similarly, if we extend the graph to the three states from different regions, we generate the following chart:
[[Image:UMHS Gallaway 07012014 Fig9.PNG|500px]]

We can now address the question: is there any association between UR and HPI among all the states based on the chart?
The motion chart, however, makes the study much more interesting by exhibiting a moving chart with UR vs. HPI of 51 states from different areas during the period from 2000 to 2006. This allows us to get an idea of the changing values over the years among all states. You’re welcome to play with the data to see how the chart changes using the link listed above.

===Software ===
* [http://www.socr.umich.edu/html/cha SOCR Charts]
* [http://www.r-bloggers.com/exploratory-data-analysis-useful-r-functions-for-exploring-a-data-frame/ R EDA functions]

===Problems===
* Work on problems in [http://www.itl.nist.gov/div898/handbook/eda/section4/eda42.htm Uniform Random Numbers and Random Walk from this Case Study].

* Two random samples were taken to determine backpack load difference between seniors and freshmen, in pounds. The following are the summaries:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Year|| Mean || SD || Median || Min ||Max || Range|| Count
|-
| Freshmen || 20.43 || 4.21 || 17.2 || 5.78 || 31.68 || 25.9 || 115
|-
| Senior || 18.67 || 4.21 || 18.67 || 5.31 || 27.66 || 22.35 ||157
|-
|}
</center>

* Which of the following plots would be the most useful in comparing the two sets of backpack weights? Choose One Answer:
: (a) Histograms
: (b) Dot Plots
: (c) Scatter Plots
: (d) Box Plots

* School administrators are interested in examining the relationship between height and GPA. What type of plot should they use to display this relationship? Choose one answer.
: (a) box plot
: (b) scatter plot
: (c) line plot
: (d) dot plot

* What would be the most appropriate plot for comparing the heights of the 8th graders from different ethnic backgrounds? Choose one answer.
: (a) bar charts
: (b) side by side boxplot
: (c) histograms
: (d) pie charts

* There is a company in which a very small minority of males (3%) receives three times the median salary of all males, and a very small minority of females (3%) receives one-third of the median salary of all females. What do you expect the side-by-side boxplot of male and female salaries to look like? Choose one answer.
: (a) Both boxplots will be skewed and the median line will not be in the middle of any of the boxes.
: (b) Both boxplots will be skewed, in the case of the females the median line will be close to the top of the box and in the case of the males the median line will be closer to the bottom of the box.
: (c) Need to have the actual data to compare the shape of the boxplots.
: (d) Both boxplots will be skewed, in the case of males the median line will be close to the top of the box and for the females the median line will be closer to the bottom of the box.

* A researcher has collected the following information on a random sample of 200 adults in the 40-50 age range: Weight in pounds Heart beats per minute Smoker or non-smoker Single or married
He wants to examine the relationship between: 1) heart beat per minute and weight, and 2) smoking and marital status. Choose one answer.
: (a) He should draw a scatter plot of heart beat and weight, and a segmented bar chart of smoking and marital status.
: (b) He should draw a side by side boxplot of heart beat and weight and a scatterplot of smoking and marital status.
: (c) He should draw a side by side boxplot of smoking and marital status and a segmented bar chart of hear beat and weight.
: (d) He should draw a back to back stem and leaf plot of weight and heart beat and examine the cell frequencies in the contingency table for smoking by marital status.

* As part of an experiment in perception, 160 University of Michigan psych students completed a task on identifying similar objects. On average, the students spent 8.25 minutes with sa tandard deviation of 2.4 minutes. However, the minimum time was 2.3 minutes and one students worked for almost 60 minutes. What is the best description of the histogram of times that students spent on this task? Choose one answer.
: (a) The histogram of times could be symmetrical and not normal with major outliers.
: (b) The histogram of times could be left skewed, and in case there are any outliers, it is likely that they will be smaller than the mean.
: (c) The histogram of times could be right skewed, and in the case of any outliers, it is likely that they will be larger than the mean.
: (d) The histogram of times could be normal with no major outliers.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_EDA_Plots SOCR]
* [http://en.wikipedia.org/wiki/Exploratory_data_analysis Exploratory data analysis Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_EDA}}

SMHS EDA

2014-08-31T17:20:17Z

Zhenxunw: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Exploratory Data Analysis (EDA), Charts and Plots ==

===Overview===
* ''What is data?'' Data is a collection of facts, observations or information, such as values or measurements. Data can be numbers, measurements, or even just description of things (meta-data). Data types can be divided into two big categories of quantitative (numerical information) and qualitative data (descriptive information).

*''Quantitative data'' is anything that can be expressed as a number, or quantified. For example, the scores on a math test or weight of girls in the fourth grade are both quantitative data. Quantitative data (discrete or continuous) is often referred to as the measurable data and this type of data allows scientists to perform various arithmetic operations, such as addition, multiplication, functional-evaluation, or to find parameters of a population. There are two major types of quantitative data: discrete and continuous.
**Discrete data results from either a finite, or infinite but countable, possible options for the values present in a given discrete data set and the values of this data type can constitute a sequence of isolated or separated points on the real number line.
**Continuous quantitative data results from infinite and dense possible values that the observations can take on.

*''Qualitative'' data cannot be expressed as numbers. Examples of qualitative data elements include gender, religious preference. Categorical data (qualitative or nominal) results from placing individuals into groups or categories. Ordinal and qualitative categorical data types both fall into this category.

In statistics, exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics. Modern statistics regards the graphical visualization and interrogation of data as a critical component of any reliable method for statistical modeling, analysis and interpretation of data. Formally, there are two types of data analysis that should be employed in concert on the same set of data to make a valid and robust inference: graphical techniques and quantitative techniques. We will discuss many of these later, but below is a snapshot of EDA approaches:
* [[SOCR_EduMaterials_Activities_BoxPlot|Box plot]], [[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]; Multi-vari chart; Run chart; Pareto chart; [[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]; Stem-and-leaf plot;
* Parallel coordinates; Odds ratio; Multidimensional scaling; Targeted projection pursuit; Principal component analysis; Multi-linear PCA; Projection methods such as grand tour, guided tour and manual tour.
* [http://en.wikipedia.org/wiki/Median_polish Median polish], [http://en.wikipedia.org/wiki/Trimean Trimean].

===Motivation===
The feel of data comes clearly from the application of various graphical techniques, which serves as a perfect window to human perspective and sense. The primary goal of EDA is to maximize the analyst’s insight into a data set and into the underlying structure of the data set. To get a feel for the data, it is not enough for the analyst to know what is in the data, he or she must also know what is not in the data, and the only way to do that is to draw on our own pattern recognition and comparative abilities in the context of a series of judicious graphical techniques applied to the data. The [http://en.wikipedia.org/wiki/Exploratory_data_analysis main objectives of EDA] are to:
* Suggest hypotheses about the causes of observed phenomena;
* Assess (parametric) assumptions on which statistical inference will be based;
* Support the selection of appropriate statistical tools and techniques;
* Provide a basis for further data collection through surveys and experiments.

===Theory===
Many EDA techniques have been proposed, validated and adopted for various statistical methodologies. Here is an introduction to some of the frequently used EDA charts and the quantitative techniques.

====[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]]====
[[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]] is an efficient way for presenting data, especially for comparing multiple groups of data. In the box plot, we can mark-off the five-number summary of a data set (minimum, 25th percentile, median, 75th percentile, maximum). The box contains the 50% of the data. The upper edge of the box represents the 75th percentile, while the lower edge is the 25th percentile. The median is represented by a line drawn in the middle of the box. If the median is not in the middle of the box then the data are skewed. The ends of the lines (whiskers) represent the minimum and maximum values of the data set, unless there are outliers. Outliers are observations below $ Q_1-1.5(IQR) $ or above $ Q_3+1.5(IQR) $, where $ Q_1 $ is the 25th percentile, $ Q_3 $ is the 75th percentile, and $ IQR=Q_3-Q_1 $ (the interquartile range). The advantage of a box plot is that it provides graphically the location and the spread of the data set, it provides an idea about the skewness of the data set, and can provide a comparison between variables by constructing a side-by-side box plots.
<center>[[Image:SMHS EDA Gallaway 07012014 Fig1a.png|500px]]</center>

====[[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]====
Histograms represent a graphical visualization of tabulated frequencies or counts of data within equal spaced partition of the range of the data. It shows what proportion of measurements that fall into each of the categories defined by the partition of the data range space.
[[Image:UMHS Gallawaay 07012014 Fig2.PNG|500px]]

* Comment: Compare the two series from the histogram above, we can easily tell that the pattern of series 2 if more obvious compared to series 1. Our intuition may come from: series 1 has more extreme values across five days, for example, the values for Jan 1st and Jan 3rd are extremely high (almost 55 for Jan 1st) while that for Jan 4th is almost -12. However values for series 2 are all above 0 and fluctuated between 5 and 20.

[[Image:UMHS Gallaway 07012014 Fig3.PNG|500px]]

* Comment: The Dot chart above gives a clear picture of the values of all the data points and makes the fundamental measurements easily readable. We can tell that most of the values of the data fluctuate between 1 and 7 with mean 3.9 and median 4. There are two obvious outliers valued -2 and 10.

====[[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]====
Scatter plots use Cartesian coordinates to display values for two variables for a set of data, which is displayed as a collection of points. The value of variable is determined by the position on the horizontal and vertical axis.

[[Image:UMHS Gallaway 07012014 Fig4.PNG|500px]]

* Comment: The x and y axes display values for two variables and all the data points drawn in the chart are coordinates indicating a pair of values for both variables.

For the first series, all the data points lie on and above the diagonal line so with increasing x variable, the paired y variable increases faster or equal to x variable. We can infer a positive linear association between X and Y.

For the second series, most data located along the line except for two outliers (4,8) and (1,5). So for most data points, with increasing x variable, the paired y variable decreases slower or equal to x. We may infer a negative linear association between X and Y.

For the third series, we can’t draw a line association between X and Y, instead, a quadratic pattern would work better here.

====[[SOCR_EduMaterials_Activities_QQChart|QQ Plot]]====
In Quantile-Quantile plots, the observed values are plotted against theoretical quantiles in QQ charts. A line of good fit is drawn to show the behavior of the data values against the theoretical distribution. If F() is a cumulative distribution function, then a quantile (q), also known as a percentile, is defined as a solution to the equation $F(q)=p$,that is $q=F^{-1}(p)$.

[[Image:UMHS Gallaway 07012014 Fig5.PNG|500px]]

*Comment: From the chart above, we can see that the data follows a normal distribution in general given all the data points (noted in red) located along side the line. However, the data doesn’t follow a normal distribution tightly because there are data points located pretty far from the line. We can also infer that the sampled data may not be representative enough of the population because of the limited size of the sample.

====Median polish====
Median polish is an EDA procedure proposed by [http://en.wikipedia.org/wiki/John_Tukey John Tukey]. It finds an additively fit model for data in a two-way layout table of the form row effect + column effect + overall mean. It is an iterative algorithm for removing any trends by computing medians for various coordinates on the spatial domain D.

==== Trimean====
Trimean is a measure of a probability distribution’s location defined as a weighted average of the distribution’s median and its two quartiles. It combines the median’s emphasis on center values with the midhinge’s attention to the extremes. And it is a remarkably efficient estimator of population mean especially for large data set (say more than 100 points) from a symmetric population.
<center> $ \frac{Q_1+2Q_2+Q_3}{4} $ </center>

===Applications===
* [http://www.itl.nist.gov/div898/handbook/eda/eda.htm This article] provides a thorough introduction to EDA. It discusses the basic concepts, objectives, and techniques associated with EDA. It also includes case studies in which EDA is applied. The case studies include eight type of charts for univariate analyses and introduce the concepts of reliability and multi-factor studies. The article gives specific examples with background, output and interpretations of results is a useful resource for learning EDA.
* [http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf This article] begins with a general introduction to data analysis and explains EDA via examples that employ various graphical analyses. This article serves as a basic and general introduction to the concepts associated with EDA and is a good starting place for studying these concepts.
* The [http://wiki.stat.ucla.edu/socr/index.php/SOCR_HTML5_Expansion_MotionCharts SOCR Motion Charts Project enables complex data visualization, see the [http://socr.umich.edu/HTML5/MotionChart/ SOCR MotionChart webapp]. The SOCR Motion Charts provide an interactive infrastructure for discovery-based exploratory analysis of multivariate data.

Now, we want to explore the relationship between two variables in the [[SOCR_Data_Dinov_010309_HousingPriceIndex| dataset: UR (Unemployment Rate) and HPI (Housing Price Index) in the state of Alabama over 2000 to 2006]]. First, how does the UR in Alabama change from 2000 to 2006?
[[Image:UMHS Gallaway 07012014 Fig6.PNG|800px]]

From this chart, we can see that the UR in Alabama increases from 2000 to 2003 then decreases sharply from 2004 to 2006. So you may wonder what is UR for states from other part of the country over the same period?
[[Image:UMHS Gallaway 07012014 Fig7.PNG|800px]]

All the states appear to follow similar patterns. Now, let’s study relationships between UR and HPI in a single state, say Alabama, across this time span.
[[Image:UMHS Gallaway 07012014 Fig8.PNG|800px]]

The chart above suggests that HPI increases through time in Alabama, while UR increases at first and then exhibits a sharp drop between 2004 and 2006. If there is any association between UR and HPI, it appears to be quadratic rather than linear. Similarly, if we extend the graph to the three states from different regions, we generate the following chart:
[[Image:UMHS Gallaway 07012014 Fig9.PNG|500px]]

We can now address the question: is there any association between UR and HPI among all the states based on the chart?
The motion chart, however, makes the study much more interesting by exhibiting a moving chart with UR vs. HPI of 51 states from different areas during the period from 2000 to 2006. This allows us to get an idea of the changing values over the years among all states. You’re welcome to play with the data to see how the chart changes using the link listed above.

===Software ===
* [http://www.socr.umich.edu/html/cha SOCR Charts]
* [http://www.r-bloggers.com/exploratory-data-analysis-useful-r-functions-for-exploring-a-data-frame/ R EDA functions]

===Problems===
* Work on problems in [http://www.itl.nist.gov/div898/handbook/eda/section4/eda42.htm Uniform Random Numbers and Random Walk from this Case Study].

* Two random samples were taken to determine backpack load difference between seniors and freshmen, in pounds. The following are the summaries:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Year|| Mean || SD || Median || Min ||Max || Range|| Count
|-
| Freshmen || 20.43 || 4.21 || 17.2 || 5.78 || 31.68 || 25.9 || 115
|-
| Senior || 18.67 || 4.21 || 18.67 || 5.31 || 27.66 || 22.35 ||157
|-
|}
</center>

* Which of the following plots would be the most useful in comparing the two sets of backpack weights? Choose One Answer:
: (a) Histograms
: (b) Dot Plots
: (c) Scatter Plots
: (d) Box Plots

* School administrators are interested in examining the relationship between height and GPA. What type of plot should they use to display this relationship? Choose one answer.
: (a) box plot
: (b) scatter plot
: (c) line plot
: (d) dot plot

* What would be the most appropriate plot for comparing the heights of the 8th graders from different ethnic backgrounds? Choose one answer.
: (a) bar charts
: (b) side by side boxplot
: (c) histograms
: (d) pie charts

* There is a company in which a very small minority of males (3%) receives three times the median salary of all males, and a very small minority of females (3%) receives one-third of the median salary of all females. What do you expect the side-by-side boxplot of male and female salaries to look like? Choose one answer.
: (a) Both boxplots will be skewed and the median line will not be in the middle of any of the boxes.
: (b) Both boxplots will be skewed, in the case of the females the median line will be close to the top of the box and in the case of the males the median line will be closer to the bottom of the box.
: (c) Need to have the actual data to compare the shape of the boxplots.
: (d) Both boxplots will be skewed, in the case of males the median line will be close to the top of the box and for the females the median line will be closer to the bottom of the box.

* A researcher has collected the following information on a random sample of 200 adults in the 40-50 age range: Weight in pounds Heart beats per minute Smoker or non-smoker Single or married
He wants to examine the relationship between: 1) heart beat per minute and weight, and 2) smoking and marital status. Choose one answer.
: (a) He should draw a scatter plot of heart beat and weight, and a segmented bar chart of smoking and marital status.
: (b) He should draw a side by side boxplot of heart beat and weight and a scatterplot of smoking and marital status.
: (c) He should draw a side by side boxplot of smoking and marital status and a segmented bar chart of hear beat and weight.
: (d) He should draw a back to back stem and leaf plot of weight and heart beat and examine the cell frequencies in the contingency table for smoking by marital status.

* As part of an experiment in perception, 160 University of Michigan psych students completed a task on identifying similar objects. On average, the students spent 8.25 minutes with sa tandard deviation of 2.4 minutes. However, the minimum time was 2.3 minutes and one students worked for almost 60 minutes. What is the best description of the histogram of times that students spent on this task? Choose one answer.
: (a) The histogram of times could be symmetrical and not normal with major outliers.
: (b) The histogram of times could be left skewed, and in case there are any outliers, it is likely that they will be smaller than the mean.
: (c) The histogram of times could be right skewed, and in the case of any outliers, it is likely that they will be larger than the mean.
: (d) The histogram of times could be normal with no major outliers.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_EDA_Plots SOCR]
* [http://en.wikipedia.org/wiki/Exploratory_data_analysis Exploratory data analysis Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_EDA}}