Difference between revisions of "SOCR LetterFrequencyData"
m (→Graphs: fixed a typo in reference to StackedBarChartDemo3) |
|||
(5 intermediate revisions by 2 users not shown) | |||
Line 4: | Line 4: | ||
[[Image:SOCR_Data_Dinov_EnglishLetterFrequency.png|150px|thumbnail|right| [http://en.wikipedia.org/wiki/Letter_frequency English Letter Frequencies] ]] | [[Image:SOCR_Data_Dinov_EnglishLetterFrequency.png|150px|thumbnail|right| [http://en.wikipedia.org/wiki/Letter_frequency English Letter Frequencies] ]] | ||
− | The data table below present the average frequencies of the 26 most common Latin letters for different languages. Letter | + | The data table below present the average frequencies of the 26 most common Latin letters for different languages. Letter frequencies in text are studied in cryptography. The exact letter frequency distribution underling a given language is unknown and varies with time, since all writers tend to write slightly differently and are affected by their culture. Modern International [http://en.wikipedia.org/wiki/Morse_code Morse code] encodes the most frequent letters with the shortest symbols; arranging the Morse alphabet into groups of letters that require equal amounts of time to transmit, and then sorting these groups in increasing order. Similar ideas are used in modern data-compression techniques such as [http://en.wikipedia.org/wiki/Huffman_coding Huffman coding]. |
Letter frequencies, like word frequencies, tend to vary by writer, subject and language. Accurate average letter frequencies are obtained by analyzing large amounts of ''representative'' text. | Letter frequencies, like word frequencies, tend to vary by writer, subject and language. Accurate average letter frequencies are obtained by analyzing large amounts of ''representative'' text. | ||
Line 78: | Line 78: | ||
<center>[[Image:SOCR_Data_Dinov_EnglishLetterFrequency.png|400px]]</center> | <center>[[Image:SOCR_Data_Dinov_EnglishLetterFrequency.png|400px]]</center> | ||
− | * [[SOCR_EduMaterials_Activities_BarCharts_CategoryPlot | Stacked Bar-Chart]] (StackedBarChartDemo3, under BarCharts --> CategoryPlots) of all letters across each language | + | * [[SOCR_EduMaterials_Activities_BarCharts_CategoryPlot | Stacked Bar-Chart]] ([http://socr.ucla.edu/htmls/SOCR_Charts.html StackedBarChartDemo3, under BarCharts --> CategoryPlots]) of all letters across each language |
<center>[[Image:SOCR_Data_Dinov_EnglishLetterFrequency1.png|400px]]</center> | <center>[[Image:SOCR_Data_Dinov_EnglishLetterFrequency1.png|400px]]</center> | ||
+ | === See also=== | ||
+ | [[Image:SOCR_Data_Dinov_EnglishLetterFrequency_Fig2.png|150px|thumbnail|right| [http://link.springer.com/article/10.1007/BF00114162 20,000 Handwritten Letters] ]] | ||
+ | * [https://umich.instructure.com/courses/38100/files/folder/Case_Studies/16_HandwrittenLetters A machine learning based case-study for optical character recognition (OCR) from handwritten notes]. | ||
<hr> | <hr> |
Latest revision as of 08:15, 21 October 2016
Contents
SOCR Data - Latin Letters Frequency Distributions in Different Languages
Data Description
The data table below present the average frequencies of the 26 most common Latin letters for different languages. Letter frequencies in text are studied in cryptography. The exact letter frequency distribution underling a given language is unknown and varies with time, since all writers tend to write slightly differently and are affected by their culture. Modern International Morse code encodes the most frequent letters with the shortest symbols; arranging the Morse alphabet into groups of letters that require equal amounts of time to transmit, and then sorting these groups in increasing order. Similar ideas are used in modern data-compression techniques such as Huffman coding.
Letter frequencies, like word frequencies, tend to vary by writer, subject and language. Accurate average letter frequencies are obtained by analyzing large amounts of representative text.
Sources
Data Table
Letter | English | French | German | Spanish | Portuguese | Esperanto | Italian | Turkish | Swedish | Polish | Toki_Pona | Dutch | Avgerage |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a | 0.08 | 0.08 | 0.07 | 0.13 | 0.15 | 0.12 | 0.12 | 0.12 | 0.09 | 0.08 | 0.17 | 0.07 | 0.11 |
b | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.03 | 0.01 | 0.01 | 0.00 | 0.02 | 0.01 |
c | 0.03 | 0.03 | 0.03 | 0.05 | 0.04 | 0.01 | 0.05 | 0.01 | 0.01 | 0.04 | 0.00 | 0.01 | 0.03 |
d | 0.04 | 0.04 | 0.05 | 0.06 | 0.05 | 0.03 | 0.04 | 0.05 | 0.05 | 0.03 | 0.00 | 0.06 | 0.04 |
e | 0.13 | 0.15 | 0.17 | 0.14 | 0.13 | 0.09 | 0.12 | 0.09 | 0.10 | 0.07 | 0.07 | 0.19 | 0.12 |
f | 0.02 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.00 | 0.02 | 0.00 | 0.00 | 0.01 | 0.01 |
g | 0.02 | 0.01 | 0.03 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.03 | 0.01 | 0.00 | 0.03 | 0.02 |
h | 0.06 | 0.01 | 0.05 | 0.01 | 0.01 | 0.00 | 0.02 | 0.01 | 0.02 | 0.01 | 0.00 | 0.02 | 0.02 |
i | 0.07 | 0.08 | 0.08 | 0.06 | 0.06 | 0.10 | 0.11 | 0.08 | 0.05 | 0.07 | 0.15 | 0.07 | 0.08 |
j | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.01 | 0.02 | 0.03 | 0.01 | 0.01 |
k | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.04 | 0.00 | 0.05 | 0.03 | 0.03 | 0.05 | 0.02 | 0.02 |
l | 0.04 | 0.05 | 0.03 | 0.05 | 0.03 | 0.06 | 0.07 | 0.06 | 0.05 | 0.03 | 0.10 | 0.04 | 0.05 |
m | 0.02 | 0.03 | 0.03 | 0.03 | 0.05 | 0.03 | 0.03 | 0.04 | 0.04 | 0.02 | 0.04 | 0.02 | 0.03 |
n | 0.07 | 0.07 | 0.10 | 0.07 | 0.05 | 0.08 | 0.07 | 0.07 | 0.09 | 0.05 | 0.12 | 0.10 | 0.08 |
o | 0.08 | 0.05 | 0.03 | 0.09 | 0.11 | 0.09 | 0.10 | 0.02 | 0.04 | 0.07 | 0.08 | 0.06 | 0.07 |
p | 0.02 | 0.03 | 0.01 | 0.03 | 0.03 | 0.03 | 0.03 | 0.01 | 0.02 | 0.02 | 0.04 | 0.02 | 0.02 |
q | 0.00 | 0.01 | 0.00 | 0.01 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
r | 0.06 | 0.07 | 0.07 | 0.07 | 0.07 | 0.06 | 0.06 | 0.07 | 0.08 | 0.04 | 0.00 | 0.06 | 0.06 |
s | 0.06 | 0.08 | 0.07 | 0.08 | 0.08 | 0.06 | 0.05 | 0.03 | 0.06 | 0.04 | 0.04 | 0.04 | 0.06 |
t | 0.09 | 0.07 | 0.06 | 0.05 | 0.05 | 0.05 | 0.06 | 0.03 | 0.09 | 0.02 | 0.05 | 0.07 | 0.06 |
u | 0.03 | 0.06 | 0.04 | 0.04 | 0.05 | 0.03 | 0.03 | 0.03 | 0.02 | 0.02 | 0.03 | 0.02 | 0.03 |
v | 0.01 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 | 0.02 | 0.00 | 0.00 | 0.03 | 0.01 |
w | 0.02 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.03 | 0.02 | 0.01 |
x | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
y | 0.02 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.03 | 0.01 | 0.03 | 0.00 | 0.00 | 0.01 |
z | 0.00 | 0.00 | 0.01 | 0.01 | 0.00 | 0.01 | 0.00 | 0.02 | 0.00 | 0.05 | 0.00 | 0.01 | 0.01 |
Others | 0 | 0.03 | 0 | 0 | 0 | 0.02 | 0 | 0.12 | 0.06 | 0.2 | 0 | 0 | 0.04 |
Graphs
- Histogram (HistogramChartDemo7) of the English letters
- Stacked Bar-Chart (StackedBarChartDemo3, under BarCharts --> CategoryPlots) of all letters across each language
See also
- SOCR Home page: http://www.socr.ucla.edu
Translate this page: