Difference between revisions of "SOCR LetterFrequencyData"

Revision as of 14:49, 31 May 2010

SOCR Data - Latin Letters Frequency Distributions in Different Languages

Data Description

The data table below present the average frequencies of the 26 most common Latin letters for different languages. Letter frequencies in text are studied in cryptography. The exact letter frequency distribution underling a given language is unknown and varies with time, since all writers tend to write slightly differently and are affected by their culture. Modern International Morse code encodes the most frequent letters with the shortest symbols; arranging the Morse alphabet into groups of letters that require equal amounts of time to transmit, and then sorting these groups in increasing order. Similar ideas are used in modern data-compression techniques such as Huffman coding.

Letter frequencies, like word frequencies, tend to vary by writer, subject and language. Accurate average letter frequencies are obtained by analyzing large amounts of representative text.

Sources

See these references.

Data Table

Letter	English	French	German	Spanish	Portuguese	Esperanto	Italian	Turkish	Swedish	Polish	Toki_Pona	Dutch	Avgerage
a	0.08	0.08	0.07	0.13	0.15	0.12	0.12	0.12	0.09	0.08	0.17	0.07	0.11
b	0.01	0.01	0.02	0.01	0.01	0.01	0.01	0.03	0.01	0.01	0.00	0.02	0.01
c	0.03	0.03	0.03	0.05	0.04	0.01	0.05	0.01	0.01	0.04	0.00	0.01	0.03
d	0.04	0.04	0.05	0.06	0.05	0.03	0.04	0.05	0.05	0.03	0.00	0.06	0.04
e	0.13	0.15	0.17	0.14	0.13	0.09	0.12	0.09	0.10	0.07	0.07	0.19	0.12
f	0.02	0.01	0.02	0.01	0.01	0.01	0.01	0.00	0.02	0.00	0.00	0.01	0.01
g	0.02	0.01	0.03	0.01	0.01	0.01	0.02	0.01	0.03	0.01	0.00	0.03	0.02
h	0.06	0.01	0.05	0.01	0.01	0.00	0.02	0.01	0.02	0.01	0.00	0.02	0.02
i	0.07	0.08	0.08	0.06	0.06	0.10	0.11	0.08	0.05	0.07	0.15	0.07	0.08
j	0.00	0.01	0.00	0.00	0.00	0.04	0.00	0.00	0.01	0.02	0.03	0.01	0.01
k	0.01	0.00	0.01	0.00	0.00	0.04	0.00	0.05	0.03	0.03	0.05	0.02	0.02
l	0.04	0.05	0.03	0.05	0.03	0.06	0.07	0.06	0.05	0.03	0.10	0.04	0.05
m	0.02	0.03	0.03	0.03	0.05	0.03	0.03	0.04	0.04	0.02	0.04	0.02	0.03
n	0.07	0.07	0.10	0.07	0.05	0.08	0.07	0.07	0.09	0.05	0.12	0.10	0.08
o	0.08	0.05	0.03	0.09	0.11	0.09	0.10	0.02	0.04	0.07	0.08	0.06	0.07
p	0.02	0.03	0.01	0.03	0.03	0.03	0.03	0.01	0.02	0.02	0.04	0.02	0.02
q	0.00	0.01	0.00	0.01	0.01	0.00	0.01	0.00	0.00	0.00	0.00	0.00	0.00
r	0.06	0.07	0.07	0.07	0.07	0.06	0.06	0.07	0.08	0.04	0.00	0.06	0.06
s	0.06	0.08	0.07	0.08	0.08	0.06	0.05	0.03	0.06	0.04	0.04	0.04	0.06
t	0.09	0.07	0.06	0.05	0.05	0.05	0.06	0.03	0.09	0.02	0.05	0.07	0.06
u	0.03	0.06	0.04	0.04	0.05	0.03	0.03	0.03	0.02	0.02	0.03	0.02	0.03
v	0.01	0.02	0.01	0.01	0.02	0.02	0.02	0.01	0.02	0.00	0.00	0.03	0.01
w	0.02	0.00	0.02	0.00	0.00	0.00	0.00	0.00	0.00	0.04	0.03	0.02	0.01
x	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
y	0.02	0.00	0.00	0.01	0.00	0.00	0.00	0.03	0.01	0.03	0.00	0.00	0.01
z	0.00	0.00	0.01	0.01	0.00	0.01	0.00	0.02	0.00	0.05	0.00	0.01	0.01
Others	0	0.03	0	0	0	0.02	0	0.12	0.06	0.2	0	0	0.04

Graphs

Histogram (HistogramChartDemo7) of the English letters

Stacked Bar-Chart (StackedBarChartDemo3, under BarCharts --> CategoryPlots) of all letters across each language

SOCR Home page: http://www.socr.ucla.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

@@ Line 4: / Line 4: @@
 [[Image:SOCR_Data_Dinov_EnglishLetterFrequency.png|150px|thumbnail|right| [http://en.wikipedia.org/wiki/Letter_frequency English Letter Frequencies] ]]
-The data table below present the average frequencies of the 26 most common Latin letters for different languages. Letter frequencies in text are studied in cryptography. There is no ''exact'' letter frequency distribution underlies a given language, since all writers write slightly differently. Modern International [http://en.wikipedia.org/wiki/Morse_code Morse code] encodes the most frequent letters with the shortest symbols; arranging the Morse alphabet into groups of letters that require equal amounts of time to transmit, and then sorting these groups in increasing order. Similar ideas are used in modern data-compression techniques such as [http://en.wikipedia.org/wiki/Huffman_coding Huffman coding].
+The data table below present the average frequencies of the 26 most common Latin letters for different languages. Letter frequencies in text are studied in cryptography. The exact letter frequency distribution underling a given language is unknown and varies with time, since all writers tend to write slightly differently and are affected by their culture. Modern International [http://en.wikipedia.org/wiki/Morse_code Morse code] encodes the most frequent letters with the shortest symbols; arranging the Morse alphabet into groups of letters that require equal amounts of time to transmit, and then sorting these groups in increasing order. Similar ideas are used in modern data-compression techniques such as [http://en.wikipedia.org/wiki/Huffman_coding Huffman coding].
 Letter frequencies, like word frequencies, tend to vary by writer, subject and language. Accurate average letter frequencies are obtained by analyzing large amounts of ''representative'' text.

Difference between revisions of "SOCR LetterFrequencyData"

Revision as of 14:49, 31 May 2010

Contents

SOCR Data - Latin Letters Frequency Distributions in Different Languages

Data Description

Sources

Data Table

Graphs

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools