Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

Kłosowski, Piotr

doi:10.1186/s13636-017-0102-8

Research
Open access
Published: 28 February 2017

Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

Piotr Kłosowski¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2017, Article number: 5 (2017) Cite this article

3782 Accesses
8 Citations
Metrics details

Abstract

This article presents the original results of Polish language statistical analysis, based on the orthographic and phonemic language corpus. Phonemic language corpus for Polish was developed by using automatic grapheme-to-phoneme conversion of the source orthographic language corpus, obtained from the National Corpus of Polish (NCP). The corpus contains the most frequently used Polish words, written with the use of phonemic notation. Performed statistical analysis of Polish language based on phonemic language corpus, includes frequency of occurrence calculation of the orthographic and phonemic language components, as well as their sequence. Statistical language data, obtained as a result of performed statistical analysis, enable to develop statistical word-based and phoneme-based language models for Polish. Applying these language models can effectively contribute to efficiency improvement of automatic speech recognition for Polish.

1 Introduction

The main goal of automatic speech recognition (ASR) is translation of spoken words into a text [1]. Modern speech recognition systems require implementation of the acoustic and language modelling [2]. Both acoustic and language modelling are important parts of modern statistical speech recognition approach [3, 4]. Statistical language modelling enables to develop large vocabulary and effective speech recognition systems [5]. Language modelling can be used not only in speech recognition application, but also in other areas of speech and language processing, e.g., language recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.

The main motivation of the research on speech recognition area, is to improve automatic speech recognition process, especially for Polish language [6, 7]. Additionally, research studies have been conducted in the field of properties of Polish phonemes [8, 9], speech recognition based on it [10], speaker recognition [11, 12], speaker verification [13–15], and new applications of speech recognition, e.g., automatic speech translation [16].

Particularly, a good performance of automatic speech recognition is achieved with use of speech recognition by statistical methods [17]. Therefore, the main objective of the research presented in this paper, was to perform statistical analysis of Polish language based on the orthographic and phonemic language corpus, for development of statistical word-based and phoneme-based language models, as well as applying them to improve speech recognition for Polish. The development of statistical language models helps to predict a sequence of recognized spoken words and phonemes. The use of developed language models can effectively contribute to the improvement of the automatic speech recognition effectiveness, based on statistical methods. The development of word-based and phoneme-based language models for speech recognition, built on statistical language data, requires the access to large orthographic and phonemic language corpora [18, 19].

2 Orthographic language corpus

One of the biggest orthographic Polish language corpus is the National Corpus of Polish (NCP) [20]. The NCP corpus is available for the scientific community and offers great flexibility, as well as it is extremely important in terms of scientific value. The NCP corpus provides crucial reference material reflecting the state of contemporary Polish language which meets all the requirements of modern science [21]. It can be used particularly by linguists, but also by computer scientists interested in natural language processing.

The NCP corpus contains over 1500 million of words. The corpus is searchable by means of advanced tools, developed by the Institute of Computer Science at the Polish Academy of Sciences, which analyse Polish inflection and Polish sentence structure. The list of sources for the NCP corpus, presented in Table 1, contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts [22].

Table 1 Structure of the NCP coprus [20]

Full size table

The results of the statistical analyses, presented in this paper, can be considered as representative for Polish language as a whole which is justified to a certain extent, considering the corpus size. However, it is worth remembering that the NCP corpus is still primarily based on written texts. Spoken language transcripts constitute a smaller percentage of the corpus contents which might be still significant when it comes to certain specialized continuous or conversational speech recognition tasks. Table 2 presents the details of the orthographic language corpus content, obtained from the NCP corpus resources.

Table 2 Details of the orthographic language corpus content

Full size table

3 Phonemic language corpus

3.1 Grapheme-to-phoneme conversion

The phonemic Polish language corpus contains words written with the use of phonemic notation, obtained on the basis of automatic grapheme-to-phoneme conversion of an orthographic text. Automatic processing of a natural language, very often requires the implementation of automatic grapheme-to-phoneme conversion. Grapheme-to-phoneme conversion determines phonemic transcriptions directly from orthographic representations [23].

Phonemes are usually written with specially designed alphabets. The most commonly used alphabet for this purpose is the International Phonetic Alphabet (IPA) [24]. It was created on the basis of phonetics and phonology of West-European languages, and it is not satisfactorily adapted into Polish. For Polish, like other Slavic languages, a special transcriptional system, called the Slavistic Phonetic Alphabet (SPA), is most frequently used [25]. The second very often used phonetic alphabet is the Speech Assessment Methods Phonetic Alphabet (SAMPA) [26]. SAMPA is a machine-readable phonetic alphabet, using 7-bit printable ASCII characters, based on the IPA alphabet. Table 3 presents a set of Polish phonemes and the examples of their occurrence in Polish, written with the use of the SPA, IPA, and SAMPA phonetic alphabets.

Table 3 A set of Polish phonemes and examples of their occurrence

Full size table

Knowledge-based grapheme-to-phoneme approaches, unlike data-driven G2P approaches, exploit rules, created by humans or deriving from linguistic studies to convert the sequence of graphemes in a word to a sequence of phonemes [27]. Rule-based grapheme-to-phoneme approaches are typically formulated in the framework of finite state automata, and require the formulation of grapheme-to-phoneme conversion rules [28]. The largest contribution to solve the problem of automatic grapheme-to-phoneme conversion for Polish, were the publications of Maria Steffen-Batóg [29, 30].

Automatic grapheme-to-phoneme conversion process can be described as an F function, defined by the following formula:

$$ F(\alpha) = \beta $$

(1)

where:

$$ \alpha = \alpha_{1}\ldots\alpha_{k}\ldots\alpha_{a} ~~\wedge~~ \alpha_{k}\in X ~~ \forall ~~ (1 \leq k \leq a) $$

(2)

$$ \beta = \beta_{1}\ldots\beta_{k}\ldots\beta_{b} ~~\wedge~~ \beta_{k}\in Y ~~ \forall ~~ (1 \leq k \leq b) $$

(3)

and where a is the length of orthographic character sequence, b is the length of phonemic character sequence, X is the set of the orthographical alphabet characters in Polish, additionally with special characters, and Y is the set of the phonemic characters alphabet in Polish, described by the Slavistic Phonetic Alphabet:

(4)

(5)

Grapheme-to-phoneme conversion of correctly written orthographic texts in Polish is transformation of words written in the orthographic X alphabet to form written in the phonemic alphabet Y. Automatic grapheme-to-phoneme conversion F function can be delineated by a set of formal grapheme-to-phoneme conversion rules defining how each α word, constructed from the orthographic X alphabet, can be transformed into a new β word constructed from the phonemic alphabet defined by the Y set. The rules usually are numerous with varying degrees of complexity. The size and complexity of grapheme-to-phoneme conversion rules depend on the number of letters in the orthographical alphabet and the fact that each letter can be pronounced differently in various contexts.

A set of grapheme-to-phoneme conversion rules for Polish was developed by Maria Steffen-Batóg and it was presented in the monograph dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish [29, 30]. Knowledge included into these monographs was essential in developing implementation of the automatic grapheme-to-phoneme conversion algorithm for Polish. According to Maria Steffen-Batóg, all grapheme-to-phoneme conversion rules, relating to one orthographic letter, can be stored in one table, called grapheme-to-phoneme conversion rules table for one letter.

According to the grapheme-to-phoneme conversion rules for Polish, described in the literature [29–32], the grapheme-to-phoneme conversion for Polish has been implemented in the Python programming language, as automatic grapheme-to-phoneme conversion application named TransFon [33]. The implementation includes 975 grapheme-to-phoneme conversion rules for 35 orthographic letters in Polish, additionally conversion rules for special characters and automatic grapheme-to-phoneme conversion algorithm [33]. Block diagram of the grapheme-to-phoneme conversion algorithm for a single orthographic word is presented in Fig. 1. Due to that, many words have multiple variants of the correct pronunciation and the implementation includes only the most common basic variant of the pronunciation. Implementation of additional pronunciation variants is planned in the future. The problem of foreign words and acronyms phonemic transcription have been solved by using the dictionary where phonemic transcription of foreign words and acronyms have been defined.

TransFon application was developed entirely, without adapting any existing similar tools. The developed grapheme-to-phoneme conversion implementation is not the only one for Polish language [34–38], but only the one of them is available for free use [38]. The implementation of grapheme-to-phoneme conversion allows to apply it to any task (e.g., phonemic language corpus development for Polish).

Table 4 presents the phonemic transcription examples in Polish, written with the use of the SPA, IPA, and SAMPA phonetic alphabets [25, 26].

Table 4 Phonemic transcription examples in Polish

Full size table

The TransFon application enables to create the phonemic language corpus only on the basis of the orthographic source corpus. After automatic grapheme-to-phoneme conversion of the orthographic corpus with the use TransFon application, phonemic language corpus for Polish was obtained, in order to perform statistical analysis of Polish language.

3.2 Evaluation of grapheme-to-phoneme conversion implementation

The evaluation of the automatic grapheme-to-phoneme conversion implementation is crucial. During implementation of automatic grapheme-to-phoneme conversion for Polish, it was necessary to check and to prove if it works properly.

The test procedure for automatic grapheme-to-phoneme conversion implementation consisted of:

Performing the test automatic grapheme-to-phoneme conversion of orthographic text corpus file containing the most frequently used 1,943,462 unique words in Polish, obtained from the National Corpus of Polish resources [20].
In case of doubt, validation and verification of automatic grapheme-to-phoneme conversion results for words with the use of Polish language dictionary available online, with specifying correct pronunciation of words in Polish [39].
Registering cases of incorrect automatic grapheme-to-phoneme conversion, conversion errors and other encountered problems.

The automatic phonemic transcription application was implemented in such way, that the conversion algorithm was stopped, if grapheme-to-phoneme conversion problem occurred (e.g., when there was no rule allowing for a correct phonemic transcription). This solution makes it easier to work on improving and developing the automatic grapheme-to-phoneme conversion application. In addition, any doubts about the correct pronunciation was solved with help of wiktionary.org service [39]. This solution obviously has some serious limitations. The dictionary of wiktionary.org service contains only 61,141 Polish words and only in their basic form. The verification was further complicated by other problems such as different variants of the correct pronunciation of words or pronunciation of foreign words in the corpus.

The causes of problems and errors in automatic grapheme-to-phoneme conversion operation were as follows:

errors in the implementation of the grapheme-to-phoneme conversion algorithm and conversion rules,
missing grapheme-to-phoneme conversion rules in the tables (i.e., rules not included in the tables) for some orthographic letters contexts,
grapheme-to-phoneme conversion issue of foreign words, acronyms and words, which are not present in Polish language dictionary.

The above problems were solved in the following way:

The errors in the implementation of the grapheme-to-phoneme conversion algorithm and in conversion rules tables have been corrected by modifications, made within an application source code in Python programming language.
The problem of missing grapheme-to-phoneme conversion rules in tables has been solved by adding new conversion rules to the existing tables. In order to complete the missing grapheme-to-phoneme conversion rules, new conversion rules were supplemented for the following orthographic letters “i”, “n”, “d”, “z”, “z”, “c”, “f”, “s”, in some contexts.
The problems of foreign words and acronyms, have been solved by using the dictionary, where phonemic transcription of foreign words and acronyms have been defined. As a result, rule-based automatic grapheme-to-phoneme conversion was complemented by dictionary-based automatic grapheme-to-phoneme conversion method.

A number of improvements made it possible to increase effectiveness of the grapheme-to-phoneme conversion implementation. Tables 5 and 6 present the word error rate (WER) values of grapheme-to-phoneme conversion implementation, before and after improvements.

Table 5 WER values of the developed G2P conversion implementation, before improvements

Full size table

Table 6 WER values of the developed G2P conversion implementation, after improvements

Full size table

The WER value for 1,943,462 checked unique words, was equal 0.387%. The WER value for corpus contains 230,301,313 words, was equal 0.030%. The changes of WER values, before and after improvements, testify to the fact that implemented modifications have contributed to improving the effectiveness of G2P conversion.

3.3 The developed phonemic language corpus for Polish

The phonemic language corpus for Polish was developed by automatic grapheme-to-phoneme conversion of the source orthographic language corpus file obtained from the NCP corpus resources.

Table 7 presents the details of the phonemic language corpus content.

Table 7 Details of the phonemic language corpus content

Full size table

The phonemic language corpus contains the list of 1,943,462 Polish words written orthographically, their phonemic transcription written with the SAMPA phonemic alphabet and additionally, the number of word occurrence in the NCP balanced corpus. The measure of the NCP balanced corpus size is the sum of all numbers of the word occurrences, which is equal to 230,301,313 words.

A sample section of the developed phonemic language corpus for Polish is presented in Table 8. It should also be noted that the standard SAMPA for Polish includes several sequences of phonemic transcription labels that may cause ambiguity unless separated by spaces or other characters. To avoid this problem, all phonemes are separated by square brackets.

Table 8 A sample section of the developed phonemic language corpus for Polish

Full size table

4 Analysis of the obtained results and discussion

4.1 Statistical analysis of the orthographic and phonemic language corpora

With the use of the orthographic and phonemic language corpora, it was possible to perform statistical analysis of Polish language which includes calculation of the following distributions:

the frequency of the single orthographic word occurrence,
the frequency of the n-word sequence occurrence for n=2,…,5,
the frequency of the phoneme occurrence,
the frequency of the n-phoneme sequence occurrence for n=2,…,5.

The frequency distribution of words in the orthographic language corpus, is presented in Fig. 2.

A sample calculated frequency of word occurrence, is presented in Table 9, where 1% corresponds to about 2303013 occurrences.

Table 9 Frequency of the word occurrence in the orthographic corpus file

Full size table

A sample calculated frequency of occurrence for the two-word and the three-word sequences, are presented in Tables 10 and 11. The results for the four-word and the five-word sequences, are not presented in this paper, but they can also be helpful to develop advanced word-based language models.

Table 10 Frequency of the two-word sequence occurrence in the orthographic corpus file

Full size table

Table 11 Frequency of the three-word sequence occurrence in the orthographic corpus file

Full size table

The frequency distribution of the phonemes in the phonemic language corpus, is presented in Fig. 3.

The frequency distributions of the n-phoneme sequences, for n=2,…,5, are presented in Fig. 4.

4.2 Evaluation of the obtained results

The results of the research on statistical analysis of Polish language, performed with the phonemic language corpus, were compared to other results published in the literature [40–46]. Summary comparisons of the obtained statistical language data, to other results, available in the literature, are presented in Tables:

Table 12 presents the occurrence frequency of Polish phonemes and comparison to the results published in the literature [40, 42, 44, 45],
Table 12 Frequency of Polish phoneme occurrence—comparison to the results published in the literature [40, 42, 44, 45]
Full size table
Table 13 presents the occurrence frequency of the two-phoneme sequences (diphones) in Polish and comparison to the results published in the literature [45],
Table 13 Frequency of the two-phoneme sequence occurrence in Polish—comparison to the results published in the literature [45]
Full size table
Table 14 presents the occurrence frequency of the three-phoneme sequences (triphones) in Polish and comparison to the results published in the literature [45].
Table 14 Frequency of the three-phoneme sequence occurrence in Polish—comparison to the results published in the literature [45]
Full size table

The reasons of differences among the obtained results of the language statistical analysis performed by other scientists may be: differences in used corpora (e.g., in size, quality, linguistic structure) and development of language and changes over time. Language is constantly changing, evolving, and adapting to the needs of its speakers. All languages change continually, and do so in many and varied ways (e.g., lexical changes, phonetic and phonological changes, spelling changes, semantic and syntactic changes) [47]. Therefore, a results of research performed using different corpora may be very different from each other [48, 49]. The most similar results apply statistical analysis of Polish phonemes occurrence presented in Table 12 [44, 45]. The least accurate results were obtained with much smaller language corpus a few decades ago [40–42]. Taking into account the results, available in the literature, it can be concluded that performed statistical analysis of Polish language, was extensive. No results of a statistical analysis of the n-phoneme sequences occurrence in Polish for n>3 were found in the literature. On the basis of the comparison results, the following conclusion can be drawn: The developed phonemic language corpus in Polish, which was used to perform statistical analysis of Polish language, was very huge, containing 1263248497 phonemes, but not the biggest developed for Polish language [44]. The statistical analysis results obtained based on it, allow to develop statistical models of Polish language.

4.3 Frequency of the word occurrence

The frequency of word occurrence in a language is well described by Zipf’s law [50, 51]:

$$ Z_{r} = \frac{a}{r^{b}} $$

(6)

where Z _r is the frequency of the word ranked r, where r is the rank of the word if frequencies are ranked from the most frequent (r=1) to the least frequent (r=n), and a and b are parameters to be estimated from obtained statistical data. The usual findings is that b is close to 1 [50]. The fit of Zipf’s equation to the ranked frequency distribution of Polish words is presented in Fig. 5.

The ranked frequency distribution of Polish words was estimated by Zipf’s equation in the following form:

$$ Z_{r} = \frac{0.041566}{r^{0.9}} $$

(7)

The average fit of Zipf’s equation to the ranked frequency distribution of Polish words was measured by the coefficient of determination R ² value. The coefficient of determination for fit of Zipf’s equation, presented in Equation (7), to the ranked frequency distribution of Polish words is equal:

$$ R^{2} = 0.90729 $$

(8)

Additionally, root-mean-square error RMSE value was calculated for this case and it is equal:

$$ RMSE = 7.6475\cdot10^{-6} $$

(9)

The R ² value indicates how well statistical data fit into a statistical model. The R ² value equals R ²=0.90729 indicates that the Zipf’s equation fits well to the obtained statistical data of the word occurrence frequency in Polish language.

On this basis and on the basis of the results available in the literature [51–53], it can be concluded that the statistical data, obtained as the result of performed statistical analysis of Polish language, based on the orthographic language corpus, are correct.

4.4 Frequency of the phoneme and n-phoneme sequence occurrence

The frequency of word occurrence in a language is well described by Zipf’s law [50]. However, Zipf’s law does not describe well the distribution of the phonemes and phoneme sequences out of which words are composed. The examination of occurrence frequency in 95 languages, presented in the literature [51], shows that phoneme frequencies are best described by an equation first developed by Yule, that also describes the distribution of DNA codons [54]. The frequency of the phoneme occurrence in a language is described well by Yule’s equation formula [51]:

$$ Y_{r} = \frac{a}{r^{b}} \cdot c^{r} $$

(10)

where Y _r is the frequency of the phoneme ranked r, and r is the rank of the phoneme if frequencies are ranked from the most frequent (r=1) to the least frequent (r=n), and a, b and c are parameters to be estimated from the obtained statistical data.

The fits of Zipf’s and Yule’s equations to the ranked frequency distribution of Polish phonemes are presented in Fig. 6.

The evaluation results of the fits of Zipf’s and Yule’s equations to the ranked frequency distribution of Polish phonemes are presented in Table 15.

Table 15 Evaluation results of the fits of Zipf’s and Yule’s equations to the ranked frequency distribution of Polish phonemes

Full size table

Note that the Zipf’s equation is a special case of the Yule’s equation in which c ^r is neglected. It is not always possible to neglect this term. As shown in Fig. 6 and in Table 15, the Yule’s equation fits to the distribution of the phoneme frequencies in Polish much better than the Zipf’s equation. It is not an isolated case and similar regularity can be observed in other languages [51].

The same regularity was observed for frequency distributions of the n-phoneme sequence occurrence for Polish language, for n=2,...,5. The Figs. 7 and 8 present the fit of Yule’s equation to the ranked frequency distribution of Polish n-phoneme sequences for n=2 and n=3.

The summary of evaluation results of the Yule’s equation fits to the ranked frequency distribution of Polish phonemes and the n-phoneme sequences for n=2,…,5 are presented in Table 16.

Table 16 Evaluation results of the fit of Yule’s equation to the ranked frequency distribution of Polish phonemes (n=1) and n-phoneme sequences for n=2,…,5

Full size table

The values of R ², presented in Table 16, indicate that the Yule’s equation fits very well to the obtained statistical data of frequency occurrence of Polish phonemes and the n-phoneme sequences for n=2,...,5. A similar properties are observed for other languages. On the basis of the obtained results and the results available in the literature [40, 41, 43–46, 51], it can be concluded that statistical data, obtained as the result of performed statistical analysis of Polish language, based on the orthographic and phonemic language corpora, are correct.

5 Example of practical application of the obtained results for language modelling

This article contains a general statistics of Polish language that can be useful for a variety of language and speech processing applications, including automatic speech recognition with language models [55].

The goal of the word-based language model, is to model the sequence of words in the context of the task, being performed by the speech recognition system. In continuous speech recognition, the incorporation of the language model is crucial to reduce the search speed of recognized words sequence W. The probability P(W) of occurrence W, sequence of n words w _i, can be decomposed as [17]:

$$ P(W) = P(w_{1})\prod\limits_{i=2}^{n} P(w_{i}|w_{1},\ldots,w_{i-1}) $$

(11)

where P(w _i|w ₁,…,w _i−1) is the conditional probability that w _i will occur, given the previous word sequence w ₁,…,w _i−1. Unfortunately, it is impossible to compute the conditional word probabilities P(w _i|w ₁,…,w _i−1) for all words and all sequence lengths in a given language. Even though the sequences are limited to moderate values of i, there would not be enough data to estimate reliably all of the conditional probabilities. The conditional probability can be approximated by estimating the probability only on the preceding N−1 words defined by the following formula:

$$ P(W) = P(w_{1})\prod\limits_{i=1}^{n} P(w_{i}|w_{i-N+1},\ldots,w_{i-1}) $$

(12)

This approximation is commonly referred to as N-gram model [17]. The most popular solutions published in the literature, relate to the application of N-gram language models for word-based speech recognition tasks [56–59].

The language modelling may be based on modelling of words, as well as sub-words (e.g. phonemes). Statistical analysis of the phonemic corpus enables to develop statistical language models, based on phonemes.

For sequence of the phonemes Q=q ₁…q _m, containing m phonemes q _i, the probability P(Q) is given by a phoneme-based language model and the following formula:

$$ P(Q) = P(q_{1})\prod\limits_{i=2}^{m} P(q_{i}|q_{1},\ldots,q_{i-1}) $$

(13)

where P(q _i|q ₁,…,q _i−1) is the conditional probability that q _i will occur, given the previous phoneme sequence q ₁,…,q _i−1. The P(Q) probability approximation for N-gram phoneme-based language model is defined by the analogous formula:

$$ P(Q) = P(q_{1})\prod\limits_{i=2}^{m} P(q_{i}|q_{i-N+1},\ldots,q_{i-1}) $$

(14)

On the basis of performed statistical analysis of the orthographic language corpus, there have been developed the N-gram word-based language models for N=1,…,3, intended for Polish language. In a similar way, on the basis of statistical analysis results of the phonemic language corpus, the N-gram phoneme-based language models for N=1,…,3, intended for Polish language, were developed. The details of word-based and phoneme-based language models developing process are presented in the separate publication. This article presents only the example of language statistical analysis application to develop selected language models.

An approach to evaluate a language model is word recognition error rate [60].

However, this approach requires a working speech recognition system. Alternatively, we can measure the average number of possible words that follow any given word sequence in a language. This is the derivative measure of entropy, known as perplexity (PP) [17]. Given a language model P(W), where W is the n-word sequence, the entropy of the language model can be defined as [61]:

$$ H(W) = -\frac{1}{n}\log_{2}(P(W)) $$

(15)

For N-gram language model, H(W) entropy can be calculated with the following formula:

$$ H(W) = -\frac{1}{n}\sum\limits_{i=1}^{n} \log_{2}(P(w_{i}|w_{i-N+1},\ldots,w_{i-1})) $$

(16)

Note that as n approaches infinity, the entropy approaches the asymptotic entropy of the source defined by the measure P(W). This means that the typical length of the sequence must approach infinity, which is of course impossible. Thus, entropy H(W) should be estimated on a sufficient large n value. The perplexity PP(W) of the word-based language model is then defined as [17]:

$$ PP(W) = 2^{H(W)} $$

(17)

The comparison of perplexity PP _N(W) values for the developed word-based N-gram language models for N=1,…,3, is presented in Table 17. The comparison of perplexity PP _N(Q) values for the developed phoneme-based N-gram language models for N=1,…,3, is presented in Table 18.

Table 17 Comparison of perplexity PP _N(W) values for the developed word-based N-gram language model for N=1,…,3

Full size table

Table 18 Comparison of perplexity PP _N(Q) values for the developed phoneme-based N-gram language model for N=1,…,3

Full size table

The PP values, presented in Tables 17 and 18, show that the developed phoneme-based 3-gram language model has the lowest PP value equal to 7.77. The lower perplexity value for language model indicates a greater ability to predict sequence of speech components. A language model is rated as better if the perplexity PP value is less. A language models with low perplexity indicate more predictable language. However, since the perplexity is not related to the complexity of recognizing some acoustic patterns, reducing the language model, perplexity does not guarantee an improvement in automatic speech recognition performance.

5.1 Potential application of other statistical analysis results

The statistical analysis results for 4 and 5-word sequence occurrence are not presented in this paper. But these results can be helpful to develop advanced (4 and 5-gram) word-based language models for Polish. As previously written, the language modelling may be based on modelling of words, as well as sub-words (e.g., phonemes). Therefore, the statistics of higher than three-phoneme sequence can be used for developing advanced (higher than 3-gram) phoneme-based language models for Polish. The advanced word-based and phoneme-based language modelling, enables to develop a hybrid language models for out-of-vocabulary (OOV) word detection in large vocabulary conversational speech recognition (LVCSR) systems for the language [62, 63]. The language model in most state-of-the-art LVCSR systems is still the N-gram, which assigns probability to the next word based on only the N−1 preceding words [64]. But the use of an additional phoneme-based language models improves efficiency of LVCSR systems [65]. Another improvement in an LVCSR system development is the use of higher than 4-gram language models, with particular emphasis on N-gram phoneme-based language models.

6 Conclusions

This paper presents the original results of statistical analysis of Polish language, performed by means of the orthographic language text corpus, obtained from the NCP corpus and the phonemic language corpus, developed through automatic grapheme-to-phoneme conversion of the orthographic language corpus. The results of statistical analysis of Polish language, enable to develop statistical word-based and phoneme-based language models, in order to be used for automatic speech recognition.

The results of the research on statistical analysis of Polish language were compared and are consistent to other results available in the literature [40–46, 66, 67]. Taking into account the results available in the literature, it can be concluded that performed statistical analysis of the language was extensive. No results of the statistical analysis of n-phoneme sequence occurrence in Polish for n>3 were found in the literature. On the basis of the comparison results, the following conclusion can be drawn: The phonemic language corpus in Polish which used to perform statistical analysis of the language, was very huge (containing 1,263,248,497 phonemes) and the statistical analysis results, obtained and based on it, allows to develop statistical models of Polish language.

Additionally, the validation and evaluation of the obtained statistical data were performed. The frequency of the word occurrence in a language is well described by Zipf’s law. The validation of statistical data for words was performed by the fit of Zipf’s equation to the ranked frequency distribution of Polish words. Similar regularity was observed for frequency distribution of the phoneme occurrence for Polish language. The examination of frequency occurrence in 95 languages, presented in the literature [51], shows that phoneme frequencies are best described by Yule’s equation [54]. The validation of the statistical data for phonemes was performed by the fit of Yule’s equations to the ranked frequency distribution of Polish phonemes and n-phoneme sequences. According to the results available in the literature [51], it can be concluded that statistical data obtained as the result of performed statistical analysis of Polish language, based on the orthographic and phonemic language corpora, are correct.

Regularity presented in this paper, it is not an isolated case and similar regularity can be observed in other languages, so also for other language corpora, reflecting the state of contemporary language [51]. It should also be noted, that it seems to be valuable to provide similar fits for existing Polish text corpora for allowing the reader to assess the quality of the created phonemic language corpus. Similarly, it seems to be very valuable to confront word error rate and the perplexity of the language models, created by means of the existing Polish corpora with respect to a common test set. However, it is difficult to perform due to lack of access to other existing Polish text corpora of appropriate size and quality, except NCP corpus. Similarly, the author does not find any available phonemic language corpus for Polish. Therefore, the author attempts to create his own phonemic language corpus with the use of G2P conversion of the existing available orthographic language corpus for Polish (NCP). Since this problem seems to be very important, the author is planning to bring this subject up in the future publications.

The developed word-based and phoneme-based language models were also presented in this paper, as an example of practical applications of the obtained statistical data of Polish language. The obtained statistical data open up further opportunities to continue research on improving automatic speech recognition in Polish. The plan for future research includes the development of statistical word-based and subword-based language models for Polish. The word-based and subword-based language modelling, enables to develop a hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition [64, 68–70].

References

L Rabiner, B Juang, Fundamentals of Speech Recognition. Prentice Hall signal processing series (PTR Prentice Hall, USA, 1993).
Google Scholar
JR Bellegarda, C Monz, State of the art in statistical methods for language and speech processing. Comput. Speech Lang. 35:, 163–184 (2016).
Article Google Scholar
L Rabiner, B Juang, Encyclopedia of Language and Linguistics, Statistical methods for the recognition and understanding of speech (Elsevier, Amsterdam, 2005).
Google Scholar
S Sakti, K Markov, S Nakamura, W Minker, in Incorporating Knowledge Sources into Statistical Speech Recognition, vol 42 of Lecture Notes in Electrical Engineering. Statistical Speech Recognition (Springer USUSA, 2009), pp. 19–53.
Google Scholar
J Bellegarda, Large vocabulary speech recognition with multispan statistical language models. IEEE Transa. Speech Audio Process. 8:, 76–84 (2000).
Article Google Scholar
P Kłosowski, in Computer Nerworks vol 79 of Communications in Computer and Information Science, ed. by A Kwiecien, P Gaj, and P Stera. Speech processing application based on phonetics and phonology of the polish language. 17th International Conference Computer Networks, Ustron, Poland, Jun 15-19 (Springer-VerlagBerlin, 2010), pp. 236–244.
Google Scholar
P Kłosowski, Improving speech processing based on phonetics and phonology of Polish language. Przegląd Elektrotechniczny. 89:, 303–307 (2013).
Google Scholar
J Izydorczyk, P Kłosowski, Acoustic properties of Polish vowels. Bull. Pol. Acad. Sci. Tech. Sci. 47(1), 29–37 (1999).
Google Scholar
J Izydorczyk, P Kłosowski, in International Conference Programable Devices and Systems PDS2001 IFAC Workshop, Gliwice November 22nd - 23rd. Base acoustic properties of Polish speech (IFACGliwice, 2001), pp. 61–66.
Google Scholar
P Kłosowski, A Dustor, J Izydorczyk, J Kotas, Slimok J, in Computer Networks, CN 2014. vol 431 of Communications in Computer and Information Science, ed. by A Kwiecien, P Gaj, and P Stera. Speech recognition based on open source speech processing software. 21st International Science Conference on Computer Networks (CN), Brunow, Poland, Jun 23-27 (Springer-VerlagBerlin, 2014), pp. 308–317.
Google Scholar
A Dustor, Kłosowski P, in Computer Networks, CN 2013. vol 370 of Communications in Computer and Information Science, ed. by A Kwiecien, P Gaj, and Stera P. Biometric voice identification based on Fuzzy Kernel Classifier. 20th International Conference on Computer Networks (CN), Lwowek Slaski, Poland, Jun 17-21 (Springer-VerlagBerlin, 2013), pp. 456–465.
Google Scholar
A Dustor, P Kłosowski, J Izydorczyk, in 2014 International Conference on Multimedia Computing and Systems (ICMCS). Speaker recognition system with good generalization properties. International Conference on Multimedia Computing and Systems (ICMCS), Marrakech, Morocco, Apr 14-16 (IEEEUSA, 2014), pp. 206–210.
Chapter Google Scholar
A Dustor, P Kłosowski, J Izydorczyk, in Computer Networks, CN 2014. vol 431 of, Communications in Computer and Information Science, ed. by A Kwiecien, P Gaj, and P Stera. Influence of Feature Dimensionality and Model Complexity on Speaker Verification Performance. 21st International Science Conference on Computer Networks (CN), Brunow, Poland, Jun 23-27 (Springer-VerlagBerlin, 2014), pp. 177–186.
Google Scholar
P Kłosowski, A Dustor, J Izydorczyk, in Computer Networks, CN 2015. vol 522 of Communications in Computer and Information Science, ed. by P Gaj, A Kwiecien, and P Stera. Speaker verification performance evaluation based on open source speech processing software and timit speech corpus. 22nd International Conference on Computer Networks (CN), Brunow, Poland, Jun 16-19 (Springer-VerlagBerlin, 2015), pp. 400–409.
Google Scholar
A Dustor, P Kłosowski, J Izydorczyk, R Kopanski, in Computer Networks, CN 2015. vol 522 of Communications in Computer and Information Science, ed. by P Gaj, A Kwiecien, and P Stera. Influence of Corpus Size on Speaker Verification. 22nd International Conference on Computer Networks (CN), Brunow, Poland (Springer-VerlagBerlin, 2015), pp. 242–249.
Google Scholar
P Kłosowski, Dustor A, in Computer Networks, CN 2013. vol 370 of Communications in Computer and Information Science, ed. by A Kwiecien, P Gaj, and P Stera. Automatic Speech Segmentation for Automatic Speech Translation. 20th International Conference on Computer Networks (CN), Lwowek Slaski, Poland, Jun 17-21 (Springer-VerlagBerlin, 2013), pp. 466–475.
Google Scholar
F Jelinek, Statistical Methods for Speech Recognition. Language, Speech, & Communication: A Bradford Book (MIT Press, USA, 1997).
Google Scholar
S Furui, Recent progress in corpus-based spontaneous speech recognition. IEICE Trans. Inf. Syst. E88D:, 366–375 (2005).
Article Google Scholar
M Adda-Decker, Corpus for automatic speech recognition. Revue Francaise De Linguistique Appliquee. 12:, 71–84 (2007).
Google Scholar
A Przepiórkowski, M Bańko, RL Górski, B Lewandowska-Tomaszczyk, The National Corpus of Polish (in Polish: Narodowy Korpus Języka Polskiego) (Wydawnictwo Naukowe PWN, Warszawa, 2012).
Google Scholar
A Przepiórkowski, RL Górski, B Lewandowska-Tomaszczyk, Łaziński M, in Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008. Towards the national corpus of Polish (MarrakechELRA, 2008).
Google Scholar
RL Górski, B Lewandowska-Tomaszczyk, M Bańko, P Pęzik, M Łaziński, A Przepiórkowski, Practical applications of the National Corpus of Polish. Prace Filologiczne. 63:, 231–240 (2012).
Google Scholar
J Hirschberg, CD Manning, Advances in natural language processing. Science. 349:, 261–266 (2015).
Article MathSciNet MATH Google Scholar
Association International Phonetic, Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. A Regents publication (Cambridge University Press, UK, 1999).
Google Scholar
R Sussex, P Cubberley, The Slavic Languages. Cambridge Language Surveys (Cambridge University Press, UK, 2006).
Book Google Scholar
J Wells, in Handbook of Standards and Resources for Spoken Language Systems. vol Part IV, section B, ed. by D Gibbon, R Moore, and R Winski. SAMPA computer readable phonetic alphabet (Mouton de GruyterBerlin and New York, 1997).
Google Scholar
M Razavi, R Rasipuram, MM Doss, Acoustic data-driven grapheme-to-phoneme conversion in the probabilistic lexical modeling framework. Speech Commun. 80:, 1–21 (2016).
Article Google Scholar
RM Kaplan, M Kay, Regular models of phonological rule systems. Comput. Linguist. 20:, 331–378 (1994).
Google Scholar
M Steffen-Batóg, The problem of automatic phonemic transcription of written Polish. Biuletyn Fonograficzny. 14:, 75–86 (1973).
Google Scholar
M Steffen-Batóg, in Polish: Automatyzacja transkrypcji fonematycznej tekstów polskich. Automatic phonemic transcription of Polish texts (Wydawnictwo Naukowe PWNWarszawa, 1975).
Google Scholar
M Steffen-Batóg, Nowakowski P, in Studia Phonetica Posnaniensia. Vol. 3, ed. by M Steffen-Batóg, W Awedyk. An algorithm for phonetic transcription of orthographic texts in Polish (Wydawnictwo Naukowe UAMPoznań, 1993).
Google Scholar
W Jassem, A phonemic transcription and syllable division rule engine (Onomastica-Copernicus Research Colloquium, Edinburgh, 1996).
Google Scholar
P Kłosowski, in Proceedings of 20th IEEE International Conference Signal Processing Algorithms, Architectures, Arrangements, and App.lications, September 21-23. Algorithm and implementation of automatic phonemic transcription for polish (Poznan University of TechnologyPoznań, 2016), pp. 298–303.
Google Scholar
M Wypych, in Speech and Language Technology. Vol. 3. Implementation of phonenic transcription alghorithm (in Polish: Implementacja algorytmu transkrypcji fonematycznej) (Polskie Towarzystwo FonetycznePoznań, 1999).
Google Scholar
G Demenko, M Wypych, E Baranowska, Implementation of grapheme-to-phoneme rules and extended SAMPA alphabet in Polish text-to-speech synthesis. Speech Lang. Technol. 7(17) (2003).
P Przybysz, W Kasprzak, in 2013 6th International Conferance on Human Systems Interactions (HSI), ed. by WA Paja, BM Wilamowski. The generation of letter-to-sound rules for grapheme-to-phoneme conversion. Conference on Human System Interaction. Gdansk Univ Technol; Univ Informat Technol & Management; IEEE Ind Elect Soc (Gdansk University of TechlologyGdansk, 2013), pp. 292–297.
Chapter Google Scholar
D Skurzok, B Ziółko, Ziółko M, in 7th Language & Technology Conference, Poznań. Ortfon2 - tool for orthographic to phonetic transcription (Adam Mickiewicz University in PoznanPoznan, 2015).
Google Scholar
D Koržinek, Ł Brocki, Marasek K, Polish grapheme-to-phoneme tool and service, CLARIN-PL digital repository (2016). http://hdl.handle.net/11321/295, (Online: 2016.08.01).
Wiktionary, Polish Language Dictionary (2015). https://pl.wiktionary.org/. Accessed 17 Feb 2017.
W Jassem, Podstawy fonetyki akustycznej (eng. Rudiments of acoustic phonetics) (PWN, Warszawa, 1973).
P Łobacz, W Jassem, Fonotaktyczna analiza mówionego tekstu polskiego (eng. Phonotactic analysis of spoken Polish texts). Biuletyn Polskiego Towarzystwa Ję. 32:, 179–195 (1974).
Google Scholar
C Basztura, Rozmawiac z komputerem (Eng. To speak with computers), (1992).
B Ziółko, J Gałka, S Manandhar, RC Wilson, M Ziółko, in Human Language Technology: Challenges of the Information Society. Vol 5603 of Lecture Notes in Artificial Intelligence, ed. by Z Vetulani, H Uszkoreit. Triphone Statistics for Polish Language. 3rd Language and Technology Conference 2007, Poznan, Poland, Oct 05-07, (2009), pp. 63–73.
B Ziółko, J Gałka, M Ziółko, Polish phoneme statistics obtained on large set of written texts. Comput. Sci. (AGH). 10:, 97–106 (2009).
Google Scholar
B Ziółko, Gałka J, in Computer Science and Information Technology (IMCSIT), Proceedings of the 2010 International Multiconference on. Polish phones statistics (AGH Univesity of Science and TechnologyKrakow, 2010), pp. 561–565.
Google Scholar
B Ziółko, P Zelasko, Skurzok D, in 2014 XXII Annual Pacific Voice Conference (PVC). Statistics of diphones and triphones presence on the word boundaries in the Polish language. App.lications to ASR. Annual Pacific Voice Conference, AGH; Pacific Voice Speech Fdn, 2014. 22nd Annual Pacific Voice Conference (PVC) (KrakowAGH Univesity of Science and Technology, 2014).
Google Scholar
D Lightfoot, The development of language: Acquisition, change, and evolution (Wiley-Blackwell, Hoboken, 1999).
Google Scholar
D Biber, S Conrad, R Repp.en, Corpus linguistics: Investigating language structure and use (Cambridge University Press, Cambridge, 1998).
Book Google Scholar
R Facchinetti, M Rissanen, Corpus-based studies of diachronic English, vol. 31 (Peter Lang, 2006).
GK Zipf, Human behavior and the principle of least effort. J. Clin. Psychol. 6(3), 306–306 (1950).
Google Scholar
Y Tambovtsev, C Martindale, Phoneme frequencies follow a yule distribution. SKASE J. Theor. Linguist. 4(2) (2008).
ST Piantadosi, Zipf’s word frequency law in natural language: A critical review and future directions. Psychonimic Bull. Rev. 21:, 1112–1130 (2014).
Article Google Scholar
A Corral, G Boleda, R Ferrer-i Cancho, Zipf’s law for word frequencies: word forms versus lemmas in long texts. Plos ONE. 10(7), e0129031 (2015). doi:10.1371/journal.pone.0129031.
Article Google Scholar
GU Yule, A mathematical theory of evolution, based on the conclusions of Dr.J. C. Willis, F.R.S. Phil. Trans. R. Soc. London B Biol Sci. 213(402-410), 21–87 (1925).
Article Google Scholar
S Dziadzio, A NaboŻny, A Smywiński-Pohl, B Ziółko, in Computer Science and Information Systems (FedCSIS) 2015 Federated Conference on. Comparison of language models trained on written texts and speech transcripts in the context of automatic speech recognition (Lodz University of TechnologyLodz, 2015), pp. 193–197.
Chapter Google Scholar
S Takahashi, T Morimoto, in 2012 International Conference on Asian Language Processing (IALP 2012), ed. by D Xiong, E Castelli, M Dong, and PTN Yen. N-gram Language Model Based on Multi-Word Expressions in Web Documents for Speech Recognition and Closed-Captioning (Soochow UniversityChina, 2012), pp. 225–228.
Chapter Google Scholar
A Hatami, A Akbari, B Nasersharif, in 2013 21st Iranian Conference on Electrical Engineering (ICEE). N-gram Adaptation Using Dirichlet Class Language Model Based on Part-of-Speech for Speech Recognition (Ferdowsi University of MashhadMashhadm, 2013).
Google Scholar
M Bahrani, H Sameti, N Hafezi, S Momtazi, in New Frontiers in App.lied Artificial Intelligence, vol 5027 of Lecture Notes in Artificial Intelligence, ed. by NT Nguyen, L Borzemski, A Grzech, and M Ali. New word clustering method for building n-gram language models in continuous speech recognition systems (SpringerBerlin, 2008), pp. 286–293.
Google Scholar
B Rapp, in 2008 International Multiconference on Computer Science and Information Technology (IMCSIT), Vols 1 and 2, ed. by M Ganzha, M Paprzycki, and T PelechPilichowski. N-gram language models for Polish language. Basic concepts and applications in automatic speech recognition systems (IEEE Computer Society PressLos Alamitos, 2008), pp. 295–298.
Google Scholar
D Klakow, P Jochen, Testing the correlation of word error rate and perplexity. Speech Commun. 38(1–2), 19–28 (2002).
Article MATH Google Scholar
T Cover, J Thomas, Wiley series in telecommunications: Elements of information theory (John Wiley and Sons, USA, 1991).
Book Google Scholar
P Yu, FTB Seide, in Interspeech. A hybrid word/phoneme-based app.roach for improved vocabulary-independent search in spontaneous speech (CiteseerJeju Island, 2004).
Google Scholar
V Chunwijitra, A Chotimongkol, C Wutiwiwatchai, A hybrid input-type recurrent neural network for lvcsr language modeling. EURASIP J. Audio Speech Music Process. 2016(1), 15 (2016).
Article Google Scholar
A Yazgan, M Saraclar, in Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International Conference on. Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition. vol 1 (IEEE, 2004), pp. I–745.
M Larson, Sub-word-based language models for speech recognition: implications for spoken document retrieval. Whorkshop on Language Modeling and Information Retrieval (2001).
A Czardybon, O Hellwig, W Petersen, in Advances in Natural Language Processing. vol 8686 of Lecture Notes in Artificial Intelligence, ed. by A Przepiorkowski, M Ogrodniczuk. Statistical Analysis of the Interaction between Word Order and Definiteness in Polish. Polish Acad Sci, Inst Comp Sci, 2014. 9th International Conference on Natural Language Processing (NLP), Warsaw, Poland, Sep 17-19 (Polish Academy of Science, Institute of Conputer ScinceWarsaw, 2014), pp. 144–150.
Google Scholar
P Mandera, E Keuleers, Z Wodniecka, M Brysbaert, Subtlex-pl: subtitle-based word frequency estimates for Polish. Behav. Res. Methods. 47:, 471–483 (2015).
Article Google Scholar
JR Bellegarda, Large vocabulary speech recognition with multispan statistical language models. IEEE Trans. Speech Audio Process. 8:, 76–84 (2000).
Article Google Scholar
H Schwenk, Continuous space language models. Comput. Speech Lang. 21(3), 492–518 (2007).
Article Google Scholar
MAB Shaik, E-D AMousa, R Schlüter, H Ney, in INTERSPEECH. Hybrid language models using mixed types of sub-lexical units for open vocabulary German LVCSR (International Speech Communication Association (ISCA)Baixas, 2011), pp. 1441–1444.
Google Scholar

Download references

Acknowledgements

This work was supported by the Polish Ministry of Science and Higher Education funding for statutory activities.

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and Affiliations

Department of Electronics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland
Piotr Kłosowski

Authors

Piotr Kłosowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Piotr Kłosowski.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Kłosowski, P. Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling. J AUDIO SPEECH MUSIC PROC. 2017, 5 (2017). https://doi.org/10.1186/s13636-017-0102-8

Download citation

Received: 15 April 2016
Accepted: 08 February 2017
Published: 28 February 2017
DOI: https://doi.org/10.1186/s13636-017-0102-8

Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

Abstract

1 Introduction

2 Orthographic language corpus

3 Phonemic language corpus

3.1 Grapheme-to-phoneme conversion

3.2 Evaluation of grapheme-to-phoneme conversion implementation

3.3 The developed phonemic language corpus for Polish

4 Analysis of the obtained results and discussion

4.1 Statistical analysis of the orthographic and phonemic language corpora

4.2 Evaluation of the obtained results

4.3 Frequency of the word occurrence

4.4 Frequency of the phoneme and n-phoneme sequence occurrence

5 Example of practical application of the obtained results for language modelling

5.1 Potential application of other statistical analysis results

6 Conclusions

References

Acknowledgements

Competing interests

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords