This paper introduces a nonlinear function into the frequency spectrum that improves the detection of vowels, diphthongs, and semivowels within the speechsignal. The lower efficiency of consonant detection was solved by implementing the hangover and hangbefore criteria. This paper presents a procedure for faster definition of those optimal constants used by hangover and hangbefore criteria. A nonlinearly changed frequency spectrum is used in the proposed GMM (Gaussian Mixture Model) based VAD (VoiceActivityDetection) algorithm. Comparative tests between the proposed VAD algorithm and seven other VAD algorithms were made on the Aurora 2 database. The experiments were based on frame error detection and on speech recognition performance for two types of acoustic training modes (multi-condition and clean only). The lowest average percentage of frame errors was obtained by the proposed VAD algorithm,which also achieved positive improvement in the speech recognition performance for both types of acoustic training modes.
COBISS.SI-ID: 16323862
A new method for calculating acoustic confusability between words for automatic speech recognition is proposed. Acoustic confusability is one of the key elements influencing speech recognition accuracy. The proposed method is based on Levenshtein distance, calculated on phonetic transcriptions from the speech recognizer’s vocabulary. The method was evaluated in an indirect way. The experiments were carried out on four different sets of context-dependent acoustic models. The proposed method successfully predicted the acoustic confusability between words from the speech recognizer’s vocabulary.
COBISS.SI-ID: 16500502
In this paper we investigate a group of markers in spoken interaction in Slovene, known by the term "general extenders". Expressions such as in tako naprej 'and so on', pa to 'and such', pa tako 'and like that', pa vse 'and all', ali pa kaj takega 'or something like that', ali nekaj takega 'or something like that', fall into this category. They are commonly divided into two groups: those beginning with in or colloquial pa, called "adjunctive", and those beginning with ali, called "disjunctive". We use the Slovene reference speech corpus GOS for the analysis. Corpus GOS includes about 1 mio.transcribed words of speech in everyday situations (private and official) as well as school lessons, lectures, and public media (TV and radio) discourse. We find a total of 50 different expressions that can function as general extenders in speech, however, only 12 of them are used more than 12 times, and 10 of them are used merely once. The most common five are: in tako naprej (225 uses), pa to (318 uses), pa vse (44 uses), pa tako (79 uses), ali pa kaj takega (38 uses). First, we compare their usage in different discourse settings within the GOS corpus. The results show that there is a greater variety and frequency of general extenders in non-public and informal settings than in the public and formal settings; that disjunctive general extenders are similarly frequent in all non-public settings, whilst adjunctive general extenders are more frequent in private than in official non-public settings; and that there are important differences in the usage of particular general extenders. Then we examine the most frequent Slovene general extender, in tako naprej, using the sociocognitive approach, based on the theory of mental models by van Dijk and Kintsch (1983). We specify in takonaprej as a strategic device that a speaker uses when s/he wants to leave the previous element of the proposition underspecified, exemplified, or approximated. The elements of control structure that guide its usage are, for example: a speaker does not want to explain the details, a speaker thinks the details are not important or interesting, a speaker wants to indicate that the thing/process is repeating, etc. General extenders refer to the previous element of the proposition which can be of various length: single previous word, word phrase, more words/phrases, single previous sentence or several sentences. From the interactional level point of view, we find that general extenders address the listener and stimulate him/her to use his/ her own knowledge and actively cooperate, to put more effort into interpretation, to guess what the speaker has in mind. However, also some more specific interactional goals are identified, such as: a speaker lets the listener know that s/he does not want to explain the details, a speaker lets the listener know that s/he is unsure whether s/ he has exhaustively represented the propositional element, a speaker lets the listener know thats/he has some problems in production, etc.
COBISS.SI-ID: 48889442
Several features of human-human conversation have to be accounted for in orderto recreate conversational behavior on a synthetic model, as natural as possible.. Spontaneous conversations are a combination of multiple modalities (e.g. gestures, postures, gazes, expressions) in order to effectively convey information between participants. This paper presents a novel process for capturing the forms of motion performed during spontaneous conversations. Furthermore, it also addresses the process of transforming the captured motionsć descriptions into high-resolution, expressively transformable behavioral scripts. The aim of the research was design a process that will allow building a highresolution motion dictionary. The dictionary is to be presented as a set of expressively transformable behavioral scripts, each capturing the expressive details from a spontaneous conversation (e.g. spatial, repetitive, structural, and temporal features).
COBISS.SI-ID: 16541462
Non-verbal behavior performed by embodied conversational agents still appears “wooden” and sometimes even “unnatural”. Annotated corpora and high resolution annotations capturing the expressive details of movement, may improve the gradualness of synthetic behavior. This paper presents a non-functional, form-oriented annotation scheme based on informal corpora involving multi-speaker dialogues. This annotation scheme allows annotators to capture the expressive details of movement in high-resolutions. The expressive domains it captures are: spatial domain (movement-pose configuration on the level of articulators), fluidity (translations between movement-phases and phrases), temporal domain (movement variation in the form of movement phases), repetitivity (repetitive features of movement), and power (level of exposure). The presented annotation scheme can transform the encoded data into movement templates that can be directly reproduced by an embodied conversational agent.
COBISS.SI-ID: 16541974