Automatic Extraction of Sesotho Tone using a Convolutional Neural Network

Enoch Essien
By Enoch Essien

Published: 2024

Automatic Extraction of Sesotho Tone using a Convolutional Neural Network

This research investigates the feasibility of mining tone in Sesotho using Convolutional Neural Networks (CNNs). A baseline model and four variations were implemented to evaluate their effectiveness in tonal detection, achieving a best validation accuracy of 71.42%.

This research investigates the feasibility of mining tone in Sesotho using Convolutional Neural Networks (CNNs). A baseline model and four variations were implemented to compare their effectiveness in tone detection. Training and validation accuracy graphs across 30 epochs revealed that all models exhibited some degree of overfitting, suggesting the need for improved generalization techniques. The baseline model achieved a validation accuracy of 67.40%, providing a strong reference point for further experimentation.

Spectrogram representation of Sesotho speech tones

Variation 1, which increased the depth of the neural network, yielded the highest validation accuracy of 71.42%. This result indicates that deeper CNN architectures are better suited to extracting subtle tonal features from spectrograms. Variations incorporating batch normalization and global average pooling achieved moderate improvements by reducing overfitting, but they struggled to fully capture the complexity of tonal variation in Sesotho. The model utilizing the Leaky ReLU activation function produced the lowest validation accuracy, suggesting that this activation function is not well suited for tonal extraction in this context.

Training and validation accuracy comparison across CNN variants

To mitigate overfitting, training was limited to 20 epochs, preventing excessive capacity utilization and reducing noise. Future research should prioritize expanding the dataset, incorporating a wider range of speakers, and applying advanced data augmentation techniques. The integration of transfer learning and Large Language Models (LLMs) also presents promising opportunities to enhance feature extraction and improve tonal classification accuracy.

This study contributes to the field of automated tone extraction for low-resource languages such as Sesotho and highlights key directions for future work. Extending this research to related languages like Setswana and refining model architectures will support the development of scalable and accurate tonal language processing systems.

Share this post: