The Johns Hopkins University + Amazon Initiative for Artificial Intelligence (AI2AI) has selected 8 JHU WSE Faculty research projects for its inaugural round of AI2AI Faculty Research Awards. Research areas covered in these projects include Language and Speech Processing, Green AI, Machine Learning, and Computer Vision.
2022-2023 Faculty Research Awards
Evaluating the Multilinguality of Multilingual Machine Translation
The proliferation of Deep Neural Networks into Artificial Intelligence has allowed researchers and engineers to build systems that can automatically translate between large groups of languages without having to build separate models. However, the limitations of having one large, general model are not well understood. We aim to investigate the cutting-edge frontiers of this class of AI models.
Recently, multilingual neural machine translation methods have demonstrated the ability to generalize beyond a bilingual setting to handle multiple different translation directions within a single model. However, these multilingual machine translation models have suffered from the “Curse of Multilinguality” where high-resource language pairs suffer a performance drop compared to bilingual models. The goal of multilingual models is two-fold – to reduce the number of models and engineering complexity of maintaining many, and to mitigate lack of data availability in many translation directions. Yet, while this is helpful in the general setting, when dealing with specific language pairs, such as German/French/Spanish+–+ English, sacrificing performance to reduce the number of models developed may not be an optimal choice.
In general, the assumption in the literature is that for high-resource languages, model capacity is the limiting factor. Yet, though stated as such, this is not necessarily demonstrated empirically. In our recent work , we show that hypotheses about model capacity, such as the Lottery Tickets , are not necessarily true and can be violated with different training regiments.
The first proposed contribution of this proposal is to demonstrate that model capacity is the limiting factor for multilingual machine translation. We intend to experiment using various model sizes with training objectives that have been shown to maximize model capacity such as SAGE, Intra-Distillation, Self-Distillation, and R-Drop [3, 1, 4, 5] in order to definitively prove that model capacity is the limiting factor for scaling NMT to more languages.
The second proposed contribution is to determine what is the threshold for the curse of multilinguality. The majority of multilingual machine translation models such as mBART (25 or 50 languages depending on the version) or m2m100 (100 languages) look at dozens of languages [6, 7]. Yet, when the curse of multilinguality begins to take hold is not-well understood. Furthermore, it is not known whether or not this happens for related languages or similar scripts. In other words, a multilingual system based upon the Romance languages may perform better than a bilingual French-English system. Our proposed experiments will focus on scaling up the number of languages in our multilingual models, while ablating on language family and linguistic typology.
Generalist Speech Processing Models
This project will investigate how to efficiently extract the information contained in speech using large scale AI models. The outcome will be a generalist model able to transcribe speech into text, and determine the speaker’s identity, language, and emotional state, among others.
Speech processing has made great progress in the last decade thanks to the advent of deep learning. Recently, Self-Supervised Learned (SSL) models like Wav2Vec2  are enabling another revolution. These models unsupervisedly learn generic representations from large-scale data. Then, we can ﬁne-tune them to a wide range of tasks achieving state-of-the-art performance, for which they are referred to as Foundational Models . However, the common practice consists of ﬁne-tuning a model per task, ending up with as many models as tasks . In this proposal, the question we want to answer is whether we can design a Generalist Model able to perform multiple speech processing tasks with a single model evaluation. Such a model would allow us to extract the information contained in the speech signal more eﬃciently compared to evaluating N diﬀerent models.
We propose to investigate a multi-task model composed of a backbone encoder–e.g., Wav2Vec2, WavLM [1, 4, 5]–, shared across tasks, and multiple task-dependent decoders. First, we will establish some baselines by training single-task models, and comparable multi-task models. Second, we will investigate novel decoder architectures that can extract the information from the encoder by selectively attending to diﬀerent layers and neurons depending on the task. Third, we will work on modeling the dependencies between diﬀerent tasks. In this manner, the knowledge obtained from one task can help to improve other tasks, e.g., speaker and accent identiﬁcation could help to improve ASR. For this, we will use chain-rule-based decoders or models inspired by graph neural networks.
During the ﬁrst year, we will mainly focus on segment-level tasks, i.e., speaker, language, accent, emotion, age, gender, and spooﬁng detection; but we will also include ASR. In the following years, we will include more tasks with sequence-level outputs like speaker/language diarization. To account for all these tasks, we will pool multiple datasets: CommonVoice, VoxCeleb, CN-Celeb, VoxLingua, ASVSpoof2019-21, MSP-Podcast. We will also investigate the optimal strategy to sample data from the pool to perform well in all tasks. For this, we will need to consider the unbalance between datasets in terms of hours and speakers.
Green AI: Powerful and Lightweight Machine Learning via Exploiting Symmetries
In this project we investigate the use of symmetries and low dimensional structures in the design of machine learning models. Enforcing these mathematical structures will allow us to reduce the energy consumption, time, and amounts of data required for training and evaluating machine learning models while preserving (or even improving) their performance.
This project proposes to use symmetries and low dimensional structures such as manifolds to constrain the representation learning of dynamical data. The goal is to reduce the size of data representations, and computational complexity of machine learning models, by enforcing certain mathematical structure. This structure –low dimensional manifolds, possibly with symmetries–could either be known prior to the model design, which is the case of equivariances in problems like protein folding [15, 1], or learned from data in a self-supervised way. Enforcing this structure will give the correct inductive bias to the machine learning models. It will allow us to reduce the energy consumption, time, and amounts of data required for training and evaluating machine learning models while preserving (or even improving) their performance.
Improving Spoken Language Understanding for People with Atypical Speech
In this project we will create a new dataset and develop new speech technologies meant to improve the lives of persons with atypical speech and speech impairment.
Speech impairment is usually characterized by atypical articulation, phonation, respiration, prosody or a combi-nation thereof. The ability to communicate effectively, such as via speech is essential to safety, independence, social engagement, and quality of life [1–6]. Unfortunately, this communication method is lost to persons with speech impairment (PwSI) because family, caregivers, peers and assistive devices, such as smart speakers or voice dialing often fail to comprehend. Spoken language understanding (SLU) and Automatic speech recogni-tion (ASR) systems are not always trained to understand atypical speech, because training commonly requires hours of exposure to a PwSI before learning their speech patterns, and this is not commonly available. Using new corpora including atypical speech and intent labels, it will be possible to train and adapt new SLU systems for PwSI, which can be crucial for their interaction with speech assistants. Thus, we propose two innovative spe-ciﬁc aims that will lead to technologies and methods meant to improve the lives of PwSI through development of machine conversation partners of PwSI.
Speciﬁc Aim 1: Create an atypical speech corpus with speech materials optimized for the training of SLU and potentially for ASR, and make it publicly available. Approach: We shall record speech sam-ples from at least 100 PwSI, including: participants with severe hearing impairment, with Cerebral Palsy (CP), Parkinsonian and related movement disorders (PD), Stroke, Multiple Sclerosis (MS), Bulbar-Onset Amyotrophic Lateral Sclerosis (ALS), and participants without speech impairment (as control participants), under a consent form that permits distribution. Participants shall record ﬂuent speech commands used for controlling a speech assistant. These utterances will be elicited several times and will have intent annotations, in a similar way as the Fluent Speech Commands Dataset from Fluent.ai. Additionally, they will read selections from novels. Impact: The creation and distribution of this corpus will permit researchers worldwide to advance augmentative and alternative communication and listener-focused interventions for individuals with atypical speech.
Speciﬁc Aim 2: Use the collected corpus to develop new end-to-end SLU models for PwSI. Approach: We will use the atypical speech corpus along with other publicly available corpora, such as the Fluent Speech Commands Dataset from Fluent.ai to train end-to-end SLU approaches for PwSI. Impact: The use of speech assistants by PwSI has been demonstrated to enhance the patient’s quality of life and independence [1–6].
Integrating Knowledge Representation of LLMs with Information Extraction Systems
In the past few years, new types of AI models that capture patterns in language have become very good at learning information from language.
This project explores how we can use information learned by these models to inform practical applications on language data, such as identifying important features or characteristics of products in product reviews.
The past few years have seen tremendous advancements in the training of large language models (LLMs). While masked language models (MLM) (e.g. BERT) are the backbone of modern information extraction (IE) systems, autoregressive language models (ALM) capture knowledge from text that can inform IE decisions. We explore how to utilize the ability of ALMs to extract facts and knowledge from a large corpus with the task specific performance of MLM based models for information extraction tasks. We develop methods for data augmentation that enable the ALM to generate data based on the needs of the task as defined by both the available task-specific training data and uncertainty under the MLM. The resulting systems will better adapt to new domains and tasks with minimal training data.
Online Domain Adaptation via Distributionally Robust Learning
This project aims to enable fast and robust adaptation for AI algorithms via modeling uncertainty.
This project aims to enable domain adaptation when the target domain data arrive in a few-shot and online fashion. In many domain adaptation applications, the algorithm must be pretrained on an offline source data set and adapt to different target domains when the target data arrive continuously in the deployment. This is also called life-long learning or continual learning setting in the literature. For example, assistive AI agents may have data in young user groups in pretraining but need to adapt to the elderly group as it is deployed. Data from the elderly group is also much smaller than the younger group. Conventional domain adaptation methods, like those in unsupervised domain adaptation, focus on only one-step adaptation and require large amounts of unlabeled target data, thus are not suitable for continuous and few-shot online domain adaptation.
In this project, we aim to utilize the distributionally robust learning framework for source learning, uncertainty estimation, and online adaptation. When target data arrives, we extrapolate the model according to the uncertainty model. The project consists of three tasks: 1) investigating offline uncertainty estimation method under domain shift with no or only few-shot target data; 2) designing model update rules to achieve extrapolation for online adaption in the target domain; 3) analyzing the whole learning process and establish a theoretical understanding on the sample efficiency. We plan to use classification problems in language understanding or vision recognition as the exemplar applications.
Rapid Multilingual Dataset Creation with Automatic Projection and Human Supervision
Artificial intelligence (AI) in general, and natural language processing (NLP) in particular, require a massive scale of data to learn strong models. Such data might not be available in languages other than high-resource ones such as English. In this project, we study the rapid creation of multilingual data sets by automatically translating and aligning an available data set in one language into multiple other languages. We will also study the impact of human supervision in improving data quality. Once we have created these resources, we intend to use them to co-train single multilingual models for cross-lingual NLP tasks.
Multilingual data projection  creates cross-lingual datasets for a variety of down-stream Natural Language Processing (NLP) tasks. Frequently, this is done through automatic methods such as machine translation and data augmentation. However, our recent work  has shown that human-labelled data has an outsized impact on performance over large amounts of pretraining data. Yet leveraging cheaper automatic data is still integral to creating cross-lingual datasets and models. Rather, a mixture of both human annotations and corrections, as well as automatic methods, can yield the highest quality models. However, best practices for extracting the maximum value from the minimum amount of human intervention remains signiﬁcantly understudied in the broader NLP literature.
In this proposal, we intend to systematically study the impact of human supervision in a variety of NLP scenarios, such as word alignment, translation, and cross-lingual semantics. We intend to conduct error analysis of human corrections as well as creating multilingual datasets. Overall, the goal will be to create pipelines for rapid dataset creation. Once we have created these resources, we intend to use them to learn multilingual models. Previous work has suggested that training on combined gold and projected datasets can improve performance. We intend to study this further by incorporating the fact that the gold and projected data are parallel instances, thus a model can be co-trained on them. Overall, we broadly divide the proposed work into two parts: rapid multilingual dataset creation, and learning multilingual models trained on such datasets.
Weakly-Supervised Multi-Modal Transformers for Few-Shot Learning with Generalization to Novel Domains and Fine-Grained Tasks
The goal of this project is to learn models of vision which are only weakly supervised by language.
Self-supervised and weakly supervised transformers have been shown to be highly effective for a variety of vision, language, and vision-language tasks. This proposal targets three challenges. First, to improve performance on standard tasks, particularly on fine-grained tasks (e.g., object attributes and parts), which have received little study. Second, to develop tokenizer approaches to enable few-shot, and ideally zero-shot, learning. Third, to adapt these approaches so that they are able to generalize to novel domains and to out-of-distribution situations. We propose five strategies to achieve these goals which include extending the tokenizer-based approaches, modifying the transformer structure, increasing the text-annotations to help these difficult tasks, and techniques for enabling the algorithms to generalize out-of-domain and out-of-distribution.