The Johns Hopkins University + Amazon Initiative for Artificial Intelligence (AI2AI) has selected 7 JHU WSE Faculty research projects for its inaugural round of AI2AI Faculty Research Awards. Research areas covered in these projects include Speech and Language, Green AI, Computer Vision, and Responsible AI Approaches for Data.
2023-2024 Faculty Research Awards
Self-supervision for Skeleton-based Learning of Actions
Existing AI systems need a lot of “hand holding” when learning new skills. The research that will be undertaken for this AI2AI project will build AI systems that learn on its own to a large extent. This will lead to better AI systems for domains where supervision/hand-holding is difficult and not scalable.
Supervised learning of skeleton sequence encoders for action recognition has received significant attention in recent times. However, learning such encoders without labels continues to be a challenging problem. While prior works have shown promising results by applying contrastive learning to pose sequences, the quality of the learned representations is often observed to be closely tied to data augmentations that are used to craft the positives. However, augmenting pose sequences is a difficult task as the geometric constraints among the skeleton joints need to be enforced to be make the augmentations realistic for that action. In this work, we propose to build on a contrastive learning approach developed with the support of Amazon Fellowship given to Mr. Anshul Shah, a student of Rama Chellappa, during the first year of the AI2AI effort. The goal of this work was to train models for skeleton-based action recognition without labels by hallucinating latent positives for contrastive learning. Specifically, we explored the latent space of poses in suitable directions to generate new positives. This requires an optimization formulation to solve for the synthetic positives with an explicit control on their hardness. We proposed approximations to the objective making them solvable in closed form with minimal overhead. Preliminary experiments show that using these generated positives within a standard contrastive learning framework leads to consistent improvements across benchmarks such as NTU-60, NTU120, and PKU-II. We will collaborate with Amazon researchers on further hardening the proposed approach and test on real-life sequences from Amazon to validate its effectiveness and robustness.
Fair and Private NLP for High-Risk Data
This project aims to improve the fairness and privacy of AI in high stakes public-serving domains like social services and medicine. The key approach is developing ways to create high quality synthetic data, which can be used to improve models without compromising privacy.
Widespread availability of public digitized text has greatly facilitated the advancement of natural language processing (NLP), leading in turn to public-interfacing models like ChatGPT. Text processing could also be extremely valuable for processing high-stakes private data, like social workers’ notes, healthcare records, or customer data. Information extraction could help a social worker quickly search hundreds of notes for a caretaker’s contact information (Gandhi et al., 2022), and initiatives like Amazon Comprehend Medical evidence the demand for NLP in healthcare settings. However, safety risks and the need for responsible data practices hinder the development and deployment of models in these domains. Models are prone to absorbing and amplifying data biases, and potential unfairness makes models unusable in high-stakes domains. Furthermore, text data is extremely difficult to fully anonymize. Poorly-anonymized high-stakes data cannot be shared annotators or external researchers, nor can it be used it to train deployed models, as generative models are prone to outputting sensitive information from training data (Carlini et al., 2023). This proposal aims to address these challenges by developing text generation tools to create realistic synthetic data that can facilitate research and model development while improving model fairness and minimizing privacy violations. The core technical contribution includes the development of constraints for controllable text generation that balance the trade-off between removing private or biased information and ensuring synthetic text is realistic enough to be useful. To facilitate developing models for real private data, we will draw on the PI’s ongoing partnership with the Allegheny County Department of Human Services to examine notes about child welfare cases (Gandhi et al., 2022; Field et al., 2023) and will also draw from existing medical NLP research (Johnson et al., 2016; Uzuner et al., 2011; Gandhi et al., 2021). This work is critical for building text processing systems in high-risk settings with private data. The ability to synthesize and anonymize data would enable products designed over data internal to Amazon, such as public-facing dialog agents for interacting with customers, as well as products designed for external private data, such as expansions of Amazon Comprehend Medical or new tools for processing social workers’ notes.
On-device Compressed Models for Speaker Diarization
This project aims to explore on-device speech processing with an emphasis on efficient diarization, which involves identifying speakers in a recording. The primary objective is to discover compression and optimization methods that can reduce the significant computational requirements.
On-device speech processing is getting more attention as machine learning algorithms continue improving. Understanding how big a model needs to be is still an open question. Diarization is not an exception to this trend; the current state-of-the-art systems can get low diarization error rates using these large-scale self-supervised learning (SSL) models. However, these systems are rarely deployable because they are inefficient, computationally expensive and need special technical requirements. Model compression and subsampling techniques have shown to be suitable solutions to solve the problems of these models to some extent. Although the same methods can be used for various downstream tasks (including automatic speech recognition, speaker identification, emotion recognition, etc.), we aim to focus on diarization, as it is usually in the first stages of any speech-processing application. In this proposal, we will study how to build efficient diarization models based on self-supervised models that can be deployed on-device. The core pipeline uses a state-of-the-art End-to-end neural diariazition system (EEND-EDA), in which a SSL model replaces the encoder. As we want to focus on the impact of the self-supervised models, this proposal considers two avenues: compressing and subsampling.
On the one hand, we propose investigating successful compressing methods such as weight pruning, head pruning, low-rank approximation and knowledge distillation. We will jointly optimize the self-supervised and diarization losses to achieve our purpose. Our starting point is a WavLM self-supervised model combined with EEND-EDA. The WavLM will act as the encoder of the EEND-EDA model, whereas the rest of the EEND-EDA modules will become a back-end. For the pruning options and the low-rank approximation, our approach will trim a set of weights and optimize the whole system until convergence. For the distillation, inspired by DistilHuBERT, we will select specific layers from the Transformer and compute prediction heads optimized for diarization. On the other hand, we will study the effect of shortening the sequences, i.e., subsampling along the time axis in the self-supervised learning models. We follow this strategy as it has been shown that the subsampling can improve the performance and speed up the inference. Two main approaches will be explored: variable-length (for example, pooling variable-length frame representations into segment representations) and fixed-length (for example, removing odd frames) subsampling. We will use our crafted distilled version of WavLM as the encoder of the EEND-EDA model as our starting point. Using a student-teacher approach, the student will minimize the distance between the sequence of teacher representations and a projection (prediction heads) of its own representations. The prediction heads are connected to the EEND-EDA and optimized accordingly.
The combination of the compressing methods with the subsampling techniques will provide insights into how aggressive the subsampling and the compression could be and where the boundaries are. Finally, to document the performance, this study will use diarization error rate (DER), real-time factor, multiply-accumulate operations, the number of parameters, and wall clock time as metrics. We plan to test our algorithms on the Callhome, AMI, and Dihard III datasets.
Convergence of Speech, Language and Translation Models
Large language models have taken over much of NLP, but not yet machine translation which is trained on relatively large data resources. This projects explores how to merge the success of large language models and machine translation, with special attention to the spoken language.
Large language models and neural machine translation have made great progress over the last years. While large language models have shown some success at zero-shot or in-context learning of the translation task, they have not reached the state of the art yet. We propose to combine the strengths of both, especially the ability of large language models to model wider multi-sentence context and larger amounts of training data and translation models’ focus on the actual task in a supervised way.
Language-guided Universal Domain Adaptation
Recent advances in deep learning have led to the development of accurate and efficient models for various computer vision applications such as classification, segmentation, and detection. However, learning highly accurate models relies on the availability of large-scale annotated datasets. As a result, these approaches suffer from severe degradation of performance when evaluated on images that are sampled from a different distribution as compared to that of training images. In this project, we will develop a vision-language model guided domain adaptation method to address this challenge.
Recent advances in deep learning have led to the development of accurate and efficient models for various computer vision applications such as classification, segmentation, and detection. However, learning highly accurate models relies on the availability of large-scale annotated datasets. As a result, these approaches suffer from severe degradation of performance when evaluated on images that are sampled from a different distribution as compared to that of training images. Such scenarios are encountered frequently in the real world. For example, consider the case of self-driving cars where the classifiers and detectors are typically trained on datasets obtained from one particular city or environmental condition (belonging to source domain) and are expected to be deployed in different city or environment (belonging to target domain). Due to this, it is important to develop approaches that enable better generalization of classifiers and detectors. Domain Adaptation (DA) methods, aim to transfer knowledge from a labelled source domain to an unlabeled target domain.
Existing DA methods rely on prior knowledge about the relationship between source and target domain label sets. Specifically, close-set domain DA or open-set DA methods are employed depending on the source and target domain label sets. In real-world scenarios, selecting a suitable domain adaptation method is not practical as no prior knowledge regarding the target domain label set is given. To overcome these issues, we propose to develop a vision-language model guided universal domain adaptation method, which aims to handle both domain-shift and label-shift between domains in the wild.
Comparing Large Language Models Using Data Kernels
In 2019, a neural language model, known as BERT was introduced that produces general purpose representations of language called embeddings. The embeddings from these models can be used to dramatically reduce the amount of data and computation required to train a model on a downstream task. More recently, advances in language model scaling, prompt design, and modality fusion have led to the rapid and near ubiquitous adoption of large language models (LLMs) across industry and academia. However, principled statistical evaluation methodologies for LLMs have not kept pace with this mass adoption. In particular, the most extensive attempts to characterize and evaluate LLMs involve benchmarking their performance on a wide variety of datasets using performance metrics. We believe that comparing language models using their performance on benchmark datasets alone is inadequate for fully characterizing their similarities and differences. A more comprehensive statistical comparison framework is necessary.
We propose a framework for comparing and contrasting the representation spaces of deep neural networks – specifically, LLMs, such as ChatGPT before & after the introduction of Reinforcement Learning from Human Feedback (RLHF) – that is simultaneously computationally practical, statistically principled, mathematically tractable, and visually interpretable. Our method-ology is based on joint spectral embedding of the models’ data kernels into low-dimensional Euclidean space, which provides the foundation for both principled statistical inference and powerful visualization tools. Our proposal for using intrinsic (graph embedding) geometry to evaluate different LLMs fits nicely into topics of Speech and Language (“knowledge extraction, representation, and injection into speech recognition and language understanding systems”), as well as Responsible AI Approaches for Data (“quantifying and mitigating bias in training data to attain equitable model performance”).
We propose a framework for comparing and contrasting the representation spaces of deep neural networks – specifically, large language models (LLMs), such as ChatGPT before & after the introduction of Reinforcement Learning from Human Feedback (RLHF) – that is simultaneously computationally practical, statistically principled, mathematically tractable, and visually interpretable. Our methodology is based on joint spectral embedding of the models’ data kernels into low-dimensional Euclidean space, which provides the foundation for both principled statistical inference and powerful visualization tools. Our proposal for using intrinsic (graph embedding) geometry to evaluate different LLMs fits nicely into topics of Speech and Language (“knowledge extraction, representation, and injection into speech recognition and language understanding systems”), as well as Responsible AI Approaches for Data (“quantifying and mitigating bias in training data to attain equitable model performance”).
Developing an Evaluation Protocol for Contextualized ASR
This project aims to create a corpus and tests helping scientist across the world to improve ASR (automatic speech recognition) methods in cases where there is some additional external information available. For example, improving captioning of scientific talk by providing list of words from the scientist’s previous publication.
Knowing the particular context (in a broader sense) associated with the speaker can help improving the performance of an automatic speech recognition (ASR) system. For example, if we are provided with a list of in-context words or phrases — such as the speaker’s location or their contact list or previously played playlists — during inference, we can bias the recognition process towards this list. Generally speaking, contextualized ASR is a way for personalized yet highly scalable ASR – there is no need to keep adapted language or acoustic models around. Contextualized ASR is already an active research field, however, there is no standardized test bed for evaluating and for benchmarking different, alternative approaches. This is, arguably, harming the research progress and keeps the potential industrial complex from transitioning and benefiting from the development. We propose to develop an evaluation protocol that has a wide variety of types of scenarios in which the context information can be incorporated into the recognition process.