Working With What You’ve Got: Leveraging Mislabeled Datasets And Improving Imperfect Pretrained Models

Gerych, Walter

Etd

Working With What You’ve Got: Leveraging Mislabeled Datasets And Improving Imperfect Pretrained Models

Public Deposited

Resources such as OpenML and HuggingFace have made large datasets and powerful pretrained models more accessible than ever for deep learning practitioners and researchers. However, the large-scale datasets typically used to train deep learning systems are often plagued by noisy labels, where the label associated with some datapoint may be incorrect. Likewise, many pretrained models exhibit biased outputs and lack the full range of functionalities desired by the end users. In this dissertation, I study four topics related to data and model quality issues. Extending the Capabilities of Learning in Positive-Unlabeled Noisy Label Settings: In the first two tasks, I focus on the understudied Positive Unlabeled (PU) setting for noisy labels. In a PU dataset some of the positive instances are labeled, while the remaining positives and the negative instances are not distinguished form each other. This can be thought of as one-sided label noise, where some positive instances have their label flipped to the negative class. This label quality issue is common across datasets for many tasks and domains, from computer vision to biomedical data. For instance, computer vision datasets for object detection often provide annotations in the form of a list of objects that are in a given image. Despite its importance, there are limitations to the current work in the Positive Unlabeled setting. Specifically, existing methods typically assume a binary classification setting and that there is no sample selection bias in determining which instances are labeled. In my dissertation, I set out to addresses these shortcomings. Task 1: Extending Positive Unlabeled Learning To Multi-Label Data. In this task, I extend methods for learning from Positive Unlabeled data, which typically are limited to only binary classification, to also work with multi-label data and multi-label classifiers. To do this, I formalize a novel unbiased risk to train models that are unbiased on the distribution of clean data given only noisy PU data. Experimental results on common multi-label datasets show that our method is significantly more accurate in predicting the correct label-set than alternative approaches, especially as the level of PU label noise is increased. Task 2: Modeling Biased PU Sample Selection. Here, I study PU learning under the more realistic scenario of a biased sampling strategy that leads to unrepresentative labels. In contrast, previous PU works almost exclusively assumed unbiased labeling mechanisms. In this task, I analyze when it is theoretically possible to perform identifiable PU learning. I then propose two strategies to do so under a set of reasonable assumptions. The results indicate that our approaches nearly always approximate the true labeling likelihood and significantly outperform existing methods on a suite of common benchmark datasets. Extending the Applicability of Pretrained Generative Models. The last two tasks focus on extending the usability of pretrained generative models. Specifically, I study the task of debiasing generative models as well as adding the ability of conditional generation to unconditional pretrained models. Due to the prohibitive costs associated with finetuning or training generative models along with access to clean unbiased data often being limited, I focus on finetuning-free unsupervised solution strategies. Task 3: Debiasing Pretrained Generative Models Without Retraining. In this task, I focus on debiasing pretrained generative models. To do this, I formalize the concept of a semantically uniform distribution: a data distribution which places an equal amount of mass on each possible value of a semantic attribute, such as race or gender. I propose a principled approach for re-sampling from the generator’s latent space in order to yield a semantically uniform output distribution, which allows us to debias the generator without retraining the model. Experimental analysis on multiple types of generative models (GANs, VAEs, and DDIMs) shows that our approach reduces the bias of the generative model’s output significantly more than existing approaches on a variety of common image datasets. Task 4: Converting Unconditional Pretrained Generators into Conditional Models Without Retraining Or Supervision. Lastly, I propose an approach for converting unconditional generative model, which generate data distributions but do not allow the user to choose which class to sample from, into conditional models that can be made to sample from specific classes. I achieve this by identifying and removing regions of the latent space that correspond to low-density regions in the output space, and then clustering the remaining regions. Each cluster in the new latent space can be shown to correspond to a semantically meaningful sub-manifold in the output space; e.g., each sub-manifold corresponds to a particular class. Using Gaussian Mixture models fit on each of these clusters, we can then selectively sample from a given sub-manifold in the output space in order to generate a sample from a desired mode. Experimental results indicate that the clusters found using our approach are significantly correlated with unique classes in the data space.

Creator