Human biases can influence multiple stages of data management, from collection to interpretation and use. This thesis aims to develop and investigate techniques for mitigating such biases using computational methods. We begin by describing approaches to mitigate the effects of bias in the data collection phase, using tools from natural language processing. We then proceed to develop tools for assessing the robustness of data-driven claims.
In many cases, the data itself encodes human biases, which in turn propagate into the machine learning models trained on it. A prominent example arises in clinical trials, where women have been historically underrepresented. This bias harms the performance of language models and downstream predictions. Rather than erasing gender information, we propose methods that adjust embeddings to reflect underrepresentation while preserving medically meaningful distinctions, thus increasing prediction accuracy for women patients without degrading the accuracy for men patients.
Even when the underlying data is not inherently biased, it can still be selectively interpreted to support predetermined conclusions - a phenomenon known as cherry-picking. To inspect the robustness of claims based on a database, we propose two complementary methods. The first is a schema-based approach: given a claim and a database, we identify natural views or subpopulations of the database where the claim holds. The naturalness and the number of supporting views for the claim and its opposite can help the user assess the correctness of the claim. The second method is data-driven: we quantify and explain the deviation of a dataset from an expected monotonic trend (such as salaries increasing with education) by finding a minimal repair: the smallest possible set of tuples whose removal restores the expected trend.