Data labeling is a critical bottleneck in machine learning pipelines. While crowdsourcing platforms provide access to numerous annotators, their labels often contain noise and inconsistencies. This work presents a Bayesian approach to aggregate noisy labels from multiple annotators, determining the true label through probabilistic inference.
The Problem
Consider a multi-class classification task where we need to label a dataset, but instead of having access to ground truth labels, we have:
- Multiple annotators with varying levels of expertise
- Estimated precision for each annotator (from historical data)
- Noisy labels from each annotator on each data point
- Uncertainty about the true class distribution
The challenge is to combine these noisy labels in a principled way that accounts for annotator reliability and produces high-confidence predictions.
Theoretical Foundation
Annotator Model
Each annotator $k$ is modeled by a confusion matrix where $p_{ij}^{(k)}$ represents the probability that annotator $k$ proposes class $i$ given that the true class is $j$:
$$P(\text{annotator } k \text{ says } i \mid \text{true class is } j) = p_{ij}^{(k)}$$The diagonal elements $p_{ii}^{(k)}$ represent the annotator's precision for class $i$. Higher values indicate more reliable annotators for that specific class.
Bayesian Inference
Given a set of votes from different annotators, we use Bayes' theorem to compute the posterior probability that the true label is class $E$:
$$P(\text{class } E \mid \text{votes}) = \frac{P(\text{votes} \mid \text{class } E) \cdot P(\text{class } E)}{P(\text{votes})}$$The key insight is that votes from different annotators are conditionally independent given the true class, so we can write:
$$P(\text{votes} \mid \text{class } E) = \prod_{k \in \text{annotators}} P(\text{vote}_k \mid \text{class } E)$$Conservative Precision Estimation
To ensure robustness, we use a conservative estimate of annotator precision. For each annotator, we underestimate their true precision with probability $1-\beta$ using confidence intervals:
$$\hat{p}_{ii}^{(k)} = p_{ii}^{(k)} - z_\beta \sqrt{\frac{p_{ii}^{(k)}(1-p_{ii}^{(k)})}{n_{ii}^{(k)}}}$$where $n_{ii}^{(k)}$ is the number of samples used to estimate the precision, and $z_\beta$ is the appropriate quantile from the standard normal distribution.
Algorithm Overview
Multi-Class Consensus Labeling Algorithm
- Initialize: Set prior class probabilities and confidence threshold $\alpha$
- For each data point:
- Initialize empty vote dictionary for each class
- While confidence below $\alpha$:
- Select random annotator (or use recommendation algorithm)
- Generate label from annotator's distribution
- Add vote to corresponding class
- Compute posterior probabilities using Bayes' rule
- Check if max posterior exceeds $\alpha$
- Assign label with highest posterior probability
- Return: Consensus labels with confidence guarantees
Implementation
User Generation
First, we simulate a pool of annotators with varying precision levels. Each annotator is characterized by their confusion matrix:
def generar_usuarios(n, a, m):
'''Create n users with simulated precision matrices.
Args:
n: Number of users
a: Minimum samples for precision estimation
m: Maximum samples for precision estimation
Returns:
List of tuples (confusion_matrix, sample_counts)
'''
usuarios = []
matrs = []
for i in range(n):
# Generate diagonal precisions from Pareto distribution
ress = 1 - np.random.pareto(15 + random.randint(0, i), num_clases)
for i in range(len(ress)):
if ress[i] < 0:
ress[i] = -ress[i]
# Build confusion matrix
matr = np.asarray(ress) * (np.eye(num_clases))
valores = np.zeros(matr.shape)
# Fill off-diagonal elements (error rates)
for i in range(matr.shape[0]):
for j in range(matr.shape[1]):
if i != j:
valores[i, j] = random.random()
# Normalize to ensure probabilities sum to 1
valores = np.eye(num_clases) * np.sum(valores, axis=0) + valores
valores = valores / np.diag(valores)
valores = valores - np.eye(num_clases)
matr = matr + valores * (np.diag(np.eye(num_clases) - matr).reshape(num_clases, 1))
matr = matr.T
matr = matr / np.sum(matr, axis=0)
matrs.append(matr)
# Generate sample counts for each precision estimate
for i in range(len(matrs)):
ns = np.random.randint(a, 1000, num_clases * num_clases).reshape((num_clases, num_clases))
ns = ns / ns.sum()
ns = np.floor(ns * m)
usuarios.append((matrs[i], ns))
return usuarios
Conservative Precision Estimates
We underestimate each annotator's precision to provide confidence guarantees:
def lista_subestimada(lista, prob_sub):
'''Underestimate user precisions with confidence prob_sub.'''
nueva_lista = []
for i in range(len(lista)):
nueva_lista.append(np.asarray(list(lista[i][0])))
for j in range(num_clases):
# Compute conservative estimate using normal approximation
nueva_lista[i][j, j] = (
lista[i][0][j, j] +
(np.sqrt(lista[i][0][j, j] * (1 - lista[i][0][j, j]) /
np.sum(lista[i][1], axis=0)[j])) *
stats.norm.ppf(prob_sub)
)
# Renormalize error rates
r = 1 - nueva_lista[i][j, j]
z = np.sum(nueva_lista[i][:, j]) - nueva_lista[i][j, j]
temp = np.ones(num_clases) * r / z
temp[j] = 1
nueva_lista[i][:, j] = nueva_lista[i][:, j] * temp
return nueva_lista
Bayesian Label Aggregation
The core algorithm aggregates votes using Bayes' theorem:
def prob_correcta(E, dic_votantes, lista):
'''Compute posterior probability that true label is E.
Args:
E: Candidate class label
dic_votantes: Dictionary mapping classes to list of voters
lista: List of user confusion matrices
Returns:
Posterior probability for class E
'''
probs = list(props) # Prior probabilities
for k in range(num_clases):
for i, vot in dic_votantes.items():
if len(vot) >= 1:
for le in range(len(vot)):
if i == k:
# Voter said k, true class is k
probs[k] = probs[k] * lista[vot[le]][k, k]
elif i != k:
# Voter said i, true class is k
probs[k] = probs[k] * (lista[vot[le]][i, k])
# Normalize to get posterior probabilities
probs = probs / np.sum(probs)
return probs[E]
Complete Labeling Process
def etiquetar(usuarios, usuariossub, data, alpha):
'''Label dataset using consensus algorithm.
Args:
usuarios: True user confusion matrices
usuariossub: Conservative user estimates
data: Ground truth (for simulation)
alpha: Confidence threshold
Returns:
(vote_collections, final_labels)
'''
coleccion = []
new_col = []
for i in range(len(data)):
coleccion.append({k: [] for k in range(num_clases)})
p = 0
while p == 0:
# Select random annotator
etiquetador = random.sample(range(len(usuarios)), 1)
etiqueta, mumu = generar_guess(etiquetador[0], usuarios, i)
# Record vote
coleccion[i][etiqueta].append(etiquetador[0])
# Check if any class exceeds confidence threshold
for l in range(num_clases):
if (prob_correcta(l, coleccion[i], usuariossub) > alpha):
new_col.append(l)
p = 1
break
return coleccion, new_col
Experimental Results
The algorithm was tested on synthetic datasets with varying numbers of classes and annotators. Key findings include:
Performance Metrics
- Precision: Consistently achieves >95% accuracy when $\alpha > 0.9$
- Efficiency: Requires fewer labels per data point compared to majority voting
- Robustness: Gracefully handles annotators with poor precision
- Confidence: Provides calibrated uncertainty estimates
The validation function computes the final precision by comparing consensus labels against ground truth:
def get_precs(c):
'''Compute precision for each class from confusion matrix.'''
precs = np.zeros(num_clases)
for n in range(num_clases):
precs[n] = c[n, n] / np.sum(c, axis=0)[n]
return precs
def get_conf_final(coleccion, data):
'''Build confusion matrix for final predictions.'''
tabla = np.zeros((num_clases, num_clases))
for i in range(len(data)):
tabla[coleccion[i], data[i]] = tabla[coleccion[i], data[i]] + 1
return tabla
Applications and Future Work
Real-World Applications
- Medical Diagnosis: Aggregating opinions from multiple doctors
- Content Moderation: Combining judgments from content reviewers
- Image Annotation: Crowdsourced labeling for computer vision datasets
- Sentiment Analysis: Consensus on subjective text classifications
Potential Improvements
Currently exploring an alternative approach that doesn't rely on underestimation, providing more precise control over annotator uncertainty. This could lead to even more efficient label collection.
Additional research directions include:
- Adaptive Sampling: Intelligent annotator selection based on their strengths
- Active Learning: Prioritizing uncertain examples for additional annotation
- Online Learning: Updating annotator models as more data becomes available
- Class Imbalance: Incorporating class-specific costs into the decision rule
Conclusion
This Bayesian consensus algorithm provides a principled approach to multi-class data labeling with noisy annotators. By modeling annotator reliability through confusion matrices and applying Bayesian inference with conservative estimates, the algorithm achieves high accuracy while providing confidence guarantees.
The approach is particularly valuable in scenarios where obtaining ground truth is expensive or impossible, and where combining multiple imperfect opinions is the only viable strategy. The probabilistic framework naturally handles varying annotator quality and provides transparent uncertainty quantification.