Bayesian Consensus Algorithm for Multi-Class Data Labeling

Data labeling is a critical bottleneck in machine learning pipelines. While crowdsourcing platforms provide access to numerous annotators, their labels often contain noise and inconsistencies. This work presents a Bayesian approach to aggregate noisy labels from multiple annotators, determining the true label through probabilistic inference.

The Problem

Consider a multi-class classification task where we need to label a dataset, but instead of having access to ground truth labels, we have:

The challenge is to combine these noisy labels in a principled way that accounts for annotator reliability and produces high-confidence predictions.

Theoretical Foundation

Annotator Model

Each annotator $k$ is modeled by a confusion matrix where $p_{ij}^{(k)}$ represents the probability that annotator $k$ proposes class $i$ given that the true class is $j$:

$$P(\text{annotator } k \text{ says } i \mid \text{true class is } j) = p_{ij}^{(k)}$$

The diagonal elements $p_{ii}^{(k)}$ represent the annotator's precision for class $i$. Higher values indicate more reliable annotators for that specific class.

Bayesian Inference

Given a set of votes from different annotators, we use Bayes' theorem to compute the posterior probability that the true label is class $E$:

$$P(\text{class } E \mid \text{votes}) = \frac{P(\text{votes} \mid \text{class } E) \cdot P(\text{class } E)}{P(\text{votes})}$$

The key insight is that votes from different annotators are conditionally independent given the true class, so we can write:

$$P(\text{votes} \mid \text{class } E) = \prod_{k \in \text{annotators}} P(\text{vote}_k \mid \text{class } E)$$

Conservative Precision Estimation

To ensure robustness, we use a conservative estimate of annotator precision. For each annotator, we underestimate their true precision with probability $1-\beta$ using confidence intervals:

$$\hat{p}_{ii}^{(k)} = p_{ii}^{(k)} - z_\beta \sqrt{\frac{p_{ii}^{(k)}(1-p_{ii}^{(k)})}{n_{ii}^{(k)}}}$$

where $n_{ii}^{(k)}$ is the number of samples used to estimate the precision, and $z_\beta$ is the appropriate quantile from the standard normal distribution.

Algorithm Overview

Multi-Class Consensus Labeling Algorithm

  1. Initialize: Set prior class probabilities and confidence threshold $\alpha$
  2. For each data point:
    • Initialize empty vote dictionary for each class
    • While confidence below $\alpha$:
      • Select random annotator (or use recommendation algorithm)
      • Generate label from annotator's distribution
      • Add vote to corresponding class
      • Compute posterior probabilities using Bayes' rule
      • Check if max posterior exceeds $\alpha$
    • Assign label with highest posterior probability
  3. Return: Consensus labels with confidence guarantees

Implementation

User Generation

First, we simulate a pool of annotators with varying precision levels. Each annotator is characterized by their confusion matrix:

def generar_usuarios(n, a, m):
    '''Create n users with simulated precision matrices.
    
    Args:
        n: Number of users
        a: Minimum samples for precision estimation
        m: Maximum samples for precision estimation
    
    Returns:
        List of tuples (confusion_matrix, sample_counts)
    '''
    usuarios = []
    matrs = []
    
    for i in range(n):
        # Generate diagonal precisions from Pareto distribution
        ress = 1 - np.random.pareto(15 + random.randint(0, i), num_clases)
        for i in range(len(ress)):
            if ress[i] < 0:
                ress[i] = -ress[i]
        
        # Build confusion matrix
        matr = np.asarray(ress) * (np.eye(num_clases))
        valores = np.zeros(matr.shape)
        
        # Fill off-diagonal elements (error rates)
        for i in range(matr.shape[0]):
            for j in range(matr.shape[1]):
                if i != j:
                    valores[i, j] = random.random()
        
        # Normalize to ensure probabilities sum to 1
        valores = np.eye(num_clases) * np.sum(valores, axis=0) + valores
        valores = valores / np.diag(valores)
        valores = valores - np.eye(num_clases)
        matr = matr + valores * (np.diag(np.eye(num_clases) - matr).reshape(num_clases, 1))
        matr = matr.T
        matr = matr / np.sum(matr, axis=0)
        matrs.append(matr)
    
    # Generate sample counts for each precision estimate
    for i in range(len(matrs)):
        ns = np.random.randint(a, 1000, num_clases * num_clases).reshape((num_clases, num_clases))
        ns = ns / ns.sum()
        ns = np.floor(ns * m)
        usuarios.append((matrs[i], ns))
    
    return usuarios

Conservative Precision Estimates

We underestimate each annotator's precision to provide confidence guarantees:

def lista_subestimada(lista, prob_sub):
    '''Underestimate user precisions with confidence prob_sub.'''
    nueva_lista = []
    
    for i in range(len(lista)):
        nueva_lista.append(np.asarray(list(lista[i][0])))
        
        for j in range(num_clases):
            # Compute conservative estimate using normal approximation
            nueva_lista[i][j, j] = (
                lista[i][0][j, j] + 
                (np.sqrt(lista[i][0][j, j] * (1 - lista[i][0][j, j]) / 
                        np.sum(lista[i][1], axis=0)[j])) * 
                stats.norm.ppf(prob_sub)
            )
            
            # Renormalize error rates
            r = 1 - nueva_lista[i][j, j]
            z = np.sum(nueva_lista[i][:, j]) - nueva_lista[i][j, j]
            temp = np.ones(num_clases) * r / z
            temp[j] = 1
            nueva_lista[i][:, j] = nueva_lista[i][:, j] * temp
    
    return nueva_lista

Bayesian Label Aggregation

The core algorithm aggregates votes using Bayes' theorem:

def prob_correcta(E, dic_votantes, lista):
    '''Compute posterior probability that true label is E.
    
    Args:
        E: Candidate class label
        dic_votantes: Dictionary mapping classes to list of voters
        lista: List of user confusion matrices
    
    Returns:
        Posterior probability for class E
    '''
    probs = list(props)  # Prior probabilities
    
    for k in range(num_clases):
        for i, vot in dic_votantes.items():
            if len(vot) >= 1:
                for le in range(len(vot)):
                    if i == k:
                        # Voter said k, true class is k
                        probs[k] = probs[k] * lista[vot[le]][k, k]
                    elif i != k:
                        # Voter said i, true class is k
                        probs[k] = probs[k] * (lista[vot[le]][i, k])
    
    # Normalize to get posterior probabilities
    probs = probs / np.sum(probs)
    return probs[E]

Complete Labeling Process

def etiquetar(usuarios, usuariossub, data, alpha):
    '''Label dataset using consensus algorithm.
    
    Args:
        usuarios: True user confusion matrices
        usuariossub: Conservative user estimates
        data: Ground truth (for simulation)
        alpha: Confidence threshold
    
    Returns:
        (vote_collections, final_labels)
    '''
    coleccion = []
    new_col = []
    
    for i in range(len(data)):
        coleccion.append({k: [] for k in range(num_clases)})
        
        p = 0
        while p == 0:
            # Select random annotator
            etiquetador = random.sample(range(len(usuarios)), 1)
            etiqueta, mumu = generar_guess(etiquetador[0], usuarios, i)
            
            # Record vote
            coleccion[i][etiqueta].append(etiquetador[0])
            
            # Check if any class exceeds confidence threshold
            for l in range(num_clases):
                if (prob_correcta(l, coleccion[i], usuariossub) > alpha):
                    new_col.append(l)
                    p = 1
                    break
    
    return coleccion, new_col

Experimental Results

The algorithm was tested on synthetic datasets with varying numbers of classes and annotators. Key findings include:

Performance Metrics

  • Precision: Consistently achieves >95% accuracy when $\alpha > 0.9$
  • Efficiency: Requires fewer labels per data point compared to majority voting
  • Robustness: Gracefully handles annotators with poor precision
  • Confidence: Provides calibrated uncertainty estimates

The validation function computes the final precision by comparing consensus labels against ground truth:

def get_precs(c):
    '''Compute precision for each class from confusion matrix.'''
    precs = np.zeros(num_clases)
    for n in range(num_clases):
        precs[n] = c[n, n] / np.sum(c, axis=0)[n]
    return precs

def get_conf_final(coleccion, data):
    '''Build confusion matrix for final predictions.'''
    tabla = np.zeros((num_clases, num_clases))
    for i in range(len(data)):
        tabla[coleccion[i], data[i]] = tabla[coleccion[i], data[i]] + 1
    return tabla

Applications and Future Work

Real-World Applications

Potential Improvements

Currently exploring an alternative approach that doesn't rely on underestimation, providing more precise control over annotator uncertainty. This could lead to even more efficient label collection.

Additional research directions include:

Conclusion

This Bayesian consensus algorithm provides a principled approach to multi-class data labeling with noisy annotators. By modeling annotator reliability through confusion matrices and applying Bayesian inference with conservative estimates, the algorithm achieves high accuracy while providing confidence guarantees.

The approach is particularly valuable in scenarios where obtaining ground truth is expensive or impossible, and where combining multiple imperfect opinions is the only viable strategy. The probabilistic framework naturally handles varying annotator quality and provides transparent uncertainty quantification.