BBQ
BBQ, or the Bias Benchmark of QA, evaluates an LLM's ability to generate unbiased responses across various attested social biases. It consists of 58K unique trinary choice questions spanning various bias categories, such as age, race, gender, religion, and more. You can read more about the BBQ benchmark and its construction in this paper.
BBQ evaluates model responses at two levels for bias:
- How the responses reflect social biases given insufficient context.
- Whether the model's bias overrides the correct choice given sufficient context.
Arguments
There are two optional arguments when using the BBQ benchmark:
- [Optional]
tasks: a list of tasks (BBQTaskenums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list ofBBQTaskenums can be found here. - [Optional]
n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.
Example
The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on age and gender-related biases using 3-shot prompting.
from deepeval.benchmarks import BBQ
from deepeval.benchmarks.tasks import BBQTask
# Define benchmark with specific tasks and shots
benchmark = BBQ(
tasks=[BBQTask.AGE, BBQTask.GENDER_IDENTITY],
n_shots=3
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)
The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of questions for which the model produces the precise correct multiple choice answer (e.g. 'A' or ‘C’) in relation to the total number of questions.
As a result, utilizing more few-shot prompts (n_shots) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.
BBQ Tasks
The BBQTask enum classifies the diverse range of reasoning categories covered in the BBQ benchmark.
from deepeval.benchmarks.tasks import BBQTask
math_qa_tasks = [BBQTask.AGE]
Below is the comprehensive list of available tasks:
AGEDISABILITY_STATUSGENDER_IDENTITYNATIONALITYPHYSICAL_APPEARANCERACE_ETHNICITYRACE_X_SESRACE_X_GENDERRELIGIONSESSEXUAL_ORIENTATION