IOB’s 17 evaluation quality criteria serve as a comprehensive framework for assessing and monitoring the quality of project and programme-level evaluations commissioned by the Ministry of Foreign Affairs or by its partner organizations. In addition, IOB uses the evaluation quality criteria in its own research. The criteria have been structured chronologically across three different phases in the evaluation process: the Terms of Reference, the elaborated methodology, and the evaluation report.
The document IOB evaluation criteria 2024 (Zip) presents the 17 criteria, accompanied by an explanation as to why IOB deems an evaluation practice as ‘good’, ‘adequate’, or ‘inadequate’, along with illustrative examples.
This page presents the 17 criteria in the expandable items.
We recognise that the art and craft of policy evaluation is constantly evolving. We welcome feedback and suggestions via iob@minbuza.nl.
Evaluation criteria in phase 1 Terms of Reference

During formulating the Terms of Reference, the first subset of criteria (1 – 8) help shape aspects such as project description, evaluation objectives, scope, evaluation questions, and quality control.
The ToR should broadly present and assess the intervention theory and assumptions of the project under evaluation as this will provide useful input for formulating the evaluation questions.
The reference group plays a crucial role in guaranteeing the quality and independence of the evaluation process. It serves as an advisory body to the evaluation commissioner, offering guidance on the Terms of Reference (ToR), the selection of evaluators, the detailed methodology section (in the inception report), and the draft evaluation report.
The reference group typically comprises the evaluation commissioner, a representative from an implementing partner organization, and individuals with experience in relevant themes and evaluation methodologies. It should include at least one but preferably two or more independent members. To ensure downward accountability and sufficiently take the context into account, IOB recommends including at least one reference group member from a target country.
The evaluators and affiliated organisations should not have been involved in the design or implementation of the project under evaluation – either with the organisation responsible for implementation or at the MFA. Additionally, they must maintain complete impartiality with no vested interest in the ultimate outcome of the evaluation.
The ToR should present an adequate description of the national, sectoral, and political context in which the project or programme has been implemented, and also mention the rationale and problem analysis of the project. If available, the ToR should also present relevant baseline data on project and other relevant result indicators. Note that benchmarks for assessing progress, such as baseline date, do not have to be exclusively quantitative and can also encompass qualitative data.
The ToR should present a specification of a project’s intervention logic, ‘policy theory’, or Theory of Change (ToC). When there is no intervention logic or ‘policy theory’, the evaluator may need to reconstruct one by using project documentation.
The ToR should describe and illustrate of how and why a desired change is expected to happen in a specific context: the description of the ToC should articulate the implicit and explicit assumptions underpinning the relationships in the results chain (e.g. between output and outcome) and may assess the existing evidence base for these assumptions and links. Herein, it is essential to describe ‘the missing middle’ between a project and the desired outcome and impact.
In addition, the (reconstructed) ToC should consider contextual external factors that may, as ‘rival explanations’, influence results at various levels in the results chain (input, activity, output, (intermediate) outcome, impact).
The ToR should clearly delineate the evaluation period, geographical focus and specific policy domain (programs, projects) and expenditure that will be a part of the evaluation. Moreover, it should also specify which particular components of the intervention theory or ToC are to be included in the evaluation.
The ToR should distinctly articulate the evaluation's objectives and the intended utilization of its results. The ToR should explain who are the intended users of the results of the evaluation and for what purpose the results will be used. Typically an evaluation informs policymakers, funders, and (parliament’s) policy and financial oversight organizations, but it could also inform other stakeholders, including implementing agencies and final beneficiaries.
Ensuring clarity in the ToR about the specific objectives can shape the research questions and the type of recommendations required. Evaluations often serve multiple objectives, which can be categorized into:
- Knowledge objectives: These involve understanding whether the stated objectives have been achieved, contributing to accountability. It also seeks to uncover lessons about what works, why, and how, thereby contributing to learning.
- Action objectives: These aim at providing recommendations. An action objective can determine when evaluation results are needed, for example for subsequent project phases or formulation of new policy.
The ToR should carefully consider the selection of relevant OECD/DAC Evaluation Criteria and other cross-cutting topics.
Given the specific evaluation objectives (criterion 6) and scope (criterion 5), not all OECD/DAC evaluation criteria (relevance, coherence, effectiveness, efficiency, sustainability, impact) may be applicable. The ToR should include an assessment of which OECD/DAC evaluation criteria will be integrated into the evaluation questions.
The ToR may also incorporate other relevant cross-cutting issues (e.g. gender, localisation, poverty reduction, or climate), accompanied by a rationale.
In order to optimize the applicability of the OECD/DAC criteria ‘relevance’ and ‘coherence’ in evaluation research, the criteria are put in a logical order:
- relevance
- coherence
- effectiveness
- efficiency
- sustainability
- impact
For the ‘relevance’ criterium, two dimensions are central:
- the extent to which policy objectives, design and implementation of policy interventions are in line with the needs of the target group and other stakeholders
- the extent to which the policy intervention is based on prior available, context-specific knowledge about possible effectiveness and the necessary preconditions (evidence-based)
In case the ToR identifies ‘impact’ as an evaluation criterion, the evaluation should also assess unintended effects. This includes the effects for the non-target population and the broader effects of the intervention on society, instead of only assessing the stated project objectives.
The ToR should include a main research question and a number of evaluation questions in a precise and testable manner, aligning with the evaluation objective (criterion 6), the scope of the evaluation (criterion 5), and the relevant OECD/DAC evaluation criteria and cross-cutting topics (criterion 7).
Formulating overly broad or vague evaluation questions or mechanically translating the OECD/DAC criteria into questions should be avoided. At the same time, one should also steer clear of crafting too many detailed evaluation questions, which could dilute the focus of the evaluation. The evaluation questions should strike a balance between being ‘realistically ambitious’ while also considering the practical limitations of the evaluation, including constraints related to time, travel possibilities, budget, and the availability of information.
Evaluation criteria in phase 2 Elaborated methodology

The second subset of criteria (9 – 14) help in assessing the elaborated methodology, as outlined by the evaluator, e.g. in an ‘inception report’ or ‘technical proposal’. Based on the assessment of these criteria, a commissioner may ask evaluators to adjust the methodology.
This phase also includes a re-assessment of the first subset of criteria, to make sure whether the elaborated methodology is in line with the ToR.
The evaluation’s research design structures the overall approach and methods. The approach and methods must be appropriate answer the evaluation questions and achieve the evaluation’s objective in a valid and reliable manner. As the purpose of and possibilities for evaluation projects will differ, the research design may encompass multiple evaluation methods: how you evaluate depends on what you evaluate.
Evaluation robustness may be increased using multiple methods or data sources – a strategy known as ‘triangulation’.
Triangulation is a strategy to enhance the validity and reliability of the findings by cross-verifying information from different perspectives in order to provide a more comprehensive and accurate understanding of what is being evaluated. Triangulation can refer to the use of multiple research methods, multiple data collection methods, i.e. interviews, surveys, observations, or multiple data sources, i.e. persons, documents, project sites.
The ‘inception report’ - or ‘technical report’ - should elaborate on the research design, highlighting why the method or combination of methods have been selected and how they are expected to validly and reliably contribute to answering the research questions.
The selected evaluation method(s) should be appropriate to assess the contribution or attribution of the projects or interventions to observed results at the outcome or impact levels. Evaluators may use qualitative evaluation method(s) to evaluate the degree to which causal claims of projects or interventions about results, effects, and outcomes are plausible.
In their paper, White and Phillips (2012) distinguish four qualitative evaluation methods that are suitable for substantiating claims about effectiveness (realistic evaluation; contribution analysis; process tracing and; general elimination methodology) and four methods that are less suitable for substantiating claims about effectiveness (most significant change; success case method; outcome mapping; method for impact). In recent years, outcome harvesting, which shares characteristics with outcome mapping, has also gained popularity amongst evaluators, especially for evaluating lobby and advocacy related activities.
Qualitative evaluation methods that are able to substantiate clausal claims about effectiveness should generally follow these five steps:
- Formulate the cause-effect contribution question.
- Construct of reconstruct an intervention theory, including the assumptions.
- Formulate alternative theories and explanations for the observed changes.
- Collect data along results areas in the intervention theory, and for the alternative theories, including data from stakeholders that have not been directly involved in the project.
- Verify in a step-by-step manner the causal chains of the intervention theory for the full range of possible outcomes (including achieved results, intended results that have not been achieved, and unintended effects), and the alternative theories.
Research designs may include participatory evaluation methods, which can be used to deepen the understanding of specific mechanisms and can improve the evaluator’s contextual understanding and to facilitate downward accountability.
In order to answer evaluation questions about effectiveness it is important that, combined, the methods align with the five steps outlined above to validly and reliably answer the evaluation questions. Therefore, (single) participatory evaluation methods should be complemented with other evaluation methods. Furthermore, the methods should assess the full range of possible outcomes, including achieved results, intended results that have not been achieved, and unintended effects.
Evaluators can use quantitative evaluation method(s) to robustly substantiate causal claims about effects of projects or interventions. The Maryland scientific methods scale delineates five progressing levels of rigour:
- Single observation moment, after a project: a comparison with and without the project
- Two observation moments:
- comparisons before and after the project, without a control group
- Two observation moments - double difference:
- comparing before and after, and
- with and without the project
- Two observation moments - double difference, semi-experimental design:
- comparing before and after, and
- with and without the project
- while accounting for other external influences
- Two observation moments - double difference, randomized control group, experimental design:
- comparing before and after, and
- with and without the project
- participants are randomly assigned to a project
Although level 5 is best suited for attributing results to a specific project or intervention, that level of rigour is not always feasible for evaluations commissioned by the Netherlands Ministry of Foreign Affairs or its partner organisations. Alternatively, quantitative evaluations with a level 4 rigour are generally accepted as robust to support substantial causal claims.
Evaluations below level 4 can be assessed as ‘adequate’, if the following three assumptions are valid (see assessment grid in the publication):
- Participants in the intervention group had similar values on the dependent variable X as those in the control group at baseline.
- Participants in the control group exhibit similar relevant characteristics as those in the intervention group.
- In the absence of the project, the dependent variable X would not change: there are no other factors that influence X.
Formulating indicators or describing result areas at small intervals in the result chain can make an effect claim from activities to results more plausible. The ‘technical proposal’ or ‘inception report’ should propose valid quantitative indicators, or qualitatively describe the result areas as concretely as possible to accurately measure what it claims to capture.
For quantitative evaluation methods, the indicators should be appropriate for assessing the intended results, and should adhere to the SMART criteria:
- Specific
Indicators should be clear and specific, leaving no room for ambiguity. They should be linked to different levels in the ToC or results chain. - Measurable
Indicators should be quantifiable, allowing for measuring progress. - Attainable
Objectives and corresponding indicators should be attainable, making it possible to capture changes during the evaluation. - Relevant
The indicators should align with the objectives of the evaluated project. - Time-bound
There should be a defined timeframe, making it clear when the measurement takes place.
Evaluations can include multiple sampling strategies and/or case selection strategies to serve different evaluation objectives or answer different evaluation questions.
Sampling is the process of selecting a subset of individuals or units from a larger population. The following distinction is important: probability sampling and non-probability sampling
Probability sampling, e.g. random sampling, stratified sampling, or systematic sampling, ensures that each member of the population has a known chance of being selected. The objective is to create a representative subsample which allows the evaluators to generalise findings from the sample to the broader population and prevent selection bias.
When the chosen sample in an evaluation does not accurately represent the entire target population, this is referred to as selection bias. It occurs, for example, when participants in an evaluation of a youth employment project self-select. More successful or more entrepreneurial young people may be more likely to voluntarily participate in an evaluation and, as a result, the findings might not be representative for all young people.
Non-probability sampling, e.g. convenience sampling, purposive sampling, snowball sampling, does not ensure that every member has a chance of being selected. Especially in qualitative research, evaluators apply non-probability sampling methods to study specific phenomena, without the objective to generalise results to the broader population.
In quantitative research, calculating the sample size should be done prior to data-collection and is ideally based on a power calculation.
Case selection is choosing specific cases for in-depth study based on their relevance to the objective of the evaluation, and the evaluation questions. The main goal of case selection is to gain a deep, contextual understanding of a particular case or particular cases, which may provide valuable insights. Depending on the objectives, evaluation questions and analysis strategy, evaluators may select typical cases, extreme cases, critical cases, or comparative cases.
When sampling and selecting cases, evaluators should:
- Formulate sample or case selection criteria - the set of characteristics that must be present - independently from the actively involved stakeholders.
- Be transparent about sample selection criteria applied, e.g. by presenting a list of all potential cases, interventions or countries and their scores or the selection criteria.
The ‘technical proposal’ or ‘inception report’ should present which information sources will be included in a way that minimizes source bias, by using sufficient and independent information sources.
Source bias occurs when the chosen information sources in an evaluation are not well balanced with certain selected sources emphasized over others. This can lead to unrepresentative selection of sources and skewed findings, affecting an evaluation’s validity.
Collecting information from stakeholders that were not directly involved with the project can provide important information about external factors, assumptions or about what happens in the absence of an intervention. In addition to information sources obtained from direct stakeholders, such project staff or the targeted population, the evaluator should also include other information sources, such as:
- the non-targeted population
- informed but not directly engaged stakeholders
- the evaluator’s own direct observations
- validated secondary data sources
Because it may be difficult to predict the amount of data needed to reach the point of saturation, it is important that the evaluators have sufficient flexibility - time and resources - to add more information sources if needed.
The 'inception report' or 'technical proposal' should clearly describe the limitations in the reliability and validity (both internal and external) of the proposed methodology, as well as any potential biases, including selection bias. Note that this criterion does not assess the limitations or bias itself, but rather the acknowledgement thereof. Criterion 17 assesses how the implications of bias are taken into account in formulating conclusions.
Reliability is the degree to which an evaluation would consistently produce the same results under similar conditions. It involves the extent to which the evaluation, including data collection and analysis, can be replicated or repeated to yield similar results.
Internal validity refers to the trustworthiness and credibility of the findings within the specific context. It involves the extent to which the evaluation accurately describes relationships between the results and the intervention, minimizing the effects of other relevant factors. Various forms of respondent bias, for example, can affect the internal validity of an evaluation.
External validity, also known as generalizability, is the extent to which the results of a study can be generalized or applied to settings, populations, and conditions beyond the specific evaluation. Selection bias, for example, may affect the external validity of an evaluation – see criterion 12.
Evaluation criteria in phase 3 Draft and final report

The third subset of criteria (15 – 17) are designed to assess the quality of the draft and final report. These criteria focus on transparent reporting and on conclusions and recommendations formulated.
During this stage of assessing evaluation quality, it is no longer possible to adjust the data collection as applied, but it remains possible to improve descriptions, enhance the analysis, and to reformulate conclusions.
This phase includes a re-assessment of the second subset and of five criteria from the first subset (1 – 5).
Describing the analysis in a transparent manner allows for scrutiny, replicability, and a good understanding of the entire research process. The evaluation report should therefore present all research methods, data collection methods, data sources, and data analysis techniques employed in a systematic, transparent and complete yet accessible manner, either in the main report or in appendices to the report.
In case different research methods, data collection methods, or data sources provide contradictory findings, the evaluator should be transparent about the differences and the assessment thereof. Transparently presenting the range of evaluation findings prevents cherry-picking of certain results or the selective presentation of findings to fit a specific narrative. The report should explain how the evaluator has weighted and combined findings from sources and methods to come to an overall judgement and conclusion.
The draft and final report should provide an answer to all evaluation questions. Although the main and sub-conclusions may not be structured according to the evaluation questions, in principle they all must be answered in the text in an recognizable way. If evaluators faced unforeseen limitations during the evaluation process that prevented them to answer all evaluation questions, as a minimum the report must highlight which evaluation questions it did not answer and explain the reasons for this.
There should be a clear and rational connection in the evaluation report from findings to conclusions, and from the conclusions to the recommendations.
Each conclusion should be supported by specific findings. It is important that the final report takes possible limitations and bias (criterion 14) sufficiently into account. For example, the authors state that evaluation has will have limited external validity, the report’s conclusions should not generalized beyond the selected cases.
Recommendations, in turn, must logically follow from the conclusions presented in the evaluation report. If the ToR identified an action objective (criterion 6), the recommendations should be in line with the stated evaluation objective.