We investigate the composition of representative benchmark sets for evaluating and improving the performance of robust Boolean constraint solvers in the context of satisfiability testing and answer set programming. Starting from an analysis of current practice, we isolate a set of desiderata for guiding the development of a parametrized benchmark selection algorithm. Our algorithm samples a benchmark set from a larger base set (or distribution) comprising a large variety of instances. This is done fully automatically, in a way that carefully calibrates instance hardness and instance similarity. We demonstrate the usefulness of this approach by means of empirical results showing that optimizing solvers on the benchmark sets produced by our method leads to better configurations than obtained based on the much larger, original sets.