Every problem is generated on the fly from parameterized templates, so it cannot be in any model's training data. Gold answers are computed while the problem is built (and independently re-verified).