A Benchmark for Verifying Chain-Of-Thought