A well-defined benchmark is a crucial instrument for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. Our experiments reveal that while the task is challenging, neural models can achieve surprisingly good performance. We further design a hierarchical transformer model that outperforms the transformer baseline. We will make the dataset and models publicly available.