Recently developed deep learning models are able to learn to decompose scenes into objects and their respective representations without supervision. This has led to many interesting opportunities to leverage object representations in agents. In this paper we learn a slot-wise object based transition model that firstly decomposes the scene into objects, aligns them (with respect to a slot-wise object memory) into a consistent order and predicts how those objects evolve over time. Our model is trained end-to-end without supervision and without any privileged information. This work draws together and extends previous work on MONet, AlignNet and Transformers to lay the foundations for future object-centric model based and imaginative agents.