Learning the structure of a Bayesian network from observations is a common goal in several areas of science and technology. We here show that we can use the prequential minimum description length principle (MDL) to derive a scoring function for Bayesian Networks. We model the conditional probability distributions between observed variables with flexible and highly overparametrized neural networks which can represent non-linear relationships between these variables. Nevertheless, we obtain plausible and sparse DAGs without relying on a sparsity inducing prior over the graph structures or other regularizers that have to be tuned in order to obtain the desired sparsity at the graph level. Furthermore, the prequential MDL score highlights an interesting relationship with recent work that infers causal structure from the speed of adaptation when the observations were obtained from a source undergoing distributional shift.