DeepMind Publicationshttps://www.deepmind.comWe work on some of the most complex and interesting challenges in AI. Our world-class research has resulted in hundreds of peer-reviewed papers, including in Nature and Science.Thu, 11 May 2023 17:32:50 GMTWebflowSearch-Improved Game-Theoretic Multiagent Reinforcement Learning in General and Negotiation Gameshttps://www.deepmind.com/publications/search-improved-game-theoretic-multiagent-reinforcement-learning-in-general-and-negotiation-gameshttps://www.deepmind.com/publications/search-improved-game-theoretic-multiagent-reinforcement-learning-in-general-and-negotiation-gamesMultiagent reinforcement learning (MARL) has benefited significantly from population-based and game-theoretic training regimes. Many applications have focused on two-player zero-sum games, employing standard reinforcement learning to compute oracle'' orexploiter'' response policies via approximate best response. In this paper, we introduce Monte Carlo tree search with generative world state sampling to augment the best response steps. We show empirical convergence to Nash equilibria and the effects of various choices of meta-solvers across a suite of general-sum and n-player sequential games. We then present case studies on negotiation games including Colored Trails and a multi-issue bargaining game ``Deal or no Deal''. We propose two new forms of meta-solvers based on the Nash Bargaining Solution (NBS) and simple gradient ascent algorithms to solve them. The NBS meta-solvers produce agents that achieve higher social welfare than purely Nash-inspired ones, and reach closest to the Pareto-frontier in Colored Trails. Finally, we report on the generalization capabilities of agents trained via this regime by evaluating them against human participants. Overall, we find that search and generative modeling helps find stronger policies during training, enables online Bayesian co-player prediction, and trains fair agents that can achieve comparable social welfare negotiating with humans as humans trading among themselves.Mon, 29 May 2023 00:00:00 GMTIs forgetting less a good inductive bias for forward transfer?https://www.deepmind.com/publications/is-forgetting-less-a-good-inductive-bias-for-forward-transferhttps://www.deepmind.com/publications/is-forgetting-less-a-good-inductive-bias-for-forward-transferOne of the main motivations of studying continual learning is that the problem setting allows a model to accrue knowledge from past tasks to learn new tasks more efficiently. However, recent studies suggest that the key metric that continual learning algorithms optimize, reduction in catastrophic forgetting, does not correlate well with the forward transfer of knowledge. We believe that the conclusion previous works reached is due to the way they measure forward transfer. We argue that the measure of forward transfer to a task should not be affected by the restrictions placed on the continual learner in order to preserve knowledge of previous tasks. Instead, forward transfer should be measured by how easy it is to learn a new task given a set of representations produced by continual learning on previous tasks. Under this notion of forward transfer, we evaluate different continual learning algorithms on a variety of image classification benchmarks. Our results indicate that less forgetful representations lead to a better forward transfer suggesting a strong correlation between retaining past information and learning efficiency on new tasks. Further, we found less forgetful representations to be more diverse and discriminative compared to their forgetful counterparts.Mon, 01 May 2023 00:00:00 GMTMeta-Learning Black-Box Optimization via Black-Box Optimizationhttps://www.deepmind.com/publications/meta-learning-black-box-optimization-via-black-box-optimizationhttps://www.deepmind.com/publications/meta-learning-black-box-optimization-via-black-box-optimizationEvolution strategies constitute a set of domain-general optimization algorithms, which do not require the ability to compute well-behaved gradients. They suffer from the inability to scale to large search spaces, a lack of hyperparameter intuition and an inflexible, often heuristic design. In order to address such limitations, we take inspiration from recent advances in learned optimization and automatically discover new evolution strategies via meta-learning. The proposed search strategy is parametrized by a self-attention-based architecture, which enables flexible interpolation between different search heuristics. The induced search update rule is equivariant to the ordering of the candidate solutions. We show that meta-evolving this system on a set of representative low-dimensional optimization problems discovers new evolution strategies capable of generalization to unseen optimization problems, population sizes and optimization horizons. Our experiments on vision and continuous control tasks demonstrate that the learned evolution strategy is more sample-efficient than established baseline strategies and can effectively scale to large population sizes. As a bonus, we show that it is possible to self-referentially train the evolution strategy starting from a random initialization using a simple selection heuristic. Finally, we study the contributions of the individual neural network components and reverse engineer the learned strategy into a competitive Mon, 01 May 2023 00:00:00 GMTEquilibrium-Invariant Embedding, Metric Space, and Fundamental Set of 2×2 Normal-Form Gameshttps://www.deepmind.com/publications/equilibrium-invariant-embedding-metric-space-and-fundamental-set-of-22-normal-form-gameshttps://www.deepmind.com/publications/equilibrium-invariant-embedding-metric-space-and-fundamental-set-of-22-normal-form-gamesEquilibrium solution concepts of normal-form games, such as Nash equilibria, correlated equilibria, and coarse correlated equilibria, describe the joint strategy profiles from which no player has incentive to unilaterally deviate. They are widely studied in game theory, economics, and multiagent systems. Equilibrium concepts are invariant under certain transforms of the payoffs. We define an equilibrium-inspired distance metric for the space of all normal-form games and uncover a distance-preserving equilibrium-invariant embedding. Furthermore, we propose an additional transform which defines a better-response-invariant distance metric and embedding. To demonstrate these metric spaces we study 2×2 games. The equilibrium-invariant embedding of 2×2 games has an efficient two variable parameterization (a reduction from eight), where each variable geometrically describes an angle on a unit circle. Interesting properties can be spatially inferred from the embedding, including: equilibrium support, cycles, competition, coordination, distances, best-responses, and symmetries. The best-response-invariant embedding of 2×2 games, after considering symmetries, rediscovers a set of 15 games, and their respective equivalence classes. We propose that this set of game classes is fundamental and captures all possible interesting strategic interactions in 2×2 games. We introduce a directed graph representation and name for each class. Finally, we leverage the tools deFri, 21 Apr 2023 00:00:00 GMTFitting Autoregressive Graph Generative Models through Maximum Likelihood Estimationhttps://www.deepmind.com/publications/fitting-autoregressive-graph-generative-models-through-maximum-likelihood-estimationhttps://www.deepmind.com/publications/fitting-autoregressive-graph-generative-models-through-maximum-likelihood-estimationWe consider the problem of fitting autoregressive graph generative models via maximum likelihood estimation (MLE). MLE is intractable for graph autoregressive models because the nodes in a graph can be arbitrarily reordered; thus the exact likelihood involves a sum over all possible node orders leading to the same graph. In this work, we fit the graph models by maximizing a variational bound, which is built by first deriving the joint probability over the graph and the node order of the autoregressive process. This approach avoids the need to specify ad-hoc node orders, since an inference network learns the most likely node sequences that have generated a given graph. We improve the approach by developing a graph generative model based on attention mechanisms and an inference network based on routing search. We demonstrate empirically that fitting autoregressive graph models via variational inference improves their qualitative and quantitative performance, and the improved model and inference network further boost the performance.Thu, 06 Apr 2023 00:00:00 GMTRethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalizationhttps://www.deepmind.com/publications/rethinking-evaluation-practices-in-visual-question-answering-a-case-study-on-out-of-distribution-generalizationhttps://www.deepmind.com/publications/rethinking-evaluation-practices-in-visual-question-answering-a-case-study-on-out-of-distribution-generalizationVision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, when evaluated under out-of-distribution (OOD) settings for VQA, we observe that these models exhibit poor generalization. We comprehensively investigate performance of two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also find that in most cases generative models are less susceptible to shifts in data distribution compared to discriminative ones, and that multimodal pretraining is mostly helpful for OOD generalization. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.Sat, 01 Apr 2023 00:00:00 GMTThree ways to improve feature alignment for open vocabulary detectionhttps://www.deepmind.com/publications/three-ways-to-improve-feature-alignment-for-open-vocabulary-detectionhttps://www.deepmind.com/publications/three-ways-to-improve-feature-alignment-for-open-vocabulary-detectionThe core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new state-of-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.Thu, 23 Mar 2023 00:00:00 GMTFast exploration and learning of latent graphs with aliased observationshttps://www.deepmind.com/publications/fast-exploration-and-learning-of-latent-graphs-with-aliased-observationshttps://www.deepmind.com/publications/fast-exploration-and-learning-of-latent-graphs-with-aliased-observationsAn agent navigates a latent graph by performing actions that take it from one node to another. The chosen action determines the probability distribution over the next visited node. At each node, the agent receives an observation, but this observation is not unique, so it does not identify the node, making the problem \emph{aliased}. The purpose of this work is three-fold, we want to (a) provide a mechanism to recover the latent graph from a sequence of observation-action pairs in the presence of aliasing; (b) provide a policy that approximately maximizes exploration efficiency (i.e., how well the graph is recovered for a given exploration budget); and (c) introduce measures that are adequate to quantify performance in this type of problem. In the unaliased case, we show improved performance w.r.t.\ state-of-the-art reinforcement learning baselines. For the aliased case we are not aware of suitable baselines and instead show faster recovery w.r.t.\ a random policy for a wide variety of topologies, and exponentially faster recovery than a random policy for challenging topologies. We dub the algorithm eFeX (from eFficient eXploration).Mon, 13 Mar 2023 00:00:00 GMTEvaluating Number Discrimination in Deep Neural Networks for Visionhttps://www.deepmind.com/publications/evaluating-number-discrimination-in-deep-neural-networks-for-visionhttps://www.deepmind.com/publications/evaluating-number-discrimination-in-deep-neural-networks-for-visionThe ability to discriminate large and small quantities---number discrimination---is a core aspect of basic numerical competence in both humans and animals. In this work, we examine the extent to which the state-of-the-art neural networks designed for vision exhibit this basic ability. Motivated by studies in animal and infant numerical cognition, we use the numerical bisection procedure to test number discrimination in three family of neural architectures. We find that models with vision-specific inductive biases are more successful in discriminating numbers than those with no or less implicit biases. Interestingly, the model with both hierarchical and locality biases best match the empirical data. We also observe that even the strongest model does not exhibit the expected number discrimination behavior if the test situation is different that the training one. In some cases, the model has learned a correct and ordered clustering of numbers, but cannot use this knowledge in new situations.Mon, 13 Mar 2023 00:00:00 GMTDenoising diffusion samplershttps://www.deepmind.com/publications/denoising-diffusion-samplershttps://www.deepmind.com/publications/denoising-diffusion-samplersDenoising diffusion models are a very popular class of generative models which provide state-of-the-art results in a variety of domains such as image and speech synthesis. One adds gradually noise to data using a diffusion to transform the data distribution into a Gaussian distribution. Samples from the generative model are then obtained by simulating an approximation of the time-reversal of this diffusion initialized by Gaussian samples. Practically, the intractable score terms appearing in the time-reversed process are approximated using score matching techniques. We explore here a similar idea to sample approximately from unnormalized probability density functions and estimate their normalizing constants. We consider a process where the target density diffuses towards a Gaussian. Denoising Diffusion Samplers (DDS) are obtained by approximating the corresponding time-reversal. Score matching is not applicable in this context and an alternative variational inference approach is used. However, we can leverage some of the ideas introduced in generative modeling to this Monte Carlo sampling task. Similarly, we can adapt existing theoretical results from denoising diffusion models to provide theoretical guarantees for DDS. We discuss the connections between DDS, optimal control and Schr"odinger bridge and we finally demonstrate them experimentally on a variety of sampling tasks.Tue, 28 Feb 2023 00:00:00 GMTLeveraging Jumpy Models for Planning and Fast Learning in Robotic Domainshttps://www.deepmind.com/publications/leveraging-jumpy-models-for-planning-and-fast-learning-in-robotic-domainshttps://www.deepmind.com/publications/leveraging-jumpy-models-for-planning-and-fast-learning-in-robotic-domainsIn this paper we study the problem of learning multi-step dynamics prediction models (jumpy models) from unlabeled experience and their utility for fast inference of (high-level) plans in downstream tasks. In particular we propose to learn a jumpy model alongside a skill embedding space offline, from previously collected experience for which no labels or reward annotations are required. We then investigate several options of harnessing those learned components in combination with model-based planning or model-free reinforcement learning (RL) to speed up learning on downstream tasks. We conduct a set of experiments in the RGB-stacking environment, showing that planning with the learned skills and the associated model can enable zero-shot generalization to new tasks, and can further speed up training of policies via reinforcement learning. We specifically investigate the effect of temporal abstraction of the learned skills and jumpy model prediction and show that this combination can facilitate planning in long-horizon tasks.Fri, 24 Feb 2023 00:00:00 GMTGraph schemas as abstractions for transfer learning, inference, and planninghttps://www.deepmind.com/publications/graph-schemas-as-abstractions-for-transfer-learning-inference-and-planninghttps://www.deepmind.com/publications/graph-schemas-as-abstractions-for-transfer-learning-inference-and-planningWe propose schemas as a model for abstractions that can be used for rapid transfer learning, inference, and planning. Common structured representations of concepts and behaviors -- schemas -- have been proposed as a powerful way to encode abstractions. Latent graph learning is emerging as a new computational model of the hippocampus to explain map learning and transitive inference. We build on this work to show that learned latent graphs in these models have a slot structure -- schemas -- that allow for quick knowledge transfer across environments. In a new environment, an agent can rapidly learn new bindings between the sensory stream to multiple latent schemas and select the best fitting one to guide behavior. To evaluate these graph schemas, we use two previously published challenging tasks: the memory & planning game and one-shot StreetLearn, that are designed to test rapid task solving in novel environments. Graph schemas can be learned in far fewer episodes than previous baselines, and can model and plan in a few steps in novel variations of these tasks. We further demonstrate learning, matching, and reusing graph schemas in navigation tasks in more challenging environments with aliased observations and size variations, and show how different schemas can be composed to model larger 2D and 3D environments.Thu, 16 Feb 2023 00:00:00 GMTUniversal Agent Mixtures and the Geometry of Intelligencehttps://www.deepmind.com/publications/universal-agent-mixtures-and-the-geometry-of-intelligencehttps://www.deepmind.com/publications/universal-agent-mixtures-and-the-geometry-of-intelligenceInspired by recent progress in multi-agent Reinforcement Learning (RL), in this work we examine the collective intelligent behaviour of theoretical universal agents by introducing a weighted mixture operation. Given a weighted set of agents, their weighted mixture is a new agent whose expected total reward in any environment is the corresponding weighted average of the original agents' expected total rewards in that environment. Thus, if RL agent intelligence is quantified in terms of performance across environments, the weighted mixture's intelligence is the weighted average of the original agents' intelligences. This operation enables various interesting new theorems that shed light on the geometry of RL agent intelligence, namely: results about symmetries, convex agent-sets, and local extrema. We also show that any RL agent intelligence measure based on average performance across environments, subject to certain weak technical conditions, is identical (up to a constant factor) to performance within a single environment dependent on said intelligence measure.Mon, 13 Feb 2023 00:00:00 GMTScaling Goal-based Exploration via Pruning Proto-goalshttps://www.deepmind.com/publications/scaling-goal-based-exploration-via-pruning-proto-goalshttps://www.deepmind.com/publications/scaling-goal-based-exploration-via-pruning-proto-goalsOne of the gnarliest challenges in reinforcement learning is exploration that scales to vast domains, where novelty-, or coverage-seeking behaviour falls short. Goal-directed, purposeful behaviours are able to overcome this, but rely on a good goal space. The core challenge in *goal discovery* is finding the right balance between generality (not hand-crafted) and tractability (useful, not too many). Our approach explicitly seeks the middle ground, enabling the human designer to specify a vast but meaningful proto-goal space, and an autonomous discovery process to narrow this to a narrow space of controllable, reachable, novel, and relevant goals. The effectiveness of goal-conditioned exploration with the latter is then demonstrated in three challenging environments.Thu, 09 Feb 2023 00:00:00 GMTEquivariant MuZerohttps://www.deepmind.com/publications/equivariant-muzerohttps://www.deepmind.com/publications/equivariant-muzeroDeep reinforcement learning repeatedly succeeds in closed, well-defined domains such as games (Chess, Go, StarCraft). The next frontier is real-world scenarios, where setups are numerous and varied. For this, agents need to learn the underlying rules governing the environment, so as to robustly generalise to conditions that differ from those they were trained on. Model-based reinforcement learning algorithms, such as the highly successful MuZero, aim to accomplish this by learning a world model. However, leveraging a world model has not consistently shown greater generalisation capabilities compared to model-free alternatives. In this work, we propose improving the data efficiency and generalisation capabilities of MuZero by explicitly incorporating the symmetries of the environment in its world-model architecture. We prove that, so long as the neural networks used by MuZero are equivariant to a particular symmetry group acting on the environment, the entirety of MuZero's action-selection algorithm will also be equivariant to that group. We evaluate Equivariant MuZero on procedurally-generated MiniPacman and on Chaser from the ProcGen suite: training on a set of mazes, and then testing on unseen rotated versions, demonstrating the benefits of equivariance. Further, we verify that our performance improvements hold even when only some of the components of Equivariant MuZero obey strict equivariance, which highlights the robustness of our construction.Thu, 09 Feb 2023 00:00:00 GMT3D Neural Embedding Likelihood for Robust Sim-to-Real Transfer in Inverse Graphicshttps://www.deepmind.com/publications/3d-neural-embedding-likelihood-for-robust-sim-to-real-transfer-in-inverse-graphicshttps://www.deepmind.com/publications/3d-neural-embedding-likelihood-for-robust-sim-to-real-transfer-in-inverse-graphicsA central challenge in 3D scene perception via inverse graphics is robustly modeling the gap between the generative graphics model and real-world data. We propose a novel 3D neural embedding likelihood (3DNEL) that jointly models RGB and depth images, and empirically demonstrate that it enables robust 6D object pose estimation via Bayesian inverse graphics on real-world RGB-D images. 3DNEL uses neural embeddings, learned entirely from synthetic data, to predict dense 2D-3D correspondence scores from RGB, combines this with depth information in a principled manner, and uses a mixture model formulation to jointly model multiple objects in a scene. 3DNEL achieves new state-of-the-art (SOTA) performance in sim-to-real pose estimation on the YCB-Video dataset, and demonstrates improved robustness when compared with the previous SOTA, with significantly fewer large-error pose predictions. Formulated as a structured probabilistic generative model, 3DNEL can be easily adapted for object tracking from dynamic videos, further improving accuracy of 6D pose estimation.Tue, 07 Feb 2023 00:00:00 GMTExploration via Epistemic Value Estimationhttps://www.deepmind.com/publications/exploration-via-epistemic-value-estimationhttps://www.deepmind.com/publications/exploration-via-epistemic-value-estimationHow to efficiently explore in reinforcement learning is an open problem. Many exploration algorithms employ the epistemic uncertainty of their own value predictions -- for instance to compute an exploration bonus or upper confidence bound. Unfortunately the required uncertainty is difficult to estimate in general with function approximation. We propose epistemic value estimation (EVE): a recipe that is compatible with sequential decision making and with neural network function approximators. It equips agents with a tractable posterior over all their parameters from which epistemic value uncertainty can be computed efficiently. We use the recipe to derive an epistemic Q-Learning agent and observe competitive performance on a series of benchmarks. Experiments confirm that the EVE recipe facilitates efficient exploration in hard exploration tasks.Tue, 07 Feb 2023 00:00:00 GMTDiversity Through Exclusion (DTE): Niche Identification for Reinforcement Learning through Value-Decompositionhttps://www.deepmind.com/publications/diversity-through-exclusion-dte-niche-identification-for-reinforcement-learning-through-value-decompositionhttps://www.deepmind.com/publications/diversity-through-exclusion-dte-niche-identification-for-reinforcement-learning-through-value-decompositionMany environments contain numerous available niches of variable value, each associated with a different local optimum in the space of behaviors (policy space). In such situations, it is often difficult to design a learning process capable of evading distraction by poor local optima long enough to stumble upon the best available niche. In this work we propose a generic reinforcement learning (RL) algorithm that performs better than baseline deep Q-learning algorithms in such environments with multiple variably-valued niches. The algorithm we propose consists of two parts: an agent architecture and a learning rule. The agent architecture contains multiple sub-policies. The learning rule, inspired by the ecological principle of competitive exclusion, can be understood as adding an extra loss term where one policy's experience is also used to update all the other policies in a manner that decreases their value estimates for the visited states. Thus when a sub-policy visits a particular state frequently it discourages other sub-policies from learning to visit that state, if they have other alternatives. Further, we introduce an artificial chemistry inspired platform for defining tasks based on reaction graphs, where it is easy to create tasks with multiple rewarding strategies utilizing different resources (i.e. multiple niches). We show that agents trained this way can escape poor-but-attractive local optima to instead converge to harder to discover higher valFri, 03 Feb 2023 00:00:00 GMTReinforcement Learning for Minimizing Age of Information over Wireless Linkshttps://www.deepmind.com/publications/reinforcement-learning-for-minimizing-age-of-information-over-wireless-linkshttps://www.deepmind.com/publications/reinforcement-learning-for-minimizing-age-of-information-over-wireless-linksIn this chapter, we study the Age of Information (AoI) when the status updates of the underlying process of interest can be sampled at any time by the source node and are transmitted over an error-prone wireless channel. We assume the availability of perfect feedback that informs the transmitter about the success or failure of transmitted status updates and consider various retransmission strategies. More specifically, we study the scheduling of sampling and transmission of status updates in order to minimize the long-term average AoI at the destination under resource constraints. We assume that the underlying statistics of the system are not known, and hence, propose average-cost reinforcement learning algorithms for practical applications. Extensions of the results to a multiuser setting with multiple receivers and to an energy-harvesting source node are also presented, different reinforcement learning methods including deep Q Network (DQN) are exploited and their performances are demonstrated.Thu, 02 Feb 2023 00:00:00 GMTPGMax: Factor Graphs for Discrete Probabilistic Graphical Models and Loopy Belief Propagation in JAXhttps://www.deepmind.com/publications/pgmax-factor-graphs-for-discrete-probabilistic-graphical-models-and-loopy-belief-propagation-in-jaxhttps://www.deepmind.com/publications/pgmax-factor-graphs-for-discrete-probabilistic-graphical-models-and-loopy-belief-propagation-in-jaxPGMax is an open-source Python package for easy specification of discrete Probabilistic Graphical Models (PGMs) as factor graphs, and automatic derivation of efficient and scalable loopy belief propagation (LBP) implementation in JAX. It supports general factor graphs, and can effectively leverage modern accelerators like GPUs for inference. Compared with existing alternatives, PGMax obtains higher-quality inference results with orders-of-magnitude inference speedups. PGMax additionally interacts seamlessly with the rapidly growing JAX ecosystem, opening up exciting new possibilities. Our source code, examples and documentation are available at https://github.com/deepmind/PGMax.Thu, 02 Feb 2023 00:00:00 GMTLearning Noisy OR Bayesian Networks with Max-Product Belief Propagationhttps://www.deepmind.com/publications/learning-noisy-or-bayesian-networks-with-max-product-belief-propagationhttps://www.deepmind.com/publications/learning-noisy-or-bayesian-networks-with-max-product-belief-propagationNoisy-OR Bayesian Networks (BNs) are a family of probabilistic graphical models which express rich statistical dependencies in binary data. Variational inference (VI) has been the main method proposed to learn noisy-OR BNs with complex latent structures (Jaakkola & Jordan, 1999; Ji et al., 2020; Buhai et al., 2020). However, the proposed VI approaches either (a) use a recognition network with standard amortized inference that cannot induce ``explaining-away''; or (b) assume a simple mean-field (MF) posterior which is vulnerable to bad local optima. Existing MF VI methods also update the MF parameters sequentially which makes them inherently slow. In this paper, we propose parallel max-product as an alternative algorithm for learning noisy-OR BNs with complex latent structures and we derive a fast stochastic training scheme that scales to large datasets. We evaluate both approaches on several benchmarks where VI is the state-of-the-art and show that our method (a) achieves better test performance than Ji et al. (2020) for learning noisy-OR BNs with hierarchical latent structures on large sparse real datasets; (b) recovers a higher number of ground truth parameters than Buhai et al. (2020) from cluttered synthetic scenes; and (c) solves the 2D blind deconvolution problem from Lazaro-Gredilla et al. (2021) and variants—including binary matrix factorization—while VI catastrophically fails and is up to two orders of magnitude slower.Thu, 02 Feb 2023 00:00:00 GMTDual Algorithmic Reasoninghttps://www.deepmind.com/publications/dual-algorithmic-reasoninghttps://www.deepmind.com/publications/dual-algorithmic-reasoningNeural Algorithmic Reasoning is an emerging area of machine learning which seeks to infuse algorithmic computation in neural networks, typically by training neural models to approximate steps of classical algorithms. In this context, much of the current work has focused on learning reachability and shortest path graph algorithms, showing that joint learning on similar algorithms is beneficial for generalisation. However, when targeting more complex problems, such "similar" algorithms become more difficult to find. Here, we propose to learn algorithms by exploiting duality of the underlying algorithmic problem. Many algorithms solve optimisation problems. We demonstrate that simultaneously learning the dual definition of these optimisation problems in algorithmic learning allows for better learning and qualitatively better solutions. Specifically, we exploit the max-flow min-cut theorem to simultaneously learn these two algorithms over synthetically generated graphs, demonstrating the effectiveness of the proposed approach. We then validate the real-world utility of our dual algorithmic reasoner by deploying it on a challenging brain vessel classification task, which likely depends on the vessels’ flow properties. We demonstrate a clear performance gain when using our model within such a context, and empirically show that learning the max-flow and min-cut algorithms together is critical for achieving such a result.Thu, 02 Feb 2023 00:00:00 GMTDistilling Internet-Scale Vision-Language Models into Embodied Agentshttps://www.deepmind.com/publications/distilling-internet-scale-vision-language-models-into-embodied-agentshttps://www.deepmind.com/publications/distilling-internet-scale-vision-language-models-into-embodied-agentsInstruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent's behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.Sun, 29 Jan 2023 00:00:00 GMTPragmatic Fairness: Developing Policies with Outcome Disparity Controlhttps://www.deepmind.com/publications/pragmatic-fairness-developing-policies-with-outcome-disparity-controlhttps://www.deepmind.com/publications/pragmatic-fairness-developing-policies-with-outcome-disparity-controlWe introduce a causal framework for designing optimal policies that satisfy fairness constraints. We take a pragmatic approach asking what we can do with an action space available to us and only with access to historical data. We propose two different fairness constraints: a moderation breaking constraint which aims at blocking moderation paths from the action and sensitive attribute to the outcome, and by that at reducing disparity in outcome levels as much as the provided action space permits; and an equal benefit constraint which aims at distributing gain from the new and maximized policy equally across sensitive attribute levels, and thus at keeping pre-existing preferential treatment in place or avoiding the introduction of new disparity. We introduce practical methods for implementing the constraints and illustrate their uses on experiments with semi-synthetic models.Sat, 28 Jan 2023 00:00:00 GMTOn a continuous time model of gradient descent dynamics and instability in deep learninghttps://www.deepmind.com/publications/on-a-continuous-time-model-of-gradient-descent-dynamics-and-instability-in-deep-learninghttps://www.deepmind.com/publications/on-a-continuous-time-model-of-gradient-descent-dynamics-and-instability-in-deep-learningThe recipe of success behind the deep learning phenomenon has been the combination of neural networks and gradient-based optimisation. Understanding the fundamental behaviour of optimisation in this context has lagged behind the empirical success of deep learning. We aim to add to a growing set of results trying to understand the behaviour of gradient descent. We find a continuous time flow, called \textit{the principal flow} to help describe gradient descent dynamics. Unlike existing flows, the principal flow better captures the dynamics of gradient descent and can be used to explain its unstable or oscillatory behaviour. The principal flow is depends on the eigendecomposition of the Hessian, which allows us to shed light on recently observed behaviours in deep learning, such as the edge of stability results. Using our new understanding of instability, we propose an adaptive learning rate which leads to stable training and can find the right balance in the stability-performance trade-off.Wed, 25 Jan 2023 00:00:00 GMTDiscovering Quantum Phase Transitions with Fermionic Neural Networkshttps://www.deepmind.com/publications/discovering-quantum-phase-transitions-with-fermionic-neural-networkshttps://www.deepmind.com/publications/discovering-quantum-phase-transitions-with-fermionic-neural-networksDeep neural networks have been extremely successful as highly accurate wave function ansatze for variational Monte Carlo calculations of molecular ground states. We present an extension of one such ansatz, FermiNet, to calculations of the ground states of periodic Hamiltonians, and study the homogeneous electron gas. FermiNet calculations of the ground-state energies of small electron gas systems are in excellent agreement with previous initiator full configuration interaction quantum Monte Carlo and diffusion Monte Carlo calculations. We investigate the spin-polarized homogeneous electron gas and demonstrate that the same neural network architecture is capable of accurately representing both the delocalized Fermi liquid state and the localized Wigner crystal state. The network is given no a priori knowledge that a phase transition exists, but converges on the translationally invariant ground state at high density and spontaneously breaks the symmetry to produce the crystalline ground state at low density.Fri, 20 Jan 2023 00:00:00 GMTTransformer Grammars: Augmenting Transformer Language Models with a Syntactic Inductive Biashttps://www.deepmind.com/publications/transformer-grammars-augmenting-transformer-language-models-with-a-syntactic-inductive-biashttps://www.deepmind.com/publications/transformer-grammars-augmenting-transformer-language-models-with-a-syntactic-inductive-biasWe introduce Transformer Grammars (TGs), a novel class of Transformer language models that combine (i) the expressive power, scalability, and strong performance of Transformers and (ii) recursive syntactic compositions, which here are implemented through a special attention mask and deterministic transformation of the linearized tree. We find that TGs outperform various strong baselines on sentence-level language modeling perplexity, as well as on multiple syntax-sensitive language modeling evaluation metrics. Additionally, we find that the recursive syntactic composition bottleneck which represents each sentence as a single vector harms perplexity on document-level language modeling, providing evidence that a different kind of memory mechanism—one that is independent of composed syntactic representations—plays an important role in current successful models of long text.Thu, 22 Dec 2022 00:00:00 GMTAn empirical study of implicit regularization in deep offline RLhttps://www.deepmind.com/publications/an-empirical-study-of-implicit-regularization-in-deep-offline-rlhttps://www.deepmind.com/publications/an-empirical-study-of-implicit-regularization-in-deep-offline-rlDeep neural networks have become the de facto choice of function approximator for offline RL. However, prior work argued that their learning dynamics could lead to an implicit under-parameterization of the model mainly due to bootstrapping. Specifically, during training, the model's effective capacity, measured by the rank of the penultimate feature layer, can drastically collapse. In turn, this is argued to reduce the model's ability to further adapt in later stages of learning, leading to poor final performance. We show that it is challenging to establish a causal link between reduced rank and poor performance. Additionally, bootstrapping does not seem sufficient to explain the previously reported rank collapse in the representation. Our empirical evaluation targets offline-RL, where this observation was previously made, covering three distinct domains: BSuite, Atari, and Deepmind Lab. We provide an in-depth analysis of the behaviour of the learning system, providing clues of why and when rank and performance might become uncorrelated.Tue, 20 Dec 2022 00:00:00 GMTConfident Approximate Policy Iteration for Efficient Local Planning in q^π-realizable MDPshttps://www.deepmind.com/publications/confident-approximate-policy-iteration-for-efficient-local-planning-in-q-realizable-mdpshttps://www.deepmind.com/publications/confident-approximate-policy-iteration-for-efficient-local-planning-in-q-realizable-mdpsWe consider approximate dynamic programming in $\gamma$-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI), which computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon $H$ and the worst-case approximation error $\epsilon$ of the action-value functions of stationary policies. This improvement over API (whose error scales with $H^2$) comes at the price of an $H$-fold increase in memory cost. Unlike Scherrer and Lesner [2012], who recommended computing a non-stationary policy to achieve a similar improvement (with the same memory overhead), we are able to stick to stationary policies. This allows for our second contribution, the application of CAPI to planning with local access to a simulator and $d$-dimensional linear function approximation. As such, we design a planning algorithm that applies CAPI to obtain a sequence of policies with successively refined accuracies on a dynamically evolving set of states. The algorithm outputs an $\tilde{O}(\sqrt{d}H\epsilon)$-optimal policy after issuing $\tilde O(dH^4/\epsilon^2)$ queries to the simulator, simultaneously achieving the optimal accuracy bound and the best known query complexity bound, while earlier algorithms in the literature achieve onlFri, 16 Dec 2022 00:00:00 GMTParticle-Based Score Estimation for Jointly Learning Transition and Observation Models in Autonomous Driving https://www.deepmind.com/publications/particle-based-score-estimation-for-jointly-learning-transition-and-observation-models-in-autonomous-drivinghttps://www.deepmind.com/publications/particle-based-score-estimation-for-jointly-learning-transition-and-observation-models-in-autonomous-drivingLearning driving behaviour from observed trajectories of real-world road users is subject to observation noise. A lack of proper treatment of such noise confounds the learned behaviour, and can lead to unrealistic simulation of agents for testing Autonomous Vehicles (AVs). We use a state-space model as the underlying data generating process of our dataset of observed trajectories, and jointly learn its transition and observation models using a particle approximation of the score function. Our method yields a consistent estimate of the score without having to differentiate through the particle filter. We also demonstrate the efficacy of our learned transition and observation models as generative models for sampling driving behaviour and observation noise in simulation respectively.Wed, 14 Dec 2022 00:00:00 GMTipie: A Python-based Auxiliary-Field Quantum Monte Carlo Program with Flexibility and Efficiency on CPUs and GPUshttps://www.deepmind.com/publications/ipie-a-python-based-auxiliary-field-quantum-monte-carlo-program-with-flexibility-and-efficiency-on-cpus-and-gpushttps://www.deepmind.com/publications/ipie-a-python-based-auxiliary-field-quantum-monte-carlo-program-with-flexibility-and-efficiency-on-cpus-and-gpusWe report the development of a python-based auxiliary-field quantum Monte Carlo (AFQMC) program, ipie, with preliminary timing benchmarks and new AFQMC results on the isomerization of [Cu2O2]2+. We demonstrated how implementations for both central and graphical processing units (CPUs and GPUs) are achieved in ipie. We showed an interface of ipie with PySCF as well as a straightforward template for adding new estimators to ipie. Our timing benchmarks against other C++ codes, QMCPACK and Dice, suggest that ipie is faster or similarly performing for all chemical systems considered on both CPUs and GPUs. Our results on [Cu2O2]2+ using selected configuration interaction trials show that it is possible to converge the ph-AFQMC isomerization energy between bis(μ-oxo) and μ-η2:η2 peroxo configurations to the exact known result with 105 to 106 determinants for small basis sets. We also report the isomerization energy with a quadruple- zeta basis set, which involved 32 electrons and 280 orbitals with 106 determinants in the trial wavefunction. These results highlight the utility of ph-AFQMC and ipie for systems with modest strong correlation and large-scale dynamic correlation.Mon, 12 Dec 2022 00:00:00 GMTScore-based generative models learn manifold-like structures with constrained mixinghttps://www.deepmind.com/publications/score-based-generative-models-learn-manifold-like-structures-with-constrained-mixinghttps://www.deepmind.com/publications/score-based-generative-models-learn-manifold-like-structures-with-constrained-mixingHow do score-based generative models (SBMs) learn the data distribution supported on a low-dimensional manifold? We investigate the score model of a trained SBM through its linear approximations and subspaces spanned by local feature vectors. During diffusion as the noise decreases, the local dimensionality increases and becomes more varied between different sample sequences. Importantly, we find that the learned vector field mixes samples by a non-conservative field within the manifold, although it denoises with normal projections as if there is an energy function in off-manifold directions. At each noise level, the subspace spanned by the local features overlap with an effective density function. These observations suggest that SBMs can flexibly mix samples with the learned score field while carefully maintaining a manifold-like structure of the data distribution.Sat, 10 Dec 2022 00:00:00 GMTContinuous Neural Algorithmic Plannershttps://www.deepmind.com/publications/continuous-neural-algorithmic-plannershttps://www.deepmind.com/publications/continuous-neural-algorithmic-plannersPlanning is an important aspect of successful agency, especially as tasks get more combinatorially challenging. This intuition is applied in reinforcement learning, bringing about algorithms, such as value iteration, that allow us to plan and obtain optimal policies, if given the necessary information about the environment. Implicit planning eliminates the need for this privileged information, by combining learned world models with model-free reinforcement learning. A recent implicit planner, XLVIN, allows reaping the benefits of modern representation learning while still maintaining alignment to the value iteration algorithm; however, it only supports discrete action spaces, and is hence nontrivially applicable on most tasks of real-world interest. We expand XLVIN to continuous action spaces by discretising the action space, and evaluating several selective expansion policies. Our proposal, CNAP, demonstrates how neural algorithmic reasoning can make measurable impact in higher-dimensional continuous control settings, such as MuJoCo, bringing gains in low-data settings and outperforming model-free baselines.Fri, 09 Dec 2022 00:00:00 GMTUnbiased and Efficient Sampling of Dependency Treeshttps://www.deepmind.com/publications/unbiased-and-efficient-sampling-of-dependency-treeshttps://www.deepmind.com/publications/unbiased-and-efficient-sampling-of-dependency-treesMost computational models of dependency syntax consist of distributions over spanning trees. However, the majority of dependency treebanks require that every valid dependency tree has a single edge coming out of the ROOT node, a constraint that is not part of the definition of spanning trees. For this reason all standard inference algorithms for spanning trees are suboptimal for inference over dependency trees. Zmigrod et al. (2021b) proposed algorithms for sampling with and without replacement from the dependency tree distribution that incorporate the single-root constraint. In this paper we show that their fastest algorithm for sampling with replacement, Wilson-RC, is in fact producing biased samples and we provide two alternatives that are unbiased. Additionally, we propose two algorithms (one incremental, one parallel) that reduce the asymptotic runtime of algorithm for sampling k trees without replacement to O(kn3). These algorithms are both asymptotically and practically more efficient.Fri, 09 Dec 2022 00:00:00 GMTReasoning-Modulated Representationshttps://www.deepmind.com/publications/reasoning-modulated-representationshttps://www.deepmind.com/publications/reasoning-modulated-representationsNeural networks leverage robust internal representations in order to generalise. Learning them is difficult, and often requires a large training set that covers the data distribution densely. We study a common setting where our task is not purely opaque. Indeed, very often we may have access to information about the underlying system (e.g. that observations must obey certain laws of physics) that any "tabula rasa" neural network would need to re-learn from scratch, penalising performance. We incorporate this information into a pre-trained reasoning module, and investigate its role in shaping the discovered representations in diverse self-supervised learning settings from pixels. Our approach paves the way for a new class of representation learning, grounded in algorithmic priors.Fri, 09 Dec 2022 00:00:00 GMTBLaDE: Robust Exploration via Diffusion Modelshttps://www.deepmind.com/publications/blade-robust-exploration-via-diffusion-modelshttps://www.deepmind.com/publications/blade-robust-exploration-via-diffusion-modelsWe present Bootstrap your own Latents with Diffusion models for Exploration (BLaDE), a general approach for curiosity-driven exploration in complex, partially-observable and stochastic environments. BLaDE is a natural extension of Bootstrap Your Own Latents for Exploration (BYOL-Explore) which is a multi-step prediction-error method at the latent level that learns a world representation, the world dynamics, and provides an intrinsic-reward all-together by optimizing a single prediction loss with no additional auxiliary objective. Contrary to BYOL-Explore that predicts future latents from past latents and future open-loop actions, BLaDE predicts, via a diffusion model, future latents from past observations, future open-loop actions and a noisy version of future latents. This simple modification allows to obtain an intrinsic reward that does not depend on the variance of the distribution of future latents which makes the method agnostic to stochastic traps. Our experiments on different noisy versions of Montezuma's Revenge show that BLaDE handles stochasticity better than Random Network Distillation, Intrinsic Curiosity Module and BYOL-Explore without degrading the performance of BYOL-Explore in the non-noisy and fairly deterministic Montezuma's Revenge.Fri, 09 Dec 2022 00:00:00 GMTLearning Graph Search Heuristicshttps://www.deepmind.com/publications/learning-graph-search-heuristicshttps://www.deepmind.com/publications/learning-graph-search-heuristicsSearching for a path between two nodes in a graph is one of the most well-studied and fundamental problems in computer science. In numerous domains such as robotics, AI, or biology, practitioners develop search heuristics to accelerate their pathfinding algorithms. However, it is a laborious and complex process to hand-design heuristics based on the problem and the structure of a given use case. Here we present PHIL (Path Heuristic with Imitation Learning), a novel neural architecture and a training algorithm for discovering graph search and navigation heuristics from data by leveraging recent advances in imitation learning and graph representation learning. At training time, we aggregate datasets of search trajectories and ground-truth shortest path distances, which we use to train a specialized graph neural network-based heuristic function using backpropagation through steps of the pathfinding process. Our heuristic function learns graph embeddings useful for inferring node distances, runs in constant time independent of graph sizes, and can be easily incorporated in an algorithm such as A* at test time. Experiments show that PHIL reduces the number of explored nodes compared to state-of-the-art methods on benchmark datasets by 58.5% on average, can be directly applied in diverse graphs ranging from biological networks to road networks, and allows for fast planning in time-critical robotics domains.Fri, 09 Dec 2022 00:00:00 GMTLearnable Commutative Monoids for Graph Neural Networkshttps://www.deepmind.com/publications/learnable-commutative-monoids-for-graph-neural-networkshttps://www.deepmind.com/publications/learnable-commutative-monoids-for-graph-neural-networksGraph neural networks (GNNs) have been shown to be highly sensitive to the choice of aggregation function. While summing over a node's neighbours can approximate any permutation-invariant function over discrete inputs, recent work has proved that there are set-aggregation problems for which summing cannot generalise to unbounded inputs, proposing recurrent neural networks regularised towards permutation-invariance as a more expressive aggregator. We show that these results carry over to the graph domain: GNNs equipped with recurrent aggregators are competitive with state-of-the-art invariant aggregators, on both synthetic benchmarks and real-world problems. However, despite the benefits of recurrent aggregators, their $O(V)$ time complexity makes them both difficult to parallelise and harder to train on large graphs. Inspired by the observation that a well-behaved aggregator for a GNN is a commutative semigroup over its latent space, we propose various simple architectures for learnable binary operators, and regularise them towards commutativity and associativity to obtain learnable commutative monoids. Using these, we construct an aggregator of $O(\log V)$ depth, yielding exponential improvements for both parallelism and dependency length while achieving performance competitive with recurrent aggregators. Based on our empirical observations, our proposed dense binary aggregator represents the ``best of both worlds'' between efficient and expressive aggregFri, 09 Dec 2022 00:00:00 GMTA Generalist Neural Algorithmic Learnerhttps://www.deepmind.com/publications/a-generalist-neural-algorithmic-learnerhttps://www.deepmind.com/publications/a-generalist-neural-algorithmic-learnerThe cornerstone of neural algorithmic reasoning is the ability to solve algorithmic tasks, especially in a way that generalises out-of-distribution. While recent years have seen a surge in methodological improvements in this area, they mostly focussed on building *specialist* models. Specialist models are capable of learning to neurally execute either only one algorithm or a collection of algorithms with identical control-flow backbone. Here, instead, we focus on constructing a *generalist* neural algorithmic learner---a single graph neural network processor capable of learning to execute a wide range of algorithms, such as sorting, searching, dynamic programming, path-finding and geometry. We leverage the CLRS benchmark to empirically show that, much like recent successes in the domain of perception, generalist algorithmic learners can be built by ``containing'' knowledge. That is, it is possible to effectively learn algorithms in a multi-task manner, so long as we can learn to execute them well in a single-task regime. Motivated by this, we present a series of improvements to the input representation, training regime and processor architecture over CLRS, improving average single-task performance by over 20% from previously published work. We then conduct a thorough ablation of multi-task learners leveraging these improvements. Our results demonstrate a generalist learner that effectively contains knowledge captured by specialist models.Fri, 09 Dec 2022 00:00:00 GMTExpander Graph Propagationhttps://www.deepmind.com/publications/expander-graph-propagationhttps://www.deepmind.com/publications/expander-graph-propagationDeploying graph neural networks (GNNs) on whole-graph classification or regression tasks is known to be challenging: it often requires computing node features that are mindful of both local interactions in their neighbourhood and the global context of the graph structure. GNN architectures that navigate this space need to avoid pathological behaviours, such as bottlenecks and oversquashing, while ideally having linear time and space complexity requirements. In this work, we propose an elegant approach based on propagating information over expander graphs. We leverage an efficient method for constructing expander graphs of a given size, and use this insight to propose the EGP model. We show that EGP is able to address all of the above concerns, while requiring minimal effort to set up, and provide evidence of its empirical utility on relevant graph classification datasets and baselines in the Open Graph Benchmark. Importantly, using expander graphs as a template for message passing necessarily gives rise to negative curvature. While this appears to be counterintuitive in light of recent related work on oversquashing, we theoretically demonstrate that negatively curved edges are likely to be required to obtain scalable message passing without bottlenecks. To the best of our knowledge, this is a previously unstudied result in the context of graph representation learning, and we believe our analysis paves the way to a novel class of scalable methods to counterFri, 09 Dec 2022 00:00:00 GMTScaffolding cooperation in human groups with deep reinforcement learninghttps://www.deepmind.com/publications/scaffolding-cooperation-in-human-groups-with-deep-reinforcement-learninghttps://www.deepmind.com/publications/scaffolding-cooperation-in-human-groups-with-deep-reinforcement-learningAltruism and selfishness are highly transmissible. Either can easily cascade through human communities. Effective approaches to encouraging group cooperation—while also mitigating the risk of spreading defection—are still an open challenge. Here, we apply recent advances in deep reinforcement learning to structure networks of human participants playing a group cooperation game. We leverage deep reinforcement learning and simulation methods to train a "social planner" capable of making recommendations to create or break connections between group members. This social planner learns a strategy from scratch, through repeated trial and error. The strategy that it develops succeeds at encouraging prosociality in networks of human participants (_N_ = 208 participants in 13 sessions) playing the group cooperation game for real monetary stakes. Under the social planner, groups finished the game with an average cooperation rate of 77.7%, compared to 42.8% in static networks (_N_ = 176 participants in 11 sessions). In contrast to prior strategies that separate defectors from cooperators (tested here with *N* = 384 participants in 24 sessions), the social planner learns to take a conciliatory approach to defectors, encouraging them to act prosocially by moving them to small, highly-cooperative neighborhoods. A subsequent validation study (_N_ = 224 participants in 14 sessions) confirms that this encouraging approach can scaffold group cooperation absent the "blackThu, 08 Dec 2022 00:00:00 GMTRe-assessing Commonsense Knowledge in Large Language Modelshttps://www.deepmind.com/publications/re-assessing-commonsense-knowledge-in-large-language-modelshttps://www.deepmind.com/publications/re-assessing-commonsense-knowledge-in-large-language-modelsLarge language models such as GPT2, GPT3, and MT-NLG have shown impressive performance on many natural language processing (NLP) tasks in a zero-shot setting. However, it remains unclear whether these models exhibit commonsense understanding-- a critical component of NLP tasks. We aim to answer this question by performing a systematic study on four commonsense benchmarks including social, physical, and temporal commonsense knowledge. Our investigation demonstrates that the impressive zero-shot performance of large language models is mostly due to existence of dataset bias in our benchmarks. We also show that the zero-shot performance is sensitive to the choice of hyper-parameters and similarity of the benchmark to the pre-training datasets. Our work highlights the need for including strong baselines when studying pre-trained language models.Wed, 07 Dec 2022 00:00:00 GMTLearning rigid dynamics with face interaction graph networkshttps://www.deepmind.com/publications/learning-rigid-dynamics-with-face-interaction-graph-networks-3https://www.deepmind.com/publications/learning-rigid-dynamics-with-face-interaction-graph-networks-3Simulating rigid collisions among arbitrary shapes is notoriously difficult due to complex geometry and the strong non-linearity of the interactions. While graph neural network (GNN)-based models are effective at learning to simulate complex physical dynamics, such as fluids, cloth and articulated bodies, they have been less effective and efficient on rigid-body physics, except with very simple shapes. Existing methods that model collisions through the meshes' nodes are often inaccurate because they struggle when collisions occur on faces far from nodes. Alternative approaches that represent the geometry densely with many particles are prohibitively expensive for complex shapes. Here we introduce the ``Face Interaction Graph Network'' (FIGNet) which extends beyond GNN-based methods, and computes interactions between mesh faces, rather than nodes. Compared to learned node- and particle-based methods, FIGNet is around 4x more accurate in simulating complex shape interactions, while also 8x more computationally efficient on sparse, rigid meshes. Moreover, FIGNet can learn frictional dynamics directly from real-world data, and can be more accurate than analytical solvers given modest amounts of training data. FIGNet represents a key step forward in one of the few remaining physical domains which have seen little competition from learned simulators, and offers allied fields such as robotics, graphics and mechanical design a new tool for simulation and model-basWed, 07 Dec 2022 00:00:00 GMTLearning rigid dynamics with face interaction graph networkshttps://www.deepmind.com/publications/learning-rigid-dynamics-with-face-interaction-graph-networkshttps://www.deepmind.com/publications/learning-rigid-dynamics-with-face-interaction-graph-networksSimulating rigid collisions among arbitrary shapes is notoriously difficult due to complex geometry and the strong non-linearity of the interactions. While graph neural network (GNN)-based models are effective at learning to simulate complex physical dynamics, such as fluids, cloth and articulated bodies, they have been less effective and efficient on rigid-body physics, except with very simple shapes. Existing methods that model collisions through the meshes' nodes are often inaccurate because they struggle when collisions occur on faces far from nodes. Alternative approaches that represent the geometry densely with many particles are prohibitively expensive for complex shapes. Here we introduce the ``Face Interaction Graph Network'' (FIGNet) which extends beyond GNN-based methods, and computes interactions between mesh faces, rather than nodes. Compared to learned node- and particle-based methods, FIGNet is around 4x more accurate in simulating complex shape interactions, while also 8x more computationally efficient on sparse, rigid meshes. Moreover, FIGNet can learn frictional dynamics directly from real-world data, and can be more accurate than analytical solvers given modest amounts of training data. FIGNet represents a key step forward in one of the few remaining physical domains which have seen little competition from learned simulators, and offers allied fields such as robotics, graphics and mechanical design a new tool for simulation and model-basWed, 07 Dec 2022 00:00:00 GMTNegotiation and honesty in artificial intelligence methods for the board game of Diplomacyhttps://www.deepmind.com/publications/negotiation-and-honesty-in-artificial-intelligence-methods-for-the-board-game-of-diplomacyhttps://www.deepmind.com/publications/negotiation-and-honesty-in-artificial-intelligence-methods-for-the-board-game-of-diplomacyThe success of human civilization is rooted in our ability to cooperate by communicating and making joint plans. We study how artificial agents may use communication to better cooperate in Diplomacy, a long-standing AI challenge. We propose negotiation algorithms allowing agents to agree on contracts regarding joint plans, and show they outperform agents lacking this ability. For humans, misleading others about our intentions forms a barrier to cooperation. Diplomacy requires reasoning about our opponents’ future plans, enabling us to study broken commitments between agents and the conditions for honest cooperation. We find that artificial agents face a similar problem as humans: communities of communicating agents are susceptible to peers who deviate from agreements. To defend against this, we show that the inclination to sanction peers who break contracts dramatically reduces the advantage of such deviators. Hence, sanctioning helps foster mostly truthful communication, despite conditions that initially favor deviations from agreements.Tue, 06 Dec 2022 00:00:00 GMTLarge-Scale Retrieval for Reinforcement Learninghttps://www.deepmind.com/publications/large-scale-retrieval-for-reinforcement-learninghttps://www.deepmind.com/publications/large-scale-retrieval-for-reinforcement-learningEffective decision making involves flexibly relating past experiences and relevant contextual information to a novel situation. In deep reinforcement learning (RL), the dominant paradigm is for an agent to amortise information that helps decision-making into its network weights via gradient descent on training losses. Here, we pursue an alternative approach in which agents can utilise large-scale context-sensitive database lookups to support their parametric computations. This allows agents to directly learn in an end-to-end manner to utilise relevant information to inform their outputs. In addition, new information can be attended to by the agent, without retraining, by simply augmenting the retrieval dataset. We study this approach for offline RL in 9x9 Go, a challenging game for which the vast combinatorial state space privileges generalisation over direct matching to past experiences. We leverage fast, approximate nearest neighbor techniques in order to retrieve relevant data from a set of tens of millions of expert demonstration states. Attending to this information provides a significant boost to prediction accuracy and game-play performance over simply using these demonstrations as training trajectories, providing a compelling demonstration of the value of large-scale retrieval in offline RL agents.Mon, 05 Dec 2022 00:00:00 GMTA Fourier Approach to Mixture Learninghttps://www.deepmind.com/publications/a-fourier-approach-to-mixture-learninghttps://www.deepmind.com/publications/a-fourier-approach-to-mixture-learningWe revisit the problem of learning mixtures of spherical Gaussians. Given samples from a mixture $\frac{1}{k}\sum_{j=1}^{k}\mathcal{N}(\mu_j, I_d)$, the goal is to estimate the means $\mu_1, \mu_2, \ldots, \mu_k \in \mathbb{R}^d$ up to a small error. The hardness of this learning problem can be measured by the \emph{separation} $\Delta$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan (2017) showed that with $\Delta = \Omega(\sqrt{\log k})$ separation, the means can be learned using $\mathrm{poly}(k, d)$ samples, whereas super-polynomially many samples are required if $\Delta = o(\sqrt{\log k})$ and $d = \Omega(\log k)$. This leaves open the low-dimensional regime where $d = o(\log k)$. In this work, we give an algorithm that efficiently learns the means in $d = O(\log k/\log\log k)$ dimensions under separation $d/\sqrt{\log k}$ (modulo doubly logarithmic factors). This separation is strictly smaller than $\sqrt{\log k}$, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan (2017), our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. More generally, our algorithm runs in time $\mathrm{poly}(k)\cdot f(d, \Delta, \epsilon)$, and is thus fixed-parameter tractable in parameters $d$, $\Delta$ and $\epsilon$. Our approach is based on estimating the Fourier transform of the mixture at carefully chosenThu, 01 Dec 2022 00:00:00 GMTCharacteristics of Harmful Text: Towards Rigorous Benchmarking of Language Modelshttps://www.deepmind.com/publications/characteristics-of-harmful-text-towards-rigorous-benchmarking-of-language-modelshttps://www.deepmind.com/publications/characteristics-of-harmful-text-towards-rigorous-benchmarking-of-language-modelsLarge language models produce human-like text that drives a growing number of applications. However, recent literature and, increasingly, real-world observations, have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful. Though work to evaluate language model harms is under way, translating foresight about which harms may arise into rigorous benchmarks is not straightforward. To facilitate this translation, we outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks. We then use these characteristics as a lens to identify trends and gaps in existing benchmarks. Finally, we apply them in a case study of the Perspective API, a toxicity classifier that is widely used in harm benchmarks. Our characteristics provide one piece of the bridge that translates between foresight and effective evaluation.Tue, 29 Nov 2022 00:00:00 GMTLearning to Navigate Wikipedia by Taking Random Walkshttps://www.deepmind.com/publications/learning-to-navigate-wikipedia-by-taking-random-walkshttps://www.deepmind.com/publications/learning-to-navigate-wikipedia-by-taking-random-walksA fundamental ability of an intelligent web-based agent is seeking out and acquiring new information. Internet search engines reliably find the correct vicinity but the top results may be a few links away from the desired target. A complementary approach is navigation via hyperlinks, employing a policy that comprehends local content and selects a link that moves it closer to the target. In this paper we show that using behavioral cloning of randomly sampled trajectories is sufficient to learn an effective link selection policy. We demonstrate the approach on a graph version of Wikipedia with 38M nodes and 387M edges. The model is able to efficiently navigate between nodes 5 and 20 steps apart 96% and 92% of the time, respectively. We then use the resulting embeddings and policy in a downstream fact verification task where, in combination with basic TF-IDF search and ranking methods, they are able to obtain competitive results to state-of-the-art methods.Mon, 28 Nov 2022 00:00:00 GMTTurbocharging Solution Concepts: Solving NEs, CEs and CCEs with Deep Equilibrium Networkshttps://www.deepmind.com/publications/turbocharging-solution-concepts-solving-nes-ces-and-cces-with-deep-equilibrium-networkshttps://www.deepmind.com/publications/turbocharging-solution-concepts-solving-nes-ces-and-cces-with-deep-equilibrium-networksSolution concepts such as Nash Equilibria, Correlated Equilibria, and Coarse Correlated Equilibria are useful components for many multiagent machine learning algorithms. Unfortunately, solving a normal-form game could take prohibitive or non-deterministic time to converge, and could fail. We introduce the Neural Equilibrium Solver which utilizes a special equivariant neural network architecture to approximately solve the space of all games of fixed shape, buying speed and determinism. We define a flexible equilibrium selection framework, that is capable of uniquely selecting an equilibrium that minimizes relative entropy, or maximizes welfare. The network is trained without needing to generate any supervised training data. We show remarkable zero-shot generalization to larger games. We argue that such a network is a powerful component for many possible multiagent algorithms.Mon, 28 Nov 2022 00:00:00 GMTSemantic Exploration from Language Abstractions and Pretrained Representationshttps://www.deepmind.com/publications/semantic-exploration-from-language-abstractions-and-pretrained-representationshttps://www.deepmind.com/publications/semantic-exploration-from-language-abstractions-and-pretrained-representationsEffective exploration is a challenge in reinforcement learning (RL). Novelty-based exploration methods can suffer in high-dimensional state spaces, such as continuous partially-observable 3D environments. We address this challenge by defining novelty using semantically meaningful state abstractions, which can be found in learned representations shaped by natural language. In particular, we evaluate vision-language representations, pretrained on natural image captioning datasets. We show that these pretrained representations drive meaningful, task-relevant exploration and improve performance on 3D simulated environments. We also characterize why and how language provides useful abstractions for exploration by considering the impacts of using representations from a pretrained model, a language oracle, and several ablations. We demonstrate the benefits of our approach with on- and off-policy RL algorithms and in two very different task domains---one that stresses the identification and manipulation of everyday objects, and one that requires navigational exploration in an expansive world. Our results suggest that using language-shaped representations could improve exploration for various algorithms and agents in challenging environments.Mon, 28 Nov 2022 00:00:00 GMTContinuous diffusion for categorical datahttps://www.deepmind.com/publications/continuous-diffusion-for-categorical-datahttps://www.deepmind.com/publications/continuous-diffusion-for-categorical-dataDiffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.Mon, 28 Nov 2022 00:00:00 GMTScore-Based Diffusion meets Annealed Importance Samplinghttps://www.deepmind.com/publications/score-based-diffusion-meets-annealed-importance-samplinghttps://www.deepmind.com/publications/score-based-diffusion-meets-annealed-importance-samplingMore than twenty years after its introduction, Annealed Importance Sampling (AIS) remains one of the most effective methods for marginal likelihood estimation. It relies on a sequence of distributions interpolating between a tractable initial distribution and the posterior of interest which we simulate from approximately using a non-homogeneous Markov chain. To obtain an importance sampling estimate of the marginal likelihood, AIS introduces an extended target distribution to reweight the Markov chain proposal. While much effort has been devoted to improving the proposal distribution used by AIS by changing the intermediate distributions and corresponding Markov kernels, an underappreciated issue is that AIS uses a convenient but suboptimal extended target distribution which can hinder its performance. We leverage here recent progress in score-based generative modeling (SGM) to approximate the optimal extended target distribution for AIS proposals corresponding to the discretization of Langevin and Hamiltonian dynamics using score matching ideas. We demonstrate these novel, differentiable, AIS procedures on a number of synthetic benchmark distributions and amortized inference tasks.Mon, 28 Nov 2022 00:00:00 GMTFine-tuning language models to find agreement among humans with diverse preferenceshttps://www.deepmind.com/publications/fine-tuning-language-models-to-find-agreement-among-humans-with-diverse-preferenceshttps://www.deepmind.com/publications/fine-tuning-language-models-to-find-agreement-among-humans-with-diverse-preferencesRecent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a Mon, 28 Nov 2022 00:00:00 GMTNeural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Membershttps://www.deepmind.com/publications/neural-payoff-machines-predicting-fair-and-stable-payoff-allocations-among-team-membershttps://www.deepmind.com/publications/neural-payoff-machines-predicting-fair-and-stable-payoff-allocations-among-team-membersIn multi-agent systems, agents can form teams allowing for collective outcomes that may far surpass the capabilities of an individual. Although a greater reward is generated through this process, fair distribution of the reward amongst the collective agents requires careful consideration. How one measures the relative contribution of an agent, allocates a share of the reward that is reflective of their effort, while ensuring future collaboration is a difficult equation to balance. Cooperative game theory offers solution concepts identifying payment schemes such as the Shapley value that fairly reflect the contribution of individuals to the performance of the team or the Core, which reduces the incentive of agents to abandon their team. Applications of such methods include identifying influential features for explainable AI, sharing the costs of joint ventures or team formation. Unfortunately, using these solutions requires tackling a computational barrier as they are hard to calculate even in restricted settings. We train neural networks to propose fair and stable payoff allocations, showing that cooperative game theoretic solutions can be distilled into a learned model that generalizes to previously unobserved games. We show that these techniques generalize even to games that are very far from the training distribution or with more players than the training set.Sat, 26 Nov 2022 00:00:00 GMTInverse Design for Fluid-Structure Interactions using Graph Network Simulatorshttps://www.deepmind.com/publications/inverse-design-for-fluid-structure-interactions-using-graph-network-simulatorshttps://www.deepmind.com/publications/inverse-design-for-fluid-structure-interactions-using-graph-network-simulatorsDesigning physical artifacts that serve a purpose---such as tools and other functional structures---is central to engineering as well as everyday human behavior. Though automating design has tremendous promise, general-purpose methods do not yet exist. Here we explore a simple, fast, and robust approach to inverse design which combines learned forward simulators based on graph neural networks with gradient-based design optimization. Our approach solves high-dimensional problems with complex physical dynamics, including designing surfaces and tools to manipulate fluid flows and optimizing the shape of an airfoil to minimize drag. This framework produces high-quality designs by propagating gradients through trajectories of hundreds of steps, even when using models that were pre-trained for single-step predictions on data substantially different from the design tasks. In our fluid manipulation tasks, the resulting designs outperformed those found by sampling-based optimization techniques. In airfoil design, they matched the quality of those obtained with a specialized solver. Our results suggest that despite some remaining challenges, machine learning-based simulators are maturing to the point where they can support general-purpose design optimization across a variety of domains.Sat, 26 Nov 2022 00:00:00 GMTTowards combinatorial invariance for Kazhdan-Lusztig polynomialshttps://www.deepmind.com/publications/on-kazhdan-lusztig-polynomials-for-symmetric-groupshttps://www.deepmind.com/publications/on-kazhdan-lusztig-polynomials-for-symmetric-groupsKazhdan-Lusztig polynomials are important and mysterious objects in representation theory. Here we present a new formula for their computation for symmetric groups based on the Bruhat graph. Our approach suggests a solution to the combinatorial invariance conjecture for symmetric groups, a well-known conjecture formulated by Lusztig and Dyer in the 1980s.Wed, 23 Nov 2022 00:00:00 GMTCuriosity in hindsighthttps://www.deepmind.com/publications/curiosity-in-hindsighthttps://www.deepmind.com/publications/curiosity-in-hindsightConsider the problem of exploration in sparse-reward or reward-free environments, such as Montezuma's Revenge. The *curiosity-driven* paradigm dictates an intuitive technique: At each step, the agent is rewarded for how much the realized outcome differs from their predicted outcome. However, using predictive error as intrinsic motivation is prone to fail in *stochastic environments*, as the agent may become hopelessly drawn to high-entropy areas of the state-action space, such as a noisy TV. Therefore it is important to distinguish between aspects of world dynamics that are inherently *predictable* (for which errors reflect epistemic uncertainty) and aspects that are inherently *unpredictable* (for which errors reflect aleatoric uncertainty): The former should constitute a source of intrinsic reward, whereas the latter should not. In this work, we study a natural solution derived from structural causal models of the world: Our key idea is to learn representations of the future that capture precisely the unpredictable aspects of each outcome---not any more, not any less---which we use as additional input for predictions, such that intrinsic rewards do vanish in the limit. First, we propose incorporating such hindsight representations into the agent's model to disentangle "noise" from "novelty", yielding *Curiosity in Hindsight*: a simple and scalable generalization of curiosity that is robust to all types of stochasticity. Second, we implement this frameworFri, 18 Nov 2022 00:00:00 GMTSpace is a sequence: Structured sequence learning as a unified theory of spatial representation in the hippocampushttps://www.deepmind.com/publications/space-is-a-sequence-structured-sequence-learning-as-a-unified-theory-of-spatial-representation-in-the-hippocampushttps://www.deepmind.com/publications/space-is-a-sequence-structured-sequence-learning-as-a-unified-theory-of-spatial-representation-in-the-hippocampusFascinating and puzzling phenomena, such as landmark vector cells, splitter cells, and event-specific representations to name a few, are regularly discovered in the hippocampus. Without a unifying principle that can explain these divergent observations, each experiment seemingly discovers a new anomaly or coding type. Here, we provide the singular key insight that the hippocampus is a sequence learner, and that a mental representation of space is an emergent property of latent higher-order sequence learning. Treating space as a sequence resolves myriad phenomena, and suggests that the place-field mapping methodology where sequential neuron responses are interpreted in spatial and Euclidean terms might itself be the source of anomalies. Our model, called Clone-structured Cognitive Graph (CSCG), uses a specific higher-order graph scaffolding to learn latent representations by mapping sensory inputs to unique contexts. Learning to compress sequential and episodic experiences using CSCGs result in the emergence of cognitive maps - mental representations of spatial and conceptual relationships in an environment that are suited for planning, introspection, consolidation, and abstraction. We demonstrate that over a dozen different hippocampal phenomena, ranging from those reported in classic experiments to the most recent ones, are succinctly and mechanistically explained by our model.Fri, 18 Nov 2022 00:00:00 GMTCommunicative Capital: A key resource for human-machine shared agency and collaborative capacityhttps://www.deepmind.com/publications/communicative-capital-a-key-resource-for-human-machine-shared-agency-and-collaborative-capacityhttps://www.deepmind.com/publications/communicative-capital-a-key-resource-for-human-machine-shared-agency-and-collaborative-capacityWe present a perspective on the role machine intelligence can play in enhancing human abilities. We argue that viewing learning machines such as prosthetic devices as systems that share agency with us allows us to improve their collaborative capability. Moreover, increasing shared agency will continue to enable more complex interactions, as sensorimotor technology evolves. To facilitate an agent-based view of such devices, we propose a framework for interpreting the capacity of a human-machine collaboration as a function of both the human's and machine's degrees of agency. We introduce communicative capital as a measure of the communication resources developed by a human and a machine sharing agency in ongoing interactions. We examine the benefits and challenges of increasing the agency of prostheses by surveying literature that builds communicative resources to enable more complex task-directed interactions. The novel agent-based viewpoint developed in this article significantly extends current thinking on how best to support the functional use of increasingly complex prosthetic enhancements, and ushers in more powerful interactions between humans and assistive or augmentative technologies.Mon, 14 Nov 2022 00:00:00 GMTControlling Commercial Cooling Systems Using Reinforcement Learninghttps://www.deepmind.com/publications/controlling-commercial-cooling-systems-using-reinforcement-learninghttps://www.deepmind.com/publications/controlling-commercial-cooling-systems-using-reinforcement-learningThis paper is a technical overview of DeepMind and Google's recent work on reinforcement learning for controlling commercial cooling systems. Building on expertise that began with cooling Google's data centers more efficiently, we recently conducted live experiments on two real-world facilities in partnership with Trane Technologies, a building management system provider. These live experiments had a variety of challenges in areas such as evaluation, learning from offline data, and constraint satisfaction. Our paper describes these challenges in the hope that awareness of them will benefit future applied RL work. We also describe the way we adapted our RL system to deal with these challenges, resulting in energy savings of approximately 9% and 13% respectively at the two live experiment sites.Fri, 11 Nov 2022 00:00:00 GMTActive Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Taskhttps://www.deepmind.com/publications/active-acquisition-for-multimodal-temporal-data-a-challenging-decision-making-taskhttps://www.deepmind.com/publications/active-acquisition-for-multimodal-temporal-data-a-challenging-decision-making-taskWe introduce a challenging decision-making task that we call active acquisition for multimodal temporal data (A2MT). In many real-world scenarios, input features are not readily available at test time and must instead be acquired at significant cost. With A2MT, we aim to learn agents that actively select which modalities of an input to acquire, trading off acquisition cost and predictive performance. A2MT extends a previous task called active feature acquisition to temporal decision making about high-dimensional inputs. Further, we propose a method based on the Perceiver IO architecture to address A2MT in practice. Our agents are able to solve a novel synthetic scenario requiring practically relevant cross-modal reasoning skills. On two large-scale, real-world datasets, Kinetics-700 and AudioSet, our agents successfully learn cost-reactive acquisition behavior. However, an ablation reveals they are unable to learn to learn adaptive acquisition strategies, emphasizing the difficulty of the task even for state-of-the-art models. Applications of A2MT may be impactful in domains like medicine, robotics, or finance, where modalities differ in acquisition cost and informativeness.Thu, 10 Nov 2022 00:00:00 GMTWhat is the simplest model that can account for high-fidelity imitation?https://www.deepmind.com/publications/what-is-the-simplest-model-that-can-account-for-high-fidelity-imitationhttps://www.deepmind.com/publications/what-is-the-simplest-model-that-can-account-for-high-fidelity-imitationWhat inductive biases must be incorporated into multi-agent artificial intelligence models to get them to capture high-fidelity imitation? We think very little is needed. In the right environments, both instrumental- and ritual-stance imitation can emerge from generic learning mechanisms operating on non-deliberative decision architectures. In this view, imitation emerges from trial-and-error learning and does not require explicit deliberation.Thu, 10 Nov 2022 00:00:00 GMTA Generalist Agenthttps://www.deepmind.com/publications/a-generalist-agenthttps://www.deepmind.com/publications/a-generalist-agentInspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.Thu, 10 Nov 2022 00:00:00 GMTOver-communicate no more: Situated RL agents learn concise communication protocolshttps://www.deepmind.com/publications/over-communicate-no-more-situated-rl-agents-learn-concise-communication-protocolshttps://www.deepmind.com/publications/over-communicate-no-more-situated-rl-agents-learn-concise-communication-protocolsMost research on communication emergence using reinforcement learning (RL) explores unsituated communication in one-step referential tasks. The tasks are not temporally interactive and lack time pressures typically present in natural communication. In these settings, agents may successfully learn to communicate, but they do not learn to exchange information concisely—they tend towards over-communication and an anti-efficient encoding.Here, we explore situated communication in a multi-step task, where the acting agent has to forgo an environmental action to communicate. Thus, we impose an opportunity cost on communication and mimic the real-world pressure of passing time. We compare communication emergence under this pressure against learning to communicate with a cost on articulation effort, implemented as a per-message penalty (fixed and progressively increasing). We find that while both pressures can disincentivise over-communication, situated communication does it more effectively and, unlike the cost on effort, does not negatively impact emergence. Implementing an opportunity cost on communication in a temporally extended environment is a step towards embodiment, and might be a pre-condition for incentivising efficient, human-like communication.Wed, 02 Nov 2022 00:00:00 GMTLearning to Configure Computer Networks with Neural Algorithmic Reasoninghttps://www.deepmind.com/publications/learning-to-configure-computer-networks-with-neural-algorithmic-reasoninghttps://www.deepmind.com/publications/learning-to-configure-computer-networks-with-neural-algorithmic-reasoningWe present a new method for scaling automatic configuration of computer networks. The key idea is to relax the computationally hard search problem of finding a configuration that satisfies a given specification into an approximate objective amenable to learning-based techniques. Based on this idea, we train a neural algorithmic model which learns to generate configurations likely to (fully or partially) satisfy a given specification under existing routing protocols. By relaxing the rigid satisfaction guarantees, our approach (i) enables greater flexibility: it is protocol-agnostic, enables cross-protocol reasoning, and does not depend on hardcoded rules; and (ii) finds configurations for much larger computer networks than previously possible. Our learned synthesizer is up to 490x faster than state-of-the-art SMT-based methods, while producing configurations which on average satisfy more than 93% of the provided requirements.Mon, 31 Oct 2022 00:00:00 GMTDiagnosing failures of fairness transfer across distribution shift in real-world medical settingshttps://www.deepmind.com/publications/diagnosing-failures-of-fairness-transfer-across-distribution-shift-in-real-world-medical-settingshttps://www.deepmind.com/publications/diagnosing-failures-of-fairness-transfer-across-distribution-shift-in-real-world-medical-settingsDiagnosing and mitigating changes in model fairness under distribution shift is an important component of the safe deployment of machine learning in healthcare settings. Importantly, the success of any mitigation strategy strongly depends on the structure of the shift. Despite this, there has been little discussion of how to empirically assess the structure of a distribution shift that one is encountering in practice. In this work, we adopt a causal framing to motivate conditional independence tests as a key tool for characterizing distribution shifts. Using our approach in two medical applications, we show that this knowledge can help diagnose failures of fairness transfer, including cases where real-world shifts are more complex than is often assumed in the literature. Based on these results, we discuss potential remedies at each step of the machine learning pipeline.Mon, 31 Oct 2022 00:00:00 GMTCan language models handle recursively nested grammatical structures? A case study on comparing models and humanshttps://www.deepmind.com/publications/can-language-models-handle-recursively-nested-grammatical-structures-a-case-study-on-comparing-models-and-humanshttps://www.deepmind.com/publications/can-language-models-handle-recursively-nested-grammatical-structures-a-case-study-on-comparing-models-and-humansHow should we compare the capabilities of language models and humans? Here, I consider a case study: processing of recursively nested grammatical structures. Prior work has suggested that language models cannot handle these structures as reliably as humans can. I revisit this questions while attempting to more closely match the experimental paradigms used to evaluate the humans and models. In prior work, the humans were provided with instructions and training before being evaluated, while the language models were evaluated zero-shot. I therefore explore the performance of language models provided with few-shot prompts. A simple prompt, which contains substantially less content than the human training, allows large language models to consistently outperform the human results. The same prompt even allows extrapolation to more-deeply-nested conditions than have been tested in humans. Further, a re-analysis of the prior human experiments suggests that the humans may not perform above chance at the difficult structures initially. These results suggest that large language models can in fact process recursively nested grammatical structures at least as well as humans. This case study highlights how discrepancies in the level of experiment-specific training provided to humans and large language models can confound comparisons. I use this case study to reflect on the broader challenge of comparing human and model capabilities, and to highlight an important distinctThu, 27 Oct 2022 00:00:00 GMTCategorical SDEs with Simplex Diffusionhttps://www.deepmind.com/publications/categorical-sdes-with-simplex-diffusionhttps://www.deepmind.com/publications/categorical-sdes-with-simplex-diffusionDiffusion models typically operate in the standard framework of generative modelling by producing continuously-valued datapoints. To this end, they rely on a progressive Gaussian smoothing of the original data distribution, which admits an SDE interpretation involving increments of a standard Brownian motion. However, some applications such as text generation or reinforcement learning might naturally be better served by diffusing categorical-valued data, i.e., lifting the diffusion to a space of probability distributions. To this end, this short theoretical note proposes simplex diffusion, a means to directly diffuse datapoints located on an n-dimensional probability simplex. We show how this relates to the Dirichlet distribution on the simplex and how the analogous SDE is realized thanks to a multi-dimensional Cox–Ingersoll–Ross process (abbreviated as CIR), previously used in economics and mathematical finance. Finally, we make remarks as to the numerical implementation of trajectories of the CIR process, and discuss some limitations of our approach.Wed, 26 Oct 2022 00:00:00 GMTLatent Space Smoothing for Individually Fair Representationshttps://www.deepmind.com/publications/latent-space-smoothing-for-individually-fair-representationshttps://www.deepmind.com/publications/latent-space-smoothing-for-individually-fair-representationsFair representation learning transforms user data into a representation that ensures fairness and utility regardless of the downstream application. However, learning individually fair representations, i.e., guaranteeing that similar individuals are treated similarly, remains challenging in high-dimensional settings such as computer vision. In this work, we introduce LASSI, the first representation learning method for certifying individual fairness of high-dimensional data. Our key insight is to leverage recent advances in generative modeling to capture the set of similar individuals in the generative latent space. This enables us to learn individually fair representations that map similar individuals close together by using adversarial training to minimize the distance between their representations. Finally, we employ randomized smoothing to provably map similar individuals close together, in turn ensuring that local robustness verification of the downstream application results in end-to-end fairness certification. Our experimental evaluation on challenging real-world image data demonstrates that our method increases certified individual fairness by up to 90% without significantly affecting task utility.Mon, 24 Oct 2022 00:00:00 GMTTesting Independence of Exchangeable Random Variableshttps://www.deepmind.com/publications/testing-independence-of-exchangeable-random-variableshttps://www.deepmind.com/publications/testing-independence-of-exchangeable-random-variablesGiven well-shuffled data, can we determine whether the data items are statistically (in)dependent? Formally, we consider the problem of testing whether a set of exchangeable random variables are independent. We will show that this is possible and develop tests that can confidently reject the null hypothesis that data is independent and identically distributed and have high power for (some) exchangeable distributions. We will make no structural assumptions on the underlying sample space. One potential application is in Deep Learning, where data is often scraped from the whole internet, with duplications abound. Duplications can render data non-iid and test-set evaluation prone to give wrong answers.Sat, 22 Oct 2022 00:00:00 GMTCollaborating with language models for embodied reasoninghttps://www.deepmind.com/publications/collaborating-with-language-models-for-embodied-reasoninghttps://www.deepmind.com/publications/collaborating-with-language-models-for-embodied-reasoningReasoning in a complex and ambiguous embodied environment is a key goal for Reinforcement Learning (RL) agents. While some sophisticated RL agents can successfully solve difficult tasks, they require a large amount of training data and often struggle to generalize to new unseen environments and new tasks. On the other hand, Large Scale Language Models (LSLMs) have exhibited strong reasoning ability and the ability to to adapt to new tasks through in-context learning. However, LSLMs do not inherently have the ability to interrogate or intervene on the environment. In this work, we investigate how to combine these complementary abilities in a single system consisting of three parts: a Planner, an Actor, and a Reporter. The Planner is a pre-trained language model that can issue commands to a simple embodied agent (the Actor), while the Reporter communicates with the Planner to inform its next command. We present a set of tasks that require reasoning, test this system's ability to generalize zero-shot and investigate failure cases, and demonstrate how components of this system can be trained with reinforcement-learning to improve performance.Thu, 20 Oct 2022 00:00:00 GMTWhy neural networks find simple solutions: the many regularizers of geometric complexityhttps://www.deepmind.com/publications/why-neural-networks-find-simple-solutions-the-many-regularizers-of-geometric-complexityhttps://www.deepmind.com/publications/why-neural-networks-find-simple-solutions-the-many-regularizers-of-geometric-complexityIn many contexts, simpler models are preferable to more complex models and the control of this model complexity is the goal for many methods in machine learning such as regularization, hyperparameter tuning and architecture design. In deep learning, it has been difficult to understand the underlying mechanisms of complexity control, since many traditional measures are not naturally suitable for deep neural networks. Here we develop the notion of geometric complexity, which is a measure of the variability of the model function, computed using a discrete Dirichlet energy. Using a combination of theoretical arguments and empirical results, we show that many common training heuristics such as parameter norm regularization, spectral norm regularization, flatness regularization, implicit gradient regularization, noise regularization and the choice of parameter initialization all act to control geometric complexity, providing a unifying framework in which to characterize the behavior of deep learning models.Tue, 18 Oct 2022 00:00:00 GMTSelf-supervised video pretraining yields strong image representationshttps://www.deepmind.com/publications/self-supervised-video-pretraining-yields-strong-image-representationshttps://www.deepmind.com/publications/self-supervised-video-pretraining-yields-strong-image-representationsVideos contain infinitely more information than still images. Yet pretraining on image datasets has remained the dominant paradigm for learning representations that capture spatial information, and previous attempts at video pretraining have fallen short on image understanding benchmarks. In this work we revisit self-supervised learning of image representations from the dynamic evolution of video frames. We find that a suitably, but minimally curated video dataset coupled with a contrastive objective that encourages learning combinations of spatial and temporal invariances is sufficient to produce frame-based models that perform surprisingly well on a variety of downstream image-based scene understanding tasks. Additionally, we find video pretraining to scale considerably better with model capacity than image pretraining, closing the gap on semantic segmentation on PASCAL and ADE20k, and object detection on COCO. Together, these results present video pretraining as a general solution for learning visual representations.Fri, 14 Oct 2022 00:00:00 GMTBeyond Bayes-optimality: meta-learning what you know you don’t knowhttps://www.deepmind.com/publications/beyond-bayes-optimality-meta-learning-what-you-know-you-dont-knowhttps://www.deepmind.com/publications/beyond-bayes-optimality-meta-learning-what-you-know-you-dont-knowMeta-training agents with memory has been shown to culminate in Bayes-optimal agents, which casts Bayes-optimality as the solution to an optimization problem rather than an a priori modeling assumption. Bayes-optimal agents are risk-neutral, since they solely attune to the expected return, and ambiguity-neutral, since they act in new situations as if uncertainty were known. This is in contrast to risk-sensitive agents that in addition exploit the higher-order moments of the return, and ambiguity-sensitive agents that act differently recognizing when they don't know. How can we extend the meta-learning protocol to generate risk and ambiguity-sensitive agents? The goal of this work is to fill this gap in the literature by showing that risk- and ambiguity-sensitivity also emerge as the result of an optimization problem---instead of being an a priori modeling assumption---using modified meta-training mechanisms. We empirically test our proposed meta-training mechanisms on agents exposed to foundational classes of decision-making experiments and demonstrate that they become sensitive to risk and ambiguity.Wed, 12 Oct 2022 00:00:00 GMTTransformers generalize differently from in-context and in-weights informationhttps://www.deepmind.com/publications/transformers-generalize-differently-from-in-context-and-in-weights-informationhttps://www.deepmind.com/publications/transformers-generalize-differently-from-in-context-and-in-weights-informationTransformer models have a powerful dual ability to utilize two kinds of information: information stored in weights during training, and information provided only via the inputs presented at inference time (known as "in-context learning"). However, it is unknown whether generalization from in-weights vs in-context exhibit similar inductive biases. In this work, we show that transformers exhibit different inductive biases in these two modes. When transformers are meta-trained for few-shot learning from context, they are biased towards exemplar-based generalization from in-context information. In contrast, transformers are biased towards sparse rule-based extrapolation when generalizing from in-weights information. However, large language pretrained transformer models exhibit partially rule-based generalization even from novel in-context information. Finally, we show that in-context learning can be pushed towards rule-based generalization by changing the training data, providing a potential explanation for language model behavior. In-context learning is now ubiquitously used for efficiently imparting task specifications to large pre-trained models; understanding their generalization behaviors (and how to shape them through the training data) is of important practical consequence.Tue, 11 Oct 2022 00:00:00 GMTPalm up: Playing in the Latent Manifold for Unsupervised Pretraininghttps://www.deepmind.com/publications/palm-up-playing-in-the-latent-manifold-for-unsupervised-pretraininghttps://www.deepmind.com/publications/palm-up-playing-in-the-latent-manifold-for-unsupervised-pretrainingLarge and diverse datasets have been the cornerstones of many impressive advancements in artificial intelligence. Intelligent creatures, however, learn by interacting with the environment, which changes the input sensory signals and the state of the environment. In this work, we aim to bring the best of both worlds and propose an algorithm that exhibits an exploratory behavior whilst it utilizes large diverse datasets. Our key idea is to leverage deep generative models that are pretrained on static datasets and introduce a dynamic model in the latent space. The transition dynamics simply mixes an action and a random sampled latent. It then applies an exponential moving average for temporal persistency, the resulting latent is decoded to image using pretrained generator. We then employ an unsupervised reinforcement learning algorithm to explore in this environment and perform unsupervised representation learning on the collected data. We further leverage the temporal information of this data to pair data points as a natural supervision for representation learning. Our experiments suggest that the learned representations can be successfully transferred to downstream tasks in both vision and reinforcement learning domains.Mon, 10 Oct 2022 00:00:00 GMTGame Theoretic Rating in n-player general-sum games with Equilibriahttps://www.deepmind.com/publications/game-theoretic-rating-in-n-player-general-sum-games-with-equilibriahttps://www.deepmind.com/publications/game-theoretic-rating-in-n-player-general-sum-games-with-equilibriaRating strategies in a game is an important area of research in game theory and artificial intelligence, and can be applied to any real-world competitive or cooperative setting. Traditionally, only transitive dependencies between strategies have been used to rate strategies, however recent work has expanded ratings to utilize game theoretic solutions to better rate strategies in non-transitive games. This work generalizes these ideas and proposes novel algorithms suitable for n-player, general-sum rating of strategies in normal-form games. This enables well-established solution concepts, such as equilibria, to be leveraged to efficiently rate strategies in games with complex strategic interactions, which arise in multi-agent training and real-world interactions between many agents. Based on match-up data from 2018/2019 Premier League, we identify real-life non-transitivities and demonstrate how a club's ratings are affected by their success against members in the cycle, and the extent to which teams capitalize on home advantage.Wed, 05 Oct 2022 00:00:00 GMT Optimistic posterior sampling for reinforcement learning with few samples and tight guaranteeshttps://www.deepmind.com/publications/optimistic-posterior-sampling-for-reinforcement-learning-with-few-samples-and-tight-guaranteeshttps://www.deepmind.com/publications/optimistic-posterior-sampling-for-reinforcement-learning-with-few-samples-and-tight-guaranteesWe consider reinforcement learning in an environment modeled by an episodic, tabular, step-dependent Markov decision process of horizon $H$ with $S$ states, and $A$ actions. The performance of an agent is measured by the regret after interacting with the environment for $T$ episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of posterior sampling that only needs a number of posterior samples logarithmic in $H$, $S$, $A$, and $T$ per state-action pair. For OPSRL we guarantee a high-probability regret bound of order at most $O(\sqrt{H^3SAT})$ ignoring $\text{poly}\log(HSAT)$ terms. The key novel technical ingredient is a new sharp anti-concentration inequality for linear forms of a Dirichlet random vector which may be of independent interest. Specifically, we extend the normal approximation-based lower bound for Beta distributions by Alfers and Dinges (1984) to Dirichlet distributions. Our bound matches the lower bound of order $\Omega(\sqrt{H^3SAT})$, thereby answering the open problems raised by Agrawal and Jia (2017) for the episodic setting.Mon, 03 Oct 2022 00:00:00 GMTCo-Writing Screenplays and Theatre Scripts with Language Models: An Evaluation by Industry Professionalshttps://www.deepmind.com/publications/co-writing-screenplays-and-theatre-scripts-with-language-models-an-evaluation-by-industry-professionalshttps://www.deepmind.com/publications/co-writing-screenplays-and-theatre-scripts-with-language-models-an-evaluation-by-industry-professionalsLanguage models are increasingly attracting interest from writers. However, they have limited usefulness for long form creative writing because they lack long-range semantic coherence. We address this limitation by applying language models hierarchically ,in a system we call Dramatron. By building structural context via prompt chaining, Dramatron can generate coherent scripts and screenplays complete with a title, characters, story beats, location descriptions, and dialogue. We illustrate Dramatron’s usefulness as an interactive co-creative system with a user-study of 15 theatre and film industry professionals, who co-wrote theatre scripts and screenplays with Dramatron and engaged in open-ended interviews. We report reflections from our interviewees, and from independent reviewers who watched recent stagings of these works, to illustrate how both Dramatron and hierarchical text generation are useful for human-machine co-creativity. Finally, we discuss the suitability of Dramatron for co-creativity, ethical considerations–including plagiarism and bias–and participation models for the design and deployment of such tools.Fri, 30 Sep 2022 00:00:00 GMTWhere Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methodshttps://www.deepmind.com/publications/where-should-i-spend-my-flops-efficiency-evaluations-of-visual-pre-training-methodshttps://www.deepmind.com/publications/where-should-i-spend-my-flops-efficiency-evaluations-of-visual-pre-training-methodsThe past few years have witnessed an explosion of strong self-supervised methods that have achieved remarkable success achieving parity with supervised methods as a pre-training method. Much of prior work has added large numbers of contrastive views, training for overlong schedules, and scaling to larger models in order to drive up absolute accuracy on downstream tasks. In this work, we are interested in examining a related, but slightly orthogonal question: given a fixed FLOP budget, what are the best datasets, models, and (self-)supervised training methods to achieve high downstream performance on representative visual tasks? This setting is often more realistic for both academic and industry labs. We examine five large scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and Supervised), and characterize their FLOP and CO2 footprints, relative to their absolute performance on a common image segmentation task. From this, we advocate that more close attention be paid to (1) dataset quality and curatedness and (2) accuracy gains in context of FLOP usage, and question the commonly held hypothesis of the inherent scalability of current self-supervised methods.Fri, 30 Sep 2022 00:00:00 GMTCOptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimationhttps://www.deepmind.com/publications/constrained-offline-rl-via-stationary-distribution-correction-estimationhttps://www.deepmind.com/publications/constrained-offline-rl-via-stationary-distribution-correction-estimationWe consider the offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. This problem setting is appealing in many real-world scenarios, where direct interaction with the environment is costly or risky, and where the resulting policy should comply with safety constraints. However, it is challenging to compute a policy that guarantees satisfying the cost constraints in the offline RL setting, since the off-policy evaluation inherently has an estimation error. In this paper, we present an offline constrained RL algorithm that optimizes the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction. Experimental results show that COptiDICE attains better policies in terms of constraint satisfaction and return-maximization, outperforming baseline algorithms.Thu, 29 Sep 2022 00:00:00 GMTLearned Force Fields Are Ready For Ground State Catalyst Discoveryhttps://www.deepmind.com/publications/learned-force-fields-are-ready-for-ground-state-catalyst-discoveryhttps://www.deepmind.com/publications/learned-force-fields-are-ready-for-ground-state-catalyst-discoveryWe present evidence that learned density functional theory (DFT'') force fields are ready for ground state catalyst discovery. Our key finding is that relaxation using forces from a learned potential yields structures with similar or lower energy to those relaxed using the RPBE functional in over 50\% of evaluated systems, despite the fact that the predicted forces differ significantly from the ground truth. This has the surprising implication that learned potentials may be ready for replacing DFT in challenging catalytic systems such as those found in the Open Catalyst 2020 dataset. Furthermore, we show that a force field trained on a locally harmonic energy surface with the same minima as a target DFT energy is also able to find lower or similar energy structures in over 50\% of cases. ThisEasy Potential'' converges in fewer steps than a standard model trained on true energies and forces, which further accelerates calculations. Its success illustrates a key point: learned potentials can locate energy minima even when the model has high force errors. The main requirement for structure optimisation is simply that the learned potential has the correct minima. Since learned potentials are fast and scale linearly with system size, our results open the possibility of quickly finding ground states for large systems.Mon, 26 Sep 2022 00:00:00 GMTStructure of the PAPP-A-IGFBP5 complex reveals mechanism of substrate recognitionhttps://www.deepmind.com/publications/structure-of-the-papp-a-igfbp5-complex-reveals-mechanism-of-substrate-recognitionhttps://www.deepmind.com/publications/structure-of-the-papp-a-igfbp5-complex-reveals-mechanism-of-substrate-recognitionInsulin-like growth factor (IGF) signaling is highly conserved and tightly regulated by proteases including Pregnancy-Associated Plasma Protein A (PAPP-A). PAPP-A and its paralog PAPP-A2 are metalloproteases that mediate IGF bioavailability through cleavage of IGF binding proteins (IGFBPs). Here, we present single-particle cryo-EM structures of the catalytically inactive mutant PAPP-A (E483A) in complex with a peptide from its substrate IGFBP5 (PAPP-ABP5) and also in its substrate-free form, by leveraging the power of AlphaFold to generate a high quality predicted model as a starting template. We show that PAPP-A is a flexible trans-dimer that binds IGFBP5 via a 25-amino acid anchor peptide which extends into the metalloprotease active site. This unique IGFBP5 anchor peptide that mediates the specific PAPP-AIGFBP5 interaction is not found in other PAPP-A substrates. Additionally, we illustrate the critical role of the PAPP-A central domain as it mediates both IGFBP5 recognition and trans-dimerization. We further demonstrate that PAPP-A trans-dimer formation and distal inter-domain interactions are both required for efficient proteolysis of IGFBP4, but dispensable for IGFBP5 cleavage. Together the structural and biochemical studies reveal the mechanism of PAPP-A substrate binding and selectivity.Tue, 20 Sep 2022 00:00:00 GMTDiscovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimalityhttps://www.deepmind.com/publications/discovering-policies-with-domino-diversity-optimization-maintaining-near-optimalityhttps://www.deepmind.com/publications/discovering-policies-with-domino-diversity-optimization-maintaining-near-optimalityFinding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO’s $\pi$, a method for Diversity Optimization that is Maintaining Near Optimal Policies. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviours in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, and demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters. Finally, we demonstrate that the discovered set is robust to perturbations of the environment.Tue, 20 Sep 2022 00:00:00 GMTOn Reward Binarisation and Bayesian Agentshttps://www.deepmind.com/publications/on-reward-binarisation-and-bayesian-agentshttps://www.deepmind.com/publications/on-reward-binarisation-and-bayesian-agentsReward binarisation is a common heuristically applied technique which can potentially simplify a given reinforcement learning problem. However this procedure done without care can modify the original problem, or throw away essential information. In this paper we study a number of natural forms of reward binarisation, and characterise their effects in terms of problem expressitivity. We show positive results for MDPs, POMDPs, and k-order MDPs and a negative result for general history based reinforcement learning agents. Furthermore we show that binary Bayesian reinforcement learning agents enjoy convergence properties similar to their non-binarised counterparts.Tue, 20 Sep 2022 00:00:00 GMTOptimizing Industrial Cooling Systems with Hierarchical Reinforcement Learninghttps://www.deepmind.com/publications/optimizing-industrial-cooling-systems-with-hierarchical-reinforcement-learninghttps://www.deepmind.com/publications/optimizing-industrial-cooling-systems-with-hierarchical-reinforcement-learningReinforcement learning (RL) techniques have been deployed in optimizing industrial cooling systems, offering substantial energy reductions compared to traditional heuristic policies. A major challenge in controlling these systems involves learning behaviors that are feasible in the real world due to machinery constraints. For example, certain actions can only be executed every few hours while other actions can be taken more frequently. Without extensive reward engineering and experimentation, an RL agent may not learn realistic operation of machinery. To address this, we use hierarchical reinforcement learning with multiple agents that control subsets of actions according to their operation time scales. Our hierarchical approach achieves energy savings over existing baselines while maintaining constraints such as operating chillers within safe bounds in a simulated HVAC control environment.Mon, 19 Sep 2022 00:00:00 GMTLeveraging Natural Language and Program Abstractions to Instill Human Inductive Biases in Machineshttps://www.deepmind.com/publications/leveraging-natural-language-and-program-abstractions-to-instill-human-inductive-biases-in-machineshttps://www.deepmind.com/publications/leveraging-natural-language-and-program-abstractions-to-instill-human-inductive-biases-in-machinesStrong inductive biases are a key component of human intelligence, allowing people to quickly learn a variety of tasks. Although meta-learning has emerged as an approach for endowing neural networks with useful inductive biases, agents trained via meta-learning may learn very different strategies from humans. We show that co-training these agents on predicting representations from natural language task descriptions and from programs induced to generate such tasks guides agents toward human-like inductive biases. Human-generated language descriptions and program induction with library learning both result in more human-like inductive biases in downstream meta-reinforcement learning agents than less abstract controls (synthetic language descriptions, program induction without library learning), suggesting that the abstraction supported by these representations is key. This work shows that natural language and programs can be used as repositories of human-like inductive bias, and demonstrates a general and flexible approach to inducing these biases in artificial agents.Wed, 14 Sep 2022 00:00:00 GMTDeveloping, evaluating and scaling learning agents in multi-agent environmentshttps://www.deepmind.com/publications/developing-evaluating-and-scaling-learning-agents-in-multi-agent-environmentshttps://www.deepmind.com/publications/developing-evaluating-and-scaling-learning-agents-in-multi-agent-environmentsThe Game Theory & Multi-Agent team at DeepMind studies several aspects of multi-agent learning ranging from computing approximations to fundamental concepts in game theory to simulating social dilemmas in rich spatial environments and training 3-d humanoids in difficult team coordination tasks. A signature aim of our group is to use the resources and expertise made available to us at DeepMind in deep reinforcement learning to explore multi-agent systems in complex environments and use these benchmarks to advance our understanding. Here, we summarise the recent work of our team and present a taxonomy that we feel highlights many important open challenges in multi-agent research.Tue, 06 Sep 2022 00:00:00 GMTFrom Motor Control to Team Play in Simulated Humanoid Footballhttps://www.deepmind.com/publications/from-motor-control-to-team-play-in-simulated-humanoid-footballhttps://www.deepmind.com/publications/from-motor-control-to-team-play-in-simulated-humanoid-footballLearning to combine control at the level of joint torques with longer term goal-directed behavior is a long-standing challenge for physically embodied artificial agents. Intelligent behavior in the physical world unfolds across multiple spatial and temporal scales: although movements are ultimately executed at the level of instantaneous muscle tensions or joint torques, they must be selected to serve goals which are defined on much longer timescales and often involve complex interactions with the environment and other agents. Recent research has demonstrated the potential of learning-based approaches applied to the respective problems of complex movement, long-term planning, and multi-agent coordination. However, their integration remains challenging and traditionally required the design and optimization of independent sub-systems. In this work, we tackle the integration of motor control and long-horizon decision making, in the context of physically simulated humanoid football which requires agile, human-like motor control and multi-agent coordination. We optimize teams of agents to play simulated football via reinforcement learning, constraining the solution space to that of plausible human-like movements using motion capture data. They are trained to maximize several environment rewards, and imitate pre-trained football-specific skills if doing so leads to improved performance. The result is a team of coordinated humanoid football players that exhibit coWed, 31 Aug 2022 00:00:00 GMTOf Correlated Equilibria in Mean-Field Gameshttps://www.deepmind.com/publications/of-correlated-equilibria-in-mean-field-gameshttps://www.deepmind.com/publications/of-correlated-equilibria-in-mean-field-gamesWe have built a new notion of correlated equilibria in mean-field games, motivated its use, provided examples and its properties, and are ready for submission.Mon, 22 Aug 2022 00:00:00 GMTProbing Transfer in Deep Reinforcement Learning without Task Engineeringhttps://www.deepmind.com/publications/probing-transfer-in-deep-reinforcement-learning-without-task-engineeringhttps://www.deepmind.com/publications/probing-transfer-in-deep-reinforcement-learning-without-task-engineeringWe evaluate the use of original game curricula supported by the Atari 2600 console as a heterogeneous transfer benchmark for deep reinforcement learning agents. Game designers created curricula using combinations of several discrete modifications to the basic versions of games such as Space Invaders, Breakout and Freeway, making them progressively more challenging for human players. By formally organising these modifications into several factors of variation, we are able to show that Analyses of Variance (ANOVA) are a potent tool for studying the effects of human-relevant domain changes on the learning and transfer performance of a deep reinforcement learning agent. Since no manual task engineering is needed on our part, leveraging the original multi-factorial design avoids the pitfalls of unintentionally biasing the experimental setup. We find that game design factors have a large and statistically significant impact on an agent's ability to learn, and so do their combinatorial interactions. Furthermore, we show that zero-shot transfer from the basic games to their respective variations is possible, but the variance in performance is also largely explained by interactions between factors. As such, we argue that Atari game curricula offer a challenging benchmark for transfer learning in RL, that can help the community better understand the generalisation capabilities of RL agents along dimensions which meaningfully impact human generalisation performance. Mon, 22 Aug 2022 00:00:00 GMTReinforcement Learning with Information Theoretic Actuationhttps://www.deepmind.com/publications/reinforcement-learning-with-information-theoretic-actuationhttps://www.deepmind.com/publications/reinforcement-learning-with-information-theoretic-actuationReinforcement Learning formalises an embodied agent's interaction with the environment through observations, rewards and actions. But where do the actions come from? Actions are often considered to represent something external, such as the movement of a limb, a chess piece, or more generally, the output of an actuator. In this work we explore and formalise a contrasting view, namely that actions are best thought of as the output of a sequence of internal choices with respect to a action model. This view is particularly well suited for leveraging the recent advances in large sequence models as prior knowledge for multi-task reinforcement learning problems. Our main contribution in this work is to show how to augment the standard MDP formalism with a sequential notion of internal action using information-theoretic techniques, and that this leads to self-consistent definitions of both internal and external value.Fri, 19 Aug 2022 00:00:00 GMTUsing the Veil of Ignorance to align AI systems with principles of justicehttps://www.deepmind.com/publications/using-the-veil-of-ignorance-to-align-ai-systems-with-principles-of-justicehttps://www.deepmind.com/publications/using-the-veil-of-ignorance-to-align-ai-systems-with-principles-of-justiceThe philosopher John Rawls proposed the Veil of Ignorance (VoI) as a thought experiment to identify fair principles for governing a society. Here, we apply the VoI to an important governance domain: artificial intelligence (AI). In five incentive-compatible studies (_N_ = 2,508), including two preregistered protocols, participants choose principles to govern an Artificial Intelligence (AI) assistant from behind the veil: that is, without knowledge of their own relative position in the group. Compared to participants who have this information, we find a consistent preference for a principle that instructs the AI assistant to prioritize the worst-off. Neither risk attitudes nor political preferences adequately explain these choices. Instead, they appear to be driven by elevated concerns about fairness: Without prompting, participants who reason behind the VoI more frequently explain their choice in terms of fairness, compared to those in the Control condition. Moreover, we find initial support for the ability of the VoI to elicit more robust preferences: In the studies presented here, the VoI increases the likelihood of participants continuing to endorse their initial choice in a subsequent round where they know how they will be affected by the AI intervention and have a self-interested motivation to change their mind. These results emerge in both a descriptive and an immersive game. Our findings suggest that the VoI may be a suitable mechanism for selecTue, 16 Aug 2022 00:00:00 GMTMeta-Learning Sparse Compression Networkshttps://www.deepmind.com/publications/meta-learning-sparse-compression-networkshttps://www.deepmind.com/publications/meta-learning-sparse-compression-networksRecent work in Deep Learning has re-imagined the representation of data as functions mapping from a coordinate space to an underlying continuous signal. When such functions are approximated by neural networks this introduces a compelling alternative to the more common multi-dimensional array representation. Recent work on such Implicit Neural Representations (INRs) has shown that - following careful architecture search - INRs can outperform established compression methods such as JPEG (e.g. Dupont et al., 2021). In this paper, we propose crucial steps towards making such ideas scalable: Firstly, we employ state-of-the-art network sparsification techniques to drastically improve compression and secondly, introduce the first technique allowing for sparsification to be employed in the inner-loop of commonly used Meta-Learning algorithms, thus drastically improving both compression and computational cost of learning INRs. The generality of this formalism allows us to present results on diverse data modalities, such as scenes, manifolds, images, signed distance functions and 3D scenes, several of which constitute new state-of-the-art results.Mon, 08 Aug 2022 00:00:00 GMTHow fair is your graph? Addressing fairness concerns in neuroimaging studieshttps://www.deepmind.com/publications/how-fair-is-your-graph-addressing-fairness-concerns-in-neuroimaging-studieshttps://www.deepmind.com/publications/how-fair-is-your-graph-addressing-fairness-concerns-in-neuroimaging-studiesRecent work on neuroimaging has demonstrated significant benefits of using population graphs to capture non-imaging information in the prediction of neurodegenerative and neurodevelopmental disorders. These non-imaging attributes may contain demographic information about the individuals, e.g., age or sex, but also the acquisition site, as imaging protocols and hardware might significantly differ across sites in large-scale studies. The effect of the latter is particularly prevalent in functional connectomics studies, where it remains unclear how to sufficiently homogenise fMRI signals across the different sites. Recent studies have highlighted the need to investigate potential biases in the classifiers devised using large-scale datasets, which might be imbalanced in terms of one or more sensitive attributes. This can be exacerbated when employing these attributes in a population graph to explicitly introduce inductive biases to the machine learning model and lead to disparate predictive performance across sub-populations. This study scrutinises such a system and aims to uncover potential biases of a semi-supervised classifier that relies on a population graph. We further explore the effect of the graph structure and stratification strategies, as well as methods to mitigate such biases and produce fairer predictions across the population.Fri, 05 Aug 2022 00:00:00 GMTObject discovery and representation networkshttps://www.deepmind.com/publications/object-discovery-and-representation-networkshttps://www.deepmind.com/publications/object-discovery-and-representation-networksThe promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning, recent methods have shown the advantage of including knowledge of image structure. However, by introducing hand-crafted image segmentations to define regions of interest, or specialized augmentation strategies, these methods sacrifice the simplicity and generality that makes SSL so powerful. Instead, we propose a self-supervised learning paradigm that discovers this image structure by itself. Our method, Odin, couples object discovery and representation networks to discover meaningful image segmentations without any supervision. The resulting learning paradigm is simpler, less brittle, and more general, and achieves state-of-the-art transfer learning results for object detection and instance segmentation on COCO, and semantic segmentation on PASCAL and Cityscapes, while strongly surpassing supervised pre-training for video segmentation on DAVIS.Thu, 28 Jul 2022 00:00:00 GMTLearning composable world models for physical predictionhttps://www.deepmind.com/publications/learning-composable-world-models-for-physical-predictionhttps://www.deepmind.com/publications/learning-composable-world-models-for-physical-predictionPeople readily infer the hidden structures and parameters of the physical world around them. While previous work has shown that people can infer various hidden properties of physical objects from their motion, here we ask whether people can infer multiple physical variables simultaneously and compose them to generalize in a novel scenario. In a ball catching task, participants must simultaneously learn the masses of different balls, as well as the existence of a constant wind force that appears when a context cue is present. People can learn these variables simultaneously over training, but also compose novel combinations of ball masses and wind conditions at test. A variety of heuristic models fail to capture the same pattern of results seen in people, suggesting that people are using compositional model-based generalization to solve the task.Wed, 27 Jul 2022 00:00:00 GMTSemi-analytical Industrial Cooling System Model for Reinforcement Learninghttps://www.deepmind.com/publications/semi-analytical-industrial-cooling-system-model-for-reinforcement-learninghttps://www.deepmind.com/publications/semi-analytical-industrial-cooling-system-model-for-reinforcement-learningWe present a hybrid industrial cooling system model that embeds analytical solutions within a multi-physics simulation. This model is designed for reinforcement learning (RL) applications and balances simplicity with simulation fidelity and interpretability. The model’s fidelity is evaluated against real world data from a large scale cooling system. This is followed by a case study illustrating how the model can be used for RL research. For this, we develop an industrial task suite that allows specifying different problem settings and levels of complexity, and use it to evaluate the performance of different RL algorithms.Tue, 26 Jul 2022 00:00:00 GMTStochastic Parallelizable Eigengap Dilation for Large Graph Clusteringhttps://www.deepmind.com/publications/stochastic-parallelizable-eigengap-dilation-for-large-graph-clusteringhttps://www.deepmind.com/publications/stochastic-parallelizable-eigengap-dilation-for-large-graph-clusteringLarge graphs commonly appear in social networks, knowledge graphs, recommender systems, life sciences, and decision making problems. Summarizing large graphs by their high level properties is helpful in solving problems in these settings. In spectral clustering, we aim to identify clusters of nodes where most edges fall within clusters and only few edges fall between clusters. This task is important for many downstream applications and exploratory analysis. A core step of spectral clustering is performing an eigendecomposition of the corresponding graph Laplacian matrix (or equivalently, a singular value decomposition, SVD, of the incidence matrix). The convergence of iterative singular value decomposition approaches depends on the eigengaps of the spectrum of the given matrix, i.e., the difference between consecutive eigenvalues. For a graph Laplacian corresponding to a well-clustered graph, the eigenvalues will be non-negative but very small (much less than $1$) slowing convergence. This paper introduces a parallelizable approach to dilating the spectrum in order to accelerate SVD solvers and in turn, spectral clustering. This is accomplished via polynomial approximations to matrix operations that favorably transform the spectrum of a matrix without changing its eigenvectors. Experiments demonstrate that this approach significantly accelerates convergence, and we explain how this transformation can be parallelized and stochastically approximated to scalFri, 22 Jul 2022 00:00:00 GMT