{"title": "Gradient descent GAN optimization is locally stable", "book": "Advances in Neural Information Processing Systems", "page_first": 5585, "page_last": 5595, "abstract": "Despite the growing prominence of generative adversarial networks (GANs), optimization in GANs is still a poorly understood topic. In this paper, we analyze the ``gradient descent'' form of GAN optimization, i.e., the natural setting where we simultaneously take small gradient steps in both generator and discriminator parameters. We show that even though GAN optimization does \\emph{not} correspond to a convex-concave game (even for simple parameterizations), under proper conditions, equilibrium points of this optimization procedure are still \\emph{locally asymptotically stable} for the traditional GAN formulation. On the other hand, we show that the recently proposed Wasserstein GAN can have non-convergent limit cycles near equilibrium. Motivated by this stability analysis, we propose an additional regularization term for gradient descent GAN updates, which \\emph{is} able to guarantee local stability for both the WGAN and the traditional GAN, and also shows practical promise in speeding up convergence and addressing mode collapse.", "full_text": "Gradient descent GAN optimization is locally stable\n\nVaishnavh Nagarajan\n\nComputer Science Department\nCarnegie-Mellon University\n\nPittsburgh, PA 15213\n\nvaishnavh@cs.cmu.edu\n\nJ. Zico Kolter\n\nComputer Science Department\nCarnegie-Mellon University\n\nPittsburgh, PA 15213\nzkolter@cs.cmu.edu\n\nAbstract\n\nDespite the growing prominence of generative adversarial networks (GANs), op-\ntimization in GANs is still a poorly understood topic. In this paper, we analyze\nthe \u201cgradient descent\u201d form of GAN optimization, i.e., the natural setting where\nwe simultaneously take small gradient steps in both generator and discriminator\nparameters. We show that even though GAN optimization does not correspond to a\nconvex-concave game (even for simple parameterizations), under proper conditions,\nequilibrium points of this optimization procedure are still locally asymptotically\nstable for the traditional GAN formulation. On the other hand, we show that the\nrecently proposed Wasserstein GAN can have non-convergent limit cycles near\nequilibrium. Motivated by this stability analysis, we propose an additional regular-\nization term for gradient descent GAN updates, which is able to guarantee local\nstability for both the WGAN and the traditional GAN, and also shows practical\npromise in speeding up convergence and addressing mode collapse.\n\n1\n\nIntroduction\n\nSince their introduction a few years ago, Generative Adversarial Networks (GANs) [Goodfellow et al.,\n2014] have gained prominence as one of the most widely used methods for training deep generative\nmodels. GANs have been successfully deployed for tasks such as photo super-resolution, object\ngeneration, video prediction, language modeling, vocal synthesis, and semi-supervised learning,\namongst many others [Ledig et al., 2017, Wu et al., 2016, Mathieu et al., 2016, Nguyen et al., 2017,\nDenton et al., 2015, Im et al., 2016].\nAt the core of the GAN methodology is the idea of jointly training two networks: a generator network,\nmeant to produce samples from some distribution (that ideally will mimic examples from the data\ndistribution), and a discriminator network, which attempts to differentiate between samples from\nthe data distribution and the ones produced by the generator. This problem is typically written as a\nmin-max optimization problem of the following form:\n\nmax\n\nD\n\nmin\nG\n\n(Ex\u21e0pdata[log D(x)] + Ez\u21e0platent[log(1 D(G(z)))]) .\n\n(1)\nFor the purposes of this paper, we will shortly consider a more general form of the optimization prob-\nlem, which also includes the recent Wasserstein GAN (WGAN) [Arjovsky et al., 2017] formulation.\nDespite their prominence, the actual task of optimizing GANs remains a challenging problem, both\nfrom a theoretical and a practical standpoint. Although the original GAN paper included some\nanalysis on the convergence properties of the approach [Goodfellow et al., 2014], it assumed that\nupdates occurred in pure function space, allowed arbitrarily powerful generator and discriminator\nnetworks, and modeled the resulting optimization objective as a convex-concave game, therefore\nyielding well-de\ufb01ned global convergence properties. Furthermore, this analysis assumed that the\ndiscriminator network is fully optimized between generator updates, an assumption that does not\nmirror the practice of GAN optimization. Indeed, in practice, there exist a number of well-documented\nfailure modes for GANs such as mode collapse or vanishing gradient problems.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fOur contributions.\nIn this paper, we consider the \u201cgradient descent\u201d formulation of GAN opti-\nmization, the setting where both the generator and the discriminator are updated simultaneously via\nsimple (stochastic) gradient updates; that is, there are no inner and outer optimization loops, and\nneither the generator nor the discriminator are assumed to be optimized to convergence. Despite the\nfact that, as we show, this does not correspond to a convex-concave optimization problem (even for\nsimple linear generator and discriminator representations), we show that:\n\nUnder suitable conditions on the representational powers of the discriminator and the generator,\nthe resulting GAN dynamical system is locally exponentially stable.\n\nThat is, for some region around an equilibrium point of the updates, the gradient updates will converge\nto this equilibrium point at an exponential rate. Interestingly, our conditions can be satis\ufb01ed by the\ntraditional GAN but not by the WGAN, and we indeed show that WGANs can have non-convergent\nlimit cycles in the gradient descent case.\nOur theoretical analysis also suggests a natural method for regularizing GAN updates by adding\nan additional regularization term on the norm of the discriminator gradient. We show that the\naddition of this term leads to locally exponentially stable equilibria for all classes of GANs, including\nWGANs. The additional penalty is highly related to (but also notably different from) recent proposals\nfor practical GAN optimization, such as the unrolled GAN [Metz et al., 2017] and the improved\nWasserstein GAN training [Gulrajani et al., 2017]. In practice, the approach is simple to implement,\nand preliminary experiments show that it helps avert mode collapse and leads to faster convergence.\n\n2 Background and related work\n\nGAN optimization and theory. Although the theoretical analysis of GANs has been far outpaced\nby their practical application, there have been some notable results in recent years, in addition\nto the aforementioned work in the original GAN paper. For the most part, this work is entirely\ncomplementary to our own, and studies a very different set of questions. Arjovsky and Bottou [2017]\nprovide important insights into instability that arises when the supports of the generated distribution\nand the true distribution are disjoint. In contrast, in this paper we delve into an equally important\nquestion of whether the updates are stable even when the generator is in fact very close to the true\ndistribution (and we answer in the af\ufb01rmative). Arora et al. [2017], on the other hand, explore\nquestions relating to the sample complexity and expressivity of the GAN architecture, and their\nrelation to the existence of an equilibrium point. However, it is still unknown as to whether, given\nthat an equilibrium exists, the GAN update procedure will converge locally.\nFrom a more practical standpoint, there have been a number of papers that address the topic of\noptimization in GANs. Several methods have been proposed that introduce new objectives or\narchitectures for improving the (practical and theoretical) stability of GAN optimization [Arjovsky\net al., 2017, Poole et al., 2016]. A wide variety of optimization heuristics and architectures have\nalso been proposed to address challenges such as mode collapse [Salimans et al., 2016, Metz et al.,\n2017, Che et al., 2017, Radford et al., 2016]. Our own proposed regularization term falls under\nthis same category, and hopefully provides some context for understanding some of these methods.\nSpeci\ufb01cally, our regularization term (motivated by stability analysis) captures a degree of \u201cforesight\u201d\nof the generator in the optimization procedure, similar to the unrolled GANs procedure [Metz et al.,\n2017]. Indeed, we show that our gradient penalty is closely related to 1-unrolled GANs, but also\nprovides more \ufb02exibility in leveraging this foresight. Finally, gradient-based regularization has been\nexplored for GANs, with one of the most recent works being that of Gulrajani et al. [2017], though\ntheir penalty is on the discriminator rather than the generator as in our case.\nFinally, there are several works that have concurrently addressed similar issues as this paper. Of\nparticular similarity to the methodology we propose here are the works by Roth et al. [2017] and\nMescheder et al. [2017]. The \ufb01rst of these two presents a stabilizing regularizer that is based on a\ngradient norm, where the gradient is calculated with respect to the datapoints. Our regularizer on the\nother hand is based on the norm of a gradient calculated with respect to the parameters. Our approach\nhas some strong similarities with that of the second work noted above; however, the authors there\ndo not establish or disprove stability, and instead note the presence of zero eigenvalues (which we\nwill treat in some depth) as a motivation for their alternative optimization method. Thus, we feel the\nworks as a whole are quite complementary, and signify the growing interest in GAN optimization\nissues.\n\n2\n\n\fStochastic approximation algorithms and analysis of nonlinear systems. The technical tools\nwe use to analyze the GAN optimization dynamics in this paper come from the \ufb01elds of stochastic\napproximation algorithms and the analysis of nonlinear differential equations \u2013 notably the \u201cODE\nmethod\u201d for analyzing convergence properties of dynamical systems [Borkar and Meyn, 2000,\nKushner and Yin, 2003]. Consider a general stochastic process driven by the updates \u2713t+1 =\n\u2713t + \u21b5t(h(\u2713t) + \u270ft) for vector \u2713t 2 Rn, step size \u21b5t > 0, function h : Rn ! Rn and a martingale\ndifference sequence \u270ft.1 Under fairly general conditions, namely: 1) bounded second moments of \u270ft,\n2) Lipschitz continuity of h, and 3) summable but not square-summable step sizes, the stochastic\napproximation algorithm converges to an equilibrium point of the (deterministic) ordinary differential\nequation \u02d9\u2713(t) = h(\u2713(t)).\nThus, to understand stability of the stochastic approximation algorithm, it suf\ufb01ces to understand\nthe stability and convergence of the deterministic differential equation. Though such analysis is\ntypically used to show global asymptotic convergence of the stochastic approximation algorithm to\nan equilibrium point (assuming the related ODE also is globally asymptotically stable), it can also be\nused to analyze the local asymptotic stability properties of the stochastic approximation algorithm\naround equilibrium points.2 This is the technique we follow throughout this entire work, though for\nbrevity we will focus entirely on the analysis of the continuous time ordinary differential equation,\nand appeal to these standard results to imply similar properties regarding the discrete updates.\nGiven the above consideration, our focus will be on proving stability of the dynamical system around\nequilbrium points, i.e. points \u2713? for which h(\u2713?) = 0.3. Speci\ufb01cally, we appeal to the well known\nlinearization theorem [Khalil, 1996, Sec 4.3], which states that if the Jacobian of the dynamical\nsystem J = @h(\u2713)/@\u2713|\u2713=\u2713? evaluated at an equilibrium point is Hurwitz (has all strictly negative\neigenvalues, Re(i(J)) < 0, 8i = 1, . . . , n), then the ODE will converge to \u2713? for some non-empty\nregion around \u2713?, at an exponential rate. This means that the system is locally asymptotically stable,\nor more precisely, locally exponentially stable (see De\ufb01nition A.1 in Appendix A).\nThus, an important contribution of this paper is a proof of this seemingly simple fact: under some\nconditions, the Jacobian of the dynamical system given by the GAN update is a Hurwitz matrix at\nan equilibrium (or, if there are zero-eigenvalues, if they correspond to a subspace of equilibria, the\nsystem is still asymptotically stable). While this is a trivial property to show for convex-concave\ngames, the fact that the GAN is not convex-concave leads to a substantially more challenging analysis.\nIn addition to this, we provide an analysis that is based on Lyapunov\u2019s stability theorem (described\nin Appendix A). The crux of the idea is that to prove convergence it is suf\ufb01cient to identify a non-\nnegative \u201cenergy\u201d function for the linearized system which always decreases with time (speci\ufb01cally,\nthe energy function will be a distance from the equilibrium, or from the subspace of equilibria). Most\nimportantly, this analysis provides insights into the dynamics that lead to GAN convergence.\n\n3 GAN optimization dynamics\n\nThis section comprises the main results of this paper, showing that under proper conditions the\ngradient descent updates for GANs (that is, updating both the generator and discriminator locally and\nsimultaneously) is locally exponentially stable around \u201cgood\u201d equilibrium points (where \u201cgood\u201d will\nbe de\ufb01ned shortly). This requires that the GAN loss be strictly concave, which is not the case for\nWGANs, and we indeed show that the updates for WGANs can cycle inde\ufb01nitely. This leads us to\npropose a simple regularization term that is able to guarantee exponential stability for any concave\nGAN loss, including the WGAN, rather than requiring strict concavity.\n\n1Stochastic gradient descent on an objective f (\u2713) can be expressed in this framework as h(\u2713) = r\u2713f (\u2713).\n2Note that the local analysis does not show that the stochastic approximation algorithm will necessarily\nconverge to an equilibrium point, but still provides a valuable characterization of how the algorithm will behave\naround these points.\n\n3Note that this is a slightly different usage of the term equilibrium as typically used in the GAN literature,\nwhere it refers to a Nash equilibrium of the min-max optimization problem. These two de\ufb01nitions (assuming we\nmean just a local Nash equilibrium) are equivalent for the ODE corresponding to the min-max game, but we use\nthe dynamical systems meaning throughout this paper, that is, any point where the gradient update is zero.\n\n3\n\n\f3.1 The generalized GAN setting\nFor the remainder of the paper, we consider a slightly more general formulation of the GAN\noptimization problem than the one presented earlier, given by the following min/max problem:\n\nmax\n\nD\n\nmin\nG\n\nV (G, D) = (Ex\u21e0pdata[f (D(x))] + Ez\u21e0platent[f (D(G(z)))])\n\n(2)\nwhere G : Z!X is the generator network, which maps from the latent space Z to the input space\nX ; D : X! R is the discriminator network, which maps from the input space to a classi\ufb01cation\nof the example as real or synthetic; and f : R ! R is a concave function. We can recover the\ntraditional GAN formulation [Goodfellow et al., 2014] by taking f to be the (negated) logistic loss\nf (x) = log(1 + exp(x)); note that this convention slightly differs from the standard formulation\nin that in this case the discriminator outputs the real-valued \u201clogits\u201d and the loss function would\nimplicitly scale this to a probability. We can recover the Wasserstein GAN by simply taking f (x) = x.\nAssuming the generator and discriminator networks to be parameterized by some set of parameters,\n\u2713D and \u2713G respectively, we analyze the simple stochastic gradient descent approach to solving this\noptimization problem. That is, we take simultaneous gradient steps in both \u2713D and \u2713G, which in our\n\u201cODE method\u201d analysis leads to the following differential equation:\n\n\u02d9\u2713D = r\u2713DV (\u2713G, \u2713D),\n\n(3)\nA note on alternative updates. Rather than updating both the generator and discriminator accord-\ning to the min-max problem above, Goodfellow et al. [2014] also proposed a modi\ufb01ed update for\njust the generator that minimizes a different objective, V 0(G, D) = Ez\u21e0platent[f (D(G(z)))] (the\nnegative sign is pulled out from inside f). In fact, all the analyses we consider in this paper apply\nequally to this case (or any convex combination of both updates), as the ODE of the update equations\nhave the same Jacobians at equilibrium.\n\n\u02d9\u2713G := r\u2713GV (\u2713G, \u2713D).\n\n3.2 Why is proving stability hard for GANs?\nBefore presenting our main results, we \ufb01rst highlight why understanding the local stability of GANs\nis non-trivial, even when the generator and discriminator have simple forms. As stated above, GAN\noptimization consists of a min-max game, and gradient descent algorithms will converge if the game\nis convex-concave \u2013 the objective must be convex in the term being minimized and concave in the\nterm being maximized. Indeed, this was a crucial assumption in the convergence proof in the original\nGAN paper. However, for virtually any parameterization of the real GAN generator and discriminator,\neven if both representations are linear, the GAN objective will not be a convex-concave game:\nProposition 3.1. The GAN objective in Equation 2 can be a concave-concave objective, i.e., concave\nwith respect to both the discriminator and generator parameters, for a large part of the discriminator\nspace, including regions arbitrarily close to the equilibrium.\nTo see why, consider a simple GAN over 1-dimensional data and latent space with linear generator\nand discriminator, i.e. D(x) = \u2713Dx + \u27130D and G(z) = \u2713Gz + \u27130G. Then the GAN objective is:\n\nV (G, D) = Ex\u21e0pdata[f (\u2713Dx + \u27130D)] + Ez\u21e0platent[f (\u2713D(\u2713Gz + \u27130G) \u27130D)].\n\nBecause f is concave, by inspection we can see that V is concave in \u2713D and \u27130D; but it is also\nconcave (not convex) in \u2713G and \u27130G, for the same reason. Thus, the optimization involves concave\nminimization, which in general is a dif\ufb01cult problem. To prove that this is not a peculiarity of the\nabove linear discriminator system, in Appendix B, we show similar observations for a more general\nparametrization, and also for the case where f00(x) = 0 (which happens in the case of WGANs).\nThus, a major question remains as to whether or not GAN optimization is stable at all (most concave\nmaximization is not). Indeed, there are several well-known properties of GAN optimization that may\nmake it seem as though gradient descent optimization may not work in theory. For instance, it is\nwell-known that at the optimal location pg = pdata, the optimal discriminator will output zero on all\nexamples, which in turn means that any generator distribution will be optimal for this generator. This\nwould seem to imply that the system can not be stable around such an equilibrium.\nHowever, as we will show, gradient descent GAN optimization is locally asymptotically stable, even\nfor natural parameterizations of generator-discriminator pairs (which still make up concave-concave\noptimization problems). Furthermore, at equilibrium, although the zero-discriminator property\nmeans that the generator is not stable \u201cindependently\u201d, the joint dynamical system of generator and\ndiscriminator is locally asymptotically stable around certain equilibrium points.\n\n4\n\n\f\u2713DD\u2713D(x)]\u2713?\n\nD\n\n, KDG , ZX r\u2713DD\u2713D(x)rT\n\n\u2713Gp\u2713G(x)dx(\u2713?\n\n.\n\nD,\u2713?\n\nG)\n\n3.3 Local stability of general GAN systems\nThis section contains our \ufb01rst technical result, establishing that GANs are locally stable under proper\nlocal conditions. Although the proofs are deferred to the appendix, the elements that we do emphasize\nhere are the conditions that we identi\ufb01ed for local stability to hold. Indeed, because the proof rests on\nthese conditions (some of which are fairly strong), we want to highlight them as much as possible, as\nthey themselves also convey valuable intuition as to what is required for GAN convergence.\nTo formalize our conditions, we denote the support of a distribution with probability density function\n(p.d.f) p by supp(p) and the p.d.f of the generator \u2713G by p\u2713G. Let B\u270f(\u00b7) denote the Euclidean L2-ball\nof radius of \u270f. Let max(\u00b7) and (+)\nmin(\u00b7) denote the largest and the smallest non-zero eigenvalues of a\nnon-zero positive semide\ufb01nite matrix. Let Col(\u00b7) and Null(\u00b7) denote the column space and null space\nof a matrix respectively. Finally, we de\ufb01ne two key matrices that will be integral to our analyses:\n\nKDD , Epdata[r\u2713DD\u2713D(x)rT\n\nG\n\nD\n\nD, \u2713?\n\nD, \u2713?\n\n= pdata and D\u2713?\n\n(x) = 0, 8 x 2 supp(pdata).\n\nHere, the matrices are evaluated at an equilibrium point (\u2713?\nG) which we will characterize shortly.\nThe signi\ufb01cance of these terms is that, as we will see, KDD is proportional to the Hessian of the\nGAN objective with respect to the discriminator parameters at equilibrium, and KDG is proportional\nto the off-diagonal term in this Hessian, corresponding to the discriminator and generator parameters.\nThese matrices also occur in similar positions in the Jacobian of the system at equilibrium.\nWe now discuss conditions under which we can guarantee exponential stability. All our conditions\nare imposed on both (\u2713?\nG) and all equilibria in a small neighborhood around it, though we do\nnot state this explicitly in every assumption. First, we de\ufb01ne the \u201cgood\u201d equilibria we care about as\nthose that correspond to a generator which matches the true distribution and a discriminator that is\nidentically zero on the support of this distribution. As described next, implicitly, this also assumes\nthat the discriminator and generator representations are powerful enough to guarantee that there are\nno \u201cbad\u201d equilibria in a local neighborhood of this equilibrium.\nAssumption I. p\u2713?\nThe assumption that the generator matches the true distribution is a rather strong assumption, as\nit limits us to the \u201crealizable\u201d case, where the generator is capable of creating the underlying data\ndistribution. Furthermore, this means the discriminator is (locally) powerful enough that for any other\ngenerator distribution it is not at equilibrium (i.e., discriminator updates are non-zero). Since we\ndo not typically expect this to be the case, we also provide an alternative non-realizable assumption\nbelow that is also suf\ufb01cient for our results, i.e., the system is still stable. In both the realizable and\nnon-realizable cases the requirement of an all-zero discriminator remains. This implicitly requires\neven the generator representation be (locally) rich enough so that when the discriminator is not\nidentically zero, the generator is not at equilibrium (i.e., generator updates are non-zero). Finally,\nnote that these conditions do not disallow bad equilibria outside of this neighborhood, which may\npotentially even be unstable.\nAssumption I. (Non-realizable) The discriminator is linear in its parameters \u2713D and furthermore,\nfor any equilibrium point (\u2713?\nThis alternative assumption is largely a weakening of Assumption I, as the condition on the dis-\ncriminator remains, but there is no requirement that the generator give rise to the true distribution.\nHowever, the requirement that the discriminator be linear in the parameters (not in its input) is an\nadditional restriction that seems unavoidable in this case for technical reasons. Further, note that\nthe fact that D\u2713?\n(x) = 0 and that the generator/discriminator are both at equilibrium, still means\nthat although it may be that p\u2713?\nG 6= pdata, these distributions are (locally) indistinguishable as far as\nthe discriminator is concerned. Indeed, this is a nice characterization of \u201cgood\u201d equilibria that the\ndiscriminator cannot differentiate between the real and generated samples.\nOur goal next is to identify strong curvature conditions that can be imposed on the objective V (or\na function related to the objective), though only locally at equilibrium. First, we will require that\nthe objective is strongly concave in the discriminator parameter space at equilibrium (note that it is\nconcave by default). However, on the other hand, we cannot require the objective to be strongly convex\nin the generator parameter space as we saw that the objective is not convex-concave even in the nicest\nscenario, even arbitrarily close to equilbrium. Instead, we identify another convex function, namely\n\n(x) = 0, 8 x 2 supp(pdata) [ supp(p\u2713?\n\nD, \u2713?\n\nG), D\u2713?\n\nD\n\n).\n\nG\n\nD\n\n5\n\n\fEpdata[r\u2713DD\u2713D(x)] Ep\u2713G\n\n[r\u2713DD\u2713D(x)]\n\nG),\n\nD, \u2713?\n\n2\u2713D=\u2713?\n\nD\n\nthe magnitude of the update on the equilibrium discriminator, i.e., k r\u2713DV (\u2713D, \u2713G)|\u2713D=\u2713?\nD k2, and\nrequire that to be strongly convex in the generator space at equilibrium. Since these strong curvature\nassumptions will allow only systems with a locally unique equilibrium, we will state them in a relaxed\nform that accommodates a local subspace of equilibria. Furthermore, we will state these assumptions\nin two parts, \ufb01rst as a condition on f and second as a condition on the parameter space.\nFirst, the condition on f is straightforward, making it necessary that the loss f be concave at 0; as we\nwill show, when this condition is not met, there need not be local asymptotic convergence.\nAssumption II. The function f satis\ufb01es f00(0) < 0, and f0(0) 6= 0.\nNext, to state conditions on the parameter space while also allowing systems with multiple equilibria\nlocally, we \ufb01rst de\ufb01ne the following property for a function, say g, at a speci\ufb01c point in its domain:\nalong any direction, either the second derivative of g must be non-zero or all derivatives must be zero.\nFor example, at the origin, g(x, y) = x2 + x2y2 is \ufb02at along y, and along any other direction at an\nangle \u21b5 6= 0 with the y axis, the second derivative is 2 sin2 \u21b5. For the GAN system, we will require\nthis property, formalized in Property I, for two convex functions whose Hessians are proportional to\nKDD and KT\nDGKDG. We provide more intuition for these functions below.\nProperty I. g :\u21e5 ! R satis\ufb01es Property I at \u2713? 2 \u21e5 if for any \u2713 2 Null(r2\nis locally constant along \u2713 at \u2713?, i.e., 9\u270f> 0 such that for all \u270f0 2 (\u270f, \u270f), g(\u2713?) = g(\u2713? + \u270f0\u2713).\nAssumption III. At\n\n\u2713g(\u2713)\u2713?), the function\n\nfunctions Epdata[D2\n\nequilibrium (\u2713?\n\nand\n\nthe\n\nan\n\n\u2713D(x)]\n\nmust satisfy Property I in the discriminator\n\nDGKDG.\n\nand generator space respectively.\nHere is an intuitive explanation of what these two non-negative functions represent and how they relate\nto the objective. The \ufb01rst function is a function of \u2713D which measures how far \u2713D is from an all-zero\nstate, and the second is a function of \u2713G which measures how far \u2713G is from the true distribution; at\nequilibrium these functions are zero. We will see later that given f00(0) < 0, the curvature of the \ufb01rst\nG) in the discriminator space; similarly,\nfunction at \u2713?\nD is representative of the curvature of V (\u2713D, \u2713?\nG is representative of the curvature of the\ngiven f0(0) 6= 0 the curvature of the second function at \u2713?\nmagnitude of the discriminator update on \u2713?\nD in the generator space. The intuition behind why this\nparticular relation holds is that, when \u2713G moves away from the true distribution, while the second\nfunction in Assumption III increases, \u2713?\nD also becomes more suboptimal for that generator; as a result,\nthe magnitude of update on \u2713?\nD increases too. Note that we show in Lemma C.2 that the Hessian of\nthe two functions in Assumption III in the discriminator and the generator space respectively are\nproportional to KDD and KT\nThe above relations involving the two functions and the GAN objective, together with Assumption III,\nbasically allow us to consider systems with reasonable strong curvature properties, while also\nallowing many equilibria in a local neighborhood in a speci\ufb01c sense. In particular, if the curvature\nof the \ufb01rst function is \ufb02at along a direction u (which also means that KDDu = 0) we can perturb\nD slightly along u and still have an \u2018equilibrium discriminator\u2019 as de\ufb01ned in Assumption I, i.e.,\n\u2713?\n), D\u2713D(x) = 0. Similarly, for any direction v along which the curvature of the\n8x 2 supp(p\u2713?\nsecond function is \ufb02at (i.e., KDGv = 0), we can perturb \u2713?\nG slightly along that direction such that\n\u2713G remains an \u2018equilibrium generator\u2019 as de\ufb01ned in Assumption I, i.e., p\u2713G = pdata. We prove this\nformally in Lemma C.2. Perturbations along any other directions do not yield equilibria because\nthen, either \u2713D is no longer in an all-zero state or \u2713G does not match the true distribution. Thus, we\nconsider a setup where the rank de\ufb01ciencies of KDD and KT\nDGKDG if any, correspond to equivalent\nequilibria (which typically exist for neural networks, though in practice they may not correspond to\n\u2018linear\u2019 perturbations as modeled here).\nOur \ufb01nal assumption is on the supports of the true and generated distributions: we require that all the\ngenerators in a suf\ufb01ciently small neighborhood of the equilibrium have distributions with the same\nsupport as the true distribution. Following this, we brie\ufb02y discuss a relaxation of this assumption.\nAssumption IV. 9\u270fG > 0 such that 8\u2713G 2 B\u270fG(\u2713?\nThis may typically hold if the support covers the whole space X ; but when the true distribution has\nsupport in some smaller disjoint parts of the space X , nearby generators may correspond to slightly\n\nG), supp(p\u2713G) = supp(pdata).\n\nG\n\n6\n\n\fdisplaced versions of this distribution with a different support. For the latter scenario, we show in\nAppendix C.1 that local exponential stability holds under a certain smoothness condition on the\ndiscriminator. Speci\ufb01cally, we require that D\u2713?\nG but also\non the support of small perturbations of \u2713?\nG as otherwise the generator will not be at equilibrium.\n(Additionally, we also require this property from the discriminators that lie within a small perturbation\nD in the null space of KDD so that they correspond to equilibrium discriminators.) We note\nof \u2713?\nthat while this relaxed assumption accounts for a larger class of examples, it is still strong in that\nit also restricts us from certain simple systems. Due to space constraints, we state and discuss the\nimplications of this assumption in greater detail in Appendix C.1.\nWe now state our main result.\n\nD (\u00b7) be zero not only on the support of \u2713?\n\nTheorem 3.1. The dynamical system de\ufb01ned by the GAN objective in Equation 2 and the updates in\nEquation 3 is locally exponentially stable with respect to an equilibrium point (\u2713?\nG) when the\nAssumptions I, II, III, IV hold for (\u2713?\nG) and other equilibria in a small neighborhood around it.\nFurthermore, the rate of convergence is governed only by the eigenvalues of the Jacobian J of the\nsystem at equilibrium with a strict negative real part upper bounded as:\n\nD, \u2713?\n\nD, \u2713?\n\n\u2022 If Im() = 0, then Re() \uf8ff\n4f002(0)(+)\n\u2022 If Im() 6= 0, then Re() \uf8ff f00(0)(+)\n\nmin(KDD)(+)\n\n2f00(0)f02(0)(+)\nmin(KDD)max(KDD)+f0(0)2(+)\nmin(KDD)\n\nmin(KT\n\nDGKDG)\nmin(KT\n\nDGKDG)\n\nThe vast majority of our proofs are deferred to the appendix, but we brie\ufb02y describe the intuition\nhere. It is straightforward to show that the Jacobian J of the system at equilibrium can be written as:\n\nJ =\uf8ff JDD\nDG JGG =\uf8ff2f00(0)KDD f0(0)KDG\nJT\n\nf0(0)KT\n\nJDG\n\nDG\n\n0\n\n .\n\nRecall that we wish to show this is Hurwitz. First note that JDD (the Hessian of the objective with\nrespect to the discriminator) is negative semi-de\ufb01nite if and only if f00(0) < 0. Next, a crucial\nobservation is that JGG = 0 i.e, the Hessian term w.r.t. the generator vanishes because for the all-zero\ndiscriminator, all generators result in the same objective value. Fortunately, this means at equilibrium\nwe do not have non-convexity in \u2713G precluding local stability. Then, we make use of the crucial\nLemma G.2 we prove in the appendix, showing that any matrix of the form\u21e5Q P; PT\nHurwitz provided that Q is strictly negative de\ufb01nite and P has full column rank.\nHowever, this property holds only when KDD is positive de\ufb01nite and KDG is full column rank.\nNow, if KDD or KDG do not have this property, recall that the rank de\ufb01ciency is due to a subspace\nof equilibria around (\u2713?\nG). Consequently, we can analyze the stability of the system projected\nto an subspace orthogonal to these equilibria (Theorem A.4). Additionally, we also prove stability\nusing Lyapunov\u2019s stability (Theorem A.1) by showing that the squared L2 distance to the subspace of\nequilibria always either decreases or only instantaneously remains constant.\n\n0\u21e4 is\n\nD, \u2713?\n\nAdditional results.\nIn order to illustrate our assumptions in Theorem 3.1, in Appendix D we\nconsider a simple GAN that learns a multi-dimensional Gaussian using a quadratic discriminator and\na linear generator. In a similar set up, in Appendix E, we consider the case where f (x) = x, i.e., the\nWasserstein GAN, and so f00(x) = 0, and we show that the system can perennially cycle around an\nequilibrium point without converging. A simple two-dimensional example is visualized in Section 4.\nThus, gradient descent WGAN optimization is not necessarily asymptotically stable.\n\n3.4 Stabilizing optimization via gradient-based regularization\nMotivated by the considerations above, in this section we propose a regularization penalty for the\ngenerator update, which uses a term based upon the gradient of the discriminator. Crucially, the\nregularization term does not change the parameter values at the equilibrium point, and at the same\ntime enhances the local stability of the optimization procedure, both in theory and practice. Although\nthese update equations do require that we differentiate with respect to a function of another gradient\nterm, such \u201cdouble backprop\u201d terms (see e.g., Drucker and Le Cun [1992]) are easily computed by\nmodern automatic differentiation tools. Speci\ufb01cally, we propose the regularized update\n\n\u2713G := \u2713G \u21b5r\u2713GV (D\u2713D, G\u2713G) + \u2318kr\u2713DV (D\u2713D, G\u2713G)k2 .\n\n(4)\n\n7\n\n\fLocal Stability The intuition of this regularizer is perhaps most easily understood by considering\nhow it changes the Jacobian at equilibrium (though there are other means of motivating the update as\nwell, discussed further in Appendix F.2). In the Jacobian of the new update, although there are now\nnon-antisymmetric diagonal blocks, the block diagonal terms are now negative de\ufb01nite:\n\n\uf8ff\n\nJDD\n\nJDG\nDG(I + 2\u2318JDD) 2\u2318JT\n\nJT\n\nDGJDG .\n\nAs we show below in Theorem 3.2 (proved in Appendix F), as long as we choose \u2318 small enough so\nthat I + 2\u2318JDD \u232b 0, this guarantees the updates are locally asymptotically stable for any concave f.\nIn addition to stability properties, this regularization term also addresses a well known failure state\nin GANs called mode collapse, by lending more \u201cforesight\u201d to the generator. The way our updates\nprovide this foresight is very similar to the unrolled updates proposed in Metz et al. [2017], although\nour regularization is much simpler and provides more \ufb02exibility to leverage the foresight. In practice,\nwe see that our method can be as powerful as the more complex and slower 10-unrolled GANs. We\ndiscuss this and other intuitive ways of motivating our regularizer in Appendix F.\n\nTheorem 3.2. The dynamical system de\ufb01ned by the GAN objective in Equation 2 and the updates\nin Equation 4, is locally exponentially stable at the equilibrium, under the same conditions as in\nTheorem 3.1, if \u2318<\n2max(JDD). Further, under appropriate conditions similar to these, the WGAN\nsystem is locally exponentially stable at the equilibrium for any \u2318. The rate of convergence for the\nWGAN is governed only by the eigenvalues of the Jacobian at equilibrium with a strict negative\nreal part upper bounded as:\n\n1\n\n\u2022 If Im() = 0, then Re() \uf8ff 2f02(0)\u2318(+)\n\u2022 If Im() 6= 0, then Re() \uf8ff \u2318f 02(0)(+)\n\nmin(KT\n4f02(0)\u23182max(KT\n\nmin(KT\n\nDGKDG)\nDGKDG)+1\n\nDGKDG).\n\n4 Experimental results\n\nWe very brie\ufb02y present experimental results that demonstrate that our regularization term also has\nsubstantial practical promise.4 In Figure 1, we compare our gradient regularization to 10-unrolled\nGANs on the same architecture and dataset (a mixture of eight Gaussians) as in Metz et al. [2017].\nOur system quickly spreads out all the points instead of \ufb01rst exploring only a few modes and then\nredistributing its mass over all the modes gradually. Note that the conventional GAN updates are\nknown to enter mode collapse for this setup. We see similar results (see Figure 2 here, and Figure 4\nin the Appendix for a more detailed \ufb01gure) in the case of a stacked MNIST dataset using a DCGAN\n[Radford et al., 2016], i.e., three random digits from MNIST are stacked together so as to create a\ndistribution over 1000 modes. Finally, Figure 3 presents streamline plots for a 2D system where both\nthe true and the latent distribution is uniform over [1, 1] and the discriminator is D(x) = w2x2\nwhile the generator is G(z) = az. Observe that while the WGAN system goes in orbits as expected,\nthe original GAN system converges. With our updates, both these systems converge quickly to the\ntrue equilibrium.\n\n5 Conclusion\n\nIn this paper, we presented a theoretical analysis of the local asymptotic stability of GAN optimization\nunder proper conditions. We further showed that the recently proposed WGAN is not asymptotically\nstable under the same conditions, but we introduced a gradient-based regularizer which stabilizes\nboth traditional GANs and the WGANs, and can improve convergence speed in practice.\nThe results here provide substantial insight into the nature of GAN optimization, perhaps even\noffering some clues as to why these methods have worked so well despite not being convex-concave.\nHowever, we also emphasize that there are substantial limitations to the analysis, and directions for\nfuture work. Perhaps most notably, the analysis here only provides an understanding of what happens\n4We provide an implementation of this technique at https://github.com/locuslab/gradient_\n\nregularized_gan\n\n8\n\n\fIteration 0\n\nIteration 3000\n\nIteration 8000\n\nIteration 50000\n\nIteration 70000\n\nFigure 1: Gradient regularized GAN, \u2318 = 0.5 (top row) vs. 10-unrolled with \u2318 = 104 (bottom row).\n\nFigure 2: Gradient regularized (left) and traditional (right) DCGAN architectures on stacked MNIST\nexamples, after 1, 4 and 20 epochs.\n\nGAN, \u0397#0.0\n\nGAN, \u0397#0.25\n\nGAN, \u0397#0.5\n\nGAN, \u0397#1.0\n\na\n\na\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\n0.0\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\n0.0\n\n!1.0\n\n!0.5\n\n0.0\nw2\n\nWGAN, \u0397#0.0\n\na\n\n0.5\n\n1.0\n\na\n\n!1.0\n\n!0.5\n\n0.0\nw2\n\n0.5\n\n1.0\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\n0.0\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\n0.0\n\na\n\n0.5\n\n1.0\n\n!1.0\n\n!0.5\n\n0.0\nw2\n\nWGAN, \u0397#0.25\n\na\n\n!1.0\n\n!0.5\n\n0.0\nw2\n\n0.5\n\n1.0\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\n0.0\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\n0.0\n\n!1.0\n\n!0.5\n\n0.0\nw2\n\nWGAN, \u0397#0.5\n\na\n\n0.5\n\n1.0\n\na\n\n!1.0\n\n!0.5\n\n0.0\nw2\n\n0.5\n\n1.0\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\n0.0\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\n0.0\n\n!1.0\n\n!0.5\n\n0.5\n\n1.0\n\n0.0\nw2\n\nWGAN, \u0397#1\n\n!1.0\n\n!0.5\n\n0.0\nw2\n\n0.5\n\n1.0\n\nFigure 3: Streamline plots around the equilibrium (0, 1) for the conventional GAN (top) and the\nWGAN (bottom) for \u2318 = 0 (vanilla updates) and \u2318 = 0.25, 0.5, 1 (left to right).\n\nlocally, close to an equilibrium point. For non-convex architectures this may be all that is possible, but\nit seems plausible that much stronger global convergence results could hold for simple settings like\nthe linear quadratic GAN (indeed, as the streamline plots show, we observe this in practice for simple\ndomains). Second, the analysis here does not show the equilibrium points necessarily exist, but only\nillustrates convergence if there do exist points that satisfy certain criteria: the existence question has\nbeen addressed by previous work [Arora et al., 2017], but much more analysis remains to be done\nhere. GANs are rapidly becoming a cornerstone of deep learning methods, and the theoretical and\npractical understanding of these methods will prove crucial in moving the \ufb01eld forward.\n\nAcknowledgements. We thank Lars Mescheder for pointing out a missing condition in the relaxed\nversion of Assumption IV (see Appendix C.1) in earlier versions of this manuscript.\n\n9\n\n\fReferences\nMartin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adversarial\n\nnetworks. In International Conference on Learning Representations (ICLR), 2017.\n\nMartin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial networks. In\nProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings\nof Machine Learning Research, pages 214\u2013223, 2017.\n\nSanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium\nin generative adversarial nets (GANs). In Proceedings of the 34th International Conference on\nMachine Learning, volume 70 of Proceedings of Machine Learning Research, pages 224\u2013232,\n2017.\n\nVivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and\n\nreinforcement learning. SIAM Journal on Control and Optimization, 38(2):447\u2013469, 2000.\n\nTong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative\nadversarial networks. In Fifth International Conference on Learning Representations (ICLR).\n2017.\n\nEmily L Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. In Advances in Neural Information\n\nProcessing Systems 28, pages 1486\u20131494. 2015.\n\nHarris Drucker and Yann Le Cun. Improving generalization performance using double backpropaga-\n\ntion. IEEE Transactions on Neural Networks, 3(6):991\u2013997, 1992.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nIn Advances in Neural\n\nAaron Courville, and Yoshua Bengio. Generative adversarial nets.\nInformation Processing Systems 27, pages 2672\u20132680. 2014.\n\nIshaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved\ntraining of wasserstein GANs. In Thirty-\ufb01rst Annual Conference on Neural Information Processing\nSystems (NIPS). 2017.\n\nDaniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic. Generating images with\n\nrecurrent adversarial networks. arXiv preprint arXiv:1602.05110, 2016.\n\nHassan K Khalil. Non-linear Systems. Prentice-Hall, New Jersey, 1996.\n\nHarold Kushner and George Yin. Stochastic Approximation and Recursive Algorithms and Applica-\ntions, volume 35 of Stochastic Modelling and Applied Probability. Springer-Verlag New York,\nThe address, 2003.\n\nChristian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta,\nAndrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic\nsingle image super-resolution using a generative adversarial network. In The IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), July 2017.\n\nJan R Magnus, Heinz Neudecker, et al. Matrix differential calculus with applications in statistics and\n\neconometrics. 1995.\n\nMichael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond\nmean square error. In Fourth International Conference on Learning Representations (ICLR). 2016.\n\nL. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In Thirty-\ufb01rst Annual Conference\n\non Neural Information Processing Systems (NIPS). 2017.\n\nLuke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial\n\nnetworks. In Fifth International Conference on Learning Representations (ICLR). 2017.\n\nAnh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & play\ngenerative networks: Conditional iterative generation of images in latent space. In The IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), July 2017.\n\n10\n\n\fBen Poole, Alexander A Alemi, Jascha Sohl-Dickstein, and Anelia Angelova. Improved generator\n\nobjectives for GANs. arXiv preprint arXiv:1612.02780, 2016.\n\nAlec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\nconvolutional generative adversarial networks. In Fourth International Conference on Learning\nRepresentations (ICLR). 2016.\n\nK. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generative adversarial net-\nworks through regularization. In Thirty-\ufb01rst Annual Conference on Neural Information Processing\nSystems (NIPS). 2017.\n\nTim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training GANs. In Advances in Neural Information Processing Systems\n29, pages 2234\u20132242. 2016.\n\nJiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilis-\ntic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural\nInformation Processing Systems 29, pages 82\u201390. 2016.\n\n11\n\n\f", "award": [], "sourceid": 2872, "authors": [{"given_name": "Vaishnavh", "family_name": "Nagarajan", "institution": "Carnegie Mellon University"}, {"given_name": "J. Zico", "family_name": "Kolter", "institution": "Carnegie Mellon University"}]}