{"title": "Learning Temporal Point Processes via Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 10781, "page_last": 10791, "abstract": "Social goods, such as healthcare, smart city, and information networks, often produce ordered event data in continuous time. The generative processes of these event data can be very complex, requiring flexible models to capture their dynamics. Temporal point processes offer an elegant framework for modeling event data without discretizing the time. However, the existing maximum-likelihood-estimation (MLE) learning paradigm requires hand-crafting the intensity function beforehand and cannot directly monitor the goodness-of-fit of the estimated model in the process of training. To alleviate the risk of model-misspecification in MLE, we propose to generate samples from the generative model and monitor the quality of the samples in the process of training until the samples and the real data are indistinguishable. We take inspiration from reinforcement learning (RL) and treat the generation of each event as the action taken by a stochastic policy. We parameterize the policy as a flexible recurrent neural network and gradually improve the policy to mimic the observed event distribution. Since the reward function is unknown in this setting, we uncover an analytic and nonparametric form of the reward function using an inverse reinforcement learning formulation. This new RL framework allows us to derive an efficient policy gradient algorithm for learning flexible point process models, and we show that it performs well in both synthetic and real data.", "full_text": "Learning Temporal Point Processes via Reinforcement Learning\n\nShuang Li\u22171, Shuai Xiao 2, Shixiang Zhu1, Nan Du3, Yao Xie1, and Le Song1,2\n\n1Georgia Institute of Technology\n\n2Ant Financial\n3Google Brain\n\nAbstract\n\nSocial goods, such as healthcare, smart city, and information networks, often pro-\nduce ordered event data in continuous time. The generative processes of these event\ndata can be very complex, requiring \ufb02exible models to capture their dynamics.\nTemporal point processes offer an elegant framework for modeling event data with-\nout discretizing the time. However, the existing maximum-likelihood-estimation\n(MLE) learning paradigm requires hand-crafting the intensity function beforehand\nand cannot directly monitor the goodness-of-\ufb01t of the estimated model in the\nprocess of training. To alleviate the risk of model-misspeci\ufb01cation in MLE, we\npropose to generate samples from the generative model and monitor the quality\nof the samples in the process of training until the samples and the real data are\nindistinguishable. We take inspiration from reinforcement learning (RL) and treat\nthe generation of each event as the action taken by a stochastic policy. We param-\neterize the policy as a \ufb02exible recurrent neural network and gradually improve\nthe policy to mimic the observed event distribution. Since the reward function is\nunknown in this setting, we uncover an analytic and nonparametric form of the\nreward function using an inverse reinforcement learning formulation. This new RL\nframework allows us to derive an ef\ufb01cient policy gradient algorithm for learning\n\ufb02exible point process models, and we show that it performs well in both synthetic\nand real data.\n\nIntroduction\n\n1\nMany natural and arti\ufb01cial systems produce a large volume of discrete events occurring in continuous\ntime, for example, the occurrence of crime events, earthquakes, patient visits to hospitals, \ufb01nancial\ntransactions, and user behavior in mobile applications [5]. It is essential to understand and model these\ncomplex and intricate event dynamics so that accurate prediction, recommendation or intervention\ncan be carried out subsequently depending on the context.\nTemporal point processes offer an elegant mathematical framework for modeling the generative\nprocesses of these event data. Typically, parametric (or semi-parametric) assumptions are made on the\nintensity function [11, 9] based on prior knowledge of the processes, and the maximum-likelihood-\nestimation (MLE) is used to \ufb01t the model parameters from data. These models often work well when\nthe parametric assumptions are correct. However, in many cases where the real event generative\nprocess is unknown, these parametric assumptions may be too restricted and do not re\ufb02ect the reality.\nThus there emerge some recent efforts in increasing the expressiveness of the intensity function using\nnonparametric forms [7] and recurrent neural networks [6, 19]. However, these more sophisticated\nmodels still rely on maximizing the likelihood which now involves intractable integrals and needs to\nbe approximated. Most recently, [27] proposed to bypass the problem of maximum likelihood by\nadopting a generative adversarial network (GAN) framework, where a recurrent neural network is\n\u2217Correspondence to: Shuang Li , Yao Xie , and Le Song\n\n\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flearned to transform event sequence from a Poisson process to the target event sequence. However,\nthis approach is rather computationally intensive, since it requires \ufb01tting another recurrent neural\nnetwork as the discriminator, and it takes many iterations and careful tuning for both neural networks\nto reach equilibrium.\nIn this paper, we take a new perspective and establish an under-explored connection between temporal\npoint processes and reinforcement learning: the generation of each event can be treated as the action\ntaken by a stochastic policy, and the intensity function learning problem in temporal point processes\ncan be viewed as the policy learning problem in reinforcement learning.\nMore speci\ufb01cally, we parameterize a stochastic policy\n\u03c0 using a recurrent neural network over event history\nand learn the unknown reward function via inverse re-\ninforcement learning [1, 20, 28, 15]. Our algorithm for\npolicy optimization iterates between learning the reward\nfunction and the stochastic policy \u03c0. Inverse reinforce-\nment learning is known to be time-consuming, which\nrequires solving a reinforcement learning problem in\nevery inner-loop. To tackle this problem, we convert the\ninverse reinforcement learning step to a minimization\nproblem over the discrepancy between the expert point\nprocess and the learner point process. By choosing the\nfunction class of reward to be the unit ball in reproduc-\ning kernel Hilbert space (RKHS) [13, 16, 8], we can get\nan explicit nonparametric closed form for the optimal re-\nward function. Then the stochastic policy can be learned\nby a customized policy gradient with the optimal reward\nfunction having an analytical expression. An illustration\nof our modeling framework is shown in Figure 1.\nWe conducted experiments on various synthetic and real\nsequences of event data and showed that our approach\noutperforms the state-of-the-art regarding both data de-\nscription and computational ef\ufb01ciency.\n2 Preliminaries\nTemporal Point Processes. A temporal point process is a stochastic process whose realization is a\nsequence of discrete events {ti} with ti \u2208 R+ and i \u2208 Z+ abstracted as points on a timeline [5]. Let\nthe history st = {t1, t2, . . . , tn|tn < t} be the sequence of event times up to but not including time t.\nThe intensity function (rate function) \u03bb(t|st) conditioned on the history st uniquely characterizes the\ngenerative process of the events. Different functional forms of \u03bb(t|st)dt capture different generating\npatterns of events. For example, a plain homogeneous Poisson process has \u03bb(t|st) = \u03bb0 (cid:62) 0,\nimplying that each event occurs independently of each and uniformly on the timeline. A Hawkes\nexp(\u2212(t \u2212 ti)) where the occurrences of past events will\nboost future occurrences. Given the intensity function, the survival function de\ufb01ned as S(t|st) =\n\u03bb(\u03c4 )d\u03c4 ) is the conditional probability that no event occurs in the window [tn, t), and\nthe likelihood of observing event at time t is de\ufb01ned as f (t|st) = \u03bb(t|st)S(t|st). Then we can\nexpress the joint likelihood of observing a sequence of events sT = {t1, t2, . . . , tn|tn < T} up to an\nobservation window T as\n\nFigure 1: Illustration of our modeling frame-\nwork. The observed trajectories of events will\nbe viewed as the actions generated by an expert\npolicy \u03c0E. The goal is to learn a policy which\nwe call learner that mimics the distribution\nof the observed expert event sequences. The\nlearner policy \u03c0(a|st) provides the probabil-\nity of the next event occurring at a after t, and\nst := {ti}ti 0 is the kernel bandwidth, and\npolynomial kernel k(t, t(cid:48)) = ((cid:104)t, t(cid:48)(cid:105) + a)d where a > 0 and d \u2208 N [23, 3, 13]. In this paper, if\nnot otherwise stated, we will assume that Gaussian RBF kernel is used. Let P be a measure on\nt\u2208T k(t,\u00b7) dP(t), as the Hilbert\nspace embedding of P [24]. Then for all f \u2208 H, EP[f (t)] = (cid:104)f, \u00b5P(cid:105)H by the reproducing property.\nSimilarly, one can also embed another measure Q on T into RKHS as \u00b5Q. Then a distance between\nmeasure P and Q can be de\ufb01ned as (cid:107)\u00b5P \u2212 \u00b5Q(cid:107)H := sup(cid:107)f(cid:107)H(cid:54)1 (cid:104)f, \u00b5P \u2212 \u00b5Q(cid:105)H. A characteristic\nRKHS is one for which the embedding is injective: that is, each measure has a unique embedding [25],\nand (cid:107)\u00b5P \u2212 \u00b5Q(cid:107)H = 0 if and only if P = Q. This property holds for many commonly used kernels.\nFor T = Rd, this includes the Gaussian kernels.\n3 A Reinforcement Learning Framework\nSuppose we are interested in modeling the daily crime patterns, or monthly occurrences of disease\nfor patients, then the data are collected as trajectories of events within a prede\ufb01ned time window T .\nWe regard the observed paths as actions taken by an expert (nature).\nLet \u03be = {\u03c41, \u03c42, . . . , \u03c4N \u03be\nT is the\ntotal number of events up to T , and it can be different for different sequences. Then, each trajectory\n\u03be \u223c \u03c0E can be seen as an expert demonstration sampled from the expert policy \u03c0E. Hence, on a high\nlevel, given a set of expert demonstrations D = {\u03be1, \u03be2, . . . , \u03bej, . . .|\u03bej \u223c \u03c0E}, we can treat \ufb01tting a\ntemporal point process to D as searching for a learner policy \u03c0\u03b8 which can generate another set of\nsequences \u02dcD = {\u03b71, \u03b72, . . . , \u03b7j, . . .|\u03b7j \u223c \u03c0\u03b8} with similar patterns as D. We will elaborate on this\nreinforcement learning framework below.\nReinforcement Learning Formulation (RL). Given a sequence of past events st = {ti}ti 0. The choice of model \u03c0 is quite \ufb02exible, only with\nthe constraint that the random variable should be positive since a is always positive. Common\ndistributions such as exponential and Rayleigh distributions would satisfy such constraint, leading\nto \u03c0(a|\u0398(hi\u22121)) = \u0398(h)e\u2212\u0398(h)a and \u03c0(a|\u0398(hi\u22121)) = \u0398(h)ae\u2212\u0398(h)a2/2 respectively. In this way,\nwe specify a nonlinear and \ufb02exible dependency over the history.\nThe architecture of our model in (5) is shown in Fig-\nure 2. Different from traditional RNN, the outputs\nai are sampled from \u03c0 rather than obtained by deter-\nministic transformations. This is what \u201cstochastic\"\npolicy means. Randomly sampling will allow the\npolicy to explore the temporary space. Furthermore,\nthe sampled time point will be fed back to the RNN.\nThe proposed model aims to capture that the state hi\nis attributed by two parts. One is the deterministic in\ufb02uence from the previous hidden state hi\u22121, and\nthe other is the stochastic in\ufb02uence from the latest sampled action ai. Action ai is sampled from the\nprevious distribution \u03c0(a|\u0398(hi\u22121)) with parameter \u0398(hi\u22121) and will be fed back to in\ufb02uence the\ncurrent hidden state hi.\nIn some sense, our RNN with stochastic neurons mimics the event generating mechanism of the\ndoubly stochastic point process, such as Hawkes process and self-correcting process. For these types\nof point processes, the intensity is stochastic, which depends on history, and the intensity function\nwill control the occurrence rate of the next event.\nReward Function Class. The reward function directly quanti\ufb01es the discrepancy between \u03c0E and\n\u03c0\u03b8, and it guides the learning of the optimal policy \u03c0\u2217\n\u03b8. On the one hand, we want its function class\nr \u2208 F to be suf\ufb01ciently \ufb02exible so that it can represent the reward function of various shapes. On\nthe other hand, it should be restrictive enough to be ef\ufb01ciently learned with \ufb01nite samples [3, 13].\nWith these competing considerations, we choose F to be the unit ball in RKHS H, (cid:107)r(cid:107)H (cid:54) 1. An\nimmediate bene\ufb01t of this function class is that we can show the optimal policy can be directly learned\nvia a minimization formulation given in Theorem 1 instead of the original minimax formulation (3).\nA sketch of proof is provided as follows. For short notation, we denote\n\nFigure 2: Illustration of generator \u03c0\u03b8.\n\nk(t,\u00b7)dN (\u03b7)\n\n,\n\nt\n\nand\n\n\u00b5\u03c0\u03b8 := E\u03b7\u223c\u03c0\u03b8 [\u03c6(\u03b7)]\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nfeature mapping from data space to R\n\nmean embeddings of the intensity function in RKHS\n\n(cid:90)\n\n(cid:123)(cid:122)\n\n[0,T )\n\n\u03c6(\u03b7) :=\n\n(cid:124)\n\n(cid:125)\n\n4\n\nhi\u22121a1h0h1ai\u22121hiaihi+1ai+1\u2026\u2026anhnan+10T\u2026\u2026t1ti\u22121titi+1tn\f\uf8ee\uf8f0N (\u03b7)\nT(cid:88)\n\n\uf8f9\uf8fb = E\u03b7\u223c\u03c0\u03b8\n\n(cid:34)(cid:90)\n\n(cid:35)\n\nwhere dN (\u03b7)\nkernel. Then using the reproducing property,\n\nt\n\nis the counting process associated with sample path \u03b7, and k(t, t(cid:48)) is a universal RKHS\n\nJ(\u03c0\u03b8) := E\u03b7\u223c\u03c0\u03b8\n\n(cid:104)r, k (t,\u00b7)(cid:105)HdN (\u03b7)\nSimilarly, we can obtain J(\u03c0E) = (cid:104)r, \u00b5\u03c0E(cid:105)H. From (3), r\u2217 is obtained by\n\nr(ti)\n\n[0,T )\n\ni=1\n\nt\n\n= (cid:104)r, \u00b5\u03c0\u03b8(cid:105)H.\n\nmax\n(cid:107)r(cid:107)H\u22641\n\n\u03c0\u03b8\u2208G (cid:104)r, \u00b5\u03c0E \u2212 \u00b5\u03c0\u03b8(cid:105)H = min\n\nmin\n\n(cid:104)r, \u00b5\u03c0E \u2212 \u00b5\u03c0\u03b8(cid:105)H = min\n\n\u03c0\u03b8\u2208G (cid:107)\u00b5\u03c0E \u2212 \u00b5\u03c0\u03b8(cid:107)H,\n\nwhere the \ufb01rst equality is guaranteed by the minimax theorem, and\n\nr\u2217(\u00b7|\u03c0E, \u03c0\u03b8) =\n\n\u221d \u00b5\u03c0E \u2212 \u00b5\u03c0\u03b8\n\n(6)\n\n\u03c0\u03b8\u2208G max\n(cid:107)r(cid:107)H\u22641\n\u00b5\u03c0E \u2212 \u00b5\u03c0\u03b8\n(cid:107)\u00b5\u03c0E \u2212 \u00b5\u03c0\u03b8(cid:107)H\n\ncan be empirically evaluated by data. In this way, we change the original minimax formulation for\nsolving \u03c0\u2217\n\u03b8 to a simple minimization problem, which will be more ef\ufb01cient and stable to solve in\npractice. We summarize the formulation in Theorem 1.\nTheorem 1 Let the family of reward function be the unit ball in RKHS H, i.e., (cid:107)r(cid:107)H (cid:54) 1. Then the\noptimal policy obtained by (4) can also be obtained by solving\n\n(7)\nwhere D(\u03c0E, \u03c0\u03b8,H) is the maximum expected cumulative reward discrepancy between \u03c0E and \u03c0\u03b8,\n\n\u03c0\u2217\n\u03b8 = arg min\n\n\u03c0\u03b8\u2208G D(\u03c0E, \u03c0\u03b8,H)\n(cid:20)(cid:88)N (\u03be)\n(cid:21)\n\nT\n\nr(\u03c4i)\n\ni=1\n\n\u2212 E\u03b7\u223c\u03c0\u03b8\n\n(cid:20)(cid:88)N (\u03b7)\n\nT\n\ni=1\n\nD(\u03c0E, \u03c0\u03b8,H) := max\n(cid:107)r(cid:107)H(cid:54)1\n\nE\u03be\u223c\u03c0E\n\nr(ti)\n\n.\n\n(8)\n\n(cid:21)(cid:19)\n\n(cid:18)\n\nTheorem 1 implies that we can transform the inverse reinforcement learning procedure of (4) to a\nsimple minimization problem which minimizes the maximum expected cumulative reward discrep-\nancy between \u03c0E and \u03c0\u03b8. This enables us to sidestep the expensive computation of (4) caused by\nthe solving the inner RL problem repeatedly. What\u2019s more interesting, we can derive an analytical\nsolution to (8) given by (6).\nFinite Sample Estimation. Given L trajectories of expert point processes, and M trajectories\nof events generated by \u03c0\u03b8, mean embeddings \u00b5\u03c0E and \u00b5\u03c0\u03b8 can be estimated by their respective\n,\u00b7). Then for\nempirical mean: \u02c6\u00b5\u03c0E = 1\nany t \u2208 [0, T ), the estimated optimal reward is (without normalization) is\nL\n\n(cid:80)N (m)\n\ni=1 k(t(m)\n\ni=1 k(\u03c4 (l)\n\nl=1\n\nM\n\nT\n\nT\n\ni\n\ni\n\n\u02c6r\u2217(t) \u221d 1\nL\n\nk(\u03c4 (l)\n\ni\n\n, t) \u2212 1\nM\nand t(m)\n. Unbiased estimator can also be obtained and\n\nk(t(m)\n\n, t).\n\n(9)\n\nm=1\n\ni=1\n\ni\n\ni\n\nNote this empirical estimator is biased at \u03c4 (l)\nwill be provided in Algorithm RLPP discussed later for simplicity.\nKernel Choice. The unit ball in RKHS is dense and expressive. Fundamentally, our proposed\nframework and theoretical results are general and can be directly applied to other types of kernels.\nFor example, we can use the Mat\u00e9rn kernel, which generates spaces of differentiable functions known\nas the Sobolev spaces [10, 2]. In later experiments, we have used Gaussian kernel and obtained\npromising results.\n\ni\n\n(cid:80)L\n(cid:88)L\n\n(cid:80)N (l)\n(cid:88)N (l)\n\nT\n\nl=1\n\ni=1\n\n,\u00b7) and \u02c6\u00b5\u03c0\u03b8 = 1\n(cid:88)M\n\n(cid:80)M\n(cid:88)N (m)\n\nm=1\n\nT\n\n5 Learning Algorithm\nLearning via Policy Gradient. In practice, instead of minimizing D(\u03c0E, \u03c0\u03b8,H) as in (7), we can\nequivalently minimize D(\u03c0E, \u03c0\u03b8,H)2 since square is a monotonic transformation. Now, we can\nlearn \u03c0\u2217\n\u03b8 from the RL formulation (2) using policy gradient with variance reduction. First, with the\nlikelihood ratio trick, the gradient of \u2207\u03b8D(\u03c0E, \u03c0\u03b8,H)2 can be computed as\n\n(cid:18)(cid:88)N \u03b7\n\nT\n\n\u02c6r\u2217(ti)\n\ni=1\n\n(cid:19)\uf8f9\uf8fb ,\n\n(10)\n\n\u2207\u03b8D(\u03c0E, \u03c0\u03b8,H)2 = E\u03b7\u223c\u03c0\u03b8\n\n(\u2207\u03b8 log \u03c0\u03b8(ai|\u0398(hi\u22121))) \u00b7\n\nwhere (cid:80)N \u03b7\n\nT\n\ni=1 (\u2207\u03b8 log \u03c0\u03b8(ai|\u0398(hi\u22121))) is the gradient of the log-likelihood of a roll-out sample\n\n\u03b7 = {t1, . . . , tN \u03b7\n\n} using the learner policy \u03c0\u03b8.\n\nT\n\n\uf8ee\uf8f0 N \u03b7\nT(cid:88)\n\ni=1\n\n5\n\n\fAlgorithm RLPP: Mini-batch Reinforcement Learning\nfor Learning Point Processes\n1. Initialize model parameters \u03b8;\n2. For number of training iterations do\n\n1\n\n};\n\n};\n\nN\n\n(l)\nT\n\n, . . . , t(m)\nNT\n\n1 , . . . , \u03c4 (l)\n\n(cid:88)M\n\n(cid:18)(cid:88)N\n\n\u2022 Sample minibatch of L trajectories of events\n{\u03be(1), . . . , \u03be(L)} from expert policy \u03c0E, where\n\u03be(l) = {\u03c4 (l)\n\u2022 Sample minibatch of M trajectories of events\n{\u03b7(1), . . . , \u03b7(M )} from learner policy \u03c0\u03b8, where\n\u03b7(m) = {t(m)\n(cid:19)\n\u2022 Estimate policy gradient \u2207\u03b8D(\u03c0E, \u03c0\u03b8,H)2 as\nwhere log p\u03b8(\u03b7(m)) =(cid:80)N \u03b7\n\u2207\u03b8\n\u2217\n) log p\u03b8(\u03b7(m))\n\u02c6r\ni=1 (log \u03c0\u03b8(ai|\u0398(hi\u22121)))\nis the log-likelihood of the sample \u03b7(m), and\nr\u2217(t(m)\n) can be estimated by L expert trajectories\n(cid:88)N\nand (M \u2212 1) roll-out samples without \u03b7(m)\n\u2217\nM(cid:88)\n\u02c6r\n\n(cid:88)N\n\n(cid:88)L\n\nk(\u03c4 (l)\ni\n(m(cid:48) )\n\n(t(m)) =\n\n1\nM\n\n(t(m)\n\ni\n\n(m)\nT\n\nl=1\n\ni=1\n\nm=1\n\ni=1\n\n, t)\nk(t(m(cid:48))\n\nj\n\n, t);\n\n1\nL\n\u2212 1\n\nM \u2212 1\n\nT\n\n(l)\nT\n\ni\n\n(a)\n\n(b)\n\nFigure 3: The reward function \u02c6r\u2217(t) is estimated us-\ning 100 sampled sequences from \u03c0E and \u03c0\u03b8. In (a),\n\u02c6r\u2217(t) > 0 when the expert\u2019s intensity is above the\nlearner\u2019s intensity, and \u02c6r\u2217(t) < 0 when the expert\u2019s\nintensity is below the learner\u2019s intensity. In order to\nmaximize the cumulative reward given the current\nreward, the learner should generate more events in\nthe region when \u02c6r\u2217(t) > 0 and reduce the number\nof events when \u02c6r\u2217(t) < 0. Based on our formula-\ntion, the optimal reward function always quanti\ufb01es the\ndiscrepancy between the expert and current learner\nby considering the worst case. As a result, once the\nlearner is changed, the current optimal reward \u02c6r\u2217(t)\nis updated accordingly, and \u02c6r\u2217(t) guides the learner\nto update its policy towards mimicking the expert\u2019s\nbehavior until they exactly match each other in (b)\nwhere \u02c6r\u2217(t) becomes zero.\n\nT\n\nj=1\n\nm(cid:48)=1,m(cid:48)(cid:54)=m\n\u2022 Update policy parameters as\n\n\u03b8 \u2190 \u03b8 + \u03b1\u2207\u03b8D(\u03c0E, \u03c0\u03b8,H)2.\n\nT\n\nT\n\nwhere\n\n(cid:16)(cid:80)NT\n\n(cid:17)\nl=i \u02c6r\u2217(tl)\n\n(cid:17)(cid:105)\nl=i [\u02c6r\u2217(tl) \u2212 bl]\n\ni=1 (\u2207\u03b8 log \u03c0\u03b8(ai|\u0398(hi\u22121))) \u00b7(cid:16)(cid:80)N \u03b7\n(cid:104)(cid:80)N \u03b7\n\nTo reduce the variance of the gradient, we can exploit the observation that future actions do not\ndepend on past rewards. This leads to a variance reduced gradient estimate \u2207\u03b8D(\u03c0E, \u03c0\u03b8,H)2 =\nE\u03b7\u223c\u03c0\u03b8\nis referred to\nas the \u201creward to go\" and bl is the baseline to further reduce the variance. The overall procedure is\ngiven in Algorithm RLPP. In the algorithm, after we sample M trajectories from the current policy,\nwe use one trajectory \u03b7m for evaluation and the rest M \u2212 1 samples to estimate reward function. An\nexample reward function learned at a different stage of the algorithm is also illustrated in Figure 3.\nComparison with MLE. During training, our generative model directly compares the generated\ntemporal events with the observed events to iteratively correct the mistakes, which can effectively\navoid model misspeci\ufb01cation. Since the training only involves the policy gradient, it bypasses the\nintractability issue of the log-survival term in the likelihood (Eq. (1)). On the other hand, because the\nlearned policy is in fact the conditional density of a point process, our approach still resembles the\nform of MLE in the RL reformulation and can thus be interpreted in a statistically principled way.\nComparison with GAN and GAIL. By Theorem 1, our policy is learned directly by minimizing the\ndiscrepancy between \u03c0E and \u03c0\u03b8 which has a closed form expression. Thus, we convert the original\nIRL problem to a minimization problem with only one set of parameters with respect to the policy.\nIn each training iteration with the policy gradient, we have an unbiased estimator of the gradient,\nand the estimated reward function also depends on the current policy \u03c0\u03b8. In contrast, in GAN or\nGAIL formulation, they have two sets of parameters related to the generator and the discriminator.\nThe gradient estimator is biased because each min-/max-problem is in fact nonconvex and cannot be\nsolved in one-shot. Thus, our framework is more stable and ef\ufb01cient than the mini-max formulation\nfor learning point processes.\n6 Experiments\nWe evaluate our algorithm by comparing with state-of-the-arts on both synthetic and real datasets.\nSynthetic datasets. To show the robustness to model-misspeci\ufb01cations of our approach, we propose\nthe following four different point processes as the ground-truth: (I) Inhomogeneous Poisson (IP)\nwith \u03bb(t) = at + b where a = \u22120.2 and b = 3.5; Here we omit st since \u03bb(t) does not depend on\nti