1 Introduction
Hierarchical planners have been widely researched in artificial intelligence communities. One of the main reasons for that is that the hierarchical planers can divide complex planning problems, which flat planners cannot solve, into a series of more simple subproblems, by using highlevel knowledges about the planning problem (e.g.,
[Nilsson1984, Choi and Amir2009, Kaelbling and LozanoPérez2011]).A hierarchical planner is composed of multiple planner layers that are typically divided into two types: highlevel and lowlevel. A lowlevel planner performs microlevel planning, and it deals with raw information about an environment. In contrast, a highlevel planner performs macrolevel planning, and it deals with more abstract symbolic information. The raw and abstract symbolic information are mapped to each other by symbol grounding functions. Imagine that a hierarchical planner is used for controlling a humanoid robot to put a lemon on a board. Here, the highlevel planner makes a plan such a “Pick a lemon up, and then put it on a board.” The lowlevel planner makes a plan for controlling the robot’s motors according to sensor inputs, to achieve subgoals given by the highlevel planner (e.g., “Pick a lemon up”). As the lowlevel planner cannot understand what “Pick a lemon up” means, the symbol ground function converts it into actual values, in the environment, which the lowlevel planner can understand.
Hierarchical planners are often used for supporting human decision making (e.g., in supply chain [Özdamar et al.1998] or clinical operations [FdezOlivares et al.2011]). In such cases, people make decisions on the basis of a plan, and thus it is necessary that 1) they understand the plan (especially one of a highlevel planner) and 2) they can reach satisfying outcomes by following the plan (i.e., the hierarchical planner gives appropriate plans).
In many previous studies on hierarchical planners, symbol grounding functions and highlevel planners were designed manually [Nilsson1984, Malcolm and Smithers1990, Cambon et al.2009, Choi and Amir2009, Dornhege et al.2009, Wolfe et al.2010, Kaelbling and LozanoPérez2011]. Although this makes it possible for people to understand the plans easily, much human effort is needed to carefully design a hierarchical planner that provides appropriate plans.
Konidaris et al. konidaris2014constructing,konidaris2015symbol,konidaris2016constructing have proposed frameworks for automatically constructing symbol grounding functions and highlevel planners, but they require a human to carefully analyze them to understand the plans. These constructed modules are often complicated and, in such cases, analysis becomes a burden.
In this paper, we propose a framework that automatically refines manuallydesigned symbol grounding functions and highlevel planners, with a policy gradient method. Our framework differs from frameworks proposed in the aforementioned previous studies on the basis of the following points:

Unlike the hierarchical planners based solely on manuallydesigned symbol grounding functions and highlevel planners [Nilsson1984, Malcolm and Smithers1990, Cambon et al.2009, Choi and Amir2009, Dornhege et al.2009, Wolfe et al.2010, Kaelbling and LozanoPérez2011], our framework refines these modules without human intervention. This automated refinement reduces the design workload for the modules.

Unlike the frameworks that automatically construct symbol grounding functions and highlevel planners [Konidaris et al.2014, Konidaris et al.2015, Konidaris2016], our framework refines these while attempting to keep the resulting symbol grounding consistent with prior knowledge of the definition of the symbols as much as possible (see Section 4). Therefore, a person can understand the plan that highlevel planners output without careful analysis of the refined modules.
In this paper, we first explain our hierarchical planner (including the highlevel planner and symbol grounding functions), and how these are designed (Section 3). Then, we introduce the framework designed to refine them (Section 4). Finally, we experimentally demonstrate the effectiveness of our framework (Section 5).
2 Preliminaries
Our framework, introduced in Section 4, is based on semiMarkov decision processes (SMDPs) and policy gradient methods.
2.1 SemiMarkov Decision Processes
SMDPs are a framework for modeling a decision problem in an environment where a sojourn time in each state is a random variable, and it is defined as a tuple
. is the dimensional continuous state space; is a function that returns a finite set of options [Sutton et al.1999] available in the environment’s state ; is the reward received when option is executed at ; arriving in state after time steps;is a probability of
, and after executing in ; and is a discount factor.Given SMDPs, our interest is to find an optimal policy over options :
(1)  
(2) 
where and are transitions of a state, an option, time steps elapsed while executing the option, and the arriving state after executing the option.
2.2 Policy Gradient
To find , we use a policy gradient method [Sutton et al.2000]. In a policy gradient method, a policy parameterized by is introduced to approximate , and the approximation is performed by updating with a gradient. Although there are many policy gradient implementations (e.g., [Kakade2002, Silver et al.2014, Schulman et al.2015]), we use REINFORCE [Williams1992]. In REINFORCE, is updated as follows:
(3)  
(4) 
where is a learning rate and and are transitions of state, the executing option, elapsed time steps, and arriving state, which are sampled on the basis of in a time horizon. Other variables and functions are the same as those introduced in Section 2.1. We decided to use REINFORCE for our work because it has successfully worked in recent work [Silver et al.2016, Das et al.2017].
3 Hierarchical Planner with Symbol Grounding Functions
In this section, we first describe the outline of a hierarchical planner (including the highlevel planner) with symbol grounding functions, which are manually designed. We then provide concrete examples of them. The highlevel planner and symbol grounding functions described here are refined by the framework, which is proposed in Section 4.
The hierarchical planner (Figure 1) is composed of two symbol grounding functions (one for abstraction and the other for concretization), a highlevel planner, a lowlevel planner, and two knowledge bases (one each for the highlevel and lowlevel planners). These modules work as follows:
 Step 1

: The symbol grounding function for abstraction receives raw information, abstracts it to a symbolic information on the basis of its knowledge base, and then outputs the symbolic information.
 Step 2

: The highlevel planner receives the abstract symbolic information, makes a plan using its knowledge base, and then outputs abstract symbolic information as a subgoal, which indicates the next abstract state to be achieved.
 Step 3

: The symbol grounding function for concretization receives the abstract symbolic information, concretizes it to raw information about a subgoal, which specifies an actual state to be achieved, then outputs the raw information on the subgoal. This module performs the concretization on the basis of its the knowledge base.
 Step 4

: The lowlevel planner receives the raw information about a subgoal and then interacts with the environment to achieve the given subgoal. In the interaction, the lowlevel planner outputs primitive actions in accordance with the raw information given by the environment. The interaction continues until the lowlevel planner achieves the given subgoal, or until the total number of elapsed time steps reaches a given threshold.
 Step 5

: If the raw information from the environment is not a goal or terminal state, return to the Step 1.
The knowledge bases for symbol grounding functions and the highlevel planners are designed manually.
 Knowledge base for highlevel planners

is described as a simple planning domain definition language (PDDL) [McDermott et al.1998]. In a PDDL, objects, predicates, goals, and operators are manually specified. The objects and predicates are for building logical formulae, which specify the possible states in the planning domain. The operators are represented as a pair of preconditions and effects. The preconditions represent the states required for applying the operator, and the effects represent the arriving states after applying the operators. We use PDDLs in this work because they are widely used for describing a knowledge base for symbolic planners.
 Knowledge base for symbol grounding functions

is described as a list of maps between abstract symbolic information and corresponding raw information. In this paper, to simplify the problem, we assume that each item of abstract symbolic information is mapped into one interval of raw information. Despite its simplicity, it is useful for representing, for example, typical spatial information.
Here, we describe the knowledge bases and how the hierarchical planner works to solve the mountain car problem [Moore1991] (Figure 2). In this problem, a car is placed within a deep valley, and its goal is to drive out by going up the right side hill. However, as the car’s engine is not strong enough, it needs to first drive back and forth between the two hills to generate momentum. In this problem, the hierarchical planner receives raw information (the position and velocity of the car) from the environment and is required to make a plan to move it to the goal (the top of the right side hill).
An example of knowledge for the highlevel planner is shown in Table 1. In this example, objects are composed of only a “Car.” Predicates are composed of four instances (“Bottom_of_hills(), On_right_side_hill(), On_left_side_hill(), and At_top_of_right_side_hill()”). For example, “On_right_side_hill(Car)” means that the car is on the right side hill. Operators are composed of three types that refer to a transition of the objects on the hills. For example, “Opr.1” refers to the transition that object has moved from the bottom of the hills to the right side hill.
Objects  Car 

Predicates  Bottom_of_hills(), On_right_side_hill(), 
On_left_side_hill(), At_top_of_right_side_hill()  
Goals  At_top_of_right_side_hill(Car) 
Operators  Preconditions  Effects 

Opr.1  Bottom_of_hills()  On_right_side_hill() 
Opr.2  On_right_side_hill()  On_left_side_hill() 
Opr.3  On_left_side_hill()  At_top_of_right_side_hill() 
An example of the knowledge for symbol grounding functions is shown in Table 2. This example shows mappings between abstract symbolic information (the location of the car), and corresponding intervals of raw information (the actual value of the car’s position). For example, “Bottom_of_hills(Car)” is mapped to the position of the car is in the interval [0.6, 0.4].
Abstract symbolic informations  Interval of raw information 

Bottom_of_hills(Car)  position 
On_right_side_hill(Car)  position 
On_left_side_hill(Car)  position 
At_top_of_right_side_hill(Car)  position 
Given the knowledge described in Tables 1 and 2, an example of how the hierarchical planner works is shown as follows:
 Example of Step 1:

The symbol grounding function for abstraction receives raw information (position=0.5 and velocity=0). The position is in the interval [0.6, 0.4], which corresponds to the “Bottom_of_hills(Car)” in Table 2. Therefore, the symbol grounding function outputs “Bottom_of_hills(Car).”
 Example of Step 2:

The highlevel planner receives “Bottom_of_hills(Car),” and makes a plan to achieve the goal (“At_top_of_right_side_hill(Car)”). By using the knowledge in Table 1, the highlevel planner makes the plan [Bottom_of_hills(Car) On_right_side_hill(Car) On_left_side_hill(Car) At_top_of_right_side_hill(Car)], which means “Starting at the bottom of the hills, visit, in order, the right side hill, the left side hill, and the top of the right side hill.” After following the plan, the highlevel planner outputs “On_right_side_hill(Car).”
 Example of Step 3:

The symbol grounding function receives “On_right_side_hill(Car),” and concretizes it to raw information about subgoal (position= 0.1, velocity=*). Here, the position in the raw information is determined as the mean of the corresponding interval in Table 2. In addition, the mask (represented by “*”) is putted to filter out factors in raw information, which is irrelevant in the subgoal (i.e., velocity in this example).
 Example of Step 4:

The lowlevel planner receives position= 0.1 and the mask. To move the car to the given subgoal (position=0.1), the lowlevel planner makes a plan to accelerate the car. This planning is performed by model predictive control [Camacho and Alba2013]. The lowlevel planner terminates itself when the car arrives at the given subgoal (position=0.1), or when it takes a primitive action 20 times.
4 Framework for Refining Grounding Function and High Level Planner
In this section, we propose a framework for refining the symbol grounding functions and the highlevel planner introduced in the previous section. In our framework, symbol grounding and highlevel planning, which are based on manuallydesigned knowledge bases, are modeled with SMDPs. Refinement of the symbol grounding functions and the highlevel planner is achieved by applying policy gradients to the model. First, we introduce an abstract model and then provide an example of its implementation in the mountain car problem. Finally, we explain how the policy gradient method is applied to the model.
4.1 Modeling Symbol Grounding and HighLevel Planning with SMDPs
We model symbol grounding and highlevel planning, which are based on manuallydesigned knowledge bases, with SMDPs. The symbol grounding functions and the highlevel planner are modeled as components of the parameterized policy. In addition, the knowledge bases are modeled as priors for the policy’s parameters.
We first assume that information and modules, which appear in hierarchical planning, are represented as random variables and probability functions, respectively (Figure 1). Suppose that is a set of all possible symbols the symbol grounding functions and the highlevel planner deal with, raw information is represented as an
dimensional vector, and
is a set of all possible primitive actions. We denote raw information by ^{1}^{1}1The denotation is the same as that of the state described in Section 2.1 because the raw information is modeled as the state. , abstract symbol information by , abstract symbolic information about a subgoal by , raw information about a subgoal , and a primitive action by . In addition, we denote the symbol grounding function for abstraction by , the symbol grounding function for concretization by , the highlevel planner by , the lowlevel planner by , the environment by , the knowledge base for highlevel planners by , and the knowledge base for the highlevel planner by . Here, and are the parameters for the symbol grounding functions and the highlevel planner, respectively.Highlevel planning and symbol grounding based on the knowledge base are modeled as SMDPs (Figure 3). In this model, the components of SMDPs (i.e., an option, a state, a reward, and a transition probability) are implemented as follows:
 Option :

is implemented as a tuple of abstract symbolic information , abstract symbolic information about a subgoal , and raw information about a subgoal .
 State :

is implemented as raw information.
 Reward :

is the cumulative reward given by the environment , while the lowlevel planner is interacting with .
 Transition probability :

is implemented as a function, which represents the state transition proceeded by the interaction between the lowlevel planner and the environment . Note that although the transition probability receives option , only is used in the transition probability.
In this model, the parameterized policy is implemented to control abstraction of raw information, highlevel planning, and concretization of abstract symbolic information, in accordance with the knowledge bases. Formally, is implemented as follows:
(5)  
The right term in the second line can be derived by decomposing the joint probability in the first line, in accordance with the probabilistic dependency shown in Figure 2. Note that, in this equation, is represented as , i.e., a concatenation of and . By using this representation for , we can derive an update expression, which can refines and keeping them consistent with and . See Section 4.3 for details.
and are needed to reflect the manuallydesigned knowledge bases. To do so, first, and are implemented as parametric distributions and , respectively, and their hyperparameters and are determined to replicate manuallydesigned symbol grounding functions and highlevel planners. More formally, we use and as the optimal parameters of and , respectively, acquired by the following equations:
(6)  
(7) 
where and are a divergence (e.g., KL divergence) from the manuallydesigned symbol grounding function and highlevel planner, respectively. and are abstract criteria, and thus, there are many implementations of functionals “” and “.”
4.2 An Example of Model Implementation to Solve the Mountain Car Problem
We introduced an abstract model for symbol grounding and highlevel planning with knowledge bases in the previous section. In this section, we provide an example of an implementation of the model to solve the mountain car problem.
First, and are implemented as follows:
(8)  
(9) 
is implemented in accordance with the knowledge shown in Table 2. is implemented in accordance with the definition of actions to solve the mountain car problem, and represented as a set of values for the acceleration of the car.
Second, the probabilities of the modules in the hierarchical planner are implemented as follows:
(10)  
(11)  
(12) 
is implemented as the normalized likelihood of a normal distribution (Eq. (
10)), and is implemented as a normal distribution (Eq. (11)). In Eq. (10) and Eq. (11), represents a normal distribution for , which is parameterized by mean , s.t, . Note that and are identical to and , respectively. is implemented as a softmax function (Eq. (12)). In Eq. (12), is a base function that returns a onehot vector in which only one element corresponding to the value of is set to a value of 1, and the other elements are set to a value of 0. is a weight vector, s.t., . In this implementation, is a vector composed of , s.t., , and is a vector composed of the set , s.t., . and are implemented as deterministic functions, which represent the simulator of environment and the model predictive controller.Third, the reward function is implemented as follows:
(13)  
(14) 
where and are a state and a primitive action sampled from the environment time steps later from the executing option , respectively. Eq. (14) represents “lowlevel” reward , which is fed in accordance with and the car position included in .
Fourth, and are implemented as follows:
(15)  
(16) 
Eq. (15) represents a distribution for and . The component for is a normal distribution, which has mean and standard deviation 1, and the component for
is a uniform distribution
. In addition, Eq. (16) represents the normal distribution for , which is the th element of . This distribution has mean and standard deviation 1. Note that, in this implementation, and are and , respectively.Finally, functionals in Eq. (6) and Eq. (7), are implemented as follows:
 Implementation of :
 Implementation of :

Using Eq. (7), is determined by Algorithm 1. The algorithm is outlined as follows: first initialize with (line 1–3), and if the operator, in which refers to the preconditions and refers to the effects, is contained in knowledge base , the corresponding weight is initialized with (line 4–11). is initialized in accordance with Table 1 before it is passed to the algorithm.
4.3 Refining Symbol Grounding and HighLevel Planning with Policy Gradients
Refining the highlevel planner and symbol grounding functions ( and ) is achieved by a parameter update in Eq. (17).
(17) 
This equation contains two unique terms: a reinforcement term and a penalty term
. The reinforcement term contributes to updating the parameters to maximize the expected cumulative reward, as in standard reinforcement learning. The penalty term contributes to keeping the parameters consistent with the priors (i.e., manuallydesigned knowledge bases). This update is derived by substituting
and Eq. (5) for and Eq. (3), respectively. Using the example described in Section 4.2, , and are updated in this equation. In this case, the penalty term prevents and , for all , , and , from moving far away from and , respectively.5 Experiments
In this section, we perform an experimental evaluation to investigate whether the symbol grounding functions and highlevel planner are refined successfully by using the framework we proposed in the previous section. In Section 5.1, we focus on the evaluation for refining the symbol grounding functions only. Then, in Section 5.2, we evaluate the effect of jointly refining symbol grounding functions and the highlevel planner.
5.1 Refinement of Symbol Grounding
We evaluate how the symbol grounding functions are refined by our framework to solve the mountain car problem. The experimental set up to implement the planner and our framework is the same as that in the example introduced in Section 3 and Section 4.
For the evaluation, we prepared three types of method:
 Baseline:

A hierarchical planner that uses the grounding functions and a highlevel planner, which are manually designed. This planner is identical to that introduced in the example in Section 3.
 NoPenalty:

The framework that refines the symbol grounding functions without the penalty term in Eq. (17). In this method, the highlevel planner is the same as that in Baseline.
 Proposed:

The framework that refines the symbol grounding functions with the penalty term. In this method, the highlevel planner is the same as that in Baseline.
These methods were evaluated on the basis of two criteria: an average cumulative reward over episodes, and a parameter divergence. The former is to evaluate if the hierarchical planner produces a more appropriate plan by refining its modules, and the latter is to evaluate the interpretability of the refined modules. The parameter divergence represents how much the policy’s parameters ()^{2}^{2}2We assume dominatingly determines the behaviors of symbol grounding functions. refined by the framework differ from the initial parameters. In this paper, this divergence is measured by the Euclidean distance between the refined parameter () and its initial parameter (). Initial values for and are given, shown as “Init” in Table 3. is initialized with , which is determined on the basis of the implementation of the functional in Eq. (6) (see Section 4.2), and
is manually determined. We consider 50 episodes as one epoch and performed refinement over 2000 epochs.
The experimental results (shown in Figures 5 and 5) show that 1) refining the grounding functions improves the performance (average cumulative reward) of hierarchical planners, and 2) considering the penalty term keeps the refined parameters within a certain distance from the initial parameters. Regarding 1), Figure 5 shows the methods in which the grounding functions are refined (NoPenalty and Proposed) outperform Baseline. This result indicates the refinement for grounding functions successfully improves its performance. Regarding 2), Figure 5 shows that the parameter in NoPenalty moves away from the original parameter in refining, while in Proposed, the parameter stays close to the original one.
An example of the refined parameter for the grounding functions for Proposed is shown in Table 3, which indicates that the parameter is updated to achieve highperformance planning while staying close to the original parameter. In this example, the mean and standard deviation of “On_right_side_hill(Car)” is changed significantly through refinement. The mean for grounding On_right_side_hill(Car) is biased to a more negative position, and also flattened to make the car climb up the left side hill quickly (Figure 7). This change makes the symbol grounding function more flattened and considers the center position as “On_right_side_hill(Car).” The main interpretation of this result is that the symbol grounding function was refined to reduce the redundancy in highlevel planning. In the original symbol grounding functions, the center position is grounded to “Bottom_of_hills (Car),” and the highlevel planner makes a plan [Bottom_of_hills(Car) On_right_side_hill(Car) On_left_side_hill(Car) At_top_of_right_side_hill(Car)], which means “Starting at the bottom of hills, visit, in order, the right side hill, the left side hill, and the top of the right side hill.” However, this plan is redundant; the car does not need to visit the right side hill first. The refined symbol grounding function considers the center position as “Right_side_hill(Car),” and thus the highlevel planner produces the plan [Bottom_of_hills(Car) On_right_side_hill(Car) On_left_side_hill(Car) At_top_of_right_side_hill(Car)], in which the redundancy is removed. It should also be noted that the order of the refined means is intuitively correct. For example, the value of is higher than the value of (i.e., means the place on more rightside than ). It cannot be seen in the Baseline and NoPenalty cases. This result supports the fact that our framework refines the modules by maintaining their interpretability.
Init  0.5  0.6  0.2  1.1 

Refined  0.5  0.46  0.39  1.1 
Init  0.4  0.1  0.4  0.3 

Refined  0.4  0.12  1.42  0.11 
5.2 Joint Refinement of Symbol Grounding and HighLevel Planning
In this section, we refine both the symbol grounding functions and the highlevel planner. The setup of the hierarchical planner and the problem are the same as those of the previous section, except for the knowledge base for the highlevel planner. We removed “Opr.2” (as shown in Table 1) and used this degraded version as the knowledge base for the experiment. This degradation makes a space for refining the knowledge base for the highlevel planner. In addition, we put a small coefficient of the penalty term for , because we found that considering this term too much makes the refinement worse in a preliminary experiment. As long as the results of the symbol grounding functions are interpretable, the result of the highlevel planner is interpretable as well. is initialized with , which is determined by (i.e., Algorithm 1) where we set 1.3 as , and 0.02 as . The resulting is shown as “Init” in Table 4.
We prepared three types of methods:
 NoRefining:

A hierarchical planner with the degraded version of the knowledge base for highlevel planner. The knowledge base for the symbol grounding function is the same to that shown in Table 2.
 RefiningHP:

The framework that refines the highlevel planner only. In this method, symbol grounding functions are the same as those in NoRefining.
 RefiningHPSGF:

The framework that refines both symbol grounding functions and the highlevel planner.
From the experimental result (Figure 7), we can confirm that our framework successfully refines both symbol grounding functions and the highlevel planner, from the viewpoint of performance. RefiningHP outperforms NoRefining, and RefiningHPSGF outperforms the other methods.
Table 4 provides an example of how the highlevel planner was refined. It indicates that the dropped knowledge (i.e., Opr. 2) was successfully acquired in refinement, and knowledge is discovered that makes highlevel planning more efficient. Considering the form of Eq. (12), the operator, which corresponds to the element of a weight with a higher value, contributes more to highlevel planning. Therefore, these corresponding operators are worthwhile as knowledge for highlevel planning. In Table 4, the refined weight of the operator (preconditions=On_right_side_hill, effects=On_left_side_hill) is higher than those of other operators in which the precondition contains On_right_side_hill. This operator was once initially removed and later acquired by the refinement. Similarly, the operator (preconditions=Bottom_of_hills, effects=On_left_side_hill), which is not shown in Table 1, was newly acquired.
Refined (Init)  Bottom_of_hills  At_top_of_right_side_hill  On_right_side_hill  On_left_side_hill 

Bottom_of_hills  5.88 (1.3)  6.34 (1.3)  3.15 (1.3)  6.65 (1.3) 
At_top_of_right_side_hill  9.04 (1.3)  9.75 (1.3)  4.76 (1.3)  2.5 (0.02) 
On_right_side_hill  0.98 (0.02)  1 (1.3)  2.03 (1.3)  1.34 (1.3) 
On_left_side_hill  0.85 (1.3)  2.12 (1.3)  1.74 (1.3)  11.71 (1.3) 
6 Conclusion
In this paper, we proposed a framework that refines manuallydesigned symbol grounding functions and a highlevel planner. Our framework refines these modules with policy gradients. Unlike standard policy gradient implementations, our framework additionally considers the penalty term to keep parameters close to the prior parameter derived from manuallydesigned modules. Experimental results showed that our framework successfully refined the parameters for the modules; it improves the performance (cumulative reward) of the hierarchical planner, and keeps the parameters close to those derived from the manuallydesigned modules.
One of the limitations of our framework is that it deals only with predefined symbols (such “Bottom_of_hills”), and it does not discover new symbols. We plan to address this drawback in future work. We also plan to evaluate our framework in a more complex domain where primitive actions and states are highdimensional, and the knowledge base is represented in a more complex description (e.g., precondition contains multiple states).
References
 [Camacho and Alba2013] Eduardo F Camacho and Carlos Bordons Alba. Model predictive control. Springer Science & Business Media, 2013.
 [Cambon et al.2009] Stéphane Cambon, Rachid Alami, and Fabien Gravot. A hybrid approach to intricate motion, manipulation and task planning. The International Journal of Robotics Research, 28(1):104–126, 2009.
 [Choi and Amir2009] Jaesik Choi and Eyal Amir. Combining planning and motion planning. In Proc. of ICRA09, pages 238–244. IEEE, 2009.
 [Das et al.2017] Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. Learning cooperative visual dialog agents with deep reinforcement learning. arXiv:1703.06585, 2017.
 [Dornhege et al.2009] Christian Dornhege, Marc Gissler, Matthias Teschner, and Bernhard Nebel. Integrating symbolic and geometric planning for mobile manipulation. In Proc. of SSRR09, pages 1–6. IEEE, 2009.
 [FdezOlivares et al.2011] Juan FdezOlivares, Luis Castillo, Juan A Cózar, and Oscar García Pérez. Supporting clinical processes and decisions by hierarchical planning and scheduling. Computational Intelligence, 27(1):103–122, 2011.
 [Kaelbling and LozanoPérez2011] Leslie Pack Kaelbling and Tomás LozanoPérez. Hierarchical task and motion planning in the now. In Proc. of ICRA11, pages 1470–1477. IEEE, 2011.
 [Kakade2002] Sham M Kakade. A natural policy gradient. In Proc. of NIPS02, pages 1531–1538, 2002.
 [Konidaris et al.2014] George Konidaris, Leslie Pack Kaelbling, and Tomas LozanoPerez. Constructing symbolic representations for highlevel planning. In Proc. of AAAI14, 2014.
 [Konidaris et al.2015] George Konidaris, Leslie Pack Kaelbling, and Tomas LozanoPerez. Symbol acquisition for probabilistic highlevel planning. In Proc .of IJCAI15, 2015.
 [Konidaris2016] George Konidaris. Constructing abstraction hierarchies using a skillsymbol loop. In Proc. of IJCAI16, 2016.
 [Malcolm and Smithers1990] Chris Malcolm and Tim Smithers. Symbol grounding via a hybrid architecture in an autonomous assembly system. Robotics and Autonomous Systems, 6(12):123–144, 1990.
 [McDermott et al.1998] Drew McDermott, Malik Ghallab, Adele Howe, Craig Knoblock, Ashwin Ram, Manuela Veloso, Daniel Weld, and David Wilkins. PDDLthe planning domain definition language. 1998.
 [Moore1991] Andrew Moore. Efficient memorybased learning for robot control. March 1991.
 [Nilsson1984] Nils J Nilsson. Shakey the robot. Technical report, SRI INTERNATIONAL MENLO PARK CA, 1984.
 [Özdamar et al.1998] Linet Özdamar, M Ali Bozyel, and S Ilker Birbil. A hierarchical decision support system for production planning (with case study). European Journal of Operational Research, 104(3):403–422, 1998.
 [Schulman et al.2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proc. of ICML15, pages 1889–1897, 2015.
 [Silver et al.2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proc. of ICML14, 2014.

[Silver et al.2016]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George
Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
Panneershelvam, Marc Lanctot, et al.
Mastering the game of Go with deep neural networks and tree search.
nature, 529(7587):484–489, 2016.  [Sutton et al.1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 [Sutton et al.2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proc. of NIPS00, pages 1057–1063, 2000.
 [Williams1992] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.
 [Wolfe et al.2010] Jason Andrew Wolfe, Bhaskara Marthi, and Stuart J Russell. Combined task and motion planning for mobile manipulation. In Proc. of ICAPS10, 2010.
Comments
There are no comments yet.