A New Learning Algorithm for Optimal Stopping

上传者：甘祥超
|
上传时间：2015-04-22
|
密次下载

A New Learning Algorithm for Optimal Stopping

DiscreteEventDynSyst(2009)19:91–113

DOI10.1007/s10626-008-0055-2

ANewLearningAlgorithmforOptimalStopping

VivekS.Borkar·JervisPinto·TarunPrabhu

Received:1July2007/Accepted:14October2008/

Publishedonline:1November2008

©SpringerScience+BusinessMedia,LLC2008

http://wendang.chazidian.comingthisformulation,areinforcementlearningschemebasedonaprimal-dualmethodandincorporatingasamplingdevicecalled‘splitsampling’isproposedandanalyzed.Anillustrativeexamplefromoptionpricingisalsoincluded.

KeywordsLearningalgorithm·Optimalstopping·Linearprogramming·

Primal-dualmethods·Splitsampling·Optionpricing

1Introduction

Inrecentyears,therehasbeenmuchactivityindevelopingreinforcementlearningalgorithmsforapproximatedynamicprogrammingforMarkovdecisionprocesses,basedonrealorsimulateddata.Thisisusefulwhenexactanalyticalornumericalsolutioniseitherinfeasibleorexpensive.SeeBertsekasandTsitsiklis(1996),SuttonResearchsupportedinpartbyagrantfromGeneralMotorsIndiaPvt.Ltd.ThisauthorthanksArnabBasuforsomeusefulinputs.

V.S.Borkar(B)

TataInstituteofFundamentalResearch,HomiBhabhaRoad,Mumbai400005,India

e-mail:borkar@tifr.res.in

J.Pinto·T.Prabhu

St.FrancisInstituteofTechnology,Mumbai400103,India

PresentAddress:

J.Pinto

SchoolofElectricalEngineeringandComputerScience,

OregonStateUniversity,Corvallis,OR97331,USA

e-mail:pinto@eecs.oregonstate.edu

PresentAddress:

T.Prabhu

SchoolofComputing,UniversityofUtah,SaltLakeCity,UT84112,USA

e-mail:tarunp@cs.utah.edu

92DiscreteEventDynSyst(2009)19:91–113andBarto(1998)forbooklengthtreatmentsofthisissueandoverview.Someexamplesofsuchschemesare:Q-learning,actor-critic,orTD(λ)(BertsekasandTsitsiklis1996).ThesecanbeviewedasbeingderivedfromthetraditionaliterativeschemesforMDPssuchasthevalueorpolicyiteration,withadditionalstructure(suchasapproximationarchitectureoradditionalaveraging)builtontopofit.Thereis,however,athirdcomputationalschemeforclassicalMDPssuchas?nite/in?nitehorizondiscountedcostoraveragecost,viz.,thelinearprogrammingapproach.The‘primal’formulationofthisisalinearprogram(LPforshort)overthesocalled‘occupationmeasures’.Its‘dual’isanLPoverfunctions.AnextensiveaccountofthesedevelopmentsmaybefoundinHernández-LermaandLasserre(1996).Ouraimhereistoformulatealearningschemefortheoptimalstoppingproblembasedonanoldformulationofoptimalstoppingasalinearprograminthespiritoftheaforementioned.

Somerelevantliteratureisasfollows:

?TheLPformulationofoptimalstoppingisthesameasthe‘minimalexcessive

majorantfunction’characterizationofthevaluefunction(formaximizationproblems)thatdatesbacktoDynkin(1963),orthe‘maximalsubsolution’(forminimizationproblems)asinBensoussan(1982),ChapterIII,Section5.1.ThecomputationalimplicationshavebeenexploredinChoandStockbridge(2002).Ourschemeleadstoanalternativetothoseproposedin,e.g.,ChoiandVanRoy(2006),LongstaffandSchwartz(2001),TsitsiklisandVanRoy(1999,2001),YuandBertsekas(2007),foroptionpricing,whichisperhapstheprimeapplicationareaforoptimalstoppingatpresent.ThesearebasedontheclassicallearningalgorithmssuchasQ-learning,notontheLPformulation.

Alsomotivatedby?nanceproblems,AndersenandBroadie(2004),HaughandKogan(2004),Rogers(2002)arriveataformulationakintooursviaanabstractdualityresult,buttheiremphasisison?ndingboundsonthesolutionviaMonteCarlo.Ourschemediffersfroma‘pure’MonteCarlointhatitisareinforcementlearningscheme.AsobservedinAhamedetal.(2006),suchaschemecanbeviewedasacrossbetweenpureMonteCarloandpure(deterministic)numericalschemes,withitsperiteratecomputationmorethantheformer,butlessthanthelatter,andits?uctuations(variance)morethanthelatter(whichiszero)andlessthantheformer.ThekeydifferencewithpureMonteCarloisthatourschemeisbasedononestepconditionalaveragingratherthanaveraging,whichleadstothedifferencesmentionedabove.??

WepresenttheLPformulationandthealgorithminthenextsection.Section3providesthemathematicaljusti?cationforthescheme.Thefocusofthisworkisprimarilytheoretical.Nevertheless,Section4describesnumericalexperimentsforasimpleillustrativeoptionpricingmodel.Section5sketchesanextensiontothein?nitehorizondiscountedcostproblemtoindicatethebroaderapplicabilityoftheapproach.

2Thealgorithm

ConsideradiscretetimeMarkovchain{Xn}takingvaluesinacompactmetricspaceSwiththetransitionprobabilitykernelp(dy|x).LetN>0beaprescribedinteger.

DiscreteEventDynSyst(2009)19:91–11393Givenaboundedcontinuousfunctiong:S→Randadiscountfactor0<α<1,ourobjectiveisto?ndtheoptimalstoppingtimeτ?thatmaximizes

????(1)EαN∧τg(XN∧τ)

overallstoppingtimesτw.r.t.thenatural?ltrationof{Xn}.Astandarddynamicprogrammingargumentthentellsusthatthevaluefunction

????def?Vn(x)=supEα(N∧τ?n)g(XN∧τ)|Xn=x,

wherethesupremumisoverallstoppingtimes≥n,satis?es

????Vn(x)=g(x)∨αVn+1(y)p(dy|x),0≤n<N,

?(x)=g(x).VN(2)(3)

OurschemewillbebasedonthefollowingobservationthatessentiallygoesbacktoDynkin,whoseproofisincludedforthesakeofcompleteness:

?Theorem1{Vn}aboveisgivenbythesolutiontotheLP:

MinimizeV0(i0)s.t.

Vn(x)≥g(x),0≤n≤N

Vn(x)≥α??

?}isfeasibleforthisLP.Atthesametime,if{Vn}isanyotherProofNotethat{Vnsolution,then??

Vn(x)=ζn(x)+g(i)∨αVn+1(y)p(dy|x),0≤n<N,p(dy|x)Vn+1(y),0≤n<N

VN(x)=ζN(x)+g(x),

forsomeζn(·)≥0,0≤n≤N.Thisissimplythedynamicprogrammingequationfortheoptimalstoppingproblemwithreward

??N∧τ??αmζm(Xm)+αN∧τg(XN∧τ).E

m=n

(Notethatforthedecisiontostopattimeninstatex,the‘runningreward’ζn(x)willalsobegrantedinadditiontothe‘stoppingreward’g(x).)Astandardargumentthenshowsthat??N∧τ ???V0(x)=supEαmζm(Xm)+αN∧τg(XN∧τ)|X0=x≥V0(x)

m=n

wherethesupremumisoverallstoppingtimes.ThusthesolutionoftheaboveLPdoesindeedcoincidewiththevaluefunction.????

94DiscreteEventDynSyst(2009)19:91–113Itisworthnotingthat:(i)theconstraintsaboveneedholdonlya.s.withrespecttothelawofXnforeachn,and,(ii)thenonnegativityconstraintsVn(·)≥0donothavetobeexplicitlyincorporated.ByLagrangemultipliertheory(Luenberger1968,pp.216),theaboveoptimizationproblembecomes

minmax[L(V,??)]Vλ(4)

where??=[??1(N,dx),??2(n,dx),??3(n,dx),0≤n<N]istheLagrangemultiplier(astringofpositivemeasuresonS)andtheLagrangianL(V,??)isgivenby

??defL(V,λ)=V0(x0)+??1(N,dx)(g(x)?VN(x))

+N?1????

n=0????2(n,dx)g(x)?Vn(x)??

+N?1????

n=0????????3(n,dx)αp(dy|x)Vn+1(y)?Vn(x).(5)

NowweshalldescribeagradientschemeforestimatingVn(x)usinglinearfunctionapproximationsfor{Vn}andthesquare-rootsoftheLagrangemultipliers.Forthis,?rstsupposethat??i(n,dx)=λi(n,x)m(dx),1≤i≤3,forsomeprobabilitymeasurem(dx)onSwith√full√support.1√Inparticular,λi(n,·)’sarenonnegative.Approxi-mateVn(x)and1,2and3asfollows.Letr∈Rt,q1∈Rs1,q2∈Rs2andq3∈Rs3,with

t??????rk??φk??(n,x),Vn(x)≈

k??=1

??λ1n,x)≈s1??

j=1q1(j)?1j(n,x),??λ2n,x)≈s2??

j=1q2(j)?2j(n,x),??λ3n,x)≈s3??

j=1q3(j)?3j(n,x),

where{φk,?ij}arebasisfunctionsor‘features’selectedapriori.Squaringthelastthreeexpressionsabovegivesanapproximationtoλ1(n,x),λ2(n,x)andλ3(n,x),ensuringtheirnonnegativityautomatically.

itselfcanbethe?rststepoftheapproximationprocedureifthe??(i,dx)arenotabsolutelycontinuousw.r.t.m(dx),thelatterusuallybeingsome‘natural’candidatesuchasthenormalizedLebesguemeasure.Forexample,whenm(dx)=thenormalizedLebesguemeasure,convolutionof??i(n,dx)withasmoothapproximationofDiracmeasurewouldgivethedesiredapproximation.1This

DiscreteEventDynSyst(2009)19:91–11395

ThentheoriginalLagrangianEq.5isapproximatedintermsofr,q1,q2,q3by

L(r,q)=r(k??)φk??(0,x0)

k??

??+

?????2????

?????q1(l)?1l(N,x)r(k??)φk??(N,x)g(x)?

lN?1??n=0

??????

??2??

q2(l)?2l(n,x)

k??