教育资源为主的文档平台

当前位置: 查字典文档网> 所有文档分类> > 其它语言学习> Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources

Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources

PerformanceAnalysisofThreadMappingswitha

HolisticViewoftheHardwareResources

WeiWang,TanimaDey,JasonMars,LingjiaTang,JackW.Davidson,andMaryLouSoffa

DepartmentofComputerScience

UniversityofVirginiaCharlottesville,VA22904

Email:{wwang,td8h,jom5x,lt8f,jwd,soffa}@virginia.edu

Abstract—Withtheshifttochipmultiprocessors,managingsharedresourceshasbecomeacriticalissueinrealizingtheirfullpotential.Previousresearchhasshownthatthreadmappingisapowerfultoolforresourcemanagement.However,thedif?cultyofsimultaneouslymanagingmultiplehardwareresourcesandthevaryingnatureoftheworkloadshaveimpededtheef?ciencyofthreadmappingalgorithms.Toovercomethedif?cultiesofsimultaneouslymanagingmultipleresourceswiththreadmapping,theinteractionbetweenvariousmicroarchitecturalresourcesandthreadcharacteristicsmustbewellunderstood.

Thispaperpresentsanin-depthanalysisofPARSECbench-marksrunningunderdifferentthreadmappingstoinvestigatetheinteractionofvariousthreadmappingswithmicroarchitecturalresourcesincluding,L1I/D-caches,I/DTLBs,L2caches,hardwareprefetchers,off-chipmemoryinterconnects,branchpredictors,memorydisambiguationunitsandthecores.Foreachresource,theanalysisprovidesguidelinesforhowtoimproveitsutilizationwhenmappingthreadswithdifferentcharacteristics.Wealsoanalyzehowtherelativeimportanceoftheresourcesvariesdependingontheworkloads.Ourexperimentsshowthatwhenonlymemoryresourcesareconsidered,threadmappingimprovesanapplica-tion’sperformancebyasmuchas14%overthedefaultLinuxscheduler.Incontrast,whenbothmemoryandprocessorresourcesareconsideredthemappingalgorithmachievesperformanceimprovementsbyasmuchas28%.Additionally,wedemonstratethatthreadmappingshouldconsiderL2caches,prefetchersandoff-chipmemoryinterconnectsasoneresource,andwepresentanewmetriccalledL2-misses-memory-latency-product(L2MP)forevaluatingtheiraggregatedperformanceimpact.

I.INTRODUCTION

Comparedtotraditionaluniprocessors,chipmultiprocessors(CMPs)greatlyimprovesystemthroughputbyofferingcom-putationalresourcesthatallowmultiplethreadstoexecuteinparallel.Torealizethefullpotentialofthesepowerfulplatforms,ef?cientlymanagingtheresourcesthataresharedbythesesimultaneouslyexecutingthreadshasbecomeacriticalissue.

Inthispaper,wefocusonmanagingCMPsharedresourcesthroughthreadmapping.Previousresearchhasshownthatthreadmappingisapowerfultoolformanagingresources[6,8,17,22,26].However,despitetheintensiveandextensiveresearchonthistopic,properlymappingthreadstoachievetheoptimalperformanceforanarbitraryworkloadisstillanopenquestion.Findingtheoptimalthreadmappingisextremelydif?cultbecauseonemustconsiderallrelevantresourcesandtheinteractionbetweenthesemanyresourcesisworkloaddependent.PreviousresearchhasshownthatL2caches,front-side-busandprefetchershavetobeconsideredwhenmanaging

memoryhierarchyresources[29].Howeveritremainsunclearwhetherthereareadditionalresourcesthatshouldbeconsid-ered,andhowtoholisticallyimprovetheirutilizationbasedontheworkloadcharacteristics.

Toholisticallymanagemultipleresourceswiththreadmap-ping,therearethreemajorchallenges.

1)The?rstchallengeistoidentifythekeyresourcesthatneedtobeconsideredbythreadmappingalgorithms.Neglectingthekeyresourceswouldresultinsuboptimalperformance.

2)Thesecondchallengeistodeterminehowtomapthreadstoimprovetheutilizationofeachkeyresource.Thebestthreadmappingalsodependsonthethreadrun-timecharacteristics.Foreachkeyresource,weneedtoidentifytherelatedthreadrun-timecharacteristicsanddeterminehowtomapthreadswhentheyexhibitthesecharacteristics.

3)Thethirdchallengeistohandlesituationswherenothreadmappingcanimprovetheutilizationofallkeyresources.Undersuchcircumstances,threadmappingalgorithmsmustprioritizetheresourcesandfocusonimprovingtheutilizationofresourcesthatcanprovidethemaximumbene?t.Previousresearchonthreadmappingfocusedonimprovingtheutilizationoftheresourceswithinthememoryhierar-chy[29]oronlyfocusonindividualresource[6,17,26].Althoughtheproposedapproachesaresuccessfulinimprov-ingtheutilizationoftheseresources,thebestapplicationperformanceisnotalwaysguaranteed[28].Moreover,mostpreviousworkhasbeendoneusingsingle-threadedworkloads,whileemergingworkloadsincreasinglyincludemulti-threadedprograms.Multi-threadedworkloadshavedifferentrun-timecharacteristics,thusrequiredifferentmappingstrategies.

Asa?rststeptowardsovercomingthechallengesofholis-ticallymanagingmultipleresources,weprovideanin-depthperformanceanalysisofallpossiblethreadmappingsforasetofworkloadscreatedusingapplicationsfromthemulti-threadedPARSECbenchmarksuite[4].Whileotherworkhaslookedatthememoryhierarchy,inthisworkwetakeaholisticlookatboththememoryresourcesandprocessorresources(e.g.,branchpredictors,memorydisambiguationunit,etc.),andeval-uatetheirrelativeimportance.Inthisanalysis,weidentifythekeyresourcesthatareresponsibleforperformancedifferences.

内容需要下载文档才能查看 内容需要下载文档才能查看

relatedthreadswiththesecharacteristicstoimprovetheutilizationofthekeyresources.Additionally,tohelpmaketrade-offdecisions,weanalyzetherelativeimportanceofthekeyresourcesforeachworkload,andinvestigatethereasonforprioritizingsomeresourcesovertheothers.Weobservethat,byfocusingonmultipleresources,properthreadmappingcanimproveanapplication’sperformancebyupto28%overcurrentLinuxscheduler,whileconsiderationofonlymemoryresourcesprovidesimprovementofonly14%.

Speci?cally,thecontributionsofthispaperinclude:

1)Anin-depthanalysisthatidenti?esthekeyhardwareresourcesthatmustbeconsideredbythreadmappingalgorithms,aswellasthelessimportantresourcesthatdonotneedtobeconsidered.Unlikepreviousworkthatconsideredonlysharedmemoryresourcesformappingsingle-threadedapplications,ourpaperdemonstratesthatformulti-threadedapplications,threadmappinghastoconsidermoreresources,andthreadcharacteristicsforbetterperformance.

2)Ananalysisofhowtoimproveeachkeyresource’sutilizationwiththreadmappingwhenmanagingthreadswithdifferentrun-timecharacteristics.Tothebestofourknowledge,thisanalysisisthe?rstthatinvestigatesthecharacteristicsofmulti-threadedworkloadsandtheirimplicationsformanagingbothmemoryandprocessorresourceswiththreadmapping.

3)AnanalysisshowsthatL2caches,prefetchersandmem-oryinterconnectionsshouldbeconsideredasoneresourcebecauseoftheircomplexinteractions.WealsoproposeanewmetricL2-misses-memory-latency-product(L2MP)tomeasuretheiraggregatedperformanceimpact.

4)Ananalysisthatidenti?estherankingofthekeyre-sourcesforeachworkload,andthereasonfortheranking.Theremainderofthispaperisorganizedasfollows:Sec-tionIIprovidesanoverviewofhardwareresourcesandthethreadcharacteristicsconsideredinouranalysis.SectionIIIidenti?esthekeyresourcesforthreadmappingalgorithms.SectionIVanalyzeshowtoimprovetheutilizationofindividualresourcesviathreadmapping.SectionVdiscussesusingthere-sourcerankingstosimultaneouslymanagingmultipleresources.SectionVIsummarizestheperformanceresults.SectionVIIdiscussesrelatedworkandSectionVIIIconcludesthepaper.II.PERFORMANCEANALYSISOVERVIEW

Toaddressthechallengesmentionedabove,weperformacomprehensiveanalysisofhowanapplication’sperformanceiseffectedwhenthreadswithvariouscharacteristicssharemultiplehardwareresources.Thissectiongivesanoverviewoftheresources,themetrics,therun-timecharacteristics,andthethreadmappingsthatarecoveredinthisanalysis.A.HardwareResources

WeaddresstheresourcesthatarecommonlyavailableoncurrentCMPprocessors.Fig.1givesaschematicviewoftheresourcesprovidedbyanIntelquad-coreprocessor.The

resourcesweconsidercanbeclassi?edintotwocategories:thememoryhierarchyresourcesandtheprocessorresources.

MemoryhierarchyresourcesincludeL1instructioncaches(I-cache),L1datacaches(D-cache),instructionanddatatranslationlook-asidebuffers(I/DTLB),L2caches,hardwareprefetchersandoff-chipmemoryinterconnect.WeuseanIntelCore2processor,whichhastwoprefetchmechanisms,DataPrefetchLogic(DPL)andL2StreamingPrefetch[16].DPLfetchesastreamofinstructionsanddatafrommemoryifastridememoryaccesspatternisdetected.L2StreamingPrefetchbringstwoadjacentcachelinesintoanL2cache.

Processorresourcesincludethecoresandcomponentsthatrequiretrainingtofunction,suchasbranchpredictorsandmemorydisambiguationunits.Inthispaper,weusethetermResourceCoreswhenwediscussthecoreasaresource.B.Metrics

Thenumberofcachemisses,theamountofmemorytrans-actionsandmemoryaccesslatencyareusedtoevaluatetheutilizationofthememoryhierarchyresources.ThenumberofmispredictionsandstalledCPUcyclesduetothesemis-predictionsareusedtoevaluatethetraining-basedprocessorresources.

ProcessorutilizationisusedtoevaluatetheutilizationofResourceCores.Notethat,thisprocessorutilizationisviewedfromtheOSperspective.Forexample,supposethereisonethreadthatrunssolelyonacore.DuetoI/Ooperationsorsynchronizations,halfoftheexecutiontimeofthisthreadisstayingintheOSwaitingqueue.Thenthecorethatexecutesthisthreadhasaprocessorutilizationofonly50%.Inthispaper,theprocessorutilizationreferstotheoverallprocessorutilizationofallcoresinthesystem.

Weusetwometricstoevaluatetheperformanceoftheapplications:thenumberofCPUcyclesconsumedandtheexe-cutiontime.Memoryresourcesandthetraining-basedprocessorresourcesimpactthetotalcyclesconsumedbyanapplication.Accordingly,weuseexecutedCPUcyclestoestimatetheperformanceimpactoftheseresources.Theexecutiontime,ontheotherhand,isaffectedbybothprocessorutilizationandtheexecutedcycles.Weuseexecutiontimewhenevaluatingtheperformanceimpactofalltheresources.Therelationofexecutiontime,executedcyclesandprocessorutilizationisdescribedbyequation(1).

ExecTime=

ExecCycles

Num×Proc×Frequency

(1)

NumandFrequencyrefertothenumberofcoresandtheprocessorfrequency,respectively.Supposethereisanapplicationwithfourthreadsrunningonaquad-coreproces-sorof1GHz.Eachofthefourthreadrequires500million

CPUcyclestoexecute,andeachthreadhas50%processorutilizationduetoI/Ooperationsandsynchronizations.Thenforthisapplication,itsexecutiontimeis(500M(cycles)×4(threads))/(4(cores)×50%×1GHz)=1second.

Allofthemetricsmentionedinthissectioncanbeacquiredfromperformancemonitoringunits(PMUs)[16].TABLEIgivesthenameofthePMUsweusedinouranalysis.

WecomputememoryaccesslatencyfromPMUsusingtheequation(2)proposedbyEranian[13].Essentially,inequa-tion2,memorylatencyiscomputedbydividingthetotalcyclesofallmemoryreadtransactionsbythenumberofmemoryreads.

Memlatency=

BUSSTANDING

BUSRANSBRD?BUSRANSIFETCH

(2)

C.ThreadCharacteristics

Threadcharacteristicsincludethepropertiesofasinglethreadandtheinteractionsamongmultiplethreads.

Forsinglethreadproperties,weconsiderathread’scachedemand,memorybandwidthdemandandI/Ofrequency.Ad-ditionally,todescribehowthreadsutilizeprefetchers,weintroducethreemetrics:prefetchereffectiveness,prefetcherexcessivenessandprefetch/memoryfraction.Wede?neathread’sprefetchereffectivenessasthepercentageoftheL2cachemissesthatarereducedwhentheprefetchersareturnedoncomparedtowhentheyareturnedoff.Wede?neathread’sprefetcherexcessivenessasthepercentageoftheadditionalcachelinesthatarebroughtintotheL2cachewhentheprefetchersareturnedonthanwhentheyareturnedoff.Prefetch/memoryfractionisde?nedasthefractionofprefetchingtransactionsinthetotalmemorytransactionswhenprefetchersareon.Prefetchereffectivenessmeasureshowmuchtheapplicationbene?tsfromtheprefetcher;prefetcherexces-sivenessmeasureshowmuchextrapressureisputonmemorybandwidthduetoprefetchingactivity;andprefetch/memoryfractionillustratestheoverallimpactofprefetchersonmemorybandwidth.

Formultiplethreadinteractions,weconsiderdatasharing,instructionsharing,andthefrequencyofsynchronizationoper-ations.Theseinteractionsusuallyhappenamongthreadsfromthesameapplication.Wecallsuchthreadssiblingthreads.

TABLEa2are

threadsfromapplication1andapplication2respectively.Eachapplica-tionhasfourthreads.Threadsfromthesameapplicationareassumedtohavesimilarcharacteristics.NotethatwedonotconsiderSMTthere,sowhentwothreadsarepinnedtoonecore,theysharethatcoreinatimemultiplexingmanner.

D.ThreadMappings

Inourexperiments,weexaminedallpossiblethreadmap-pingswhenrunningtwomulti-threadedapplicationseachwithfourthreads.Therefore,thereareeightthreadsintotal,http://wendang.chazidian.comingmorethreadsthancoresallowsthoroughevaluationoftheresources,includingL1caches,TLBs,branchpredictors,memorydisambiguationunitsandResourceCores.Moreover,becauserealapplicationthreadshavesynchronizationsandI/Ooperations,theycannotuseallofthecoresallocatedtothem.Thususingmorethreadsthancorescanimprovetheoverallprocessorutilization.

Toguideouranalysis,wechoosefourthreadmappingsthatcoverallresourcesharingcon?gurations(eithersiblingthreadssharearesourceornon-siblingthreadssharearesource).Allotherthreadmappingscouldbeviewedashybridversionsofthesefourmappings.TABLEIIshowsthefourthreadmappingsonaquad-coreprocessorrunningLinux.ExceptfortheOSmapping,allmappingsaredonebystaticallypinningthreadstocoresusingtheprocessoraf?nitysystemcall.Howthesefourthreadmappingsuseresourcesisdescribedbelow.

OSMapping(OSMap):ThisthreadmappingisdeterminedbytheLinuxscheduler.TheOStriestoevenlyspreadthethreadsacrossthecoresinthesystemtoensurefairprocessortimeallocationandlowprocessoridletime.Underthismapping,aslongasthereisanavailablecoreandarunnablethread,thatthreadismappedtorunonthatcore.Asaresult,anythreadcanrunonanycoreandsharetheresourcesassociatedwiththesecores.TheOSMapisusedasthebaselineforperformancecomparison.

Isolation-mapping(IsoMap):UnderIsoMap,siblingthreadsaremappedtorunonthetwocoresthatshareoneL2cache.Inotherwords,theyareisolatedonthatL2cache.L1caches,TLBs,L2caches,hardwareprefetchersandcoresaresharedbysiblings.

Interleaving-mapping(IntMap):UnderIntMap,threadsfromdifferentapplicationsaremappedtothecoresinaninterleavedfashion.L1caches,TLBsandcoresarestillonlysharedbysiblingthreads.L2cachesandprefetchersaresharedbythreadsfromdifferentapplications.

Spreading-mapping(SprMap):UnderSprMap,fourthreadsofeachapplicationareevenlyspreadonthefourcores.Asaresult,everycoreexecutestwothreadswhichcomefromtwodifferentapplications.L1caches,TLBs,L2cachesandcoresaresharedbythreadsfromtwoapplications.

Notethat,althoughweassumethatsiblingthreadsareidenticalhere,somePARSECbenchmarkshavesiblingthreads

withdifferentcharacteristics.However,weobservethatthedifferencebetweenthesiblingthreadsisnegligible.Therefore,threadmappingsthatonlydifferintheplacementofsiblingthreadsusuallyhavesimilarperformance.

III.KEYRESOURCESIDENTIFICATION

Thissectionidenti?esthekeyresourcesforthreadmappingalgorithms.Aresourceshouldsatisfytwocriteriatobeconsid-eredasakeyresourceforthreadmappingalgorithms:

1)Theutilizationofthisresourcevariesconsiderablyamongdifferentthreadmappings.

2)Threadmappingcausedutilizationvariationsofthisresourceresultinconsiderablevariationsinanapplication’sperformance.

CriteriononecaneasilybedetermineddirectlyusingPMUs.However,thesecondcriterionrequirestwoapproaches.Al-thoughexperimentingonarealmachineprovidesmoreaccurateunderstandingofthreadmappings,theabilitytopreciselyaccounteachresource’sperformanceimpactislimitedbythetypesofPMUsavailableinthehardware.Forexample,forbranchpredictors,therearePMUsthatcountthenumberofmispredictions,aswellasthenumberofcyclesstalledduetothesemispredictions.However,forL1D-cache,thereareonlyPMUsthatgivethenumberofL1D-cachemisses.ThereisnoPMUthattellsthenumberofcyclesspentonL1D-cachemisses.Therefore,fordifferentresources,differentapproacheshavetobetaken:

1)DirectApproachForresourcesthathavePMUstomeasuretheirperformanceimpacts,weusethereadingfromthesePMUsdirectly.

2)IndirectApproachForL1D-caches,L2cachesandoff-chipmemoryinterconnects,therearenoPMUstodirectlymeasuretheirperformanceimpact.Fortheseresources,we?rstverifywithPMUsthattheperformancevariationsacrossmappingsarecausedbymemorystalls.Thenwecomparetheperformanceofthethreadmappings.Iftheapplication’sper-formanceisimprovedinonemapping,andonlyoneresource’sutilizationisimprovedinthismapping,thenwecanconcludethatitisthisresourcethatcausestheperformanceimprovement.A.ExperimentalDesign

To?ndthekeyresources,weperformedexperimentsonarealCMPmachine.Hereweintroducetheexperimentaldesign.WeusePARSECbenchmarkssuiteversion2.1(withnativeinputset)tocreateourworkloadsbecausethesebenchmarkshavealargevarietyofthreadcharacteristics.TABLEIIIgivestherun-timecharacteristicsofPARSECbenchmarks.

InTableIII,datasharing,workingsetandsynchronizationoperationsarecollectedwithasimulatorbythePARSECauthors[4].Theamountofdatasharing(highorlow)referstothepercentage(highorlow)ofthecachelinesthataresharedamongsiblingthreads.Theworkingsethereisanarchitecturalconceptwhichmeansthetotalsizeofmemorytouchedbyabenchmark.Weuseworkingsetsizetoestimatethecachedemandofabenchmark.Synchronizationoperationsmeasuresthetotalnumberoflocks,barriersandconditionsexecutedbyabenchmark.AllothercharacteristicsarecollectedonanIntelQ9550processor.TheI/OtimeiscollectedbyinstrumentingtheI/Ofunctions.Bandwidthrequirementofabenchmarkisacquiredbydividingthetotalamountofmemoryaccessedbytheexecutiontimeofthebenchmark.Thetotalamountofmemoryaccessedequalstothetotalnumberofmemorytransactionstimesthesizeofeachtransaction,whichis64Bytes.Prefetchereffectiveness,prefetcherexcessivenessandprefetch/memoryfractionsarecomputedfollowingtheirde?nitionsinSectionII-C.Thenegativevaluesofprefetchereffectivenessforswaptionsandx264suggestthatthesetwobenchmarksexperiencemoreL2cachemisseswhenhardwareprefetchersareturnedon.

Eachworkloadconsistsofapairofbenchmarks.Therefore,wecancomparethemappingswheresiblingthreadssharetheresourceswiththemappingswherenon-siblingthreadssharetheresources.WeuseninePARSECbenchmarksandthusthereare36pairs(workloads)intotal.Fourbenchmarks,ferret,dedup,freqmine,raytrace,arenotusedinouranalysisduetocompilationerrors,con?gurationerrorsorexecutionerrors.Foreachmapping,eachworkloadisexecuteduntilthelongestbenchmarkhas?nishedthreeruns.Shorterbenchmarksarerestartedifthelongestbenchmarkhasnot?nished.Theaverageoftheresultsofthe?rstthreerunsarepresented.Thevariationoftheresultsforthesamemappingandworkloadisverysmall.ForIsoMap,IntMap,andSprMap,thevariationislessthan2%.ForOSMap,thevariationishigher,usuallybetween2%and4%.However,sinceweonlyuseOSMapasabaseline,thehighervariationwouldnotaffectourconclusions.AllexperimentsareconductedonaplatformthathasanIntelquad-coreQ9550processor.Eachcoreofthisprocessorhasone32KBL1I-cacheandone32KBL1D-cache.Everytwocoresshareone6MBL2cache.(Fig.1).Thisplatformhas2GBmemoryandrunsLinuxkernel2.6.25.ReadingsfromPMUsarecollectedwithPerfMon2[12].

内容需要下载文档才能查看

Fig.2:AverageL1D-cachemisseseachbenchmarkexperiencesunder

thefourmappings.NormalizedtoOSMap.Resultofswaptionsisnotshownhere(norinthefollowingmemoryresources?gures)becauseithasveryfewmemoryaccesses.

Fig.4:Comparisonoftheperformanceofstreamclusterrunningwith

swaptionsunderIntMapandSprMap.Lowerbarisbetter.

Fig.5:Averageprocessorutilizationofeachbenchmarkrunningundertheformappings.NormalizedtoOSMap.

Fig.3:http://wendang.chazidian.comparingfourmappingsdoesnotchangetheconclusion.

B.L1D-Cache

We?rstevaluatetheimportanceofL1D-cache.Fig.2showsthenormalizedaverageL1D-cachemissesofeachbenchmarkunderthefourmappings.ForeachPARSECbenchmarkB,thereareeightworkloads(orpairs)thatcontainbenchmarkB.Foreachmapping,weruntheeightworkloads,andreadtheL1D-cachemissesofB.ThenwecomputetheaverageoftheL1D-cachemissesofBforeachmapping,andreporttheresultsinFig.2.Werepeatthesameprocessforallbenchmarksandallfourmappings.AsFig.2shows,L1D-cachemissesvaryfrom2%to14%,dependingonthethreadmapping.

WeevaluatedL1D-cache’simpactonperformancewiththeindirectapproach.Fig.3showstheCPUcyclesandmemoryresourceutilizationofx264runningwithstreamclusterunderIntMapandSprMap.Becausenootherresources’utilizationhavechangedfromonemappingtoanother,onlymemoryresourcesareshowninthe?gure.Fig.3showsthatalthoughIntMaphasmoreL2missesandhighermemorylatency,itsperformanceisstillbetterthanSprMapduetofewerL1misses.Therefore,threadmappinginducedvariationofL1D-cachemissescancauseconsiderableperformancevariation.

Inconclusion,L1D-cachemissesvarydependingonthreadmappings.Furthermore,thisvariationcancauseconsiderableperformancevariation.Consequently,L1D-Cacheshouldbeconsideredasakeyresource.

C.L2Cache,HardwarePrefetchersandOff-chipMemoryInterconnect

Previousresearchhasdemonstratedthatthreadmappingcansigni?cantlyimpacttheutilizationofL2caches,hard-wareprefetchersandoff-chipmemoryinterconnect,andcon-sequentlyimpactapplicationperformance[6,17,22,26,29].Thus,theseresourcesshouldbeconsideredaskeyresources.Resultsofourexperimentscorroboratethisconclusion.How-

ever,unlikepreviousresearch,ourexperimentresultsshowthatthesememoryresourcesarebetterviewedasoneresourcebythreadmappingalgorithm,andweprovideanewmetriccalledL2MPforevaluatingtheiraggregatedimpact.ThedetaileddiscussiononthissubjectcanbefoundinSectionV-A.D.BranchPredictors

Inourexperiments,weobservethatonethreadmappingcouldhave15timesmorebranchmispredictionsthananothermapping,whichsuggeststhatbranchpredictors’mispredica-tionsvarysigni?cantlydependingonthreadmappings.

Weevaluatetheperformanceimpactofbranchpredictorswiththedirectapproach.Fig.4showstheperformanceofstreamclusterandswaptionsrunningtogether.Streamclusterconsumes48%moreCPUcyclesundertheSprMapthantheIntMap,andswaptionsconsumes8%moreCPUcyclesunderSprMap.Fig.4alsoshowsthat99%oftheincreasedCPUcyclesarecausedbybranchmispredictions.Therefore,thevariationofbranchmispredictionscanproduceconsiderableapplicationperformancevariation.

Inconclusion,branchmispredictionsvarydependingonthreadmappings.Furthermore,thisvariationcancauseconsid-erableperformancevariation.Consequently,branchpredictorsshouldbeconsideredasakeyresource.

E.L1I-cache,I/DTLBsandMemoryDisambiguationUnitsDifferentthreadmappingshaveagreatimpactontheuti-lizationofL1I-caches,I/DTLBsandmemorydisambiguationunits.Onethreadmappingcanhavemorethantentimesmoremisses/mispredictionsfromtheseresourcesthananothermapping.Yettheabsoluteamountoftimespentinservingtheseextramisses/mispredictions(acquiredfromPMUsdirectly)accountsforlessthan2%(inmostcaseslessthan0.5%)ofthetotalexecutiontime.Therefore,weconcludethattheseresourcesshouldreceivelowprioritywhenmappingthreadsofthePARSECbenchmarks.

版权声明:此文档由查字典文档网用户提供,如用于商业用途请与作者联系,查字典文档网保持最终解释权!

下载文档

热门试卷

2016年四川省内江市中考化学试卷
广西钦州市高新区2017届高三11月月考政治试卷
浙江省湖州市2016-2017学年高一上学期期中考试政治试卷
浙江省湖州市2016-2017学年高二上学期期中考试政治试卷
辽宁省铁岭市协作体2017届高三上学期第三次联考政治试卷
广西钦州市钦州港区2016-2017学年高二11月月考政治试卷
广西钦州市钦州港区2017届高三11月月考政治试卷
广西钦州市钦州港区2016-2017学年高一11月月考政治试卷
广西钦州市高新区2016-2017学年高二11月月考政治试卷
广西钦州市高新区2016-2017学年高一11月月考政治试卷
山东省滨州市三校2017届第一学期阶段测试初三英语试题
四川省成都七中2017届高三一诊模拟考试文科综合试卷
2017届普通高等学校招生全国统一考试模拟试题(附答案)
重庆市永川中学高2017级上期12月月考语文试题
江西宜春三中2017届高三第一学期第二次月考文科综合试题
内蒙古赤峰二中2017届高三上学期第三次月考英语试题
2017年六年级(上)数学期末考试卷
2017人教版小学英语三年级上期末笔试题
江苏省常州西藏民族中学2016-2017学年九年级思想品德第一学期第二次阶段测试试卷
重庆市九龙坡区七校2016-2017学年上期八年级素质测查(二)语文学科试题卷
江苏省无锡市钱桥中学2016年12月八年级语文阶段性测试卷
江苏省无锡市钱桥中学2016-2017学年七年级英语12月阶段检测试卷
山东省邹城市第八中学2016-2017学年八年级12月物理第4章试题(无答案)
【人教版】河北省2015-2016学年度九年级上期末语文试题卷(附答案)
四川省简阳市阳安中学2016年12月高二月考英语试卷
四川省成都龙泉中学高三上学期2016年12月月考试题文科综合能力测试
安徽省滁州中学2016—2017学年度第一学期12月月考​高三英语试卷
山东省武城县第二中学2016.12高一年级上学期第二次月考历史试题(必修一第四、五单元)
福建省四地六校联考2016-2017学年上学期第三次月考高三化学试卷
甘肃省武威第二十三中学2016—2017学年度八年级第一学期12月月考生物试卷

网友关注

2018安徽公务员面试模拟题:“打伞哥”火爆朋友圈
【公共基础知识题库】公共基础知识每日一练(2017.11.7)
安徽公务员考试公共基础知识每日一练(2017.10.26)
安徽公务员考试公共基础知识每日一练(2017.10.18)
公共基础知识1000题及答案183
行测题库:行测每日一练判断推理练习题答案10.18
安徽公务员考试公共基础知识1000题及答案178
安徽公务员考试公共基础知识每日一练(2017.10.19)
行测题库:行测每日一练数量关系练习题答案10.19
安徽公务员考试公共基础知识每日一练(2017.10.31)
【公共基础知识题库】公共基础知识每日一练(2017.11.17)
安徽公务员考试公共基础知识每日一练(2017.10.27)
【公共基础知识题库】公共基础知识每日一练(2017.11.15)
【公共基础知识题库】公共基础知识每日一练(2017.11.20)
公共基础知识1000题及答案182
行测题库:行测每日一练言语理解练习题10.20
【公共基础知识题库】公共基础知识每日一练(2017.11.16)
公共基础知识1000题及答案181
安徽公务员考试公共基础知识1000题及答案179
安徽公务员考试公共基础知识每日一练(2017.10.25)
安徽公务员考试十九大报告时事政治模拟题
行测题库:行测每日一练资料分析练习题答案10.13
【公共基础知识题库】公共基础知识每日一练(2017.11.3)
公共基础知识1000题及答案184
安徽公务员考试公共基础知识每日一练(2017.11.1)
【公共基础知识题库】公共基础知识每日一练(2017.11.6)
【公共基础知识题库】公共基础知识每日一练(2017.11.13)
【公共基础知识题库】公共基础知识每日一练(2017.11.14)
安徽公务员考试公共基础知识1000题及答案176
行测题库:行测每日一练判断推理练习题10.18

网友关注视频

外研版英语三起6年级下册(14版)Module3 Unit2
化学九年级下册全册同步 人教版 第22集 酸和碱的中和反应(一)
二次函数求实际问题中的最值_第一课时(特等奖)(冀教版九年级下册)_T144339
冀教版小学数学二年级下册1
二年级下册数学第三课 搭一搭⚖⚖
第8课 对称剪纸_第一课时(二等奖)(沪书画版二年级上册)_T3784187
3.2 数学二年级下册第二单元 表内除法(一)整理和复习 李菲菲
沪教版牛津小学英语(深圳用) 四年级下册 Unit 12
沪教版八年级下册数学练习册21.3(2)分式方程P15
冀教版小学数学二年级下册第二单元《有余数除法的整理与复习》
冀教版小学数学二年级下册第二周第2课时《我们的测量》宝丰街小学庞志荣.mp4
冀教版英语四年级下册第二课
沪教版牛津小学英语(深圳用) 四年级下册 Unit 4
苏科版数学八年级下册9.2《中心对称和中心对称图形》
北师大版八年级物理下册 第六章 常见的光学仪器(二)探究凸透镜成像的规律
飞翔英语—冀教版(三起)英语三年级下册Lesson 2 Cats and Dogs
19 爱护鸟类_第一课时(二等奖)(桂美版二年级下册)_T502436
苏科版数学 八年级下册 第八章第二节 可能性的大小
外研版英语三起5年级下册(14版)Module3 Unit2
【部编】人教版语文七年级下册《泊秦淮》优质课教学视频+PPT课件+教案,辽宁省
第19课 我喜欢的鸟_第一课时(二等奖)(人美杨永善版二年级下册)_T644386
第4章 幂函数、指数函数和对数函数(下)_六 指数方程和对数方程_4.7 简单的指数方程_第一课时(沪教版高一下册)_T1566237
沪教版牛津小学英语(深圳用) 六年级下册 Unit 7
沪教版八年级下册数学练习册20.4(2)一次函数的应用2P8
【部编】人教版语文七年级下册《老山界》优质课教学视频+PPT课件+教案,安徽省
苏科版数学七年级下册7.2《探索平行线的性质》
沪教版八年级下次数学练习册21.4(2)无理方程P19
沪教版牛津小学英语(深圳用) 四年级下册 Unit 2
冀教版英语三年级下册第二课
精品·同步课程 历史 八年级 上册 第15集 近代科学技术与思想文化