Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources

上传者：戴文刚
|
上传时间：2015-05-07
|
密次下载

Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources

PerformanceAnalysisofThreadMappingswitha

HolisticViewoftheHardwareResources

WeiWang,TanimaDey,JasonMars,LingjiaTang,JackW.Davidson,andMaryLouSoffa

DepartmentofComputerScience

UniversityofVirginiaCharlottesville,VA22904

Email:{wwang,td8h,jom5x,lt8f,jwd,soffa}@virginia.edu

Abstract—Withtheshifttochipmultiprocessors,managingsharedresourceshasbecomeacriticalissueinrealizingtheirfullpotential.Previousresearchhasshownthatthreadmappingisapowerfultoolforresourcemanagement.However,thedif?cultyofsimultaneouslymanagingmultiplehardwareresourcesandthevaryingnatureoftheworkloadshaveimpededtheef?ciencyofthreadmappingalgorithms.Toovercomethedif?cultiesofsimultaneouslymanagingmultipleresourceswiththreadmapping,theinteractionbetweenvariousmicroarchitecturalresourcesandthreadcharacteristicsmustbewellunderstood.

Thispaperpresentsanin-depthanalysisofPARSECbench-marksrunningunderdifferentthreadmappingstoinvestigatetheinteractionofvariousthreadmappingswithmicroarchitecturalresourcesincluding,L1I/D-caches,I/DTLBs,L2caches,hardwareprefetchers,off-chipmemoryinterconnects,branchpredictors,memorydisambiguationunitsandthecores.Foreachresource,theanalysisprovidesguidelinesforhowtoimproveitsutilizationwhenmappingthreadswithdifferentcharacteristics.Wealsoanalyzehowtherelativeimportanceoftheresourcesvariesdependingontheworkloads.Ourexperimentsshowthatwhenonlymemoryresourcesareconsidered,threadmappingimprovesanapplica-tion’sperformancebyasmuchas14%overthedefaultLinuxscheduler.Incontrast,whenbothmemoryandprocessorresourcesareconsideredthemappingalgorithmachievesperformanceimprovementsbyasmuchas28%.Additionally,wedemonstratethatthreadmappingshouldconsiderL2caches,prefetchersandoff-chipmemoryinterconnectsasoneresource,andwepresentanewmetriccalledL2-misses-memory-latency-product(L2MP)forevaluatingtheiraggregatedperformanceimpact.

I.INTRODUCTION

Comparedtotraditionaluniprocessors,chipmultiprocessors(CMPs)greatlyimprovesystemthroughputbyofferingcom-putationalresourcesthatallowmultiplethreadstoexecuteinparallel.Torealizethefullpotentialofthesepowerfulplatforms,ef?cientlymanagingtheresourcesthataresharedbythesesimultaneouslyexecutingthreadshasbecomeacriticalissue.

Inthispaper,wefocusonmanagingCMPsharedresourcesthroughthreadmapping.Previousresearchhasshownthatthreadmappingisapowerfultoolformanagingresources[6,8,17,22,26].However,despitetheintensiveandextensiveresearchonthistopic,properlymappingthreadstoachievetheoptimalperformanceforanarbitraryworkloadisstillanopenquestion.Findingtheoptimalthreadmappingisextremelydif?cultbecauseonemustconsiderallrelevantresourcesandtheinteractionbetweenthesemanyresourcesisworkloaddependent.PreviousresearchhasshownthatL2caches,front-side-busandprefetchershavetobeconsideredwhenmanaging

memoryhierarchyresources[29].Howeveritremainsunclearwhetherthereareadditionalresourcesthatshouldbeconsid-ered,andhowtoholisticallyimprovetheirutilizationbasedontheworkloadcharacteristics.

Toholisticallymanagemultipleresourceswiththreadmap-ping,therearethreemajorchallenges.

1)The?rstchallengeistoidentifythekeyresourcesthatneedtobeconsideredbythreadmappingalgorithms.Neglectingthekeyresourceswouldresultinsuboptimalperformance.

2)Thesecondchallengeistodeterminehowtomapthreadstoimprovetheutilizationofeachkeyresource.Thebestthreadmappingalsodependsonthethreadrun-timecharacteristics.Foreachkeyresource,weneedtoidentifytherelatedthreadrun-timecharacteristicsanddeterminehowtomapthreadswhentheyexhibitthesecharacteristics.

3)Thethirdchallengeistohandlesituationswherenothreadmappingcanimprovetheutilizationofallkeyresources.Undersuchcircumstances,threadmappingalgorithmsmustprioritizetheresourcesandfocusonimprovingtheutilizationofresourcesthatcanprovidethemaximumbene?t.Previousresearchonthreadmappingfocusedonimprovingtheutilizationoftheresourceswithinthememoryhierar-chy[29]oronlyfocusonindividualresource[6,17,26].Althoughtheproposedapproachesaresuccessfulinimprov-ingtheutilizationoftheseresources,thebestapplicationperformanceisnotalwaysguaranteed[28].Moreover,mostpreviousworkhasbeendoneusingsingle-threadedworkloads,whileemergingworkloadsincreasinglyincludemulti-threadedprograms.Multi-threadedworkloadshavedifferentrun-timecharacteristics,thusrequiredifferentmappingstrategies.

Asa?rststeptowardsovercomingthechallengesofholis-ticallymanagingmultipleresources,weprovideanin-depthperformanceanalysisofallpossiblethreadmappingsforasetofworkloadscreatedusingapplicationsfromthemulti-threadedPARSECbenchmarksuite[4].Whileotherworkhaslookedatthememoryhierarchy,inthisworkwetakeaholisticlookatboththememoryresourcesandprocessorresources(e.g.,branchpredictors,memorydisambiguationunit,etc.),andeval-uatetheirrelativeimportance.Inthisanalysis,weidentifythekeyresourcesthatareresponsibleforperformancedifferences.

内容需要下载文档才能查看内容需要下载文档才能查看

relatedthreadswiththesecharacteristicstoimprovetheutilizationofthekeyresources.Additionally,tohelpmaketrade-offdecisions,weanalyzetherelativeimportanceofthekeyresourcesforeachworkload,andinvestigatethereasonforprioritizingsomeresourcesovertheothers.Weobservethat,byfocusingonmultipleresources,properthreadmappingcanimproveanapplication’sperformancebyupto28%overcurrentLinuxscheduler,whileconsiderationofonlymemoryresourcesprovidesimprovementofonly14%.

Speci?cally,thecontributionsofthispaperinclude:

1)Anin-depthanalysisthatidenti?esthekeyhardwareresourcesthatmustbeconsideredbythreadmappingalgorithms,aswellasthelessimportantresourcesthatdonotneedtobeconsidered.Unlikepreviousworkthatconsideredonlysharedmemoryresourcesformappingsingle-threadedapplications,ourpaperdemonstratesthatformulti-threadedapplications,threadmappinghastoconsidermoreresources,andthreadcharacteristicsforbetterperformance.

2)Ananalysisofhowtoimproveeachkeyresource’sutilizationwiththreadmappingwhenmanagingthreadswithdifferentrun-timecharacteristics.Tothebestofourknowledge,thisanalysisisthe?rstthatinvestigatesthecharacteristicsofmulti-threadedworkloadsandtheirimplicationsformanagingbothmemoryandprocessorresourceswiththreadmapping.

3)AnanalysisshowsthatL2caches,prefetchersandmem-oryinterconnectionsshouldbeconsideredasoneresourcebecauseoftheircomplexinteractions.WealsoproposeanewmetricL2-misses-memory-latency-product(L2MP)tomeasuretheiraggregatedperformanceimpact.

4)Ananalysisthatidenti?estherankingofthekeyre-sourcesforeachworkload,andthereasonfortheranking.Theremainderofthispaperisorganizedasfollows:Sec-tionIIprovidesanoverviewofhardwareresourcesandthethreadcharacteristicsconsideredinouranalysis.SectionIIIidenti?esthekeyresourcesforthreadmappingalgorithms.SectionIVanalyzeshowtoimprovetheutilizationofindividualresourcesviathreadmapping.SectionVdiscussesusingthere-sourcerankingstosimultaneouslymanagingmultipleresources.SectionVIsummarizestheperformanceresults.SectionVIIdiscussesrelatedworkandSectionVIIIconcludesthepaper.II.PERFORMANCEANALYSISOVERVIEW

Toaddressthechallengesmentionedabove,weperformacomprehensiveanalysisofhowanapplication’sperformanceiseffectedwhenthreadswithvariouscharacteristicssharemultiplehardwareresources.Thissectiongivesanoverviewoftheresources,themetrics,therun-timecharacteristics,andthethreadmappingsthatarecoveredinthisanalysis.A.HardwareResources

WeaddresstheresourcesthatarecommonlyavailableoncurrentCMPprocessors.Fig.1givesaschematicviewoftheresourcesprovidedbyanIntelquad-coreprocessor.The

resourcesweconsidercanbeclassi?edintotwocategories:thememoryhierarchyresourcesandtheprocessorresources.

MemoryhierarchyresourcesincludeL1instructioncaches(I-cache),L1datacaches(D-cache),instructionanddatatranslationlook-asidebuffers(I/DTLB),L2caches,hardwareprefetchersandoff-chipmemoryinterconnect.WeuseanIntelCore2processor,whichhastwoprefetchmechanisms,DataPrefetchLogic(DPL)andL2StreamingPrefetch[16].DPLfetchesastreamofinstructionsanddatafrommemoryifastridememoryaccesspatternisdetected.L2StreamingPrefetchbringstwoadjacentcachelinesintoanL2cache.

Processorresourcesincludethecoresandcomponentsthatrequiretrainingtofunction,suchasbranchpredictorsandmemorydisambiguationunits.Inthispaper,weusethetermResourceCoreswhenwediscussthecoreasaresource.B.Metrics

Thenumberofcachemisses,theamountofmemorytrans-actionsandmemoryaccesslatencyareusedtoevaluatetheutilizationofthememoryhierarchyresources.ThenumberofmispredictionsandstalledCPUcyclesduetothesemis-predictionsareusedtoevaluatethetraining-basedprocessorresources.

ProcessorutilizationisusedtoevaluatetheutilizationofResourceCores.Notethat,thisprocessorutilizationisviewedfromtheOSperspective.Forexample,supposethereisonethreadthatrunssolelyonacore.DuetoI/Ooperationsorsynchronizations,halfoftheexecutiontimeofthisthreadisstayingintheOSwaitingqueue.Thenthecorethatexecutesthisthreadhasaprocessorutilizationofonly50%.Inthispaper,theprocessorutilizationreferstotheoverallprocessorutilizationofallcoresinthesystem.

Weusetwometricstoevaluatetheperformanceoftheapplications:thenumberofCPUcyclesconsumedandtheexe-cutiontime.Memoryresourcesandthetraining-basedprocessorresourcesimpactthetotalcyclesconsumedbyanapplication.Accordingly,weuseexecutedCPUcyclestoestimatetheperformanceimpactoftheseresources.Theexecutiontime,ontheotherhand,isaffectedbybothprocessorutilizationandtheexecutedcycles.Weuseexecutiontimewhenevaluatingtheperformanceimpactofalltheresources.Therelationofexecutiontime,executedcyclesandprocessorutilizationisdescribedbyequation(1).

ExecTime=

ExecCycles

Num×Proc×Frequency

(1)

NumandFrequencyrefertothenumberofcoresandtheprocessorfrequency,respectively.Supposethereisanapplicationwithfourthreadsrunningonaquad-coreproces-sorof1GHz.Eachofthefourthreadrequires500million

CPUcyclestoexecute,andeachthreadhas50%processorutilizationduetoI/Ooperationsandsynchronizations.Thenforthisapplication,itsexecutiontimeis(500M(cycles)×4(threads))/(4(cores)×50%×1GHz)=1second.

Allofthemetricsmentionedinthissectioncanbeacquiredfromperformancemonitoringunits(PMUs)[16].TABLEIgivesthenameofthePMUsweusedinouranalysis.

WecomputememoryaccesslatencyfromPMUsusingtheequation(2)proposedbyEranian[13].Essentially,inequa-tion2,memorylatencyiscomputedbydividingthetotalcyclesofallmemoryreadtransactionsbythenumberofmemoryreads.

Memlatency=

BUSSTANDING

BUSRANSBRD?BUSRANSIFETCH

(2)

C.ThreadCharacteristics

Threadcharacteristicsincludethepropertiesofasinglethreadandtheinteractionsamongmultiplethreads.

Forsinglethreadproperties,weconsiderathread’scachedemand,memorybandwidthdemandandI/Ofrequency.Ad-ditionally,todescribehowthreadsutilizeprefetchers,weintroducethreemetrics:prefetchereffectiveness,prefetcherexcessivenessandprefetch/memoryfraction.Wede?neathread’sprefetchereffectivenessasthepercentageoftheL2cachemissesthatarereducedwhentheprefetchersareturnedoncomparedtowhentheyareturnedoff.Wede?neathread’sprefetcherexcessivenessasthepercentageoftheadditionalcachelinesthatarebroughtintotheL2cachewhentheprefetchersareturnedonthanwhentheyareturnedoff.Prefetch/memoryfractionisde?nedasthefractionofprefetchingtransactionsinthetotalmemorytransactionswhenprefetchersareon.Prefetchereffectivenessmeasureshowmuchtheapplicationbene?tsfromtheprefetcher;prefetcherexces-sivenessmeasureshowmuchextrapressureisputonmemorybandwidthduetoprefetchingactivity;andprefetch/memoryfractionillustratestheoverallimpactofprefetchersonmemorybandwidth.

Formultiplethreadinteractions,weconsiderdatasharing,instructionsharing,andthefrequencyofsynchronizationoper-ations.Theseinteractionsusuallyhappenamongthreadsfromthesameapplication.Wecallsuchthreadssiblingthreads.

TABLEa2are

threadsfromapplication1andapplication2respectively.Eachapplica-tionhasfourthreads.Threadsfromthesameapplicationareassumedtohavesimilarcharacteristics.NotethatwedonotconsiderSMTthere,sowhentwothreadsarepinnedtoonecore,theysharethatcoreinatimemultiplexingmanner.

D.ThreadMappings

Inourexperiments,weexaminedallpossiblethreadmap-pingswhenrunningtwomulti-threadedapplicationseachwithfourthreads.Therefore,thereareeightthreadsintotal,http://wendang.chazidian.comingmorethreadsthancoresallowsthoroughevaluationoftheresources,includingL1caches,TLBs,branchpredictors,memorydisambiguationunitsandResourceCores.Moreover,becauserealapplicationthreadshavesynchronizationsandI/Ooperations,theycannotuseallofthecoresallocatedtothem.Thususingmorethreadsthancorescanimprovetheoverallprocessorutilization.

Toguideouranalysis,wechoosefourthreadmappingsthatcoverallresourcesharingcon?gurations(eithersiblingthreadssharearesourceornon-siblingthreadssharearesource).Allotherthreadmappingscouldbeviewedashybridversionsofthesefourmappings.TABLEIIshowsthefourthreadmappingsonaquad-coreprocessorrunningLinux.ExceptfortheOSmapping,allmappingsaredonebystaticallypinningthreadstocoresusingtheprocessoraf?nitysystemcall.Howthesefourthreadmappingsuseresourcesisdescribedbelow.

OSMapping(OSMap):ThisthreadmappingisdeterminedbytheLinuxscheduler.TheOStriestoevenlyspreadthethreadsacrossthecoresinthesystemtoensurefairprocessortimeallocationandlowprocessoridletime.Underthismapping,aslongasthereisanavailablecoreandarunnablethread,thatthreadismappedtorunonthatcore.Asaresult,anythreadcanrunonanycoreandsharetheresourcesassociatedwiththesecores.TheOSMapisusedasthebaselineforperformancecomparison.

Isolation-mapping(IsoMap):UnderIsoMap,siblingthreadsaremappedtorunonthetwocoresthatshareoneL2cache.Inotherwords,theyareisolatedonthatL2cache.L1caches,TLBs,L2caches,hardwareprefetchersandcoresaresharedbysiblings.

Interleaving-mapping(IntMap):UnderIntMap,threadsfromdifferentapplicationsaremappedtothecoresinaninterleavedfashion.L1caches,TLBsandcoresarestillonlysharedbysiblingthreads.L2cachesandprefetchersaresharedbythreadsfromdifferentapplications.

Spreading-mapping(SprMap):UnderSprMap,fourthreadsofeachapplicationareevenlyspreadonthefourcores.Asaresult,everycoreexecutestwothreadswhichcomefromtwodifferentapplications.L1caches,TLBs,L2cachesandcoresaresharedbythreadsfromtwoapplications.

Notethat,althoughweassumethatsiblingthreadsareidenticalhere,somePARSECbenchmarkshavesiblingthreads

withdifferentcharacteristics.However,weobservethatthedifferencebetweenthesiblingthreadsisnegligible.Therefore,threadmappingsthatonlydifferintheplacementofsiblingthreadsusuallyhavesimilarperformance.

III.KEYRESOURCESIDENTIFICATION

Thissectionidenti?esthekeyresourcesforthreadmappingalgorithms.Aresourceshouldsatisfytwocriteriatobeconsid-eredasakeyresourceforthreadmappingalgorithms:

1)Theutilizationofthisresourcevariesconsiderablyamongdifferentthreadmappings.

2)Threadmappingcausedutilizationvariationsofthisresourceresultinconsiderablevariationsinanapplication’sperformance.

CriteriononecaneasilybedetermineddirectlyusingPMUs.However,thesecondcriterionrequirestwoapproaches.Al-thoughexperimentingonarealmachineprovidesmoreaccurateunderstandingofthreadmappings,theabilitytopreciselyaccounteachresource’sperformanceimpactislimitedbythetypesofPMUsavailableinthehardware.Forexample,forbranchpredictors,therearePMUsthatcountthenumberofmispredictions,aswellasthenumberofcyclesstalledduetothesemispredictions.However,forL1D-cache,thereareonlyPMUsthatgivethenumberofL1D-cachemisses.ThereisnoPMUthattellsthenumberofcyclesspentonL1D-cachemisses.Therefore,fordifferentresources,differentapproacheshavetobetaken:

1)DirectApproachForresourcesthathavePMUstomeasuretheirperformanceimpacts,weusethereadingfromthesePMUsdirectly.

2)IndirectApproachForL1D-caches,L2cachesandoff-chipmemoryinterconnects,therearenoPMUstodirectlymeasuretheirperformanceimpact.Fortheseresources,we?rstverifywithPMUsthattheperformancevariationsacrossmappingsarecausedbymemorystalls.Thenwecomparetheperformanceofthethreadmappings.Iftheapplication’sper-formanceisimprovedinonemapping,andonlyoneresource’sutilizationisimprovedinthismapping,thenwecanconcludethatitisthisresourcethatcausestheperformanceimprovement.A.ExperimentalDesign

To?ndthekeyresources,weperformedexperimentsonarealCMPmachine.Hereweintroducetheexperimentaldesign.WeusePARSECbenchmarkssuiteversion2.1(withnativeinputset)tocreateourworkloadsbecausethesebenchmarkshavealargevarietyofthreadcharacteristics.TABLEIIIgivestherun-timecharacteristicsofPARSECbenchmarks.

InTableIII,datasharing,workingsetandsynchronizationoperationsarecollectedwithasimulatorbythePARSECauthors[4].Theamountofdatasharing(highorlow)referstothepercentage(highorlow)ofthecachelinesthataresharedamongsiblingthreads.Theworkingsethereisanarchitecturalconceptwhichmeansthetotalsizeofmemorytouchedbyabenchmark.Weuseworkingsetsizetoestimatethecachedemandofabenchmark.Synchronizationoperationsmeasuresthetotalnumberoflocks,barriersandconditionsexecutedbyabenchmark.AllothercharacteristicsarecollectedonanIntelQ9550processor.TheI/OtimeiscollectedbyinstrumentingtheI/Ofunctions.Bandwidthrequirementofabenchmarkisacquiredbydividingthetotalamountofmemoryaccessedbytheexecutiontimeofthebenchmark.Thetotalamountofmemoryaccessedequalstothetotalnumberofmemorytransactionstimesthesizeofeachtransaction,whichis64Bytes.Prefetchereffectiveness,prefetcherexcessivenessandprefetch/memoryfractionsarecomputedfollowingtheirde?nitionsinSectionII-C.Thenegativevaluesofprefetchereffectivenessforswaptionsandx264suggestthatthesetwobenchmarksexperiencemoreL2cachemisseswhenhardwareprefetchersareturnedon.

Eachworkloadconsistsofapairofbenchmarks.Therefore,wecancomparethemappingswheresiblingthreadssharetheresourceswiththemappingswherenon-siblingthreadssharetheresources.WeuseninePARSECbenchmarksandthusthereare36pairs(workloads)intotal.Fourbenchmarks,ferret,dedup,freqmine,raytrace,arenotusedinouranalysisduetocompilationerrors,con?gurationerrorsorexecutionerrors.Foreachmapping,eachworkloadisexecuteduntilthelongestbenchmarkhas?nishedthreeruns.Shorterbenchmarksarerestartedifthelongestbenchmarkhasnot?nished.Theaverageoftheresultsofthe?rstthreerunsarepresented.Thevariationoftheresultsforthesamemappingandworkloadisverysmall.ForIsoMap,IntMap,andSprMap,thevariationislessthan2%.ForOSMap,thevariationishigher,usuallybetween2%and4%.However,sinceweonlyuseOSMapasabaseline,thehighervariationwouldnotaffectourconclusions.AllexperimentsareconductedonaplatformthathasanIntelquad-coreQ9550processor.Eachcoreofthisprocessorhasone32KBL1I-cacheandone32KBL1D-cache.Everytwocoresshareone6MBL2cache.(Fig.1).Thisplatformhas2GBmemoryandrunsLinuxkernel2.6.25.ReadingsfromPMUsarecollectedwithPerfMon2[12].

内容需要下载文档才能查看

Fig.2:AverageL1D-cachemisseseachbenchmarkexperiencesunder

thefourmappings.NormalizedtoOSMap.Resultofswaptionsisnotshownhere(norinthefollowingmemoryresources?gures)becauseithasveryfewmemoryaccesses.

Fig.4:Comparisonoftheperformanceofstreamclusterrunningwith

swaptionsunderIntMapandSprMap.Lowerbarisbetter.

Fig.5:Averageprocessorutilizationofeachbenchmarkrunningundertheformappings.NormalizedtoOSMap.

Fig.3:http://wendang.chazidian.comparingfourmappingsdoesnotchangetheconclusion.

B.L1D-Cache

We?rstevaluatetheimportanceofL1D-cache.Fig.2showsthenormalizedaverageL1D-cachemissesofeachbenchmarkunderthefourmappings.ForeachPARSECbenchmarkB,thereareeightworkloads(orpairs)thatcontainbenchmarkB.Foreachmapping,weruntheeightworkloads,andreadtheL1D-cachemissesofB.ThenwecomputetheaverageoftheL1D-cachemissesofBforeachmapping,andreporttheresultsinFig.2.Werepeatthesameprocessforallbenchmarksandallfourmappings.AsFig.2shows,L1D-cachemissesvaryfrom2%to14%,dependingonthethreadmapping.

WeevaluatedL1D-cache’simpactonperformancewiththeindirectapproach.Fig.3showstheCPUcyclesandmemoryresourceutilizationofx264runningwithstreamclusterunderIntMapandSprMap.Becausenootherresources’utilizationhavechangedfromonemappingtoanother,onlymemoryresourcesareshowninthe?gure.Fig.3showsthatalthoughIntMaphasmoreL2missesandhighermemorylatency,itsperformanceisstillbetterthanSprMapduetofewerL1misses.Therefore,threadmappinginducedvariationofL1D-cachemissescancauseconsiderableperformancevariation.

Inconclusion,L1D-cachemissesvarydependingonthreadmappings.Furthermore,thisvariationcancauseconsiderableperformancevariation.Consequently,L1D-Cacheshouldbeconsideredasakeyresource.

C.L2Cache,HardwarePrefetchersandOff-chipMemoryInterconnect

Previousresearchhasdemonstratedthatthreadmappingcansigni?cantlyimpacttheutilizationofL2caches,hard-wareprefetchersandoff-chipmemoryinterconnect,andcon-sequentlyimpactapplicationperformance[6,17,22,26,29].Thus,theseresourcesshouldbeconsideredaskeyresources.Resultsofourexperimentscorroboratethisconclusion.How-

ever,unlikepreviousresearch,ourexperimentresultsshowthatthesememoryresourcesarebetterviewedasoneresourcebythreadmappingalgorithm,andweprovideanewmetriccalledL2MPforevaluatingtheiraggregatedimpact.ThedetaileddiscussiononthissubjectcanbefoundinSectionV-A.D.BranchPredictors

Inourexperiments,weobservethatonethreadmappingcouldhave15timesmorebranchmispredictionsthananothermapping,whichsuggeststhatbranchpredictors’mispredica-tionsvarysigni?cantlydependingonthreadmappings.

Weevaluatetheperformanceimpactofbranchpredictorswiththedirectapproach.Fig.4showstheperformanceofstreamclusterandswaptionsrunningtogether.Streamclusterconsumes48%moreCPUcyclesundertheSprMapthantheIntMap,andswaptionsconsumes8%moreCPUcyclesunderSprMap.Fig.4alsoshowsthat99%oftheincreasedCPUcyclesarecausedbybranchmispredictions.Therefore,thevariationofbranchmispredictionscanproduceconsiderableapplicationperformancevariation.

Inconclusion,branchmispredictionsvarydependingonthreadmappings.Furthermore,thisvariationcancauseconsid-erableperformancevariation.Consequently,branchpredictorsshouldbeconsideredasakeyresource.

E.L1I-cache,I/DTLBsandMemoryDisambiguationUnitsDifferentthreadmappingshaveagreatimpactontheuti-lizationofL1I-caches,I/DTLBsandmemorydisambiguationunits.Onethreadmappingcanhavemorethantentimesmoremisses/mispredictionsfromtheseresourcesthananothermapping.Yettheabsoluteamountoftimespentinservingtheseextramisses/mispredictions(acquiredfromPMUsdirectly)accountsforlessthan2%(inmostcaseslessthan0.5%)ofthetotalexecutiontime.Therefore,weconcludethattheseresourcesshouldreceivelowprioritywhenmappingthreadsofthePARSECbenchmarks.