Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources
上传者:戴文刚|上传时间:2015-05-07|密次下载
Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources
PerformanceAnalysisofThreadMappingswitha
HolisticViewoftheHardwareResources
WeiWang,TanimaDey,JasonMars,LingjiaTang,JackW.Davidson,andMaryLouSoffa
DepartmentofComputerScience
UniversityofVirginiaCharlottesville,VA22904
Email:{wwang,td8h,jom5x,lt8f,jwd,soffa}@virginia.edu
Abstract—Withtheshifttochipmultiprocessors,managingsharedresourceshasbecomeacriticalissueinrealizingtheirfullpotential.Previousresearchhasshownthatthreadmappingisapowerfultoolforresourcemanagement.However,thedif?cultyofsimultaneouslymanagingmultiplehardwareresourcesandthevaryingnatureoftheworkloadshaveimpededtheef?ciencyofthreadmappingalgorithms.Toovercomethedif?cultiesofsimultaneouslymanagingmultipleresourceswiththreadmapping,theinteractionbetweenvariousmicroarchitecturalresourcesandthreadcharacteristicsmustbewellunderstood.
Thispaperpresentsanin-depthanalysisofPARSECbench-marksrunningunderdifferentthreadmappingstoinvestigatetheinteractionofvariousthreadmappingswithmicroarchitecturalresourcesincluding,L1I/D-caches,I/DTLBs,L2caches,hardwareprefetchers,off-chipmemoryinterconnects,branchpredictors,memorydisambiguationunitsandthecores.Foreachresource,theanalysisprovidesguidelinesforhowtoimproveitsutilizationwhenmappingthreadswithdifferentcharacteristics.Wealsoanalyzehowtherelativeimportanceoftheresourcesvariesdependingontheworkloads.Ourexperimentsshowthatwhenonlymemoryresourcesareconsidered,threadmappingimprovesanapplica-tion’sperformancebyasmuchas14%overthedefaultLinuxscheduler.Incontrast,whenbothmemoryandprocessorresourcesareconsideredthemappingalgorithmachievesperformanceimprovementsbyasmuchas28%.Additionally,wedemonstratethatthreadmappingshouldconsiderL2caches,prefetchersandoff-chipmemoryinterconnectsasoneresource,andwepresentanewmetriccalledL2-misses-memory-latency-product(L2MP)forevaluatingtheiraggregatedperformanceimpact.
I.INTRODUCTION
Comparedtotraditionaluniprocessors,chipmultiprocessors(CMPs)greatlyimprovesystemthroughputbyofferingcom-putationalresourcesthatallowmultiplethreadstoexecuteinparallel.Torealizethefullpotentialofthesepowerfulplatforms,ef?cientlymanagingtheresourcesthataresharedbythesesimultaneouslyexecutingthreadshasbecomeacriticalissue.
Inthispaper,wefocusonmanagingCMPsharedresourcesthroughthreadmapping.Previousresearchhasshownthatthreadmappingisapowerfultoolformanagingresources[6,8,17,22,26].However,despitetheintensiveandextensiveresearchonthistopic,properlymappingthreadstoachievetheoptimalperformanceforanarbitraryworkloadisstillanopenquestion.Findingtheoptimalthreadmappingisextremelydif?cultbecauseonemustconsiderallrelevantresourcesandtheinteractionbetweenthesemanyresourcesisworkloaddependent.PreviousresearchhasshownthatL2caches,front-side-busandprefetchershavetobeconsideredwhenmanaging
memoryhierarchyresources[29].Howeveritremainsunclearwhetherthereareadditionalresourcesthatshouldbeconsid-ered,andhowtoholisticallyimprovetheirutilizationbasedontheworkloadcharacteristics.
Toholisticallymanagemultipleresourceswiththreadmap-ping,therearethreemajorchallenges.
1)The?rstchallengeistoidentifythekeyresourcesthatneedtobeconsideredbythreadmappingalgorithms.Neglectingthekeyresourceswouldresultinsuboptimalperformance.
2)Thesecondchallengeistodeterminehowtomapthreadstoimprovetheutilizationofeachkeyresource.Thebestthreadmappingalsodependsonthethreadrun-timecharacteristics.Foreachkeyresource,weneedtoidentifytherelatedthreadrun-timecharacteristicsanddeterminehowtomapthreadswhentheyexhibitthesecharacteristics.
3)Thethirdchallengeistohandlesituationswherenothreadmappingcanimprovetheutilizationofallkeyresources.Undersuchcircumstances,threadmappingalgorithmsmustprioritizetheresourcesandfocusonimprovingtheutilizationofresourcesthatcanprovidethemaximumbene?t.Previousresearchonthreadmappingfocusedonimprovingtheutilizationoftheresourceswithinthememoryhierar-chy[29]oronlyfocusonindividualresource[6,17,26].Althoughtheproposedapproachesaresuccessfulinimprov-ingtheutilizationoftheseresources,thebestapplicationperformanceisnotalwaysguaranteed[28].Moreover,mostpreviousworkhasbeendoneusingsingle-threadedworkloads,whileemergingworkloadsincreasinglyincludemulti-threadedprograms.Multi-threadedworkloadshavedifferentrun-timecharacteristics,thusrequiredifferentmappingstrategies.
Asa?rststeptowardsovercomingthechallengesofholis-ticallymanagingmultipleresources,weprovideanin-depthperformanceanalysisofallpossiblethreadmappingsforasetofworkloadscreatedusingapplicationsfromthemulti-threadedPARSECbenchmarksuite[4].Whileotherworkhaslookedatthememoryhierarchy,inthisworkwetakeaholisticlookatboththememoryresourcesandprocessorresources(e.g.,branchpredictors,memorydisambiguationunit,etc.),andeval-uatetheirrelativeimportance.Inthisanalysis,weidentifythekeyresourcesthatareresponsibleforperformancedifferences.
内容需要下载文档才能查看 内容需要下载文档才能查看
relatedthreadswiththesecharacteristicstoimprovetheutilizationofthekeyresources.Additionally,tohelpmaketrade-offdecisions,weanalyzetherelativeimportanceofthekeyresourcesforeachworkload,andinvestigatethereasonforprioritizingsomeresourcesovertheothers.Weobservethat,byfocusingonmultipleresources,properthreadmappingcanimproveanapplication’sperformancebyupto28%overcurrentLinuxscheduler,whileconsiderationofonlymemoryresourcesprovidesimprovementofonly14%.
Speci?cally,thecontributionsofthispaperinclude:
1)Anin-depthanalysisthatidenti?esthekeyhardwareresourcesthatmustbeconsideredbythreadmappingalgorithms,aswellasthelessimportantresourcesthatdonotneedtobeconsidered.Unlikepreviousworkthatconsideredonlysharedmemoryresourcesformappingsingle-threadedapplications,ourpaperdemonstratesthatformulti-threadedapplications,threadmappinghastoconsidermoreresources,andthreadcharacteristicsforbetterperformance.
2)Ananalysisofhowtoimproveeachkeyresource’sutilizationwiththreadmappingwhenmanagingthreadswithdifferentrun-timecharacteristics.Tothebestofourknowledge,thisanalysisisthe?rstthatinvestigatesthecharacteristicsofmulti-threadedworkloadsandtheirimplicationsformanagingbothmemoryandprocessorresourceswiththreadmapping.
3)AnanalysisshowsthatL2caches,prefetchersandmem-oryinterconnectionsshouldbeconsideredasoneresourcebecauseoftheircomplexinteractions.WealsoproposeanewmetricL2-misses-memory-latency-product(L2MP)tomeasuretheiraggregatedperformanceimpact.
4)Ananalysisthatidenti?estherankingofthekeyre-sourcesforeachworkload,andthereasonfortheranking.Theremainderofthispaperisorganizedasfollows:Sec-tionIIprovidesanoverviewofhardwareresourcesandthethreadcharacteristicsconsideredinouranalysis.SectionIIIidenti?esthekeyresourcesforthreadmappingalgorithms.SectionIVanalyzeshowtoimprovetheutilizationofindividualresourcesviathreadmapping.SectionVdiscussesusingthere-sourcerankingstosimultaneouslymanagingmultipleresources.SectionVIsummarizestheperformanceresults.SectionVIIdiscussesrelatedworkandSectionVIIIconcludesthepaper.II.PERFORMANCEANALYSISOVERVIEW
Toaddressthechallengesmentionedabove,weperformacomprehensiveanalysisofhowanapplication’sperformanceiseffectedwhenthreadswithvariouscharacteristicssharemultiplehardwareresources.Thissectiongivesanoverviewoftheresources,themetrics,therun-timecharacteristics,andthethreadmappingsthatarecoveredinthisanalysis.A.HardwareResources
WeaddresstheresourcesthatarecommonlyavailableoncurrentCMPprocessors.Fig.1givesaschematicviewoftheresourcesprovidedbyanIntelquad-coreprocessor.The
resourcesweconsidercanbeclassi?edintotwocategories:thememoryhierarchyresourcesandtheprocessorresources.
MemoryhierarchyresourcesincludeL1instructioncaches(I-cache),L1datacaches(D-cache),instructionanddatatranslationlook-asidebuffers(I/DTLB),L2caches,hardwareprefetchersandoff-chipmemoryinterconnect.WeuseanIntelCore2processor,whichhastwoprefetchmechanisms,DataPrefetchLogic(DPL)andL2StreamingPrefetch[16].DPLfetchesastreamofinstructionsanddatafrommemoryifastridememoryaccesspatternisdetected.L2StreamingPrefetchbringstwoadjacentcachelinesintoanL2cache.
Processorresourcesincludethecoresandcomponentsthatrequiretrainingtofunction,suchasbranchpredictorsandmemorydisambiguationunits.Inthispaper,weusethetermResourceCoreswhenwediscussthecoreasaresource.B.Metrics
Thenumberofcachemisses,theamountofmemorytrans-actionsandmemoryaccesslatencyareusedtoevaluatetheutilizationofthememoryhierarchyresources.ThenumberofmispredictionsandstalledCPUcyclesduetothesemis-predictionsareusedtoevaluatethetraining-basedprocessorresources.
ProcessorutilizationisusedtoevaluatetheutilizationofResourceCores.Notethat,thisprocessorutilizationisviewedfromtheOSperspective.Forexample,supposethereisonethreadthatrunssolelyonacore.DuetoI/Ooperationsorsynchronizations,halfoftheexecutiontimeofthisthreadisstayingintheOSwaitingqueue.Thenthecorethatexecutesthisthreadhasaprocessorutilizationofonly50%.Inthispaper,theprocessorutilizationreferstotheoverallprocessorutilizationofallcoresinthesystem.
Weusetwometricstoevaluatetheperformanceoftheapplications:thenumberofCPUcyclesconsumedandtheexe-cutiontime.Memoryresourcesandthetraining-basedprocessorresourcesimpactthetotalcyclesconsumedbyanapplication.Accordingly,weuseexecutedCPUcyclestoestimatetheperformanceimpactoftheseresources.Theexecutiontime,ontheotherhand,isaffectedbybothprocessorutilizationandtheexecutedcycles.Weuseexecutiontimewhenevaluatingtheperformanceimpactofalltheresources.Therelationofexecutiontime,executedcyclesandprocessorutilizationisdescribedbyequation(1).
ExecTime=
ExecCycles
Num×Proc×Frequency
(1)
NumandFrequencyrefertothenumberofcoresandtheprocessorfrequency,respectively.Supposethereisanapplicationwithfourthreadsrunningonaquad-coreproces-sorof1GHz.Eachofthefourthreadrequires500million
CPUcyclestoexecute,andeachthreadhas50%processorutilizationduetoI/Ooperationsandsynchronizations.Thenforthisapplication,itsexecutiontimeis(500M(cycles)×4(threads))/(4(cores)×50%×1GHz)=1second.
Allofthemetricsmentionedinthissectioncanbeacquiredfromperformancemonitoringunits(PMUs)[16].TABLEIgivesthenameofthePMUsweusedinouranalysis.
WecomputememoryaccesslatencyfromPMUsusingtheequation(2)proposedbyEranian[13].Essentially,inequa-tion2,memorylatencyiscomputedbydividingthetotalcyclesofallmemoryreadtransactionsbythenumberofmemoryreads.
Memlatency=
BUSSTANDING
BUSRANSBRD?BUSRANSIFETCH
(2)
C.ThreadCharacteristics
Threadcharacteristicsincludethepropertiesofasinglethreadandtheinteractionsamongmultiplethreads.
Forsinglethreadproperties,weconsiderathread’scachedemand,memorybandwidthdemandandI/Ofrequency.Ad-ditionally,todescribehowthreadsutilizeprefetchers,weintroducethreemetrics:prefetchereffectiveness,prefetcherexcessivenessandprefetch/memoryfraction.Wede?neathread’sprefetchereffectivenessasthepercentageoftheL2cachemissesthatarereducedwhentheprefetchersareturnedoncomparedtowhentheyareturnedoff.Wede?neathread’sprefetcherexcessivenessasthepercentageoftheadditionalcachelinesthatarebroughtintotheL2cachewhentheprefetchersareturnedonthanwhentheyareturnedoff.Prefetch/memoryfractionisde?nedasthefractionofprefetchingtransactionsinthetotalmemorytransactionswhenprefetchersareon.Prefetchereffectivenessmeasureshowmuchtheapplicationbene?tsfromtheprefetcher;prefetcherexces-sivenessmeasureshowmuchextrapressureisputonmemorybandwidthduetoprefetchingactivity;andprefetch/memoryfractionillustratestheoverallimpactofprefetchersonmemorybandwidth.
Formultiplethreadinteractions,weconsiderdatasharing,instructionsharing,andthefrequencyofsynchronizationoper-ations.Theseinteractionsusuallyhappenamongthreadsfromthesameapplication.Wecallsuchthreadssiblingthreads.
TABLEa2are
threadsfromapplication1andapplication2respectively.Eachapplica-tionhasfourthreads.Threadsfromthesameapplicationareassumedtohavesimilarcharacteristics.NotethatwedonotconsiderSMTthere,sowhentwothreadsarepinnedtoonecore,theysharethatcoreinatimemultiplexingmanner.
D.ThreadMappings
Inourexperiments,weexaminedallpossiblethreadmap-pingswhenrunningtwomulti-threadedapplicationseachwithfourthreads.Therefore,thereareeightthreadsintotal,http://wendang.chazidian.comingmorethreadsthancoresallowsthoroughevaluationoftheresources,includingL1caches,TLBs,branchpredictors,memorydisambiguationunitsandResourceCores.Moreover,becauserealapplicationthreadshavesynchronizationsandI/Ooperations,theycannotuseallofthecoresallocatedtothem.Thususingmorethreadsthancorescanimprovetheoverallprocessorutilization.
Toguideouranalysis,wechoosefourthreadmappingsthatcoverallresourcesharingcon?gurations(eithersiblingthreadssharearesourceornon-siblingthreadssharearesource).Allotherthreadmappingscouldbeviewedashybridversionsofthesefourmappings.TABLEIIshowsthefourthreadmappingsonaquad-coreprocessorrunningLinux.ExceptfortheOSmapping,allmappingsaredonebystaticallypinningthreadstocoresusingtheprocessoraf?nitysystemcall.Howthesefourthreadmappingsuseresourcesisdescribedbelow.
OSMapping(OSMap):ThisthreadmappingisdeterminedbytheLinuxscheduler.TheOStriestoevenlyspreadthethreadsacrossthecoresinthesystemtoensurefairprocessortimeallocationandlowprocessoridletime.Underthismapping,aslongasthereisanavailablecoreandarunnablethread,thatthreadismappedtorunonthatcore.Asaresult,anythreadcanrunonanycoreandsharetheresourcesassociatedwiththesecores.TheOSMapisusedasthebaselineforperformancecomparison.
Isolation-mapping(IsoMap):UnderIsoMap,siblingthreadsaremappedtorunonthetwocoresthatshareoneL2cache.Inotherwords,theyareisolatedonthatL2cache.L1caches,TLBs,L2caches,hardwareprefetchersandcoresaresharedbysiblings.
Interleaving-mapping(IntMap):UnderIntMap,threadsfromdifferentapplicationsaremappedtothecoresinaninterleavedfashion.L1caches,TLBsandcoresarestillonlysharedbysiblingthreads.L2cachesandprefetchersaresharedbythreadsfromdifferentapplications.
Spreading-mapping(SprMap):UnderSprMap,fourthreadsofeachapplicationareevenlyspreadonthefourcores.Asaresult,everycoreexecutestwothreadswhichcomefromtwodifferentapplications.L1caches,TLBs,L2cachesandcoresaresharedbythreadsfromtwoapplications.
Notethat,althoughweassumethatsiblingthreadsareidenticalhere,somePARSECbenchmarkshavesiblingthreads
withdifferentcharacteristics.However,weobservethatthedifferencebetweenthesiblingthreadsisnegligible.Therefore,threadmappingsthatonlydifferintheplacementofsiblingthreadsusuallyhavesimilarperformance.
III.KEYRESOURCESIDENTIFICATION
Thissectionidenti?esthekeyresourcesforthreadmappingalgorithms.Aresourceshouldsatisfytwocriteriatobeconsid-eredasakeyresourceforthreadmappingalgorithms:
1)Theutilizationofthisresourcevariesconsiderablyamongdifferentthreadmappings.
2)Threadmappingcausedutilizationvariationsofthisresourceresultinconsiderablevariationsinanapplication’sperformance.
CriteriononecaneasilybedetermineddirectlyusingPMUs.However,thesecondcriterionrequirestwoapproaches.Al-thoughexperimentingonarealmachineprovidesmoreaccurateunderstandingofthreadmappings,theabilitytopreciselyaccounteachresource’sperformanceimpactislimitedbythetypesofPMUsavailableinthehardware.Forexample,forbranchpredictors,therearePMUsthatcountthenumberofmispredictions,aswellasthenumberofcyclesstalledduetothesemispredictions.However,forL1D-cache,thereareonlyPMUsthatgivethenumberofL1D-cachemisses.ThereisnoPMUthattellsthenumberofcyclesspentonL1D-cachemisses.Therefore,fordifferentresources,differentapproacheshavetobetaken:
1)DirectApproachForresourcesthathavePMUstomeasuretheirperformanceimpacts,weusethereadingfromthesePMUsdirectly.
2)IndirectApproachForL1D-caches,L2cachesandoff-chipmemoryinterconnects,therearenoPMUstodirectlymeasuretheirperformanceimpact.Fortheseresources,we?rstverifywithPMUsthattheperformancevariationsacrossmappingsarecausedbymemorystalls.Thenwecomparetheperformanceofthethreadmappings.Iftheapplication’sper-formanceisimprovedinonemapping,andonlyoneresource’sutilizationisimprovedinthismapping,thenwecanconcludethatitisthisresourcethatcausestheperformanceimprovement.A.ExperimentalDesign
To?ndthekeyresources,weperformedexperimentsonarealCMPmachine.Hereweintroducetheexperimentaldesign.WeusePARSECbenchmarkssuiteversion2.1(withnativeinputset)tocreateourworkloadsbecausethesebenchmarkshavealargevarietyofthreadcharacteristics.TABLEIIIgivestherun-timecharacteristicsofPARSECbenchmarks.
InTableIII,datasharing,workingsetandsynchronizationoperationsarecollectedwithasimulatorbythePARSECauthors[4].Theamountofdatasharing(highorlow)referstothepercentage(highorlow)ofthecachelinesthataresharedamongsiblingthreads.Theworkingsethereisanarchitecturalconceptwhichmeansthetotalsizeofmemorytouchedbyabenchmark.Weuseworkingsetsizetoestimatethecachedemandofabenchmark.Synchronizationoperationsmeasuresthetotalnumberoflocks,barriersandconditionsexecutedbyabenchmark.AllothercharacteristicsarecollectedonanIntelQ9550processor.TheI/OtimeiscollectedbyinstrumentingtheI/Ofunctions.Bandwidthrequirementofabenchmarkisacquiredbydividingthetotalamountofmemoryaccessedbytheexecutiontimeofthebenchmark.Thetotalamountofmemoryaccessedequalstothetotalnumberofmemorytransactionstimesthesizeofeachtransaction,whichis64Bytes.Prefetchereffectiveness,prefetcherexcessivenessandprefetch/memoryfractionsarecomputedfollowingtheirde?nitionsinSectionII-C.Thenegativevaluesofprefetchereffectivenessforswaptionsandx264suggestthatthesetwobenchmarksexperiencemoreL2cachemisseswhenhardwareprefetchersareturnedon.
Eachworkloadconsistsofapairofbenchmarks.Therefore,wecancomparethemappingswheresiblingthreadssharetheresourceswiththemappingswherenon-siblingthreadssharetheresources.WeuseninePARSECbenchmarksandthusthereare36pairs(workloads)intotal.Fourbenchmarks,ferret,dedup,freqmine,raytrace,arenotusedinouranalysisduetocompilationerrors,con?gurationerrorsorexecutionerrors.Foreachmapping,eachworkloadisexecuteduntilthelongestbenchmarkhas?nishedthreeruns.Shorterbenchmarksarerestartedifthelongestbenchmarkhasnot?nished.Theaverageoftheresultsofthe?rstthreerunsarepresented.Thevariationoftheresultsforthesamemappingandworkloadisverysmall.ForIsoMap,IntMap,andSprMap,thevariationislessthan2%.ForOSMap,thevariationishigher,usuallybetween2%and4%.However,sinceweonlyuseOSMapasabaseline,thehighervariationwouldnotaffectourconclusions.AllexperimentsareconductedonaplatformthathasanIntelquad-coreQ9550processor.Eachcoreofthisprocessorhasone32KBL1I-cacheandone32KBL1D-cache.Everytwocoresshareone6MBL2cache.(Fig.1).Thisplatformhas2GBmemoryandrunsLinuxkernel2.6.25.ReadingsfromPMUsarecollectedwithPerfMon2[12].
内容需要下载文档才能查看
Fig.2:AverageL1D-cachemisseseachbenchmarkexperiencesunder
thefourmappings.NormalizedtoOSMap.Resultofswaptionsisnotshownhere(norinthefollowingmemoryresources?gures)becauseithasveryfewmemoryaccesses.
Fig.4:Comparisonoftheperformanceofstreamclusterrunningwith
swaptionsunderIntMapandSprMap.Lowerbarisbetter.
Fig.5:Averageprocessorutilizationofeachbenchmarkrunningundertheformappings.NormalizedtoOSMap.
Fig.3:http://wendang.chazidian.comparingfourmappingsdoesnotchangetheconclusion.
B.L1D-Cache
We?rstevaluatetheimportanceofL1D-cache.Fig.2showsthenormalizedaverageL1D-cachemissesofeachbenchmarkunderthefourmappings.ForeachPARSECbenchmarkB,thereareeightworkloads(orpairs)thatcontainbenchmarkB.Foreachmapping,weruntheeightworkloads,andreadtheL1D-cachemissesofB.ThenwecomputetheaverageoftheL1D-cachemissesofBforeachmapping,andreporttheresultsinFig.2.Werepeatthesameprocessforallbenchmarksandallfourmappings.AsFig.2shows,L1D-cachemissesvaryfrom2%to14%,dependingonthethreadmapping.
WeevaluatedL1D-cache’simpactonperformancewiththeindirectapproach.Fig.3showstheCPUcyclesandmemoryresourceutilizationofx264runningwithstreamclusterunderIntMapandSprMap.Becausenootherresources’utilizationhavechangedfromonemappingtoanother,onlymemoryresourcesareshowninthe?gure.Fig.3showsthatalthoughIntMaphasmoreL2missesandhighermemorylatency,itsperformanceisstillbetterthanSprMapduetofewerL1misses.Therefore,threadmappinginducedvariationofL1D-cachemissescancauseconsiderableperformancevariation.
Inconclusion,L1D-cachemissesvarydependingonthreadmappings.Furthermore,thisvariationcancauseconsiderableperformancevariation.Consequently,L1D-Cacheshouldbeconsideredasakeyresource.
C.L2Cache,HardwarePrefetchersandOff-chipMemoryInterconnect
Previousresearchhasdemonstratedthatthreadmappingcansigni?cantlyimpacttheutilizationofL2caches,hard-wareprefetchersandoff-chipmemoryinterconnect,andcon-sequentlyimpactapplicationperformance[6,17,22,26,29].Thus,theseresourcesshouldbeconsideredaskeyresources.Resultsofourexperimentscorroboratethisconclusion.How-
ever,unlikepreviousresearch,ourexperimentresultsshowthatthesememoryresourcesarebetterviewedasoneresourcebythreadmappingalgorithm,andweprovideanewmetriccalledL2MPforevaluatingtheiraggregatedimpact.ThedetaileddiscussiononthissubjectcanbefoundinSectionV-A.D.BranchPredictors
Inourexperiments,weobservethatonethreadmappingcouldhave15timesmorebranchmispredictionsthananothermapping,whichsuggeststhatbranchpredictors’mispredica-tionsvarysigni?cantlydependingonthreadmappings.
Weevaluatetheperformanceimpactofbranchpredictorswiththedirectapproach.Fig.4showstheperformanceofstreamclusterandswaptionsrunningtogether.Streamclusterconsumes48%moreCPUcyclesundertheSprMapthantheIntMap,andswaptionsconsumes8%moreCPUcyclesunderSprMap.Fig.4alsoshowsthat99%oftheincreasedCPUcyclesarecausedbybranchmispredictions.Therefore,thevariationofbranchmispredictionscanproduceconsiderableapplicationperformancevariation.
Inconclusion,branchmispredictionsvarydependingonthreadmappings.Furthermore,thisvariationcancauseconsid-erableperformancevariation.Consequently,branchpredictorsshouldbeconsideredasakeyresource.
E.L1I-cache,I/DTLBsandMemoryDisambiguationUnitsDifferentthreadmappingshaveagreatimpactontheuti-lizationofL1I-caches,I/DTLBsandmemorydisambiguationunits.Onethreadmappingcanhavemorethantentimesmoremisses/mispredictionsfromtheseresourcesthananothermapping.Yettheabsoluteamountoftimespentinservingtheseextramisses/mispredictions(acquiredfromPMUsdirectly)accountsforlessthan2%(inmostcaseslessthan0.5%)ofthetotalexecutiontime.Therefore,weconcludethattheseresourcesshouldreceivelowprioritywhenmappingthreadsofthePARSECbenchmarks.
下载文档
热门试卷
- 2016年四川省内江市中考化学试卷
- 广西钦州市高新区2017届高三11月月考政治试卷
- 浙江省湖州市2016-2017学年高一上学期期中考试政治试卷
- 浙江省湖州市2016-2017学年高二上学期期中考试政治试卷
- 辽宁省铁岭市协作体2017届高三上学期第三次联考政治试卷
- 广西钦州市钦州港区2016-2017学年高二11月月考政治试卷
- 广西钦州市钦州港区2017届高三11月月考政治试卷
- 广西钦州市钦州港区2016-2017学年高一11月月考政治试卷
- 广西钦州市高新区2016-2017学年高二11月月考政治试卷
- 广西钦州市高新区2016-2017学年高一11月月考政治试卷
- 山东省滨州市三校2017届第一学期阶段测试初三英语试题
- 四川省成都七中2017届高三一诊模拟考试文科综合试卷
- 2017届普通高等学校招生全国统一考试模拟试题(附答案)
- 重庆市永川中学高2017级上期12月月考语文试题
- 江西宜春三中2017届高三第一学期第二次月考文科综合试题
- 内蒙古赤峰二中2017届高三上学期第三次月考英语试题
- 2017年六年级(上)数学期末考试卷
- 2017人教版小学英语三年级上期末笔试题
- 江苏省常州西藏民族中学2016-2017学年九年级思想品德第一学期第二次阶段测试试卷
- 重庆市九龙坡区七校2016-2017学年上期八年级素质测查(二)语文学科试题卷
- 江苏省无锡市钱桥中学2016年12月八年级语文阶段性测试卷
- 江苏省无锡市钱桥中学2016-2017学年七年级英语12月阶段检测试卷
- 山东省邹城市第八中学2016-2017学年八年级12月物理第4章试题(无答案)
- 【人教版】河北省2015-2016学年度九年级上期末语文试题卷(附答案)
- 四川省简阳市阳安中学2016年12月高二月考英语试卷
- 四川省成都龙泉中学高三上学期2016年12月月考试题文科综合能力测试
- 安徽省滁州中学2016—2017学年度第一学期12月月考高三英语试卷
- 山东省武城县第二中学2016.12高一年级上学期第二次月考历史试题(必修一第四、五单元)
- 福建省四地六校联考2016-2017学年上学期第三次月考高三化学试卷
- 甘肃省武威第二十三中学2016—2017学年度八年级第一学期12月月考生物试卷
网友关注
- 2018安徽公务员面试模拟题:“打伞哥”火爆朋友圈
- 【公共基础知识题库】公共基础知识每日一练(2017.11.7)
- 安徽公务员考试公共基础知识每日一练(2017.10.26)
- 安徽公务员考试公共基础知识每日一练(2017.10.18)
- 公共基础知识1000题及答案183
- 行测题库:行测每日一练判断推理练习题答案10.18
- 安徽公务员考试公共基础知识1000题及答案178
- 安徽公务员考试公共基础知识每日一练(2017.10.19)
- 行测题库:行测每日一练数量关系练习题答案10.19
- 安徽公务员考试公共基础知识每日一练(2017.10.31)
- 【公共基础知识题库】公共基础知识每日一练(2017.11.17)
- 安徽公务员考试公共基础知识每日一练(2017.10.27)
- 【公共基础知识题库】公共基础知识每日一练(2017.11.15)
- 【公共基础知识题库】公共基础知识每日一练(2017.11.20)
- 公共基础知识1000题及答案182
- 行测题库:行测每日一练言语理解练习题10.20
- 【公共基础知识题库】公共基础知识每日一练(2017.11.16)
- 公共基础知识1000题及答案181
- 安徽公务员考试公共基础知识1000题及答案179
- 安徽公务员考试公共基础知识每日一练(2017.10.25)
- 安徽公务员考试十九大报告时事政治模拟题
- 行测题库:行测每日一练资料分析练习题答案10.13
- 【公共基础知识题库】公共基础知识每日一练(2017.11.3)
- 公共基础知识1000题及答案184
- 安徽公务员考试公共基础知识每日一练(2017.11.1)
- 【公共基础知识题库】公共基础知识每日一练(2017.11.6)
- 【公共基础知识题库】公共基础知识每日一练(2017.11.13)
- 【公共基础知识题库】公共基础知识每日一练(2017.11.14)
- 安徽公务员考试公共基础知识1000题及答案176
- 行测题库:行测每日一练判断推理练习题10.18
网友关注视频
- 外研版英语三起6年级下册(14版)Module3 Unit2
- 化学九年级下册全册同步 人教版 第22集 酸和碱的中和反应(一)
- 二次函数求实际问题中的最值_第一课时(特等奖)(冀教版九年级下册)_T144339
- 冀教版小学数学二年级下册1
- 二年级下册数学第三课 搭一搭⚖⚖
- 第8课 对称剪纸_第一课时(二等奖)(沪书画版二年级上册)_T3784187
- 3.2 数学二年级下册第二单元 表内除法(一)整理和复习 李菲菲
- 沪教版牛津小学英语(深圳用) 四年级下册 Unit 12
- 沪教版八年级下册数学练习册21.3(2)分式方程P15
- 冀教版小学数学二年级下册第二单元《有余数除法的整理与复习》
- 冀教版小学数学二年级下册第二周第2课时《我们的测量》宝丰街小学庞志荣.mp4
- 冀教版英语四年级下册第二课
- 沪教版牛津小学英语(深圳用) 四年级下册 Unit 4
- 苏科版数学八年级下册9.2《中心对称和中心对称图形》
- 北师大版八年级物理下册 第六章 常见的光学仪器(二)探究凸透镜成像的规律
- 飞翔英语—冀教版(三起)英语三年级下册Lesson 2 Cats and Dogs
- 19 爱护鸟类_第一课时(二等奖)(桂美版二年级下册)_T502436
- 苏科版数学 八年级下册 第八章第二节 可能性的大小
- 外研版英语三起5年级下册(14版)Module3 Unit2
- 【部编】人教版语文七年级下册《泊秦淮》优质课教学视频+PPT课件+教案,辽宁省
- 第19课 我喜欢的鸟_第一课时(二等奖)(人美杨永善版二年级下册)_T644386
- 第4章 幂函数、指数函数和对数函数(下)_六 指数方程和对数方程_4.7 简单的指数方程_第一课时(沪教版高一下册)_T1566237
- 沪教版牛津小学英语(深圳用) 六年级下册 Unit 7
- 沪教版八年级下册数学练习册20.4(2)一次函数的应用2P8
- 【部编】人教版语文七年级下册《老山界》优质课教学视频+PPT课件+教案,安徽省
- 苏科版数学七年级下册7.2《探索平行线的性质》
- 沪教版八年级下次数学练习册21.4(2)无理方程P19
- 沪教版牛津小学英语(深圳用) 四年级下册 Unit 2
- 冀教版英语三年级下册第二课
- 精品·同步课程 历史 八年级 上册 第15集 近代科学技术与思想文化
精品推荐
- 2016-2017学年高一语文人教版必修一+模块学业水平检测试题(含答案)
- 广西钦州市高新区2017届高三11月月考政治试卷
- 浙江省湖州市2016-2017学年高一上学期期中考试政治试卷
- 浙江省湖州市2016-2017学年高二上学期期中考试政治试卷
- 辽宁省铁岭市协作体2017届高三上学期第三次联考政治试卷
- 广西钦州市钦州港区2016-2017学年高二11月月考政治试卷
- 广西钦州市钦州港区2017届高三11月月考政治试卷
- 广西钦州市钦州港区2016-2017学年高一11月月考政治试卷
- 广西钦州市高新区2016-2017学年高二11月月考政治试卷
- 广西钦州市高新区2016-2017学年高一11月月考政治试卷
分类导航
- 互联网
- 电脑基础知识
- 计算机软件及应用
- 计算机硬件及网络
- 计算机应用/办公自动化
- .NET
- 数据结构与算法
- Java
- SEO
- C/C++资料
- linux/Unix相关
- 手机开发
- UML理论/建模
- 并行计算/云计算
- 嵌入式开发
- windows相关
- 软件工程
- 管理信息系统
- 开发文档
- 图形图像
- 网络与通信
- 网络信息安全
- 电子支付
- Labview
- matlab
- 网络资源
- Python
- Delphi/Perl
- 评测
- Flash/Flex
- CSS/Script
- 计算机原理
- PHP资料
- 数据挖掘与模式识别
- Web服务
- 数据库
- Visual Basic
- 电子商务
- 服务器
- 搜索引擎优化
- 存储
- 架构
- 行业软件
- 人工智能
- 计算机辅助设计
- 多媒体
- 软件测试
- 计算机硬件与维护
- 网站策划/UE
- 网页设计/UI
- 网吧管理